PMXBOT Log file Viewer

Help | Karma | Search:

#mongodb logs for Saturday the 27th of September, 2014

(Back to #mongodb overview) (Back to channel listing) (Animate logs)
[00:00:33] <Boomtime> in your case, the information you want is trivially pre-computable
[00:02:00] <atrigent> Boomtime: it seems to me that you're blinding yourself to the fact that this query is obviously running massively slower than it needs to just because I mentioned SQL
[00:02:58] <atrigent> there is absolutely nothing SQL-y in what I'm trying to do
[00:03:01] <Boomtime> no, it's running massively slower than it needs to because it shouldn't be run at all
[00:03:37] <Boomtime> if you run this aggregation twice, will the answer be different?
[00:04:42] <Boomtime> the aggregation you quoted is trivially pre-computable, you can maintain the answer much more easily than re-computing it every time
[00:05:12] <atrigent> Boomtime: ok, are you disputing this because this query can not actually, in a purely technical sense, be made faster, or because of some idea of how mongo "should" be used?
[00:05:29] <atrigent> because if it's the ladder then I'm not interested in discussing this anymore with you
[00:05:42] <Boomtime> it's both
[00:06:02] <Boomtime> your query as it stands cannot be made faster, it requires walking the entire collection
[00:06:19] <Boomtime> which is usually a sign of using MongoDB incorrectly
[00:06:54] <Boomtime> if you had a $match at the start that made the result more dynamic then you would have something is not so easily pre-computable
[00:07:09] <Boomtime> but the you would also be making use of an index
[00:09:13] <Boomtime> what you are trying to achieve is absolutely trivial, but you insist on doing it in the worst possible way and then complaining that SQL does it better that way, you mentioned SQL, not me
[00:11:13] <atrigent> well, to answer your question about whether subsequent queries will return the same results, the answer is obviously no
[00:11:52] <atrigent> but it probably won't change a huge amount, and having this data up-to-date isn't crucial, which is why I'll probably wind up caching it somehow
[00:12:08] <Boomtime> now we're getting somewhere...
[00:12:19] <Boomtime> how can it change?
[00:15:19] <Boomtime> @atrigent: the count of any given work_id can only change when you insert or delete a document from that collection
[00:15:42] <atrigent> Boomtime: yes, I do realize where you're going with this
[00:15:45] <Boomtime> maintain your totals at those moments
[00:16:07] <Boomtime> you can even use an aggregation to ensure your totals hold accurate
[00:16:28] <Boomtime> but you only need to update one specific work_id when you insert/delete
[00:16:48] <Boomtime> so your aggregation gets to use an efficient $match as the first pipeline stage
[00:17:17] <Boomtime> and the remaining single stage then solves to a count of results, skipping any need to load them at all
[01:07:30] <SoulBlade> in the 2.4.x series, is there a mechanism to send multiple updates in a single db call
[01:08:02] <SoulBlade> basically condition / $set tuples
[01:09:02] <SoulBlade> i think i have to do it one by one
[01:15:51] <Boomtime> @SoulBlade: you are correct - many drivers have a multi-update in the API for 2.4, but under the bonnet they are sending the updates one by one
[01:16:04] <SoulBlade> thanks Boomtime
[01:16:15] <Boomtime> note that the drivers may send the updates "all at once" but as multiple update commands on the wire
[01:16:39] <Boomtime> this is much faster because they process the results concurrently, but the wire protocol still sees seperate commands
[01:16:49] <SoulBlade> in 2.6, is it using Bulk and doing multiple op.find({_id : id1}).update(doc1); op.find({_id : id2}).update(doc2); etc.
[01:17:10] <SoulBlade> yea - i can parallelize as well, but might as well let the driver
[01:17:19] <Boomtime> "op.find({_id : id1}).update(doc1)" <- wut?
[01:17:32] <SoulBlade> http://blog.mongodb.org/post/84922794768/mongodbs-new-bulk-api
[01:17:42] <Boomtime> oh, i see, the server always does the sort of thing internally
[01:17:55] <SoulBlade> where op is an unodered bulk op or ordered bulk op
[01:18:14] <Boomtime> yes, those APIs send all the commands to the server in a single blob
[01:18:30] <Boomtime> note they are still processed internally by the server the same way
[01:18:48] <Boomtime> they are just sent and replied to as a single operation
[01:18:53] <SoulBlade> so i just save the messaging overhead
[01:18:58] <Boomtime> right
[01:19:07] <SoulBlade> that could still add up i suppose
[01:19:13] <SoulBlade> well actually it does
[01:19:21] <SoulBlade> esp if the primary is elsewhere
[01:19:23] <Boomtime> it can, definitely
[01:19:45] <Boomtime> the expense of course is slightly more complex logic on the client
[01:20:00] <SoulBlade> yep
[01:20:03] <Boomtime> but if you have a pile of commands that you know you need to send, then generally this is not a problem
[01:20:53] <SoulBlade> i saw a big boost when doing an insert([doc1,doc2,doc3]) vs. insert(doc1); insert(doc2); etc. in my setup
[01:21:08] <SoulBlade> so i think once i upgrade to 2.6 moving to bulk api will def help
[01:21:22] <Boomtime> insert using an array does not use the bulk API
[01:21:41] <Boomtime> the driver just runs the commands concurrently
[01:22:16] <Boomtime> this results in a definite and obvious boost, but the bulk API can do a little better
[01:22:17] <SoulBlade> really? so it just sends individual inserts in parallel
[01:22:19] <SoulBlade> ?
[01:22:41] <Boomtime> the new bulk API is accessed only via the methods quoted in the blog
[01:23:04] <Boomtime> at the moment, most drivers still require you to construct the bulk API yourself
[01:23:09] <Boomtime> "db.collection('documents').initializeOrderedBulkOp();"
[01:23:15] <SoulBlade> that's surprising given the boost i saw.
[01:23:18] <SoulBlade> cool
[01:23:25] <SoulBlade> im ok with writing code
[01:23:39] <Boomtime> the array insert will be the single largest boost, the bulk API might get you a little more on top
[01:23:44] <SoulBlade> im kind of surprised at the array of docs actually being parallel
[01:24:30] <Boomtime> there isn't actually a huge amount different between parallel commands and the bulk API commands, at least not until you have more than, say, 100 inserts to do
[01:24:52] <Boomtime> however, the bulk API lets you do mixed commands
[01:25:10] <Boomtime> where-as the array insert means they are all definitely inserts
[01:26:04] <Boomtime> the bulk API also lets you inspect the invidual result of each op, including multi-updates, where-as the array method almost always cannot
[01:27:19] <Boomtime> in short: array insert is implemented in parallel by the client, bulk API is implemented in parallel by the server
[01:27:19] <SoulBlade> so looking at the insert w/ array, it seems multiple documents are added to a single insert command
[01:27:46] <Boomtime> ah yes, i forgot about that
[01:27:57] <Boomtime> the trouble is that you get only a single answer
[01:28:52] <Boomtime> if a document insert fails you have a lot of fiddling to do to figure out which one and what happened.. the bulk API has a seperate answer per request contained
[01:28:54] <SoulBlade> yea thats the problem w. that one, but i am ok with that limitation for the boost
[01:29:01] <SoulBlade> yea - i look forward to that :)
[01:30:46] <SoulBlade> sigh.. so yea i guess bulk update will be slow for me still on 2.4 - unfortunate
[01:30:59] <SoulBlade> but ill switch over to bulk op when i can
[01:31:22] <SoulBlade> have you migrated a 2.4 to 2.6? can it be done by just adding a 2.6 member to a 2.4 replica set?
[01:31:32] <SoulBlade> trying to avoid downtime
[01:31:45] <Boomtime> yes, the upgrade can be done in a rolling fashion
[01:31:58] <Boomtime> with no downtime so long as you have a replica-set
[01:32:04] <SoulBlade> that i do!
[01:32:05] <SoulBlade> nice
[01:33:12] <Boomtime> do you use auth?
[01:34:57] <SoulBlade> yes
[01:35:04] <SoulBlade> i have a keyfile
[01:35:35] <SoulBlade> will that be a problem?
[01:35:44] <Boomtime> auth is a little different in 2.6 and requires special attention during upgrade
[01:36:01] <Boomtime> http://docs.mongodb.org/manual/release-notes/2.6-upgrade/
[01:36:45] <Boomtime> i may have understated the auth changes, from that doc above: "MongoDB 2.6 includes significant changes to the authorization model,.."
[01:37:23] <Boomtime> "After you begin to upgrade a MongoDB deployment that uses authentication to 2.6, you cannot modify existing user data until you complete the authorization user schema upgrade."
[01:38:55] <SoulBlade> interesting
[01:39:06] <SoulBlade> well i dont really crud users in system.users - i just have one that i use to connect to the DB
[01:39:47] <SoulBlade> and i doubt id modify any of the data - and truth be told, i probably don't need the auth anyway since my app is colocated at the moment
[01:56:28] <atrigent> Boomtime: I am aware of ways to make this work, and I'm aware that counts can be maintained as I add and remove things, but I am still curious why the query is so much faster in mysql
[01:56:53] <atrigent> I realized it might have something to do with how the data is stored
[01:56:59] <atrigent> if it's not the index
[02:00:57] <Boomtime> you have asked a far more complicated question than you realise, and the last time i started to answer it you rejected the answer
[02:04:20] <Boomtime> "I have problem X, what is a fast solution?" versus "I have solution X, how do I make it fast?"
[02:05:02] <Boomtime> With the latter question there might be a system which implements "solution X" and is thus fast by default
[02:05:35] <Boomtime> Porting that solution to a different, unrelated system is an unfair starting point
[02:08:54] <SoulBlade> hmm i must have split during a good convo
[02:15:02] <atrigent> Boomtime: it looked to me like you were just saying "this isn't how it's done", whereas I was looking for technical reasons why this can't be implemented in mongo
[02:16:33] <atrigent> I have a problem where I find it hard to accept the status quo if it's less than ideal :)
[02:16:35] <Boomtime> "... can't be implemented in mongo" <- it can, and you know because you've done it, and there are multiple ways
[02:16:44] <Boomtime> are you trolling?
[02:17:22] <atrigent> but if there's some core aspect of mongo, such as the fact that it is schemaless, that makes this SLOW (sorry, I meant slow), then that's understandable
[02:22:07] <Boomtime> what you are trying to do would be covered by this: https://jira.mongodb.org/browse/SERVER-4507
[02:22:37] <atrigent> I did see that one as well
[02:23:11] <atrigent> when it says "sorted by the _id", it means sorted by the _id that you specify in the $group, right?
[02:26:46] <atrigent> oh I see, so essentially if it's sorted then when you hit another value, you know that you're done with the previous value, so you can do the accumulator operations that group
[02:29:28] <atrigent> but if you're just doing a $sum then you can just maintain a single value for each group, can't you?
[02:30:55] <Boomtime> that is the solution i first proposed
[02:31:37] <atrigent> I mean for the $group operation
[02:32:13] <Boomtime> you really are very insistent on doing this the SQL way aren't you? you are in for a lot of pain
[02:32:23] <atrigent> I might not be understanding what this issue is saying
[02:33:03] <Boomtime> It's complicated
[02:33:53] <atrigent> Boomtime: right now I'm just trying to understand why this is so much slower in mongo
[02:35:04] <Boomtime> it isn't, it's super fast for me
[02:35:32] <SoulBlade> whats the query / operation ?
[02:35:40] <SoulBlade> i think i was split during the early parts of the discussion
[02:36:17] <atrigent> Boomtime: the aggregation isn't :)
[02:36:22] <atrigent> which is what I'm asking about
[02:36:57] <Boomtime> i know, you are insistent on having a SQL solution, rather than solving the actual problem
[02:37:37] <atrigent> argh
[02:39:31] <Boomtime> there is no way to optimize your aggregation, it trawls the entire collection, i've said that, that's the answer to why your aggregation is slow, there is nothing more to say
[02:39:33] <atrigent> Boomtime: do you know why it's so much faster in mysql or not? I'm not "insisting" anything right now, I'm just, out of genuine curiosity, looking for an answer to that question
[02:40:55] <atrigent> hmmm....
[02:42:07] <atrigent> ok, I guess this is kind of an unfair question if I don't share the sql query I'm using
[02:42:24] <Boomtime> wow
[02:42:55] <atrigent> I'll spare you that though
[02:43:36] <Boomtime> Why are you changing from SQL?
[02:44:24] <atrigent> Boomtime: because we're using meteor
[02:45:08] <Boomtime> ok, nm, the "work_id" field, how many documents typically share the same work_id? like, average and maximum
[02:51:30] <atrigent> > db.annotations.aggregate([{ $group: { _id: '$work_id', count: { $sum: 1 } } }, { $group: { _id: 1, avg: { $avg: '$count'}, max: { $max: '$count' } } }]).result
[02:51:36] <atrigent> [ { "_id" : 1, "avg" : 177.87234042553192, "max" : 1501 } ]
[02:53:11] <atrigent> this definitely is the sort of thing that would work fine in a sql database
[02:53:19] <Boomtime> goodo, and how big are the documents? bson size if possible, you can use .collection.stats() to tell you averages
[02:54:13] <Boomtime> btw, with only 1501 maximum, and in particular an average of just 180 you could use an array to embed all of them in a single document unless they are quite large
[02:54:37] <atrigent> "avgObjSize" : 1286.244976076555,
[02:55:04] <Boomtime> hmm.. that might be a bit tricky on the larger ones
[02:55:43] <Boomtime> 1500 * 1286 = ~2MB
[02:56:32] <Boomtime> have you tested with using one document to hold all common work_id documents in an embedded array?
[02:57:01] <Boomtime> that would probably make a bunch of things easier if you can do it
[02:57:53] <atrigent> hmm
[03:12:11] <atrigent> ok well it's way past time I should've gone home, so I'm going to do that
[03:12:19] <atrigent> maybe I'll be back :)
[12:36:04] <b1001> hi guys. Got a problem where I need to find all documents that match a criteria and doesn't have a field called translated.
[12:36:17] <b1001> db.tweets.find({'lang':'ar'},{'translated':{$exists:false}})
[12:36:57] <kali> db.tweets.find({'lang':'ar', 'translated':{$exists:false}})
[12:38:27] <b1001> doh.. that was pretty stupid.. thanks kali
[12:45:20] <bwin> people please answer to question http://stackoverflow.com/questions/26074738/mongodb-multiple-operators
[12:52:58] <dazzled> can anyone lead me on the right path http://stackoverflow.com/questions/26072379/adding-entries-onto-a-database-document-instead-of-creating-a-new-one
[13:21:44] <b1001> I have a problem where db.tweets.find({'lang':'ar', 'translated':{'$exists':'false'}}).count() gives a different result in pymongo and mongo shell.
[13:22:44] <b1001> the mongoshell is without '' around false.. I actually find the one post where I put a field called 'translated:'gibberish'
[13:44:20] <testerbit> If mongodb does not find a search query, will it return undefined to the node driver? Or what does it return?
[13:53:50] <edrocks> anyone have experience with infinite scroll or pagination? Specifically with a ranked view not based on time
[17:42:44] <nexact> hello, is inserting a document containing a field named _id will automatically be converted to an ObjectID() item ? I'm trying to import data and I want to keep the id that i've previously generated ... thanks
[17:51:23] <kali> nexact: nope, if you provide an _id field, it will be kept as is and it can be about anything. your driver will add an objectId _id if you insert a doc without any _id, that's all it does.
[17:51:40] <nexact> kali: alright, thanks
[18:43:18] <hydrajump> hi if you have a 3 instance mongod setup, so 1 primary and 2 secondaries. When you have say a nodejs app using mongoose. It is setup to talk to the primary mongod instance. If the primary fails and one of the secondaries is elected a new primary, how does the nodejsapp know how to read/write to that new primary? I don't understand the client config for a replicaset
[18:45:42] <kali> hydrajump: when connecting to a replica set node, the driver pulls the replica set configuration and remembers it
[18:46:56] <kali> hydrajump: but in order to be fully resilient, you'd better give the driver the three hosts when it's possible, so if the expected primary is down, the client can still discover the replica set setup
[18:49:44] <hydrajump> kali: cool thanks for explaining and providing all three makes total sense!
[18:52:17] <hydrajump> kali: another Q. I can't find a chart or info on the difference between mongod and mongod enterprise? Is the latter commericial and what else is different from the "Regular" version?
[19:07:51] <hydrajump> Only thing i've found about enterprise is that apparently it won't work on ubuntu 14.04 :(
[19:08:21] <kali> hydrajump: no idea
[19:08:32] <hydrajump> kali: no worries ;)
[19:09:10] <hydrajump> kind of strange that SSL requires compiling yourself or using enterprise but on ubuntu 12.04. Doesn't make sense when 14.04 is LTS
[21:13:04] <edrocks> how can I sort by a number then by time? I tried sort({"mycounter":-1,"_id";-1) but it just ends up with the time