PMXBOT Log file Viewer

Help | Karma | Search:

#mongodb logs for Saturday the 28th of May, 2016

(Back to #mongodb overview) (Back to channel listing) (Animate logs)
[00:42:51] <pyCasso> how would i write a function that adds all quantities based on an item number?
[00:43:57] <pyCasso> db.orders.aggregate([{"$project" : {"_id" : "$item_number", "fulltotal" : {"$sum" : "$price"} } }]);
[00:46:25] <Boomtime> @pyCasso: $group not $project
[00:47:15] <Boomtime> also, if the collection is large that will take a long time since it requires reading every single document
[00:47:27] <pyCasso> I tried that and the fulltotal value returns 0
[00:47:43] <pyCasso> really? oh didn't know that
[00:49:28] <pyCasso> how would you write the query? I think to first query all items with specific item number, then get the sum of those items quantites to get total items sold
[00:51:52] <pyCasso> i'm new to using mongodb so would like to know how to maximize performance. i typically just get a list of objects and use python to filter down the results
[01:03:01] <StephenLynx> pyCasso, you have projection
[01:03:06] <StephenLynx> and aggregation
[01:03:17] <StephenLynx> aggregation is just a basic binary filter
[01:03:26] <StephenLynx> a field is either placed on the result or not.
[01:03:45] <StephenLynx> aggregation allows for more complex operations, like grouping data, unwinding arrays and more
[01:04:00] <StephenLynx> but its usually not as fast as regular finds.
[01:04:30] <StephenLynx> let me see what you need
[01:04:50] <StephenLynx> you want to take fieldA and sum it with fieldB?
[01:05:02] <StephenLynx> and just read this value?
[01:05:24] <StephenLynx> yeah, use aggregation and use $group to do that.
[01:06:06] <Boomtime> @pyCasso: can you put a sample documet in a gist/pastebin so we can see what you're starting with?
[01:08:31] <pyCasso> Boomtime http://pastebin.com/W1HAjiTR
[01:09:08] <pyCasso> the objects at top are from my mongo collection and the function i currently have is at the bottom
[01:10:36] <Boomtime> why the $unwind?
[01:10:41] <Boomtime> the field is scalar
[01:16:01] <pyCasso> i honestly don't know it was playing around. i'm new
[01:16:24] <Boomtime> ok, well, let's start at the top...
[01:16:34] <Boomtime> every field in your documents are a string
[01:16:48] <Boomtime> (except _id)
[01:17:10] <Boomtime> if you want to use math operators like $sum then you need numeric field values
[01:17:26] <Boomtime> ie. "price" : "60.00" <- this is a string
[01:17:31] <Boomtime> "price" : 60.00 <- this is a number
[01:18:06] <Boomtime> get rid of the $unwind, it's superfluous at best - change the field values to what they actually are and it should work
[01:24:06] <pyCasso> could I wrap the price string with Number("$price") ?
[01:26:50] <pyCasso> Number("$price") and parseInt and parseFloat() returns NaN boomtime.
[01:35:43] <Boomtime> are you learning javascript as well?
[01:36:04] <Boomtime> oh, you're passing that to mongodb?
[01:36:56] <Boomtime> remember, mongodb itself generally does not run javascript (there are very specific exceptions), the pipeline you have could be equally expressed in C#, Java, python, etc
[01:37:15] <Boomtime> the pipeline must be valid JSON only
[02:40:47] <crodjer> kurushiyama: Thanks. We realized that hashed indexes don't support uniqueness so doesn't fit our case anyway.
[02:48:04] <jiffe> if I have a key on a field and want to select N documents in order, assuming that collection is huge, whats the best way to do that?
[02:48:39] <jiffe> col.find.sort().limit() seems to take a very long time
[02:49:08] <jiffe> the field I'm sorting on is indexed
[02:50:52] <jiffe> nevermind seems to be coming back quick now, maybe something else was holding it up
[02:52:33] <Boomtime> @jiffe: possible, or perhaps there are multiple indexes to use and the planner is occasionally choosing poorly - use .explain(true) to see how many plans are available
[04:47:27] <jiffe> Boomtime: I only have 1 index on this collection
[04:47:45] <jiffe> I haven't had it react slowly since so not sure why it was doing that earlier
[04:47:59] <jiffe> I'm running into another issue with the perl mongodb module
[04:49:35] <jiffe> I'm storing binary gzipped data and with an older version of the perl module it worked fine but it doesn't seem to anymore with the newer version
[07:50:34] <kurushiyama> jiffe: Another question is wether the index is in sorting order.
[14:39:32] <noman> Hi
[14:40:01] <noman> i just came here to ask why mongodb rocks! :D
[14:41:15] <Zelest> :D
[14:42:08] <noman> i mean seriosly i'm so hyped with mongodb i just want to talk to the community
[14:42:19] <noman> it fits perfectly with my multiplayer game project
[16:50:53] <jiffe> alright so its happening again now, these queries have generally returned right away but now this query have taken several minutes and still going
[16:51:02] <jiffe> db.message_data.find({ message_identifier: { '$gte': '004621cc068c8f24dfeab79b3d11e912.12019af3f8bffba0aedc7aaefefc7509fcfa4399.1417782247' } }).sort({ message_identifier: 1 }).limit(20)
[16:52:25] <Zelest> silly question perhaps, but I assume you have an index on message_identifier?
[16:52:35] <jiffe> I do
[16:52:39] <Zelest> also, how does "$gte" behave on strings :o
[16:53:43] <jiffe> it seemed to work fine for the first 400000 documents I've pulled so far
[16:54:12] <jiffe> I wouldn't think doing a string comparison would be much different from a numeric comparison
[17:01:35] <jiffe> here's an explain on the query http://nsab.us/public/mongodb
[17:12:08] <jiffe> if I connect to maildump4 which is part of the mongod's directly and run the query it comes back right away, through mongos it just sits there though
[17:17:41] <jiffe> there are different versions, it looks like 3.0.8 and 3.0.9 between the mongod's and the new mongos's are 3.0.12, not sure how much of a difference that makes
[17:19:29] <jiffe> I might just run through and upgrade everything
[17:27:30] <jiffe> using mongos on maildump1 v3.0.9 does the same thing though so I'm guessing thats not it
[17:34:55] <jokke> i need some help with beamers two-screen-functionality
[17:35:29] <jokke> i've included the pdfpages package and set the beameroption show notes on second screen
[17:35:41] <jokke> still notes are on a separate pdf page
[17:36:19] <jokke> oh
[17:36:21] <jokke> lol
[17:36:28] <jokke> totally wrong channel :D
[17:38:09] <jiffe> so how does one troubleshoot a query working on the individual replica set members but not through mongos ?
[18:36:23] <kurushiyama> jiffe: Uhm, that is quite a long identifier, and we are talking of a lexicographic comparison. Compared to an int or int64 comparison, that is much more complex. I might be wrong, but iirc the average time for string comparison is O((n-1)/2), with a very long n in this case, multiplied by the number of documents.
[18:39:45] <kurushiyama> jiffe: And, as far as I can see, the query does not include your shard key, resulting in a behaviour known as "scatter/gather".
[18:41:23] <jiffe> kurushiyama: message_identifier is the shard key
[18:41:28] <kurushiyama> o.O
[18:42:06] <kurushiyama> Ah, sorry, my fault, the $gte. But the scatter/gather problem remains.
[18:42:33] <jiffe> and for the most part you shouldn't have to scan the whole string to to comparison, and if I go to the individual shards this query returns right away, mongos seems to be the problem
[18:43:31] <kurushiyama> jiffe: Here is what happens: since the shard can not be reliably determined as for the $gte, your query is send to _all_ shards which aren't ruled out by the individual shards key ranges.
[18:43:55] <jiffe> well there are only two shards
[18:43:58] <kurushiyama> Then, the query is executed on those shards, which report their results back to mongos.
[18:44:24] <kurushiyama> Only then, thos results are sorted.
[18:45:01] <jiffe> yeah, so since I'm limiting to 20 documents, each shard would send back a max of 20 documents and mongos would have to sort through 40 documents to return the results
[18:45:04] <kurushiyama> So, we have a string comparison which is less than ideal, plus a scatter/gather.
[18:45:09] <kurushiyama> jiffe: No
[18:45:16] <kurushiyama> The limit would be done on the mongos.
[18:45:27] <jiffe> that makes no sense
[18:45:35] <kurushiyama> jiffe: It is that way.
[18:45:48] <jiffe> the query would return hundreds of millions of results without the limit
[18:46:07] <kurushiyama> No, since the limit is then applied on mongos.
[18:46:26] <jiffe> yeah, but the shards would returns hundreds of millions of results to mongos
[18:46:36] <kurushiyama> Aye. guess why it takes minutes.
[18:46:43] <jiffe> it wasn't though
[18:47:11] <jiffe> I was able to go through the first 400000 documents before it started stalling
[18:47:34] <kurushiyama> 400k docs are not a large result set.
[18:48:08] <jiffe> no, I've been running this query several times changing the value of message_identifier
[18:48:18] <Zelest> anyone happen to know if mongodb-php-library has any support for GridFS yet? :o
[18:48:25] <jiffe> they've all come back quickly until this specific identifier
[18:49:05] <kurushiyama> jiffe: Well, lets us put it difeerently, then. You assume a bug in mongos? Or just arbitrary stalling?
[18:49:26] <Zelest> repairdb? drop and re-create index? *shrugs*
[18:49:31] <kurushiyama> Zelest: Derick actually should know. And I am pretty sure it is documented.
[18:49:35] <jiffe> well that's what I'm trying to figure out, why it stopped working all of a sudden
[18:50:04] <jiffe> what you're saying makes sense, although it doesn't make sense to not limit on the shards because you'll never needs more than limit() documents from any given shard
[18:50:30] <kurushiyama> jiffe: From my experience, 99.999% of the cases where somebody claims "the tool is failing", it is caused by misuse based on wrong assumptions.
[18:50:49] <kurushiyama> jiffe: Lets take that for granted.
[18:51:01] <Zelest> kurushiyama, yeah, did a quick search through the docs without much result.. and google is tricky, as it basically only shows result for the legacy driver :P
[18:51:04] <kurushiyama> jiffe: Let is say, you have to sort only 40 docs.
[18:52:06] <Zelest> nvm, looks like GridFS exists :D
[18:52:17] <kurushiyama> jiffe: Still, you have a string comparison with O((n-1)/2) over "hundreds of millions of docs", and the mongos can not return a result until all shards which got this query are finished.
[18:52:22] <Zelest> https://github.com/mongodb/mongo-php-library/tree/master/src/GridFS :)
[18:52:57] <jiffe> kurushiyama: well that should be fine because for all message_identifiers I've tried, if I go to mongod directly it always comes back right away
[18:53:01] <kurushiyama> jiffe: with an n =85
[18:53:39] <kurushiyama> jiffe: I'd have a close look at the key ranges.
[18:54:52] <jiffe> now that you mention it, looking at the op taking place on the shard via db.currentOp() through mongos I see the query and orderby but nothing about limit
[18:55:17] <jiffe> so I'm not sure why my initial queries came back without a problem
[18:55:29] <jiffe> maybe I'll rerun those and see what its doing
[18:55:36] <kurushiyama> As said: I'd have a close look at the key ranges.
[18:55:50] <jiffe> regardless I don't see any reason why you shouldn't be able to push the limit to the shards also
[18:56:26] <kurushiyama> I know there was a good reason for it. I just can not remember...
[18:56:59] <kurushiyama> cheeser: Can you remember?
[18:58:50] <kurushiyama> jiffe: Nevertheless, this is not going to help you. Looking at this key, it looks like a composite, limited by "."
[18:59:26] <kurushiyama> jiffe: I assume you are looking for messages newer than a given date?
[18:59:57] <jiffe> no I'm just trying to go through all messages in batches
[19:00:22] <kurushiyama> jiffe: So why don't you?
[19:00:57] <jiffe> is there a better way? doing a search this seemed like the most reasonable restartable approach
[19:01:16] <kurushiyama> db.foo.find().skip().limit()
[19:01:32] <kurushiyama> maybe a sort
[19:01:39] <jiffe> skip won't require scanning through all the messages?
[19:01:44] <kurushiyama> no string comparison
[19:01:56] <kurushiyama> worth a try, imho
[19:02:23] <kurushiyama> what do you do with those messages? updates?
[19:02:48] <jiffe> no building statistical information
[19:03:04] <jiffe> so just reads
[19:03:05] <kurushiyama> jiffe: How about using the aggregation framework for it.
[19:03:40] <kurushiyama> jiffe: Building statistical info is sorta for what it was invented for.
[19:03:54] <jiffe> well I have to parse out a bunch of information from the message
[19:03:59] <kurushiyama> ?
[19:04:18] <kurushiyama> Describe _precisely_ what you need to parse
[19:04:44] <kurushiyama> A sample doc wont hurt, either.
[19:05:53] <kurushiyama> And the desired reult
[19:06:06] <kurushiyama> s/reult/result/
[19:08:13] <Zelest> bah
[19:08:23] <Zelest> "Support for GridFS is forthcoming." :(
[19:10:01] <jiffe> well the messages are mail messages, the headers have some spam markings I need to pull out and I'm also running the full content through a spam classifier
[19:10:46] <kurushiyama> jiffe: The whole message is in a single field?
[19:11:05] <kurushiyama> jiffe: I need to see one, remove the message payload, if you wish.
[19:11:11] <jiffe> no its broken up between several documents because out max message size is 64MB
[19:11:22] <kurushiyama> ew
[19:11:25] <jiffe> its also gzipped up
[19:11:32] <kurushiyama> gzipped ew
[19:12:28] <jiffe> yeah we have 2 mail backups, one is a snapshot, this is a message dump so all email that comes in and out of our system goes here with no plans of deletion
[19:12:31] <kurushiyama> jiffe: ??IWK-????1, to be precise
[19:13:39] <kurushiyama> ok, so those message parts are related by said message identifier, I guess.
[19:14:09] <jiffe> yeah, there's the message_identifier and chunk_id to keep it ordered
[19:14:32] <kurushiyama> ew
[19:14:37] <kurushiyama> ;)
[19:15:02] <jiffe> heh, it was the way it had to be done to work with size limitations
[19:15:16] <kurushiyama> gridfs and metadate
[19:15:38] <kurushiyama> metadata
[19:16:10] <jiffe> we do have some metadata in a separate db for searching but not what I need so I'm just going through the docs in mongodb
[19:16:42] <kurushiyama> well, you basically reimplemnted gridfs and metadata. ;)
[19:16:59] <kurushiyama> Ok, the stats can be directly derived from the messages headers?
[19:17:25] <jiffe> yeah, the headers are gzipped up too
[19:17:46] <kurushiyama> You know my reaction, I guess ;)
[19:18:30] <jiffe> yeah, well it works for our purposes, we use elasticsearch to search for the messages and get identifiers and then pull content from mongodb by identifier
[19:19:31] <jiffe> we originally had this setup differently using mongodb for search but there were some bugs at the time that caused problems which is when we threw in the elasticsearch layer
[19:22:15] <jiffe> I just tried db.message_data.find().skip(1000000).limit(20) and that seems to take quite a while also
[19:22:32] <kurushiyama> jiffe: I do not see any other option than to work it down this way.
[19:22:37] <jiffe> it might be better to bypass mongos and just query the shards directly and merge things myself
[23:06:24] <kurushiyama> jiffe: I doubt that, and I doubt that it is a good idea.
[23:12:38] <kurushiyama> jiffe: Maybe you should try larger badges.
[23:12:48] <kurushiyama> s/badges/batches/
[23:15:29] <ule> hi there
[23:15:59] <ule> I need to build a query to find a specific string inside an array from my collection object
[23:16:08] <ule> does anybody can pls help?
[23:16:22] <ule> I have the order_item_id from this: http://s33.postimg.org/mobyvat27/Screenshot_from_2016_05_28_19_05_41.png
[23:18:59] <jiffe> kurushiyama: why don't you think it is a good idea?
[23:20:25] <StephenLynx> because when you use sharding, it is designed to not be used directly, I think.
[23:22:54] <jiffe> well I'd be doing what mongos does but in a brute force kind of way
[23:23:17] <jiffe> I suppose if the balancer ran things could move around between queries or something
[23:58:08] <kurushiyama> jiffe: I have my strongest doubts that a makeshift solution will beat a system specifically designed for the same purpose, maintained over years, bugtracked and what not.