[00:49:28] <pyCasso> how would you write the query? I think to first query all items with specific item number, then get the sum of those items quantites to get total items sold
[00:51:52] <pyCasso> i'm new to using mongodb so would like to know how to maximize performance. i typically just get a list of objects and use python to filter down the results
[01:03:01] <StephenLynx> pyCasso, you have projection
[01:17:10] <Boomtime> if you want to use math operators like $sum then you need numeric field values
[01:17:26] <Boomtime> ie. "price" : "60.00" <- this is a string
[01:17:31] <Boomtime> "price" : 60.00 <- this is a number
[01:18:06] <Boomtime> get rid of the $unwind, it's superfluous at best - change the field values to what they actually are and it should work
[01:24:06] <pyCasso> could I wrap the price string with Number("$price") ?
[01:26:50] <pyCasso> Number("$price") and parseInt and parseFloat() returns NaN boomtime.
[01:35:43] <Boomtime> are you learning javascript as well?
[01:36:04] <Boomtime> oh, you're passing that to mongodb?
[01:36:56] <Boomtime> remember, mongodb itself generally does not run javascript (there are very specific exceptions), the pipeline you have could be equally expressed in C#, Java, python, etc
[01:37:15] <Boomtime> the pipeline must be valid JSON only
[02:40:47] <crodjer> kurushiyama: Thanks. We realized that hashed indexes don't support uniqueness so doesn't fit our case anyway.
[02:48:04] <jiffe> if I have a key on a field and want to select N documents in order, assuming that collection is huge, whats the best way to do that?
[02:48:39] <jiffe> col.find.sort().limit() seems to take a very long time
[02:49:08] <jiffe> the field I'm sorting on is indexed
[02:50:52] <jiffe> nevermind seems to be coming back quick now, maybe something else was holding it up
[02:52:33] <Boomtime> @jiffe: possible, or perhaps there are multiple indexes to use and the planner is occasionally choosing poorly - use .explain(true) to see how many plans are available
[04:47:27] <jiffe> Boomtime: I only have 1 index on this collection
[04:47:45] <jiffe> I haven't had it react slowly since so not sure why it was doing that earlier
[04:47:59] <jiffe> I'm running into another issue with the perl mongodb module
[04:49:35] <jiffe> I'm storing binary gzipped data and with an older version of the perl module it worked fine but it doesn't seem to anymore with the newer version
[07:50:34] <kurushiyama> jiffe: Another question is wether the index is in sorting order.
[14:42:08] <noman> i mean seriosly i'm so hyped with mongodb i just want to talk to the community
[14:42:19] <noman> it fits perfectly with my multiplayer game project
[16:50:53] <jiffe> alright so its happening again now, these queries have generally returned right away but now this query have taken several minutes and still going
[16:52:39] <Zelest> also, how does "$gte" behave on strings :o
[16:53:43] <jiffe> it seemed to work fine for the first 400000 documents I've pulled so far
[16:54:12] <jiffe> I wouldn't think doing a string comparison would be much different from a numeric comparison
[17:01:35] <jiffe> here's an explain on the query http://nsab.us/public/mongodb
[17:12:08] <jiffe> if I connect to maildump4 which is part of the mongod's directly and run the query it comes back right away, through mongos it just sits there though
[17:17:41] <jiffe> there are different versions, it looks like 3.0.8 and 3.0.9 between the mongod's and the new mongos's are 3.0.12, not sure how much of a difference that makes
[17:19:29] <jiffe> I might just run through and upgrade everything
[17:27:30] <jiffe> using mongos on maildump1 v3.0.9 does the same thing though so I'm guessing thats not it
[17:34:55] <jokke> i need some help with beamers two-screen-functionality
[17:35:29] <jokke> i've included the pdfpages package and set the beameroption show notes on second screen
[17:35:41] <jokke> still notes are on a separate pdf page
[17:38:09] <jiffe> so how does one troubleshoot a query working on the individual replica set members but not through mongos ?
[18:36:23] <kurushiyama> jiffe: Uhm, that is quite a long identifier, and we are talking of a lexicographic comparison. Compared to an int or int64 comparison, that is much more complex. I might be wrong, but iirc the average time for string comparison is O((n-1)/2), with a very long n in this case, multiplied by the number of documents.
[18:39:45] <kurushiyama> jiffe: And, as far as I can see, the query does not include your shard key, resulting in a behaviour known as "scatter/gather".
[18:41:23] <jiffe> kurushiyama: message_identifier is the shard key
[18:42:06] <kurushiyama> Ah, sorry, my fault, the $gte. But the scatter/gather problem remains.
[18:42:33] <jiffe> and for the most part you shouldn't have to scan the whole string to to comparison, and if I go to the individual shards this query returns right away, mongos seems to be the problem
[18:43:31] <kurushiyama> jiffe: Here is what happens: since the shard can not be reliably determined as for the $gte, your query is send to _all_ shards which aren't ruled out by the individual shards key ranges.
[18:43:58] <kurushiyama> Then, the query is executed on those shards, which report their results back to mongos.
[18:44:24] <kurushiyama> Only then, thos results are sorted.
[18:45:01] <jiffe> yeah, so since I'm limiting to 20 documents, each shard would send back a max of 20 documents and mongos would have to sort through 40 documents to return the results
[18:45:04] <kurushiyama> So, we have a string comparison which is less than ideal, plus a scatter/gather.
[18:47:11] <jiffe> I was able to go through the first 400000 documents before it started stalling
[18:47:34] <kurushiyama> 400k docs are not a large result set.
[18:48:08] <jiffe> no, I've been running this query several times changing the value of message_identifier
[18:48:18] <Zelest> anyone happen to know if mongodb-php-library has any support for GridFS yet? :o
[18:48:25] <jiffe> they've all come back quickly until this specific identifier
[18:49:05] <kurushiyama> jiffe: Well, lets us put it difeerently, then. You assume a bug in mongos? Or just arbitrary stalling?
[18:49:26] <Zelest> repairdb? drop and re-create index? *shrugs*
[18:49:31] <kurushiyama> Zelest: Derick actually should know. And I am pretty sure it is documented.
[18:49:35] <jiffe> well that's what I'm trying to figure out, why it stopped working all of a sudden
[18:50:04] <jiffe> what you're saying makes sense, although it doesn't make sense to not limit on the shards because you'll never needs more than limit() documents from any given shard
[18:50:30] <kurushiyama> jiffe: From my experience, 99.999% of the cases where somebody claims "the tool is failing", it is caused by misuse based on wrong assumptions.
[18:50:49] <kurushiyama> jiffe: Lets take that for granted.
[18:51:01] <Zelest> kurushiyama, yeah, did a quick search through the docs without much result.. and google is tricky, as it basically only shows result for the legacy driver :P
[18:51:04] <kurushiyama> jiffe: Let is say, you have to sort only 40 docs.
[18:52:06] <Zelest> nvm, looks like GridFS exists :D
[18:52:17] <kurushiyama> jiffe: Still, you have a string comparison with O((n-1)/2) over "hundreds of millions of docs", and the mongos can not return a result until all shards which got this query are finished.
[18:52:57] <jiffe> kurushiyama: well that should be fine because for all message_identifiers I've tried, if I go to mongod directly it always comes back right away
[18:53:39] <kurushiyama> jiffe: I'd have a close look at the key ranges.
[18:54:52] <jiffe> now that you mention it, looking at the op taking place on the shard via db.currentOp() through mongos I see the query and orderby but nothing about limit
[18:55:17] <jiffe> so I'm not sure why my initial queries came back without a problem
[18:55:29] <jiffe> maybe I'll rerun those and see what its doing
[18:55:36] <kurushiyama> As said: I'd have a close look at the key ranges.
[18:55:50] <jiffe> regardless I don't see any reason why you shouldn't be able to push the limit to the shards also
[18:56:26] <kurushiyama> I know there was a good reason for it. I just can not remember...
[18:56:59] <kurushiyama> cheeser: Can you remember?
[18:58:50] <kurushiyama> jiffe: Nevertheless, this is not going to help you. Looking at this key, it looks like a composite, limited by "."
[18:59:26] <kurushiyama> jiffe: I assume you are looking for messages newer than a given date?
[18:59:57] <jiffe> no I'm just trying to go through all messages in batches
[19:08:23] <Zelest> "Support for GridFS is forthcoming." :(
[19:10:01] <jiffe> well the messages are mail messages, the headers have some spam markings I need to pull out and I'm also running the full content through a spam classifier
[19:10:46] <kurushiyama> jiffe: The whole message is in a single field?
[19:11:05] <kurushiyama> jiffe: I need to see one, remove the message payload, if you wish.
[19:11:11] <jiffe> no its broken up between several documents because out max message size is 64MB
[19:12:28] <jiffe> yeah we have 2 mail backups, one is a snapshot, this is a message dump so all email that comes in and out of our system goes here with no plans of deletion
[19:12:31] <kurushiyama> jiffe: ??IWK-????1, to be precise
[19:13:39] <kurushiyama> ok, so those message parts are related by said message identifier, I guess.
[19:14:09] <jiffe> yeah, there's the message_identifier and chunk_id to keep it ordered
[19:16:10] <jiffe> we do have some metadata in a separate db for searching but not what I need so I'm just going through the docs in mongodb
[19:16:42] <kurushiyama> well, you basically reimplemnted gridfs and metadata. ;)
[19:16:59] <kurushiyama> Ok, the stats can be directly derived from the messages headers?
[19:17:25] <jiffe> yeah, the headers are gzipped up too
[19:17:46] <kurushiyama> You know my reaction, I guess ;)
[19:18:30] <jiffe> yeah, well it works for our purposes, we use elasticsearch to search for the messages and get identifiers and then pull content from mongodb by identifier
[19:19:31] <jiffe> we originally had this setup differently using mongodb for search but there were some bugs at the time that caused problems which is when we threw in the elasticsearch layer
[19:22:15] <jiffe> I just tried db.message_data.find().skip(1000000).limit(20) and that seems to take quite a while also
[19:22:32] <kurushiyama> jiffe: I do not see any other option than to work it down this way.
[19:22:37] <jiffe> it might be better to bypass mongos and just query the shards directly and merge things myself
[23:06:24] <kurushiyama> jiffe: I doubt that, and I doubt that it is a good idea.
[23:12:38] <kurushiyama> jiffe: Maybe you should try larger badges.
[23:16:22] <ule> I have the order_item_id from this: http://s33.postimg.org/mobyvat27/Screenshot_from_2016_05_28_19_05_41.png
[23:18:59] <jiffe> kurushiyama: why don't you think it is a good idea?
[23:20:25] <StephenLynx> because when you use sharding, it is designed to not be used directly, I think.
[23:22:54] <jiffe> well I'd be doing what mongos does but in a brute force kind of way
[23:23:17] <jiffe> I suppose if the balancer ran things could move around between queries or something
[23:58:08] <kurushiyama> jiffe: I have my strongest doubts that a makeshift solution will beat a system specifically designed for the same purpose, maintained over years, bugtracked and what not.