pmxbot IRC Log Viewer

[00:42:51] <pyCasso> how would i write a function that adds all quantities based on an item number?

[00:43:57] <pyCasso> db.orders.aggregate([{"$project" : {"_id" : "$item_number", "fulltotal" : {"$sum" : "$price"} } }]);

[00:46:25] <Boomtime> @pyCasso: $group not $project

[00:47:15] <Boomtime> also, if the collection is large that will take a long time since it requires reading every single document

[00:47:27] <pyCasso> I tried that and the fulltotal value returns 0

[00:47:43] <pyCasso> really? oh didn't know that

[00:49:28] <pyCasso> how would you write the query? I think to first query all items with specific item number, then get the sum of those items quantites to get total items sold

[00:51:52] <pyCasso> i'm new to using mongodb so would like to know how to maximize performance. i typically just get a list of objects and use python to filter down the results

[01:03:01] <StephenLynx> pyCasso, you have projection

[01:03:06] <StephenLynx> and aggregation

[01:03:17] <StephenLynx> aggregation is just a basic binary filter

[01:03:26] <StephenLynx> a field is either placed on the result or not.

[01:03:45] <StephenLynx> aggregation allows for more complex operations, like grouping data, unwinding arrays and more

[01:04:00] <StephenLynx> but its usually not as fast as regular finds.

[01:04:30] <StephenLynx> let me see what you need

[01:04:50] <StephenLynx> you want to take fieldA and sum it with fieldB?

[01:05:02] <StephenLynx> and just read this value?

[01:05:24] <StephenLynx> yeah, use aggregation and use $group to do that.

[01:06:06] <Boomtime> @pyCasso: can you put a sample documet in a gist/pastebin so we can see what you're starting with?

[01:08:31] <pyCasso> Boomtime http://pastebin.com/W1HAjiTR

[01:09:08] <pyCasso> the objects at top are from my mongo collection and the function i currently have is at the bottom

[01:10:36] <Boomtime> why the $unwind?

[01:10:41] <Boomtime> the field is scalar

[01:16:01] <pyCasso> i honestly don't know it was playing around. i'm new

[01:16:24] <Boomtime> ok, well, let's start at the top...

[01:16:34] <Boomtime> every field in your documents are a string

[01:16:48] <Boomtime> (except _id)

[01:17:10] <Boomtime> if you want to use math operators like $sum then you need numeric field values

[01:17:26] <Boomtime> ie. "price" : "60.00" <- this is a string

[01:17:31] <Boomtime> "price" : 60.00 <- this is a number

[01:18:06] <Boomtime> get rid of the $unwind, it's superfluous at best - change the field values to what they actually are and it should work

[01:24:06] <pyCasso> could I wrap the price string with Number("$price") ?

[01:26:50] <pyCasso> Number("$price") and parseInt and parseFloat() returns NaN boomtime.

[01:35:43] <Boomtime> are you learning javascript as well?

[01:36:04] <Boomtime> oh, you're passing that to mongodb?

[01:36:56] <Boomtime> remember, mongodb itself generally does not run javascript (there are very specific exceptions), the pipeline you have could be equally expressed in C#, Java, python, etc

[01:37:15] <Boomtime> the pipeline must be valid JSON only

[02:40:47] <crodjer> kurushiyama: Thanks. We realized that hashed indexes don't support uniqueness so doesn't fit our case anyway.

[02:48:04] <jiffe> if I have a key on a field and want to select N documents in order, assuming that collection is huge, whats the best way to do that?

[02:48:39] <jiffe> col.find.sort().limit() seems to take a very long time

[02:49:08] <jiffe> the field I'm sorting on is indexed

[02:50:52] <jiffe> nevermind seems to be coming back quick now, maybe something else was holding it up

[02:52:33] <Boomtime> @jiffe: possible, or perhaps there are multiple indexes to use and the planner is occasionally choosing poorly - use .explain(true) to see how many plans are available

[04:47:27] <jiffe> Boomtime: I only have 1 index on this collection

[04:47:45] <jiffe> I haven't had it react slowly since so not sure why it was doing that earlier

[04:47:59] <jiffe> I'm running into another issue with the perl mongodb module

[04:49:35] <jiffe> I'm storing binary gzipped data and with an older version of the perl module it worked fine but it doesn't seem to anymore with the newer version

[07:50:34] <kurushiyama> jiffe: Another question is wether the index is in sorting order.

[14:39:32] <noman> Hi

[14:40:01] <noman> i just came here to ask why mongodb rocks! :D

[14:41:15] <Zelest> :D

[14:42:08] <noman> i mean seriosly i'm so hyped with mongodb i just want to talk to the community

[14:42:19] <noman> it fits perfectly with my multiplayer game project

[16:50:53] <jiffe> alright so its happening again now, these queries have generally returned right away but now this query have taken several minutes and still going

[16:51:02] <jiffe> db.message_data.find({ message_identifier: { '$gte': '004621cc068c8f24dfeab79b3d11e912.12019af3f8bffba0aedc7aaefefc7509fcfa4399.1417782247' } }).sort({ message_identifier: 1 }).limit(20)

[16:52:25] <Zelest> silly question perhaps, but I assume you have an index on message_identifier?

[16:52:35] <jiffe> I do

[16:52:39] <Zelest> also, how does "$gte" behave on strings :o

[16:53:43] <jiffe> it seemed to work fine for the first 400000 documents I've pulled so far

[16:54:12] <jiffe> I wouldn't think doing a string comparison would be much different from a numeric comparison

[17:01:35] <jiffe> here's an explain on the query http://nsab.us/public/mongodb

[17:12:08] <jiffe> if I connect to maildump4 which is part of the mongod's directly and run the query it comes back right away, through mongos it just sits there though

[17:17:41] <jiffe> there are different versions, it looks like 3.0.8 and 3.0.9 between the mongod's and the new mongos's are 3.0.12, not sure how much of a difference that makes

[17:19:29] <jiffe> I might just run through and upgrade everything

[17:27:30] <jiffe> using mongos on maildump1 v3.0.9 does the same thing though so I'm guessing thats not it

[17:34:55] <jokke> i need some help with beamers two-screen-functionality

[17:35:29] <jokke> i've included the pdfpages package and set the beameroption show notes on second screen

[17:35:41] <jokke> still notes are on a separate pdf page

[17:36:19] <jokke> oh

[17:36:21] <jokke> lol

[17:36:28] <jokke> totally wrong channel :D

[17:38:09] <jiffe> so how does one troubleshoot a query working on the individual replica set members but not through mongos ?

[18:36:23] <kurushiyama> jiffe: Uhm, that is quite a long identifier, and we are talking of a lexicographic comparison. Compared to an int or int64 comparison, that is much more complex. I might be wrong, but iirc the average time for string comparison is O((n-1)/2), with a very long n in this case, multiplied by the number of documents.

[18:39:45] <kurushiyama> jiffe: And, as far as I can see, the query does not include your shard key, resulting in a behaviour known as "scatter/gather".

[18:41:23] <jiffe> kurushiyama: message_identifier is the shard key

[18:41:28] <kurushiyama> o.O

[18:42:06] <kurushiyama> Ah, sorry, my fault, the $gte. But the scatter/gather problem remains.

[18:42:33] <jiffe> and for the most part you shouldn't have to scan the whole string to to comparison, and if I go to the individual shards this query returns right away, mongos seems to be the problem

[18:43:31] <kurushiyama> jiffe: Here is what happens: since the shard can not be reliably determined as for the $gte, your query is send to _all_ shards which aren't ruled out by the individual shards key ranges.

[18:43:55] <jiffe> well there are only two shards

[18:43:58] <kurushiyama> Then, the query is executed on those shards, which report their results back to mongos.

[18:44:24] <kurushiyama> Only then, thos results are sorted.

[18:45:01] <jiffe> yeah, so since I'm limiting to 20 documents, each shard would send back a max of 20 documents and mongos would have to sort through 40 documents to return the results

[18:45:04] <kurushiyama> So, we have a string comparison which is less than ideal, plus a scatter/gather.

[18:45:09] <kurushiyama> jiffe: No

[18:45:16] <kurushiyama> The limit would be done on the mongos.

[18:45:27] <jiffe> that makes no sense

[18:45:35] <kurushiyama> jiffe: It is that way.

[18:45:48] <jiffe> the query would return hundreds of millions of results without the limit

[18:46:07] <kurushiyama> No, since the limit is then applied on mongos.

[18:46:26] <jiffe> yeah, but the shards would returns hundreds of millions of results to mongos

[18:46:36] <kurushiyama> Aye. guess why it takes minutes.

[18:46:43] <jiffe> it wasn't though

[18:47:11] <jiffe> I was able to go through the first 400000 documents before it started stalling

[18:47:34] <kurushiyama> 400k docs are not a large result set.

[18:48:08] <jiffe> no, I've been running this query several times changing the value of message_identifier

[18:48:18] <Zelest> anyone happen to know if mongodb-php-library has any support for GridFS yet? :o

[18:48:25] <jiffe> they've all come back quickly until this specific identifier

[18:49:05] <kurushiyama> jiffe: Well, lets us put it difeerently, then. You assume a bug in mongos? Or just arbitrary stalling?

[18:49:26] <Zelest> repairdb? drop and re-create index? *shrugs*

[18:49:31] <kurushiyama> Zelest: Derick actually should know. And I am pretty sure it is documented.

[18:49:35] <jiffe> well that's what I'm trying to figure out, why it stopped working all of a sudden

[18:50:04] <jiffe> what you're saying makes sense, although it doesn't make sense to not limit on the shards because you'll never needs more than limit() documents from any given shard

[18:50:30] <kurushiyama> jiffe: From my experience, 99.999% of the cases where somebody claims "the tool is failing", it is caused by misuse based on wrong assumptions.

[18:50:49] <kurushiyama> jiffe: Lets take that for granted.

[18:51:01] <Zelest> kurushiyama, yeah, did a quick search through the docs without much result.. and google is tricky, as it basically only shows result for the legacy driver :P

[18:51:04] <kurushiyama> jiffe: Let is say, you have to sort only 40 docs.

[18:52:06] <Zelest> nvm, looks like GridFS exists :D

[18:52:17] <kurushiyama> jiffe: Still, you have a string comparison with O((n-1)/2) over "hundreds of millions of docs", and the mongos can not return a result until all shards which got this query are finished.

[18:52:22] <Zelest> https://github.com/mongodb/mongo-php-library/tree/master/src/GridFS :)

[18:52:57] <jiffe> kurushiyama: well that should be fine because for all message_identifiers I've tried, if I go to mongod directly it always comes back right away

[18:53:01] <kurushiyama> jiffe: with an n =85

[18:53:39] <kurushiyama> jiffe: I'd have a close look at the key ranges.

[18:54:52] <jiffe> now that you mention it, looking at the op taking place on the shard via db.currentOp() through mongos I see the query and orderby but nothing about limit

[18:55:17] <jiffe> so I'm not sure why my initial queries came back without a problem

[18:55:29] <jiffe> maybe I'll rerun those and see what its doing

[18:55:36] <kurushiyama> As said: I'd have a close look at the key ranges.

[18:55:50] <jiffe> regardless I don't see any reason why you shouldn't be able to push the limit to the shards also

[18:56:26] <kurushiyama> I know there was a good reason for it. I just can not remember...

[18:56:59] <kurushiyama> cheeser: Can you remember?

[18:58:50] <kurushiyama> jiffe: Nevertheless, this is not going to help you. Looking at this key, it looks like a composite, limited by "."

[18:59:26] <kurushiyama> jiffe: I assume you are looking for messages newer than a given date?

[18:59:57] <jiffe> no I'm just trying to go through all messages in batches

[19:00:22] <kurushiyama> jiffe: So why don't you?

[19:00:57] <jiffe> is there a better way? doing a search this seemed like the most reasonable restartable approach

[19:01:16] <kurushiyama> db.foo.find().skip().limit()

[19:01:32] <kurushiyama> maybe a sort

[19:01:39] <jiffe> skip won't require scanning through all the messages?

[19:01:44] <kurushiyama> no string comparison

[19:01:56] <kurushiyama> worth a try, imho

[19:02:23] <kurushiyama> what do you do with those messages? updates?

[19:02:48] <jiffe> no building statistical information

[19:03:04] <jiffe> so just reads

[19:03:05] <kurushiyama> jiffe: How about using the aggregation framework for it.

[19:03:40] <kurushiyama> jiffe: Building statistical info is sorta for what it was invented for.

[19:03:54] <jiffe> well I have to parse out a bunch of information from the message

[19:03:59] <kurushiyama> ?

[19:04:18] <kurushiyama> Describe _precisely_ what you need to parse

[19:04:44] <kurushiyama> A sample doc wont hurt, either.

[19:05:53] <kurushiyama> And the desired reult

[19:06:06] <kurushiyama> s/reult/result/

[19:08:13] <Zelest> bah

[19:08:23] <Zelest> "Support for GridFS is forthcoming." :(

[19:10:01] <jiffe> well the messages are mail messages, the headers have some spam markings I need to pull out and I'm also running the full content through a spam classifier

[19:10:46] <kurushiyama> jiffe: The whole message is in a single field?

[19:11:05] <kurushiyama> jiffe: I need to see one, remove the message payload, if you wish.

[19:11:11] <jiffe> no its broken up between several documents because out max message size is 64MB

[19:11:22] <kurushiyama> ew

[19:11:25] <jiffe> its also gzipped up

[19:11:32] <kurushiyama> gzipped ew

[19:12:28] <jiffe> yeah we have 2 mail backups, one is a snapshot, this is a message dump so all email that comes in and out of our system goes here with no plans of deletion

[19:12:31] <kurushiyama> jiffe: ??IWK-????1, to be precise

[19:13:39] <kurushiyama> ok, so those message parts are related by said message identifier, I guess.

[19:14:09] <jiffe> yeah, there's the message_identifier and chunk_id to keep it ordered

[19:14:32] <kurushiyama> ew

[19:14:37] <kurushiyama> ;)

[19:15:02] <jiffe> heh, it was the way it had to be done to work with size limitations

[19:15:16] <kurushiyama> gridfs and metadate

[19:15:38] <kurushiyama> metadata

[19:16:10] <jiffe> we do have some metadata in a separate db for searching but not what I need so I'm just going through the docs in mongodb

[19:16:42] <kurushiyama> well, you basically reimplemnted gridfs and metadata. ;)

[19:16:59] <kurushiyama> Ok, the stats can be directly derived from the messages headers?

[19:17:25] <jiffe> yeah, the headers are gzipped up too

[19:17:46] <kurushiyama> You know my reaction, I guess ;)

[19:18:30] <jiffe> yeah, well it works for our purposes, we use elasticsearch to search for the messages and get identifiers and then pull content from mongodb by identifier

[19:19:31] <jiffe> we originally had this setup differently using mongodb for search but there were some bugs at the time that caused problems which is when we threw in the elasticsearch layer

[19:22:15] <jiffe> I just tried db.message_data.find().skip(1000000).limit(20) and that seems to take quite a while also

[19:22:32] <kurushiyama> jiffe: I do not see any other option than to work it down this way.

[19:22:37] <jiffe> it might be better to bypass mongos and just query the shards directly and merge things myself

[23:06:24] <kurushiyama> jiffe: I doubt that, and I doubt that it is a good idea.

[23:12:38] <kurushiyama> jiffe: Maybe you should try larger badges.

[23:12:48] <kurushiyama> s/badges/batches/

[23:15:29] <ule> hi there

[23:15:59] <ule> I need to build a query to find a specific string inside an array from my collection object

[23:16:08] <ule> does anybody can pls help?

[23:16:22] <ule> I have the order_item_id from this: http://s33.postimg.org/mobyvat27/Screenshot_from_2016_05_28_19_05_41.png

[23:18:59] <jiffe> kurushiyama: why don't you think it is a good idea?

[23:20:25] <StephenLynx> because when you use sharding, it is designed to not be used directly, I think.

[23:22:54] <jiffe> well I'd be doing what mongos does but in a brute force kind of way

[23:23:17] <jiffe> I suppose if the balancer ran things could move around between queries or something

[23:58:08] <kurushiyama> jiffe: I have my strongest doubts that a makeshift solution will beat a system specifically designed for the same purpose, maintained over years, bugtracked and what not.

Log file Viewer

Help | Karma | Search:

#mongodb logs for Saturday the 28th of May, 2016