PMXBOT Log file Viewer

Help | Karma | Search:

#mongodb logs for Monday the 2nd of May, 2016

(Back to #mongodb overview) (Back to channel listing) (Animate logs)
[01:09:42] <csd_> Why is it that Mongo is able to easily store large binary blobs but that relational DBs (AFAIAW) cant?
[07:27:16] <hemangpatel> Hello, Why there is port range restriction in pymongo ? https://github.com/mongodb/mongo-python-driver/blob/master/pymongo/uri_parser.py#L143
[07:31:10] <joannac> hemangpatel: that's a system port restriction
[07:33:09] <joannac> hemangpatel: port numbers are 16 bits, which means 0-65535 inclusive
[07:33:11] <hemangpatel> joannac, My mongodb running on 10594 port
[07:33:28] <joannac> hemangpatel: ...yes, that's less than 65535
[07:33:55] <hemangpatel> joannac, oh sorry
[10:45:05] <jokke> hi
[10:45:05] <jokke> nevermind
[10:46:24] <jokke> actually i do have a question
[10:47:14] <jokke> if i use an embedded doc as _id, can i influence the sort order as i could do when manually defining an additional index with createIndex
[10:47:45] <jokke> seems redundant to add an index that contains just the members of the _id document
[10:47:59] <jokke> and defines sorting order
[11:05:03] <kurushiyama> jokke: Since _id is indexed by default, you can do sth like db.coll.find(query).sort({"_id.field":1})
[11:49:23] <jokke> kurushiyama: my queries are very slow... i benchmarked different schemas for the same query (fetch the last sample). And even for the flat schema (sample per document) i only get to around 14 queries per second.
[11:49:50] <jokke> It's significantly worse for the other schemas 3.5 - 8 queries per second
[11:51:09] <jokke> the worst being my flat_panel schema which stores samples of the same group in one document for every time interval
[11:51:42] <jokke> this is the query (aggregation): https://p.jreinert.com/t23U20/ruby
[11:53:11] <jokke> this is how the documents are structured in the flat_panel schema: https://p.jreinert.com/uzeJ/javascript
[11:53:16] <kurushiyama> jokke: Well, benchmarking is one part. What about the according indices? That aggregation is slow by definition: Each document get touched, and unwound
[11:53:27] <jokke> oh
[11:53:30] <jokke> wrong paste
[11:53:45] <jokke> https://p.jreinert.com/OXpxF/javascript
[11:53:48] <jokke> there
[11:53:49] <kurushiyama> The first one ;)
[11:54:25] <kurushiyama> Guess I told you earlier ;)
[11:54:30] <jokke> well not every one
[11:54:56] <jokke> ah i can speed it up
[11:55:00] <kurushiyama> Yes
[11:55:04] <kurushiyama> multiple ways
[11:55:17] <jokke> i can already sort and limit by the top level timestamp
[11:55:30] <kurushiyama> an early match with an elemMatch reduces the number of docs processed
[11:56:07] <kurushiyama> That would be my first.
[11:56:24] <kurushiyama> And I honestly doubt a flat model is slower
[11:56:44] <jokke> it's way slower with inserts
[11:56:49] <jokke> and the index is huge
[11:57:29] <kurushiyama> Wait a sec. You tell me that an doc insert is slower than an doc update?
[11:57:34] <jokke> yeah
[11:57:50] <kurushiyama> With all due respect: BS.
[11:57:56] <jokke> haha
[11:58:10] <jokke> i believe in data
[11:58:31] <kurushiyama> jokke: I do not believe _your_ data. Show me the test setup.
[11:58:45] <jokke> https://p.jreinert.com/j9m/ruby
[11:59:28] <jokke> what do you mean with elemMatch btw?
[11:59:42] <jokke> i can't really match unfortunately
[11:59:52] <jokke> oh
[11:59:54] <jokke> sure can
[12:00:26] <kurushiyama> In order to limit the number of docs processed, you want an early match. In your case, documents in which one of the array elements matches a certain condition
[12:00:35] <jokke> yeah
[12:03:33] <jokke> hm i can't seem to be able to speed it up
[12:03:51] <jokke> trying it like this atm: https://p.jreinert.com/j6Pq/ruby
[12:08:17] <kurushiyama> me no ruby. Have to validate sth, first
[12:09:44] <jokke> kurushiyama: regarding inserts being slower than a doc update: wt is super fast with $push, also updating a flat_panel doc won't trigger an index update afaik.
[12:09:58] <jokke> nothing in the index changes
[12:10:30] <kurushiyama> jokke: I am verifyig this, but tbh, I guess we are talking of orders of magnitude an update is slower for a lot of operations.
[12:13:59] <kurushiyama> jokke: Note the plural
[12:21:40] <kurushiyama> jokke: To give you an impression: inserting 100k docs took 56,84 secs. For updates, I have smoked a cigarette, and we are not halfway through.
[12:22:30] <kurushiyama> jokke: And that were single threaded inserts vs single threaded updates. With multiple threads, this ratio is going to be worse for updates due to document level locking.
[12:23:54] <jokke> you can try to convince me that inserts are faster as much as you want :D
[12:24:12] <jokke> i've run my tests, i've verified the data
[12:24:29] <kurushiyama> jokke: Ok, I am out then. You simply seem to know better, regardless of the facts and better knowledge. Good luck.
[12:24:37] <jokke> haha
[12:24:47] <jokke> what facts?
[12:25:05] <jokke> no need to take it personally dude
[12:25:26] <kurushiyama> Good luck: http://hastebin.com/uxuvidaruw.coffee
[12:26:00] <kurushiyama> jokke: I do not take it personal. But you are asking for help, ignoring my advice nevertheless. Thats a waste of time for both of us.
[12:26:14] <jokke> i'm just a bit irritated because i asked about my aggregation and why it's slow. before i know it my whole schema approach is being questioned.
[12:26:37] <kurushiyama> jokke: Since modelling defines performance.
[12:26:42] <jokke> kurushiyama: i've repeatedly told you why i won't go with flat documents
[12:27:09] <kurushiyama> jokke: As said: good luck. I guess you should try the above code. Might be an eye opener.
[12:28:39] <StephenLynx> let me guess, his model goes like, 3 or 4 layers deep?
[12:30:53] <Derick> StephenLynx: https://p.jreinert.com/OXpxF/javascript
[12:31:54] <StephenLynx> yeah, me neither
[12:32:24] <Derick> StephenLynx: let me guess, flat model?
[12:32:32] <StephenLynx> not entirely, but flatter.
[12:32:41] <StephenLynx> i'd made 'values' a separate collection
[12:32:48] <Derick> yes
[12:33:01] <StephenLynx> hm, value don't have to be an object either
[12:33:04] <Derick> an unwind on values so to speak
[12:33:58] <StephenLynx> why is it putting r and i on an object instead of directly on the object that contains value?
[12:34:31] <StephenLynx> I wouldn't have made _id an object either
[12:34:49] <StephenLynx> hm, I guess I would end up with a flat model after all.
[12:34:53] <Derick> :-)
[12:34:56] <Derick> exactly!
[12:36:49] <jokke> that would result in an index closer to 1TB
[12:37:07] <jokke> also slower writes
[12:37:17] <jokke> i don't see any benefit
[12:37:24] <jokke> except for faster reads
[12:37:28] <Derick> slower writes?
[12:37:33] <jokke> yes
[12:37:33] <Derick> that sounds... wrong
[12:37:46] <Derick> as opposed to adding a new item to that values array?
[12:37:55] <jokke> yup
[12:38:00] <Derick> I don't believe it.
[12:38:02] <jokke> :D
[12:38:09] <jokke> i'm not making it up
[12:38:36] <jokke> https://p.jreinert.com/5n6So/
[12:38:48] <Derick> yeah, you are. It can't logically be faster. Mostly, because wiredTiger is append only anyway. It doesn't do in-document updates, but always replaces the whole thing with an append
[12:39:13] <Derick> jokke: I can make up results too
[12:39:28] <jokke> why would i do that??
[12:39:59] <jokke> i don't know _why_ i get the results i do but that's what happens
[12:41:31] <jokke> maybe it has to do with sharding?
[12:41:39] <jokke> i really don't know
[12:42:17] <jokke> but the resulting documents are exactly like they should be in every schema
[12:45:00] <jokke> i'd be happy to be proven wrong, really
[12:45:10] <jokke> ahhh
[12:45:18] <jokke> Derick: i think i know now
[12:45:29] <jokke> i'm _not_ pushing every value
[12:45:31] <jokke> i buffer them
[12:45:39] <jokke> and update them in a bulk
[12:45:53] <jokke> that explains it
[12:45:59] <jokke> but that's how i get the data too
[12:46:07] <jokke> so it would make sense
[12:46:52] <jokke> i get packages of values grouped together. this group i use as top level document and store the individual values inside
[12:49:47] <jokke> so if a group contains 50 values i only need 1/50 times the write operations (update) as i would with a flat collection (insert)
[12:50:23] <jokke> i guess that's a rather important detail i forgot to mention...
[13:22:49] <jokke> StephenLynx: why wouldn't you make _id an object?
[13:23:18] <StephenLynx> KIS
[13:23:26] <jokke> hm
[13:23:48] <jokke> but then i'd end up with some weird string concatenation as id
[13:24:27] <jokke> and it'd be a lot less flexible
[13:24:37] <StephenLynx> no
[13:24:46] <StephenLynx> I would just keep it as an objectId
[13:24:58] <jokke> yeah but that's a useless index
[13:25:02] <StephenLynx> not really.
[13:25:17] <StephenLynx> you could reference it by the _id still
[13:25:20] <jokke> well apart from making sure it's unique
[13:25:26] <jokke> yeah
[13:25:34] <concernedcitizen> hi guys, I'm trying to make a $where query like: '$where': 'this.1-bedroom.length > 0'. But when I run this query, I got an SyntaxError: Unexpected Number.
[13:25:51] <jokke> concernedcitizen: $gt
[13:25:54] <concernedcitizen> I believe it might be something to do with using '1-bedroom'.
[13:25:57] <jokke> mp
[13:25:59] <jokke> no
[13:26:06] <jokke> ah
[13:26:21] <jokke> i've never seen > inside a query
[13:26:21] <concernedcitizen> 'this.studio.length > 0' works
[13:26:26] <jokke> ok
[13:26:45] <concernedcitizen> I think its some kind of javascript querying
[13:26:53] <jokke> ah ok
[13:26:57] <jokke> i'd suppose it's slow
[13:27:41] <jokke> does { 'this.1-bedroom.length': { $gt: 0 } } work?
[13:27:44] <concernedcitizen> yeah. I tried to do a {'1-bedroom': {'$size': {'$not': 0}}} but apparently $size can't have a range
[13:27:55] <concernedcitizen> hmm lemme try that
[13:28:07] <jokke> uh without this obviously
[13:28:28] <jokke> ah
[13:28:30] <jokke> aaah
[13:28:32] <jokke> ok
[13:28:46] <jokke> so no i think i finally get it. 1-bedroom is an array?
[13:28:50] <jokke> *now
[13:29:11] <concernedcitizen> nope it doesn't work, all results are 0
[13:29:19] <jokke> yeah size is tricky. I've bumped into that a few times...
[13:29:23] <kurushiyama> http://hastebin.com/iyaguxihey.coffee So much for updates being faster than inserts.
[13:29:23] <concernedcitizen> 1-bedroom is a key
[13:29:34] <jokke> kurushiyama: were you here before?
[13:29:36] <jokke> ah no
[13:29:37] <jokke> ok
[13:29:41] <jokke> never mind then
[13:29:45] <concernedcitizen> :(
[13:29:45] <cheeser> kurushiyama: who thought they would be?
[13:29:56] <kurushiyama> cheeser: Jokke did.
[13:30:08] <jokke> check the backlog
[13:30:18] <jokke> i figured out why
[13:31:17] <concernedcitizen> whoops i meant to say it should be 'this.1-bedroom.sizes.length' where sizes is an array in the 1-bedroom key
[13:31:47] <jokke> cheeser: { $not: { 1-bedroom: { $size: 0 } } } ?
[13:32:06] <jokke> oh
[13:32:17] <jokke> sorry concernedcitizen ^
[13:32:21] <jokke> wrong hl
[13:32:40] <concernedcitizen> nope doesn't work too.
[13:32:50] <jokke> { $not: { '1-bedroom.sizes': { $size: 0 } } } it should be in that case
[13:33:13] <concernedcitizen> unknown top level operator: $not
[13:33:16] <jokke> ah
[13:33:17] <jokke> ok
[13:33:33] <jokke> { '1-bedroom.sizes': { $not: { $size: 0 } } } ?
[13:33:52] <cheeser> why not just $gt 0?
[13:34:01] <jokke> cheeser: can't use gt with $size
[13:34:25] <Derick> and $not doesn't use an index
[13:34:26] <concernedcitizen> that works @jokke
[13:34:31] <cheeser> ah!
[13:34:43] <concernedcitizen> bedroom.sizes: > $not > $size > 0
[13:34:55] <jokke> concernedcitizen: great
[13:35:07] <concernedcitizen> but $not doesn't uses an index?
[13:35:10] <jokke> Derick: woah, that's good to know
[13:35:19] <concernedcitizen> is there a way to speed this up if thats the case?
[13:35:28] <jokke> concernedcitizen: in your case it shouldn't matter
[13:35:31] <Derick> then again, with $size an index isn't used either
[13:35:35] <jokke> ^
[13:35:42] <Derick> concernedcitizen: you should store the size separately in a field
[13:36:15] <concernedcitizen> ohh!
[13:36:42] <concernedcitizen> like 'number_size' and then do a {'number_size': { '$gt': 0 }} ?
[13:36:59] <jokke> yup
[13:37:08] <jokke> but you need to keep it up to date of course
[13:37:38] <cheeser> gross
[13:38:22] <Derick> cheeser: it can be useful when you're doing a $push combined with an $inc
[13:38:34] <cheeser> so long as you always remember to do so
[13:38:40] <Derick> sure
[13:38:40] <jokke> yeah.
[13:40:31] <jokke> oh god... https://p.jreinert.com/tTGRz/
[13:40:47] <jokke> that's in queries per second
[13:41:01] <jokke> https://p.jreinert.com/QV8/ruby
[13:41:01] <cheeser> nice
[13:41:01] <concernedcitizen> thanks jokke cheeser
[13:41:05] <concernedcitizen> and Derick
[13:41:37] <jokke> i'd thought the second query would've been faster...
[13:42:37] <jokke> is there some obvious optimization i've overlooked?
[13:42:58] <jokke> the only index i have is _id
[13:45:23] <jokke> i'm also not quite familiar with the concept of compound indexes. If _id is my only index, doesn't it mean that if i query with a subset of _id, it's as slow as there were no index for the fields at all?
[13:46:37] <jokke> hm but here i query using the whole index (_id contains the fields panel_id and timestamp)
[14:13:23] <jokke> interesting
[14:14:28] <jokke> the ruby mongo driver behaves differently for find(foo: 'bar').aggregate([...]) and aggregate([{ '$match' => { foo: 'bar' } }, ...])
[14:15:49] <cheeser> find().aggregate() feels weird.
[14:17:49] <jokke> but it's possible
[14:42:07] <nicolas_FR> hi there
[14:46:34] <nicolas_FR> in this json list : [{objName:"container", propA: [{objName:"subA", propB:[{objName:"subAB", propC:[] } ] } ] }] , how to get the first propC element from propB.objName="subAB" from propA.objName="subA" from objName="container" ?
[15:09:28] <deshymers> Hi I am trying to use the aggregate framework and I am grouping my results by intervals in a timestamp, so every 10 minutes. and I am sorting these by that time stamp to show the most current result
[15:10:29] <deshymers> this is working fine, however now a requirement has come up to show a certain result before all others, and this is where I am stuck
[15:12:41] <deshymers> the closest I can think of is a solution like that is done in SQL when ordering by country and putting a specific country first, however I am having trouble finding a example of this using the aggregate framework
[15:13:00] <Derick> do it in the app?
[15:13:16] <zsoc> NoSQL != SQL
[15:14:31] <deshymers> Derick: I'm returning result from mongo that may not have that data available because of the interval grouping
[15:15:48] <saml> give me your aggregate
[15:16:44] <zsoc> So you want to have your results grouped but with sort of custom 'sticky' document but the data that points that document out isn't exposed outside of your aggregate?
[15:17:42] <deshymers> saml: https://gist.github.com/deshymers/c0c0ca03cfe033d9c83b05d7c8af71a0
[15:18:42] <deshymers> zsoc: no it is exposed, I thought derick meant to order the results in the app after the result
[15:19:01] <saml> what are sticky documents?
[15:19:15] <saml> you can sort by sticky field then createdAt ?
[15:20:14] <deshymers> I need to sort by a specific value in a field
[15:20:23] <saml> {$sort: {score:1, createdAt:-1}}
[15:20:36] <saml> assumming score is that field
[15:20:43] <deshymers> so in the above aggregate, I want event = 'status' to be above all other results
[15:21:39] <Derick> you need to add a project when a condition to see if event=status, and only in that case project 1
[15:21:42] <Derick> otherwise 0
[15:21:46] <Derick> then you can sort on that
[15:22:04] <deshymers> interesting
[15:27:37] <zsoc> Anyone use mlab? Availability wise.
[16:22:13] <crazyphil> is there some guide out there on the Internet that details how "quiet" mongo-s operation should be?
[16:27:26] <silviolucenajuni> Hi everyone. Good afternoon.
[16:31:57] <starfly> And good morning
[16:47:00] <silviolucenajuni> Someone here did the certification exam ?
[17:17:44] <bogus-nick-irc> Folks, any ideas why a mongodump > mongorestore onto a different host would be ~10% smaller than on the original host? I made sure to db.dropDatabase() the databases I was migrating on the new host before running mongorestore
[17:18:40] <cheeser> probably due to disk layout efficienices in writing out the new data.
[17:19:09] <cheeser> on the old server, document moves will leave gaps in the file stores that may or may not be reused
[17:20:28] <bogus-nick-irc> cheeser - THANKS! I was so worried for a min haha!
[17:20:48] <bogus-nick-irc> I'm not a mongo whiz by any stretch, just a sysadmin working on #pulp
[17:39:02] <StephenLynx> whats pulp?
[17:40:22] <KodiakFiresmith> StephenLynx - not sure if you are coming from the Redhat world or even the Linux world so not sure where to start. Basically it's kind of like a WSUS server for Linux (if you're a windows guy). It's pretty sweet
[17:40:51] <StephenLynx> Ive been using centOS for a while now
[17:41:11] <StephenLynx> and working exclusively on linux for almost 2 years now
[17:41:19] <StephenLynx> from I see on the site
[17:41:27] <StephenLynx> it helps to manage a repository?
[17:41:50] <KodiakFiresmith> stephenlynx - this site doesn't do it justice by any means: http://www.pulpproject.org/
[17:42:06] <StephenLynx> yeah, that's what I was looking at
[17:42:28] <KodiakFiresmith> StephenLynx - yep - it takes outside repos, internal repos etc, condenses them into a local mirror
[17:42:38] <KodiakFiresmith> But it's growing beyond that even
[17:42:44] <KodiakFiresmith> debian support is being flirted with
[17:42:49] <KodiakFiresmith> Docker image repo
[17:42:59] <KodiakFiresmith> ISO (or any other arbitrary file type) repos
[17:43:06] <KodiakFiresmith> kickstart trees and kickstart file repos
[17:43:52] <KodiakFiresmith> I really like it so far - the reason I came to #mongodb today was because I just did my first hardware cutover - mongo has been a treat to work with
[17:48:35] <cheeser> sounds like bintray.
[19:42:48] <edrocks> is there anyway to set an array value at a certain positon?
[19:43:21] <StephenLynx> 'array.x'
[19:43:33] <edrocks> I tried that it added a field x
[19:43:36] <StephenLynx> or $ if you matched an element on the match block
[19:43:39] <StephenLynx> ........
[19:43:46] <StephenLynx> replace x by the index
[19:43:50] <edrocks> I did
[19:43:56] <edrocks> myfield.2.3
[19:44:05] <StephenLynx> hm
[19:44:13] <StephenLynx> was myfield already an array?
[19:44:19] <edrocks> yes
[19:44:21] <StephenLynx> and hold on
[19:44:29] <StephenLynx> are you talking about an array of objects?
[19:44:40] <edrocks> yes
[19:44:47] <StephenLynx> try just myfield.2
[19:44:50] <edrocks> well woops it'd `myfield.2.someprop.3`
[19:44:51] <StephenLynx> and see what happens
[19:45:11] <StephenLynx> I think it can still be done, but I would flatten that model.
[19:45:12] <edrocks> myfield contains an array of objects each with a field containing a second array of objects
[19:45:17] <StephenLynx> yeah, that's bad.
[19:45:21] <edrocks> why?
[19:45:25] <StephenLynx> its too complex.
[19:45:32] <StephenLynx> its slower to process and harder to use.
[19:46:08] <StephenLynx> you can have arrays of objects, but arrays of objects with arrays is one step over the line, IMO
[19:46:24] <StephenLynx> even arrays of objects is considered a stretch sometimes.
[19:46:41] <edrocks> it works if I add in my second prop
[19:46:47] <StephenLynx> yes
[19:46:52] <StephenLynx> it works alright.
[19:47:11] <StephenLynx> but still you will be hindered when you need to perform complex operations or need performant indexes.
[19:47:22] <edrocks> they aren't big arrays
[19:47:47] <edrocks> each array is only 1-5 maybe 7 elements
[19:47:56] <edrocks> with 1-4 in the second array
[19:49:47] <StephenLynx> do you have lots of documents?
[19:49:54] <edrocks> yes
[19:50:03] <StephenLynx> then you will need good indexes.
[19:50:23] <StephenLynx> it won't work as well if you index by these nested elements.
[19:50:26] <StephenLynx> afaik
[19:50:31] <edrocks> I'm using the document id to update it though
[19:50:59] <StephenLynx> which is slower than an insert.
[19:51:18] <StephenLynx> so adding elements to these nested items is slower than adding documents to a new collection
[19:51:38] <edrocks> I don't need to add elements just update them
[19:53:42] <StephenLynx> kek
[19:54:58] <edrocks> StephenLynx: wouldn't it use the id index to update them?
[19:55:51] <StephenLynx> that would be possible, even though it wouldn't be the only options.
[19:56:02] <StephenLynx> you could add any index.
[20:07:37] <kurushiyama> edrocks: Still, an update is orders of magnitude slower than an insert. This situation worsens if you concurrently add data.
[20:16:41] <GothAlice> edrocks: Or if the data grows beyond the record padding limit. Then you're moving whole records around, not just modifying one in-place.
[20:17:31] <GothAlice> Not to mention the impossibility of limited projection of array contents, doubly so for nested array contents. (Only one $ operator per query…)
[20:18:08] <cheeser> WT doesn't do in place updates, fwiw.
[20:18:34] <GothAlice> Good to know.
[20:19:26] <GothAlice> cheeser: Wait, LSM is enabled by default?
[20:19:37] <kurushiyama> Yeah, it is all c-o-w. What actually would be interesting are the conditions for the "old" space to be reused.
[20:19:47] <kurushiyama> GothAlice: It is copy-on-write.
[20:19:58] <cheeser> GothAlice: LSM is not enabled in the WT bundled in mongodb
[20:20:34] <GothAlice> Okay, phew, had me worried for a moment there. ;P
[20:20:43] <cheeser> my work here is done!
[20:22:28] <edrocks> what would you do instead of arrays? have a bunch of separate documents referencing the parents id?
[20:23:47] <GothAlice> Having one level of nesting is A-OK. Having two is tenuous unless _insanely carefully controlled_ to ensure you don't run afoul of the nearly limitless list of problems with having multiple arrays in a single document, or nested arrays, that's why it's generally so strongly frowned upon.
[20:24:01] <GothAlice> It _can_ work, but there are limits, and the simpler the structure is kept, the better for everyone.
[20:24:12] <kurushiyama> edrocks: That really depends on your use case. But basically, yes. Most of the times, it is rather questionable if there is a "parent" in the classical sense. Quite often, I find that it is more of a generic relation than a parent-child relation in the narrow sense.
[20:25:03] <edrocks> well the second nested array is mostly unused besides for the first element(but the data does need to be there and be viewable)
[20:25:06] <GothAlice> For example, an array/list of simple types within an array of compound types (I.e. [{foo: [1, 2, 3]}]) is manageable. Nested compound types? Not gonna fly.
[20:25:23] <kurushiyama> edrocks: can you pastebin a sample?
[20:25:28] <edrocks> kurushiyama: sure
[20:25:45] <kurushiyama> GothAlice: Quite an understatement ;)
[20:27:21] <GothAlice> kurushiyama: :P My forums, which I often point to as a compact example of when and when not to nest/embed, stores voter ObjectIds in a nested array within the replies within the thread. But it's still document→array(compound)→array(simple) at absolute maximum.
[20:28:54] <StephenLynx> on lynxchan I don't think I ever go beyond document>array(compound)
[20:29:00] <edrocks> kurushiyama: http://pastebin.com/6eGbRG2j
[20:29:23] <edrocks> it's for shipping quote data(think priceline for shipping)
[20:29:25] <kurushiyama> GothAlice: I even found that structure to be problematic. Ofc, it is highly optimized for queries and fulfilling the according use case. But it has to be very carefully crafted. Which I am sure _you_ can do...
[20:29:32] <StephenLynx> and its very rare for arrays to have compound objects, usually its simple values
[20:29:50] <edrocks> it's complicated data
[20:29:52] <GothAlice> kurushiyama: Heh, quite so. As a "rule of thumb", "don't do that" is quite valid.
[20:30:05] <edrocks> I've never done it anywhere else and do not intend to
[20:30:48] <kurushiyama> edrocks: Give us a bit of information. Ok, I get these are grouped quotes. What I do not get by which attribute they are grouped.
[20:30:50] <GothAlice> edrocks: Yeah, at a glance I'd pivot "QuoteGroups" out into its own collection, then the nested "quotes" for a given QuoteGroup document will be a) singly nested, and b) still "optimizing the record count" by pre-grouping.
[20:31:27] <edrocks> kurushiyama: a quote group contains quotes coming from a single location
[20:31:51] <edrocks> each quote is the actual price, transit time, etc
[20:31:51] <kurushiyama> Then I'd definetly go with GothAlice's suggestion of single docs per quote.
[20:32:19] <GothAlice> "QuoteGroup"
[20:32:42] <kurushiyama> Because you will most likely look up the quotes for a known location, right?
[20:33:00] <edrocks> kurushiyama: no they are looked up by the document id
[20:33:15] <kurushiyama> edrocks: Of the quote?
[20:33:24] <edrocks> kurushiyama: of the whole outer most document
[20:34:13] <kurushiyama> edrocks: Well, there you have it. ;) GothAlice: Point taken ;)
[20:34:28] <GothAlice> edrocks: This is likely because it was impossible to ask the DB this with your current structure, so was never a consideration in your head-space, but split apart at the QuoteGroup level as I suggest would allow you to answer the specific question: "What quotes were given for project X at location Y?"
[20:34:43] <kurushiyama> +1
[20:34:56] <GothAlice> Without needing to inspect, interrogate, load, or otherwise touch the quotes at other locations for the given top-level thing.
[20:35:04] <GothAlice> Currently, that's impossible. But split, it's not.
[20:35:47] <edrocks> admittedly I've added very little analytics so I'll probably have to change it like you recommend
[20:37:19] <edrocks> your really going to not like it when I tell you each quote also has another nested Rates array containing different pricing levels
[20:37:25] <GothAlice> Before really stepping in to design my data structures, I first sit down and plan out exactly what questions I need the data to answer. With MongoDB, this informs the actual design more than any other factor.
[20:37:43] <GothAlice> edrocks: <humour>You'll be first against the wall when the revolution comes.</humour>
[20:38:09] <edrocks> GothAlice: at least it's worked well so far :)
[20:38:20] <GothAlice> Except, clearly, you aren't actually asking the data any questions.
[20:38:25] <GothAlice> You're treating it as a JSON BLOB.
[20:38:41] <GothAlice> (Thus: various definitions of "well".)
[20:39:26] <edrocks> GothAlice: wouldn't splitting it all up make querying for everything slow though? ie if I have to lookup 20 sub documents for each document
[20:39:40] <GothAlice> I can't imagine the edge cases you've been ignoring involving the indexes of individual array elements potentially changing as things are added and removed, and thus an update might accidentally update the wrong array element.
[20:40:21] <GothAlice> edrocks: The idea of building a single, complete, comprehensive object representing the entire state of that order and all related data is a fundamentally flawed approach.
[20:40:35] <edrocks> the quote groups are all completly replaced when they are updated. I just send the updates as I get them(over 1-10seconds)
[20:41:01] <kurushiyama> GothAlice: Interesting decision to do deep nesting for the forums. Was it by design or evolutionary?
[20:41:14] <edrocks> kurushiyama: evolution
[20:41:34] <edrocks> at first it was just one quote group then I added multi locations then I added different service levels
[20:41:46] <GothAlice> Design, kurushiyama. A "forum" is a standalone thing, a "thread" contains its replies. This gives an upper bound of around 2.3 million English words of text per thread before needing to worry about "continuation" threads.
[20:43:01] <GothAlice> This is explicitly due to the querying needs: when looking at a thread, you naturally want the replies. (Single level nesting = can "paginate" or range query the array to project subsets.) When looking at a single reply, you also need to know about the thread (title, permissions, etc.) so the embedded "relationship" there is absolutely perfect. Single nesting also preserves updating individual replies by ID.
[20:44:13] <GothAlice> An array being treated as a set ($addToSet and friends) for the voter information works a-ok with the single $ limitation, since those updates are always applied to a given reply by ID.
[20:46:44] <GothAlice> (Single complex type nesting, to be specific.)
[20:50:13] <kurushiyama> GothAlice: Thanks for the insight. I would have guessed that when looking for the replies for a given thread, it would have been easier to simply embed the votings in the answer and go with a flat model otherwise. Given the context, however, this makes sense.
[20:52:05] <GothAlice> kurushiyama: There are some trade-offs, however. Certain statistics (view count, number of replies, number of votes, etc.) are tracked as counters in the Thread record and/or individual embedded replies, with a per-thread pre-aggregated copy of the individual reply statistics. I.e. update a reply view count also increments the thread view count.
[20:52:15] <GothAlice> This due to the fact that you can't .count() out of an array. ;)
[20:52:37] <GothAlice> (And also because it reduces most analytics queries to O(n-threads-in-range).)
[20:52:54] <GothAlice> (Instead of O(n-threads * m-replies))
[20:54:58] <kurushiyama> GothAlice: I see and was about to say that some operations would be rather nasty if not pre-aggregated. Your data model is highly optimized for your use cases. Which is what I love about MongoDB ;)
[20:56:06] <kurushiyama> edrocks: Let me try to get a grasp on your data. A single doc as you have shown us represents a location, for which zero or more quotes (in the economical sense) are made, right?
[20:56:48] <edrocks> kurushiyama: each quotegroup contains quotes from a specific location
[20:58:19] <kurushiyama> edrocks: Knowing your business: Those are quotes of companies for shipping something to said location, right?
[20:58:33] <edrocks> kurushiyama: I'm making a picture one second
[20:59:09] <kurushiyama> edrocks: I am sorry, but I am very bad at modelling data when I can not wrap my head around the problem at hand ;)
[21:00:01] <GothAlice> kurushiyama: On one work dataset, either we could aggregate a million records to cover a month's analytics events, or… we pre-aggregate so reports only query 30*24=720 records for the same month. (Hourly pre-aggregation.)
[21:00:24] <GothAlice> Constant time reports for fun and profit.
[21:03:33] <GothAlice> kurushiyama: Also… meddling Puppeteers…
[21:04:21] <kurushiyama> GothAlice: Well... it feels... wrong. I see the benefit, but storing aggregation data... I always felt more comfortable with downsampling.
[21:04:46] <GothAlice> kurushiyama: https://gist.github.com/amcgregor/1ca13e5a74b2ac318017
[21:04:49] <kurushiyama> GothAlice: ;)
[21:04:55] <GothAlice> That may be what you describe as "downsampling".
[21:05:19] <edrocks> kurushiyama: https://drive.google.com/file/d/0B3SnjDk4Ny4-NEFTcGdkZlotU2s/view?usp=sharing
[21:05:58] <edrocks> each quotegroup contains quotes for various carriers(shipping companies) coming from a single location
[21:09:25] <kurushiyama> GothAlice: Interesting.
[21:11:21] <kurushiyama> GothAlice: General Products approach to data modelling ;)
[21:12:12] <edrocks> kurushiyama: I have to go get a haricut but we can talk later. did you get the picture?
[21:12:27] <kurushiyama> edrocks: I think I get it now. The quotes make up a chain of transportations to get the shipment done?
[21:12:41] <kurushiyama> edrocks: ttyl!
[21:12:47] <edrocks> kurushiyama: yea they are shipping quotes from various shipping companies
[21:13:10] <edrocks> kurushiyama: it's to compare which company to use and which warehouse is best to ship something from
[21:15:54] <kurushiyama> GothAlice: Can you share experience about performance of adding data? Most of my customers have to deal with several thousand events/minute. I guess we could do a lot on the UX side, but slowing down insert performance could easil translate to additional shards. I guess the loss is significant, isn't it?
[21:21:28] <GothAlice> kurushiyama: With two "consumers" and three "producers", I was able to get around two million bidirectional "requests" per second across MongoDB. That is very specifically, on the "producer" side: insert job record into real collection, insert notification into capped collection; and on the "consumer" side watch for the notification, load the job record, run the referenced function, update the job record with the result of running the
[21:21:28] <GothAlice> function, with three "notifications" inserted per job.
[21:22:32] <GothAlice> That's two million enqueued and executed tasks per _second_.
[21:22:36] <kurushiyama> GothAlice: Bam.
[21:23:11] <GothAlice> And that was benchmarked on MongoDB 2… 6? I think it was 2.6. Five years ago at this point. ;P
[21:23:36] <kurushiyama> GothAlice: May I ask of what scale of hardware we are talking about?
[21:24:16] <GothAlice> Multiple processes on a single unibody MacBook Pro. Core i5, 8GB RAM.
[21:24:26] <kurushiyama> You kidding?
[21:24:42] <GothAlice> Did it live as part of a "this is what we've been working on" presentation during a Facebook developers mini-conference. Knocked socks off.
[21:25:28] <GothAlice> (We used it for messaging, immediate RPC, and scheduled RPC for a Facebook game we were working on at the time for the TV show The Bachelor. ;)
[21:25:37] <kurushiyama> GothAlice: On my 2011 MBA, I get maybe 2.5k inserts/s
[21:25:54] <GothAlice> Ah, but that's a single process inserting, no?
[21:26:16] <kurushiyama> GothAlice: Right, but still nowhere near your throughput.
[21:26:22] <GothAlice> If you run "top" or equivalent to watch CPU load, you may notice mongod not doing much with that level of stress.
[21:26:45] <GothAlice> Oh: also, SSD disk.
[21:27:02] <GothAlice> (And a MacBook Air is woefully underpowered.)
[21:27:39] <kurushiyama> The disk-on-bus makes the disks rather fast... ;)
[21:27:49] <GothAlice> Makes IO fast.
[21:27:55] <GothAlice> Doesn't change the thermal profile of your CPU, though.
[21:28:09] <kurushiyama> Which reminds me of getting an egg ;)
[21:28:54] <GothAlice> Basically: don't run stress tests on an Air. It's small profile and lack of substantial active cooling means the CPU will simply refuse to "spin up" to Turbo Boost speeds. Especially once it gets to the temperature where it can cook an egg.
[21:29:18] <Derick> hmm, eggs
[21:31:00] <kurushiyama> GothAlice: Ah... wait. With mgo, i was somewhere at like 100k inserts/s (though I could have started a burger bbq on the CPU). And of course I do not do stress tests on an MBA. That's what I have a lab for ;) But 2M/ops on a laptop is nevertheless impressive.
[21:31:28] <GothAlice> kurushiyama: https://gist.github.com/amcgregor/4207375 are extracts from my slides in that presentation (switching the project to MongoDB was my pet project there) with a link to a more complete implementation in the comments. :)
[21:31:30] <kurushiyama> Is there a cooling device in the shake of a BBQ grill?
[21:37:50] <kurushiyama> GothAlice: Jeez. A distributed task queue with automatic failover. Adding some shard tags into the mix, I can think of quite some great applications for that.
[21:45:06] <GothAlice> Yes indeed. :3
[21:45:43] <GothAlice> kurushiyama: It gets so much better, though. marrow.task can distribute Python generators.
[21:46:01] <GothAlice> I.e. have one worker emitting values one or more workers, somewhere else, are consuming.
[21:46:11] <GothAlice> (Using actual generators.)
[21:47:00] <kurushiyama> GothAlice: Well, I am a bit out of Python (mainly doing Go now and did Java for a long time before that). But as I get it, it is distributable closures (more or less)?
[21:47:36] <GothAlice> They can be used that way, yes. Generators are the layer below coroutines. (A generator you can send() to… is a coroutine.)
[21:49:51] <kurushiyama> GothAlice: dang. distribute them, have them report back...
[21:50:15] <GothAlice> "Deferred processing pipelines" is how I'll be advertising it once the web framework projects calm down a bit.
[22:01:02] <kurushiyama> GothAlice: ? Calm down a bit? I have the feeling it is getting more crazy day by day...
[23:20:40] <hardwire> so the online docs for MongoDB are now claiming that MongoDB enterprise has different features in the core software than the open-source version, like in-memory datastore (beta).
[23:21:11] <hardwire> Are they starting to put commercial only code in a fork of the core server?
[23:21:24] <hardwire> That'd be sorta depressing.
[23:43:22] <cheeser> gotta pay the bills