[01:09:42] <csd_> Why is it that Mongo is able to easily store large binary blobs but that relational DBs (AFAIAW) cant?
[07:27:16] <hemangpatel> Hello, Why there is port range restriction in pymongo ? https://github.com/mongodb/mongo-python-driver/blob/master/pymongo/uri_parser.py#L143
[07:31:10] <joannac> hemangpatel: that's a system port restriction
[07:33:09] <joannac> hemangpatel: port numbers are 16 bits, which means 0-65535 inclusive
[07:33:11] <hemangpatel> joannac, My mongodb running on 10594 port
[07:33:28] <joannac> hemangpatel: ...yes, that's less than 65535
[10:47:14] <jokke> if i use an embedded doc as _id, can i influence the sort order as i could do when manually defining an additional index with createIndex
[10:47:45] <jokke> seems redundant to add an index that contains just the members of the _id document
[11:05:03] <kurushiyama> jokke: Since _id is indexed by default, you can do sth like db.coll.find(query).sort({"_id.field":1})
[11:49:23] <jokke> kurushiyama: my queries are very slow... i benchmarked different schemas for the same query (fetch the last sample). And even for the flat schema (sample per document) i only get to around 14 queries per second.
[11:49:50] <jokke> It's significantly worse for the other schemas 3.5 - 8 queries per second
[11:51:09] <jokke> the worst being my flat_panel schema which stores samples of the same group in one document for every time interval
[11:51:42] <jokke> this is the query (aggregation): https://p.jreinert.com/t23U20/ruby
[11:53:11] <jokke> this is how the documents are structured in the flat_panel schema: https://p.jreinert.com/uzeJ/javascript
[11:53:16] <kurushiyama> jokke: Well, benchmarking is one part. What about the according indices? That aggregation is slow by definition: Each document get touched, and unwound
[12:00:26] <kurushiyama> In order to limit the number of docs processed, you want an early match. In your case, documents in which one of the array elements matches a certain condition
[12:03:33] <jokke> hm i can't seem to be able to speed it up
[12:03:51] <jokke> trying it like this atm: https://p.jreinert.com/j6Pq/ruby
[12:08:17] <kurushiyama> me no ruby. Have to validate sth, first
[12:09:44] <jokke> kurushiyama: regarding inserts being slower than a doc update: wt is super fast with $push, also updating a flat_panel doc won't trigger an index update afaik.
[12:10:30] <kurushiyama> jokke: I am verifyig this, but tbh, I guess we are talking of orders of magnitude an update is slower for a lot of operations.
[12:21:40] <kurushiyama> jokke: To give you an impression: inserting 100k docs took 56,84 secs. For updates, I have smoked a cigarette, and we are not halfway through.
[12:22:30] <kurushiyama> jokke: And that were single threaded inserts vs single threaded updates. With multiple threads, this ratio is going to be worse for updates due to document level locking.
[12:23:54] <jokke> you can try to convince me that inserts are faster as much as you want :D
[12:24:12] <jokke> i've run my tests, i've verified the data
[12:24:29] <kurushiyama> jokke: Ok, I am out then. You simply seem to know better, regardless of the facts and better knowledge. Good luck.
[12:25:05] <jokke> no need to take it personally dude
[12:25:26] <kurushiyama> Good luck: http://hastebin.com/uxuvidaruw.coffee
[12:26:00] <kurushiyama> jokke: I do not take it personal. But you are asking for help, ignoring my advice nevertheless. Thats a waste of time for both of us.
[12:26:14] <jokke> i'm just a bit irritated because i asked about my aggregation and why it's slow. before i know it my whole schema approach is being questioned.
[12:26:37] <kurushiyama> jokke: Since modelling defines performance.
[12:26:42] <jokke> kurushiyama: i've repeatedly told you why i won't go with flat documents
[12:27:09] <kurushiyama> jokke: As said: good luck. I guess you should try the above code. Might be an eye opener.
[12:28:39] <StephenLynx> let me guess, his model goes like, 3 or 4 layers deep?
[12:38:48] <Derick> yeah, you are. It can't logically be faster. Mostly, because wiredTiger is append only anyway. It doesn't do in-document updates, but always replaces the whole thing with an append
[12:39:13] <Derick> jokke: I can make up results too
[13:25:34] <concernedcitizen> hi guys, I'm trying to make a $where query like: '$where': 'this.1-bedroom.length > 0'. But when I run this query, I got an SyntaxError: Unexpected Number.
[13:45:23] <jokke> i'm also not quite familiar with the concept of compound indexes. If _id is my only index, doesn't it mean that if i query with a subset of _id, it's as slow as there were no index for the fields at all?
[13:46:37] <jokke> hm but here i query using the whole index (_id contains the fields panel_id and timestamp)
[14:46:34] <nicolas_FR> in this json list : [{objName:"container", propA: [{objName:"subA", propB:[{objName:"subAB", propC:[] } ] } ] }] , how to get the first propC element from propB.objName="subAB" from propA.objName="subA" from objName="container" ?
[15:09:28] <deshymers> Hi I am trying to use the aggregate framework and I am grouping my results by intervals in a timestamp, so every 10 minutes. and I am sorting these by that time stamp to show the most current result
[15:10:29] <deshymers> this is working fine, however now a requirement has come up to show a certain result before all others, and this is where I am stuck
[15:12:41] <deshymers> the closest I can think of is a solution like that is done in SQL when ordering by country and putting a specific country first, however I am having trouble finding a example of this using the aggregate framework
[15:16:44] <zsoc> So you want to have your results grouped but with sort of custom 'sticky' document but the data that points that document out isn't exposed outside of your aggregate?
[16:47:00] <silviolucenajuni> Someone here did the certification exam ?
[17:17:44] <bogus-nick-irc> Folks, any ideas why a mongodump > mongorestore onto a different host would be ~10% smaller than on the original host? I made sure to db.dropDatabase() the databases I was migrating on the new host before running mongorestore
[17:18:40] <cheeser> probably due to disk layout efficienices in writing out the new data.
[17:19:09] <cheeser> on the old server, document moves will leave gaps in the file stores that may or may not be reused
[17:20:28] <bogus-nick-irc> cheeser - THANKS! I was so worried for a min haha!
[17:20:48] <bogus-nick-irc> I'm not a mongo whiz by any stretch, just a sysadmin working on #pulp
[17:40:22] <KodiakFiresmith> StephenLynx - not sure if you are coming from the Redhat world or even the Linux world so not sure where to start. Basically it's kind of like a WSUS server for Linux (if you're a windows guy). It's pretty sweet
[17:40:51] <StephenLynx> Ive been using centOS for a while now
[17:41:11] <StephenLynx> and working exclusively on linux for almost 2 years now
[17:42:59] <KodiakFiresmith> ISO (or any other arbitrary file type) repos
[17:43:06] <KodiakFiresmith> kickstart trees and kickstart file repos
[17:43:52] <KodiakFiresmith> I really like it so far - the reason I came to #mongodb today was because I just did my first hardware cutover - mongo has been a treat to work with
[20:07:37] <kurushiyama> edrocks: Still, an update is orders of magnitude slower than an insert. This situation worsens if you concurrently add data.
[20:16:41] <GothAlice> edrocks: Or if the data grows beyond the record padding limit. Then you're moving whole records around, not just modifying one in-place.
[20:17:31] <GothAlice> Not to mention the impossibility of limited projection of array contents, doubly so for nested array contents. (Only one $ operator per query…)
[20:18:08] <cheeser> WT doesn't do in place updates, fwiw.
[20:22:28] <edrocks> what would you do instead of arrays? have a bunch of separate documents referencing the parents id?
[20:23:47] <GothAlice> Having one level of nesting is A-OK. Having two is tenuous unless _insanely carefully controlled_ to ensure you don't run afoul of the nearly limitless list of problems with having multiple arrays in a single document, or nested arrays, that's why it's generally so strongly frowned upon.
[20:24:01] <GothAlice> It _can_ work, but there are limits, and the simpler the structure is kept, the better for everyone.
[20:24:12] <kurushiyama> edrocks: That really depends on your use case. But basically, yes. Most of the times, it is rather questionable if there is a "parent" in the classical sense. Quite often, I find that it is more of a generic relation than a parent-child relation in the narrow sense.
[20:25:03] <edrocks> well the second nested array is mostly unused besides for the first element(but the data does need to be there and be viewable)
[20:25:06] <GothAlice> For example, an array/list of simple types within an array of compound types (I.e. [{foo: [1, 2, 3]}]) is manageable. Nested compound types? Not gonna fly.
[20:25:23] <kurushiyama> edrocks: can you pastebin a sample?
[20:25:45] <kurushiyama> GothAlice: Quite an understatement ;)
[20:27:21] <GothAlice> kurushiyama: :P My forums, which I often point to as a compact example of when and when not to nest/embed, stores voter ObjectIds in a nested array within the replies within the thread. But it's still document→array(compound)→array(simple) at absolute maximum.
[20:28:54] <StephenLynx> on lynxchan I don't think I ever go beyond document>array(compound)
[20:29:23] <edrocks> it's for shipping quote data(think priceline for shipping)
[20:29:25] <kurushiyama> GothAlice: I even found that structure to be problematic. Ofc, it is highly optimized for queries and fulfilling the according use case. But it has to be very carefully crafted. Which I am sure _you_ can do...
[20:29:32] <StephenLynx> and its very rare for arrays to have compound objects, usually its simple values
[20:29:52] <GothAlice> kurushiyama: Heh, quite so. As a "rule of thumb", "don't do that" is quite valid.
[20:30:05] <edrocks> I've never done it anywhere else and do not intend to
[20:30:48] <kurushiyama> edrocks: Give us a bit of information. Ok, I get these are grouped quotes. What I do not get by which attribute they are grouped.
[20:30:50] <GothAlice> edrocks: Yeah, at a glance I'd pivot "QuoteGroups" out into its own collection, then the nested "quotes" for a given QuoteGroup document will be a) singly nested, and b) still "optimizing the record count" by pre-grouping.
[20:31:27] <edrocks> kurushiyama: a quote group contains quotes coming from a single location
[20:31:51] <edrocks> each quote is the actual price, transit time, etc
[20:31:51] <kurushiyama> Then I'd definetly go with GothAlice's suggestion of single docs per quote.
[20:33:24] <edrocks> kurushiyama: of the whole outer most document
[20:34:13] <kurushiyama> edrocks: Well, there you have it. ;) GothAlice: Point taken ;)
[20:34:28] <GothAlice> edrocks: This is likely because it was impossible to ask the DB this with your current structure, so was never a consideration in your head-space, but split apart at the QuoteGroup level as I suggest would allow you to answer the specific question: "What quotes were given for project X at location Y?"
[20:34:56] <GothAlice> Without needing to inspect, interrogate, load, or otherwise touch the quotes at other locations for the given top-level thing.
[20:35:04] <GothAlice> Currently, that's impossible. But split, it's not.
[20:35:47] <edrocks> admittedly I've added very little analytics so I'll probably have to change it like you recommend
[20:37:19] <edrocks> your really going to not like it when I tell you each quote also has another nested Rates array containing different pricing levels
[20:37:25] <GothAlice> Before really stepping in to design my data structures, I first sit down and plan out exactly what questions I need the data to answer. With MongoDB, this informs the actual design more than any other factor.
[20:37:43] <GothAlice> edrocks: <humour>You'll be first against the wall when the revolution comes.</humour>
[20:38:09] <edrocks> GothAlice: at least it's worked well so far :)
[20:38:20] <GothAlice> Except, clearly, you aren't actually asking the data any questions.
[20:38:25] <GothAlice> You're treating it as a JSON BLOB.
[20:38:41] <GothAlice> (Thus: various definitions of "well".)
[20:39:26] <edrocks> GothAlice: wouldn't splitting it all up make querying for everything slow though? ie if I have to lookup 20 sub documents for each document
[20:39:40] <GothAlice> I can't imagine the edge cases you've been ignoring involving the indexes of individual array elements potentially changing as things are added and removed, and thus an update might accidentally update the wrong array element.
[20:40:21] <GothAlice> edrocks: The idea of building a single, complete, comprehensive object representing the entire state of that order and all related data is a fundamentally flawed approach.
[20:40:35] <edrocks> the quote groups are all completly replaced when they are updated. I just send the updates as I get them(over 1-10seconds)
[20:41:01] <kurushiyama> GothAlice: Interesting decision to do deep nesting for the forums. Was it by design or evolutionary?
[20:41:34] <edrocks> at first it was just one quote group then I added multi locations then I added different service levels
[20:41:46] <GothAlice> Design, kurushiyama. A "forum" is a standalone thing, a "thread" contains its replies. This gives an upper bound of around 2.3 million English words of text per thread before needing to worry about "continuation" threads.
[20:43:01] <GothAlice> This is explicitly due to the querying needs: when looking at a thread, you naturally want the replies. (Single level nesting = can "paginate" or range query the array to project subsets.) When looking at a single reply, you also need to know about the thread (title, permissions, etc.) so the embedded "relationship" there is absolutely perfect. Single nesting also preserves updating individual replies by ID.
[20:44:13] <GothAlice> An array being treated as a set ($addToSet and friends) for the voter information works a-ok with the single $ limitation, since those updates are always applied to a given reply by ID.
[20:46:44] <GothAlice> (Single complex type nesting, to be specific.)
[20:50:13] <kurushiyama> GothAlice: Thanks for the insight. I would have guessed that when looking for the replies for a given thread, it would have been easier to simply embed the votings in the answer and go with a flat model otherwise. Given the context, however, this makes sense.
[20:52:05] <GothAlice> kurushiyama: There are some trade-offs, however. Certain statistics (view count, number of replies, number of votes, etc.) are tracked as counters in the Thread record and/or individual embedded replies, with a per-thread pre-aggregated copy of the individual reply statistics. I.e. update a reply view count also increments the thread view count.
[20:52:15] <GothAlice> This due to the fact that you can't .count() out of an array. ;)
[20:52:37] <GothAlice> (And also because it reduces most analytics queries to O(n-threads-in-range).)
[20:52:54] <GothAlice> (Instead of O(n-threads * m-replies))
[20:54:58] <kurushiyama> GothAlice: I see and was about to say that some operations would be rather nasty if not pre-aggregated. Your data model is highly optimized for your use cases. Which is what I love about MongoDB ;)
[20:56:06] <kurushiyama> edrocks: Let me try to get a grasp on your data. A single doc as you have shown us represents a location, for which zero or more quotes (in the economical sense) are made, right?
[20:56:48] <edrocks> kurushiyama: each quotegroup contains quotes from a specific location
[20:58:19] <kurushiyama> edrocks: Knowing your business: Those are quotes of companies for shipping something to said location, right?
[20:58:33] <edrocks> kurushiyama: I'm making a picture one second
[20:59:09] <kurushiyama> edrocks: I am sorry, but I am very bad at modelling data when I can not wrap my head around the problem at hand ;)
[21:00:01] <GothAlice> kurushiyama: On one work dataset, either we could aggregate a million records to cover a month's analytics events, or… we pre-aggregate so reports only query 30*24=720 records for the same month. (Hourly pre-aggregation.)
[21:00:24] <GothAlice> Constant time reports for fun and profit.
[21:04:21] <kurushiyama> GothAlice: Well... it feels... wrong. I see the benefit, but storing aggregation data... I always felt more comfortable with downsampling.
[21:12:47] <edrocks> kurushiyama: yea they are shipping quotes from various shipping companies
[21:13:10] <edrocks> kurushiyama: it's to compare which company to use and which warehouse is best to ship something from
[21:15:54] <kurushiyama> GothAlice: Can you share experience about performance of adding data? Most of my customers have to deal with several thousand events/minute. I guess we could do a lot on the UX side, but slowing down insert performance could easil translate to additional shards. I guess the loss is significant, isn't it?
[21:21:28] <GothAlice> kurushiyama: With two "consumers" and three "producers", I was able to get around two million bidirectional "requests" per second across MongoDB. That is very specifically, on the "producer" side: insert job record into real collection, insert notification into capped collection; and on the "consumer" side watch for the notification, load the job record, run the referenced function, update the job record with the result of running the
[21:21:28] <GothAlice> function, with three "notifications" inserted per job.
[21:22:32] <GothAlice> That's two million enqueued and executed tasks per _second_.
[21:24:42] <GothAlice> Did it live as part of a "this is what we've been working on" presentation during a Facebook developers mini-conference. Knocked socks off.
[21:25:28] <GothAlice> (We used it for messaging, immediate RPC, and scheduled RPC for a Facebook game we were working on at the time for the TV show The Bachelor. ;)
[21:25:37] <kurushiyama> GothAlice: On my 2011 MBA, I get maybe 2.5k inserts/s
[21:25:54] <GothAlice> Ah, but that's a single process inserting, no?
[21:26:16] <kurushiyama> GothAlice: Right, but still nowhere near your throughput.
[21:26:22] <GothAlice> If you run "top" or equivalent to watch CPU load, you may notice mongod not doing much with that level of stress.
[21:27:55] <GothAlice> Doesn't change the thermal profile of your CPU, though.
[21:28:09] <kurushiyama> Which reminds me of getting an egg ;)
[21:28:54] <GothAlice> Basically: don't run stress tests on an Air. It's small profile and lack of substantial active cooling means the CPU will simply refuse to "spin up" to Turbo Boost speeds. Especially once it gets to the temperature where it can cook an egg.
[21:31:00] <kurushiyama> GothAlice: Ah... wait. With mgo, i was somewhere at like 100k inserts/s (though I could have started a burger bbq on the CPU). And of course I do not do stress tests on an MBA. That's what I have a lab for ;) But 2M/ops on a laptop is nevertheless impressive.
[21:31:28] <GothAlice> kurushiyama: https://gist.github.com/amcgregor/4207375 are extracts from my slides in that presentation (switching the project to MongoDB was my pet project there) with a link to a more complete implementation in the comments. :)
[21:31:30] <kurushiyama> Is there a cooling device in the shake of a BBQ grill?
[21:37:50] <kurushiyama> GothAlice: Jeez. A distributed task queue with automatic failover. Adding some shard tags into the mix, I can think of quite some great applications for that.
[21:47:00] <kurushiyama> GothAlice: Well, I am a bit out of Python (mainly doing Go now and did Java for a long time before that). But as I get it, it is distributable closures (more or less)?
[21:47:36] <GothAlice> They can be used that way, yes. Generators are the layer below coroutines. (A generator you can send() to… is a coroutine.)
[21:49:51] <kurushiyama> GothAlice: dang. distribute them, have them report back...
[21:50:15] <GothAlice> "Deferred processing pipelines" is how I'll be advertising it once the web framework projects calm down a bit.
[22:01:02] <kurushiyama> GothAlice: ? Calm down a bit? I have the feeling it is getting more crazy day by day...
[23:20:40] <hardwire> so the online docs for MongoDB are now claiming that MongoDB enterprise has different features in the core software than the open-source version, like in-memory datastore (beta).
[23:21:11] <hardwire> Are they starting to put commercial only code in a fork of the core server?