[05:46:35] <earendel> hi. i've just created a dump of a db via mongodump /h <host> /o <dir> .. now i'm a little bit perplex about the size. the total dump is ~ 250MB, while the orginal db files are like 1.7gb (4 files with ending from .0 to .3) .. how comes this big difference?
[06:55:52] <joannac> you might want to re-evaluate if you really need text search on those fields
[06:56:13] <earendel> also i start to realize a documentdb was not the right choice here.. the use case is simply logging irc chat data, so i have the same structure on each entry.. yet the column names are stored on each entry ..//
[06:56:36] <earendel> yeah. without the search it would be totally useless.
[06:57:23] <earendel> i think even with a search a simple csv file would make a faster search
[12:03:02] <adrian_lc> hi, anyone knows why the update result with mongo shell is in the format {"acknowledged" : true, "matchedCount" : 1.0, "modifiedCount" : 0.0} but pymongo gives {'updatedExisting': True, u'nModified': 1, u'ok': 1, u'n': 1}. I wanted to implement some logic based on modifiedCount but nModified doesn't seem to match the behaviour
[12:51:45] <lenochka> how can I search for all documents which have a list of embedded documents, in which neither of the embedded docs would have a specific year - 2016.
[13:01:42] <StephenLynx> and you need to query for a range.
[13:02:08] <StephenLynx> have you read all query operators?
[14:45:40] <Shokora> hi i have a question about projection in the MongoDB PHP Library
[14:46:21] <Shokora> I have pretty big documents and for an index page I'm making I would like only the id and some other information inside the document because of memory issues
[14:46:43] <Shokora> I specified an array and passed it with $options['projection'] to my find method, but it seems to use more memory after doing that than before
[14:47:02] <Shokora> are the projections made in memory? is there a way to do the projection in a way it is less memory intensive?
[15:50:41] <Doyle> If one of three config servers is offline, the cluster availability isn't impacted, just the meta data, right? Or does the meta have to be committed against all three for the chunk to be accessible?
[15:51:13] <Derick> you're correct that meta data only can't be changed
[15:56:04] <Doyle> If the first config server in your mongos connection string goes down, is there a timeout that mongos waits for before trying the second server in the string?
[15:59:43] <wrkrcoop> so i just created a table called users with a field called username
[15:59:52] <wrkrcoop> i then ran db.users.createIndex({"username"})
[16:00:11] <wrkrcoop> 2016-09-19T08:49:48.348-0700 E QUERY [thread1] SyntaxError: missing : after property id @(shell):1:32
[16:20:17] <StephenLynx> 127.0.0.1 means it will be only reachable from localhost.
[16:20:30] <StephenLynx> 0.0.0.0 means from all interfaces.
[16:20:35] <wrkrcoop> because monogd outputs a line that says ‘connection accepted from 127.0.0.1'
[16:20:58] <wrkrcoop> so i have to provide a host for this other db im trying out: mongo: ./cayley init --db=mongo --dbpath="<HOSTNAME>:<PORT>" -- where HOSTNAME and PORT point to your Mongo instance.
[16:25:27] <spacecrab> if anyone is bored and feels like they want to chime in on this issue i'd be happy to discuss https://github.com/gravcat/mongodb/issues/1 -- when i was trying to make a replica set as fast as possible for the fun of it, i found that rs.add() worked exactly how i needed, but rs.initiate() did not, it was what i attempted first.
[16:26:10] <spacecrab> if no takers i'll just dig through the docs at some point and resolve the issue solely :p thought it might be fun to talk about hypothetical scenarios
[18:12:13] <kuku1g> hi guys, can someone tell me if splitting up collections makes sense? for example, i have sensor data that contains geospatial data or not. shall I split them up?
[18:13:14] <kuku1g> i feel like reading the data again would be more complicated. but would it speed up queries that go to one or the other collection faster?
[18:14:59] <kuku1g> with "reading the data again" I mean reading both geospatial and "normal" data at the same time
[18:15:22] <StephenLynx> on your current model, both of these go into the same document?
[18:15:44] <kuku1g> they go to the same collection yes. they are different documents however
[18:16:19] <kuku1g> i have a lot of "small" documents atm
[18:16:27] <StephenLynx> do they refer to the same pivoting point?
[18:16:53] <kuku1g> Yeah, the data logically belongs together
[18:17:41] <StephenLynx> what's the pivoting point?
[18:19:08] <GothAlice> kuku1g: The note about "speeding up queries" encourages me to link http://www.devsmash.com/blog/mongodb-ad-hoc-analytics-aggregation-framework — an excellent article on pivoting time series data (their example is sensor buoys) in different ways, comparing the performance and storage costs. This is against mmapv1, not WiredTiger, of course, however pre-aggregation would allow for O(1) reporting.
[18:19:44] <GothAlice> (O(1) in the sense that for a given period of time, say, a report covering 24h, with hourly time-slices, only 24 records would ever be evaluated to answer the report query. Constant time.)
[18:22:11] <kuku1g> The thing is that I do not want to create reports or aggregate any data.
[18:22:30] <kuku1g> I am logging sensor data to use that data and run machine learning algorithms on it after pulling it out of mongodb
[18:22:58] <kuku1g> I think aggregating my data per minutes, he calls it "Document Per Buoy Per Hour" in your link, might be fine because I usually read out large chunks of data. I can see how that speed ups my queries a lot.
[18:23:30] <GothAlice> kuku1g: Yikes. Without pre-aggregation, your queries are guaranteed to be variable performance, mostly dependant on good index coverage and the RAM state of those indexes.
[18:24:22] <kuku1g> My use case is read only. I write data once and read it out afterwards. Some corrections (= updates) might happen like once every few months for a single log file (a single log file is 800mb to 8gb big in json - so theres a lot of overhead though)
[18:25:08] <kuku1g> as I mentioned, I have TBs of these data. I will always query minutes or tens of minutes of my data.
[18:25:44] <GothAlice> Note that we perform aggregation into hourly buckets at work _and_ store the original events, too. The original events are preserved for a period of time for auditing purposes using capped collections, and using capped collections also lets other processes "listen live". The events (and pre-aggregated stats for reports, dashboard widgets, etc.) are certainly read-only.
[18:26:16] <GothAlice> kuku1g: In your case, per-minute buckets would potentially _greatly_ reduce the amount of data. If you're getting a sensor reading at 100Hz, for example, you're saving 100:1 every minute with pre-aggregation.
[18:26:51] <kuku1g> GothAlice: With aggregation you mean that you put events together in a bucket right?
[18:27:27] <kuku1g> Oh I see. Yeah I definitely have to use this then. That's a great starting point. Thanks
[18:27:46] <GothAlice> Aggregation is one thing, pre-aggregation basically performs any calculations early and stores them instead or in addition to the original data, benefitting from the constant time querying later.
[18:28:43] <GothAlice> And sorry, always get orders of magnitude wrong. 100Hz sample rate would result in a 6000:1 savings using per-minute buckets.
[18:28:57] <kuku1g> GothAlice: Would you de-normalize even further and create more "views" on the data? Like splitting the data up by minutes as well as hours?
[18:30:05] <GothAlice> You create "bucket sizes" that match how you will want to present the data. If your charts have three zoom or granularity levels, e.g. per-minute, per-hour, and per-day, you can save the 24x more difficult processing of all of the hourly bucket data to get daily reports by pre-aggregating with daily buckets, too.
[18:30:20] <GothAlice> Note however, that this is all a trade-off.
[18:30:35] <kuku1g> GothAlice: What about mixing geospatial queries with time-range queries? In my use case, I will have to match the geolocation of my sensor data to bounding boxes (one at a time obviously) and then query +15 minutes and -15 minutes of the timestamp of the geospatial records found
[18:30:37] <GothAlice> Slightly more work on every event (to update one or more buckets) vs. potentially huge amounts of work all at once later.
[18:30:37] <StephenLynx> yeah, more work on inserts.
[18:30:49] <StephenLynx> in general, it really pays off on some cases.
[18:32:14] <GothAlice> kuku1g: Also possible. You could have a "locations" field in each bucket which uses $addToSet to add each distinct location. You can then easily perform two queries: one to find by geospatial, the second to find the ±15 minute results from there?
[18:34:30] <GothAlice> Your request is clearly two queries, regardless of underlying storage: one geospatial, the other time-based. (Though some approaches can make this worse, i.e. two queries per sensor, etc.)
[18:39:18] <kuku1g> I thought I have buckets of for example 0..59 in my document. and each of them represent a minute within a hour. so when I query for timestamp "19-09-2016 - 20:28" + 15 minutes what do i do? how to treat the subdocument that contains the buckets as one big chunk of data?
[18:39:23] <GothAlice> kuku1g: If you only had buckets of one minute, you can easily query 10 minutes at once. You're processing 10x as many records, however. My previous example was assuming buckets of one hour, but queries covering a granularity of 24 hours. 24x as many records. Using only buckets of per-minute, but desired granularity of one day would result in a 1440x increase in the number of records processed to answer queries of that granularity.
[18:40:14] <GothAlice> https://gist.github.com/amcgregor/1ca13e5a74b2ac318017 is an example from some older code at work.
[18:40:32] <GothAlice> The second file, sample.py, represents a pre-aggregated bucket record.
[18:41:30] <GothAlice> You should be able to see lots of different counters (in this case, click data, so browser, platform, etc.) and the "h" hourly period field, which is a date/time with the minutes/seconds set to zero / snapped to the hour.
[18:42:55] <GothAlice> Note that there isn't one document per hour per metric in my latest example, but one document for all metrics per hour per distinct job. No arrays of sub-documents to worry about.
[18:43:08] <kuku1g> GothAlice: I need a granularity of hours. My log data does not span a whole day. Each log file contains sensor data that was logged at most ~8 hours each.
[18:43:25] <GothAlice> In my case, I can literally query the "h" field for $gte and $lt the target time.
[18:44:43] <kuku1g> GothAlice: I see. that's actually sweet. I might adapt that to what I need. http://pastebin.com/ERbETVP1 this ain't exactly what you mean right?
[18:47:50] <GothAlice> Refer back to the original blog post I linked.
[18:49:39] <kuku1g> lmao. now I get the whole concept.
[18:50:49] <kuku1g> GothAlice: one more concern though. you said you would add all distinct locations (they are all because they are taken from a GPS trace from a moving object) to a list. Why so? Why not keep the original GPS-Events in the "events" field?
[18:52:12] <GothAlice> kuku1g: Could you gist an example of your data, with, say, three measurements included?
[18:53:05] <GothAlice> Not pre-aggregated or anything, but what you consider to be the data relevant for three specific events and their surrounding context. The last pastebin is too abstract. ;P
[18:55:09] <kuku1g> gonna give you a coffee for your stolen 15 minutes lol :)
[18:56:14] <GothAlice> While I do have a Patreon, I tend not to point at it in channels I help support. Somewhat ironic lack of self-marketing, there, but I'm not here to profit, I'm here to help. ;P
[19:07:59] <kuku1g> I can't give away any of our data really
[19:08:14] <GothAlice> kuku1g: Fake the numbers if you have to. XP
[19:08:22] <kuku1g> No, just a description of our data layout really. Not MongoDB related
[19:10:13] <GothAlice> As you are using "table" terminology, and listing fields like a CSV file (instead of using JSON notation), I'll also link you http://www.javaworld.com/article/2088406/enterprise-java/how-to-screw-up-your-mongodb-schema-design.html
[19:31:15] <kuku1g> Can't comprehend what you're missing from the latest pastebin one except the descriptions though?
[19:31:22] <kuku1g> my data do not have arrays or the like
[19:32:02] <kuku1g> there are just hundreds of thousands, up to xx million events of the same structure i posted in the pastebin before.
[19:32:27] <kuku1g> Remember, that's the "event by event" view of course. All events one by one.
[19:33:17] <GothAlice> If you can give me 40 minutes, I have a bit of a deadline for something at work, but then I can dive into helping you properly. :)
[19:39:40] <GothAlice> StephenLynx: I need to better randomize the list or something to keep enough funny / tongue-in-cheek ones mixed in to keep readers interested, apparently. ;^P
[19:41:47] <StephenLynx> not having a philosophy might as well be a philosophy in itself ¯\_(ツ)_/¯
[19:42:05] <StephenLynx> one of the reasons I go with free software and not open source.
[19:42:17] <StephenLynx> open source is a dozen of rules that no one ever remembers
[19:42:28] <StephenLynx> free software is 4 simple things.
[19:43:21] <StephenLynx> >Github private repositories for hosting of Marrow related services, such as package index, documentation site, wiki, etc.
[19:43:31] <StephenLynx> you can do that on any system running gitlab for free btw.
[19:44:08] <GothAlice> StephenLynx: For various definitions of "free".
[19:44:37] <StephenLynx> you talking about free software or the private repository feature?
[19:44:47] <GothAlice> However, while I appreciate the critique of my Patreon milestones, feel free to PM them to me. One reason I've avoided linking it in the past is to avoid making it the topic of discussion.
[19:45:13] <StephenLynx> not like we're disrupting anything going on
[20:06:52] <kuku1g> GothAlice, StephenLynx: so i think that's how you want me to model the data: http://pastebin.com/D47FrbUi how far off am I?
[20:07:11] <synthmeat> no, this totally disrupts my bi-yearly mongodb question that ends up being a mongoose mongoosery
[20:07:56] <kuku1g> is the second model superior? if yes, why?