[11:22:17] <mo-irc> Hey guys, it seems that we’re having a discrepancy between the index size in our primary node and two secondaries. Our primary has a different index file size then both of our secondaries and all three have the same object count. We’ve also noticed that the indexes for all three databases grow at the same rate from the point where we saw the change in index size. I’m wandering if there is any problem that can arise from this and if this is
[11:22:17] <mo-irc> something that needs to be resolved as it can cause a major issue. The specific stat we’re looking at is from the following query: `dbStats.indexSize`.
[11:31:50] <tantamount> When aggregating, how do I pick the first element in a collection?
[11:32:28] <tantamount> {bar: "$foo.0.bar"} doesn't seem to walk the first element
[12:13:31] <cheeser> mo-irc: that might be a question of on disk storage efficiency
[12:13:41] <cheeser> i.e., the secondary indexes are more compact
[12:15:49] <mo-irc> cheeser: we have many databases and we only see this for a single database and this discrepancy is confusing us.
[14:04:33] <tantamount> When performing a grouping aggregation, how can I form a new field whose values come from two sources instead of just one? I want to combine two sets of values into one
[14:06:01] <tantamount> i.e. all_users: {$addToSet: ["$source_users", "$target_users"]}
[14:17:45] <tantamount> I guess I need a projection here
[14:22:50] <tantamount> Is there any way to specify a projection that includes everything already specified in the input document, without having to specify every field again manually?
[15:07:21] <ams_> I want to validate bson data in a python, looking at using pymongo.bson. Is there anything else I should consider?
[15:31:11] <GothAlice> ams_: There are existing tools to allow you to declaratively define a schema, then handle incoming and outgoing BSON data. MongoEngine is my tool of choice for that. Full disclosure: I'm a contributor.
[15:31:38] <ams_> GothAlice: I just want something to read raw .bson data (output on mongodump) and make sure it's actually valid
[15:31:55] <GothAlice> If you are _really_ wanting to roll your own—can't blame you, it's fun and educational!—you'll still need some way to define the validation rules. I might recommend https://github.com/marrow/schema#readme which is a standalone declarative schema system + validation + typecasting library not tied to any database. Full disclosure: I wrote this. :P
[15:32:29] <ams_> I'm not talking schema validation, just validation that the bson is actually valid bson
[15:33:42] <GothAlice> Fastest way to do that: immediately monogimport. Testing in your own toolchain tools that are already extensively tested elsewhere doesn't sound like an efficient or necessary thing. :/
[15:33:55] <GothAlice> I.e. if mongodump exits with a zero status code, you can be pretty sure.
[15:35:31] <GothAlice> I had the opportunity to geek-slap a developer on my team when I noticed our own application's test suite was growing to include tests for bare pymongo behaviour. That was silly.
[15:36:03] <ams_> GothAlice: We had a case where mongodump was dumping corrupted bson data (or mongo db had been corrupted), so we're not really comfortable trusting it
[15:36:49] <ams_> This isn't for a unit test, it's our backup tool which dumps individual collections for historic backup. Once we dump we want to make sure what we dump is valid before we delete older ones
[15:36:52] <GothAlice> If your at-rest data is corrupted, yeah, there's not much you can do to protect yourself. You could add '--repair' as an option to mongodump, but I imagine that'd slow the process down.
[15:37:05] <ams_> The case we had, --repair did nothing
[15:37:14] <GothAlice> That's indicative of a much deeper problem.
[15:37:28] <ams_> Yup, but we're using mongo now ;-)
[15:37:45] <GothAlice> However, again, why write a custom tool instead of simply immediately attempting to mongorestore the backup to a temporary location?
[15:38:13] <GothAlice> If that fails (and can fail early with --stopOnError) the data is questionable?
[15:38:38] <ams_> Bare in mind that pymongo.bson can import (and therefore) validate .bson, so my `custom tool` is effectively a try/except.
[15:40:53] <ams_> I just don't like putting load on my production database or running a whole separate database we have to maintain just to verify backups
[15:41:10] <GothAlice> I.e. string interning will hurt. Additionally, you'd have to un-chunk the on-disk format yourself. (Or, if it actually is a single BSON blob per on-disk collection file, you'd want as much RAM as you'd need to fit three times the collection size.)
[15:41:24] <GothAlice> ams_: I spin up temporary nodes all the time. It's very handy. :D
[15:42:09] <ams_> Yeah it's do-able, I'm just wondering if I can avoid it :)
[15:42:20] <ams_> Yeah it's do-able, I'm just wondering if I can avoid it :)
[15:43:05] <ams_> looking at the load i'm getting when importing 100mb of bson though, I think we may have to go with your solution
[15:44:31] <GothAlice> ^ That can spin up a complete 2 shard 3 replica per shard cluster on a single machine. I use this to help testing out sharding indexes. Customize to taste, though. :)
[15:46:21] <GothAlice> So your backup script becomes something like: https://gist.github.com/amcgregor/2a434343eaa4ed19a22a
[15:50:04] <GothAlice> ams_: To reduce impact, you can have the backups performed on a machine other than the actual database node itself. Then the "validate it" mongod cluster.sh provides won't impact the rest of the DB connections at all. Additionally, you can "nice" and "ionice" mongod, and if your dataset fits in RAM easily, you can mount the cluster.sh data directory as tmpfs (ramdisk).
[15:50:25] <GothAlice> Ramdisk would make things much, much faster, but again, only if you have the RAM for it. ;)
[16:07:33] <ams_> ha, i was verifying all my backups (30+) so that's why it was taking me so long :-). Now I just verify one it's 3 seconds, so I think I can stick with pymongo.
[16:08:15] <GothAlice> XD Bounds checking; not just a good idea!
[16:10:27] <GothAlice> ams_: As an interesting note, I don't backup using snapshots like that. When things get to a certain size, it's just not practical to hammer the DB server during each backup period. Instead, we have hidden replicas join the set and synchronize. Our oplog can cover ~70 hours of time, so we only really need to connect the backups that frequently, but they're always connected.
[16:10:57] <GothAlice> One is a live replica, so it has the current data within a few hundred milliseconds. A second is a 24 hour delayed replica allowing for recovery from user error.
[16:11:13] <GothAlice> Both are in the office, and can be queried at the office, making some reporting much faster for our staff. :)
[16:11:34] <GothAlice> Multiple birds with one feature: replication.
[16:12:15] <ams_> GothAlice: I guess our use case is a little different. We have a load (100gb+) of non-critical (can be restored from elsewhere) data that we can lose. We then have some very critical data (100mb) that we can't lose. For this case we want historical (verified!) backups.
[16:12:53] <ams_> We also had a case (can't find the JIRA ref now, it is fixed) where a bug in mongo corrupted all of our replicas. So we're a little reluctant to rely on replicas to back up critical data.
[16:13:53] <GothAlice> Our own processes had a tendency until 3.2 of segfaulting WiredTiger. ^_^ Certain setups just aren't as heavily tested, I guess.
[16:15:04] <ams_> ah actually, i'm misremembering, we didn't have a replica setup at the time. But concluded if it could happen to one there was no reason it couldn't happen to another.
[16:15:53] <GothAlice> The joy of a properly configured replica set is that if one member goes down and doesn't come back up a la that ticket, the rest of the cluster can happily continue operating. (High availability.)
[16:16:31] <ams_> Sure but if all are hit by the same issue then we'd have lost our data
[16:16:32] <GothAlice> Once the "broken" node is online again, it'll reconnect, see that it isn't the latest, and either catch up (if possible) from the new primary or fully re-sync from the primary or a secondary.
[16:18:03] <GothAlice> "If my entire data center goes dark" is something you can plan for. As is planning at the rack level, or single machine level. You run disks in RAID to ensure the failure of one doesn't stop everything, the same holds true for database servers and whole data centres, at appropriate scale. ;P
[16:19:52] <ams_> We run in one data centre, so a power cut + SERVER-20304 would have lost all our data, without any replicaset. Which has what has lead me down this path.
[16:20:45] <GothAlice> Additionally, that's wrong. In the case I described at work, even if Texas (where our main DC is) gets nuked, our data is safe _because_ of replication. There's a set of replicas in Montréal. :P
[16:20:47] <ams_> When we hit it, the node was not. But nothing lead me to believe it was a no-replica set only thing.
[16:21:07] <ams_> Wrong for you, I'm talking about me - we run in one data centre
[16:21:20] <GothAlice> … do you work for a company with an office?
[16:21:35] <GothAlice> The Montréal replicas are literally at our office. :P
[16:21:51] <GothAlice> There's a Mac Mini in one corner doing continuous integration and those replicas.
[16:22:19] <ams_> But 100mb/100gb is all that's really important, let's say we have 5 data centers with this setup (all unrelated, but running the same setup). I don't want 5 x 100gb of data being replicated down here for 5 x 100mb.
[16:23:37] <ams_> Also, I can't really explain to someone that we lost their critical data because it was only being backed up to a mac mini ;-)
[16:24:13] <GothAlice> Uhm. Why the 5x? The "offsite backup replicas" in the office would connect to a primary (there can be only one) once. Replication is the only real path to high availability in MongoDB, and you seem really stuck in thinking it _reduces_ availability.
[16:24:55] <GothAlice> You also seem to have the misconception that every node talks to every other node and shares all data that way (the 5x thing?) which is not the case. In a given replica set, there is only one primary.
[16:25:40] <GothAlice> You also have full control over where each secondary pulls its data: https://docs.mongodb.org/manual/tutorial/configure-replica-set-secondary-sync-target/
[16:27:02] <GothAlice> So in the case of having two replicas in the office, one pulls from the primary (most up-to-date) and the 24 hour delayed one pulls from the in-office secondary (because a little lag when we're adding 24 hours of lag isn't a problem, and it saves on office bandwidth as you mention).
[16:27:42] <GothAlice> (That last bit is called Chained Replication: https://docs.mongodb.org/manual/tutorial/manage-chained-replication/)
[16:28:22] <ams_> I'm not explaining very well.. the single mongodb deploy I'm describing will be deployed to 5 separate customers, all with their own separate data
[16:28:46] <GothAlice> Aye, but each as a single node, right?
[16:29:10] <GothAlice> I.e. customer A gets mongod A, customer B gets mongod B, …
[16:30:02] <tantamount> During aggregation projection, how can I check if a field is in a collection? I can't use $setIsSubset because the field is an object, not an array! I can't wrap the object in an array because it does not resolve paths in arrays, i.e. ['$user'] is just an array with the string '$user' in it, not the user object!
[16:30:15] <ams_> snapshots of the critical data, yes
[16:32:40] <GothAlice> "field in a collection" is non-sensical at the level of an aggregation projection; you're dealing with documents in that collection, not the collection.
[16:33:28] <tantamount> I don't mean collection like that
[16:33:41] <tantamount> How can I see if an object is in an array
[16:33:50] <tantamount> I cannot wrap it in a single element array because it doesn't resolve fields that way
[16:35:15] <GothAlice> $unwind on the array, $match what you're hoping to match, $group to re-pack the array. If the array might not even be present, you can use ifNull to provide a default empty array to unwind on. If the field might be an array of things, or a single thing, that gets into the realm of "fix your data first, then try again". ;)
[16:36:12] <GothAlice> tantamount: But an example of the document structure you are trying to query would be extremely useful in trying to figure out what you're doing. Gist or pastebin an example?
[16:36:26] <tantamount> It's as simple as I stated
[16:36:37] <tantamount> I have a field, user and it's a DBRef
[16:36:43] <tantamount> And I have a field, users, it's a collection of DBRefs
[16:36:46] <tantamount> I want to see if user is in users
[16:38:27] <tantamount> Because at this point I'd rather throw in the towel than go down that road
[16:38:42] <tantamount> I'm already 8 stages deep in this aggregation nightmare
[16:38:43] <GothAlice> That's, unfortunately, not a technical problem I can help you with.
[16:39:30] <tantamount> And having to unwind just to perform a simple "in" test _just because one of my fields is not an array_ is the kind of banal insanity that will actually break the camel's back
[16:40:04] <cheeser> have you shared sample docs and your agg pipeline yet?
[16:40:26] <GothAlice> tantamount: An actual example document is way more useful than the text description given so far, which is confusing and mis-uses terminology. Help us help you, here.
[16:41:01] <tantamount> I misued collection as array
[16:41:18] <tantamount> There is nothing more to share
[16:41:40] <GothAlice> It's also very important to understand that the aggregation pipeline processes a single object (document) per stage. If you need to deal with a field that is an array, you need to unwind it to get each array element to be considered by the pipeline. This isn't "banal insanity", it's a parallelizable processing pipeline.
[16:41:43] <tantamount> If there is no operator to test if a value is in an array then I must find a way to cast the value to an array
[16:42:25] <tantamount> That would completely destroy my aggregation
[16:42:50] <GothAlice> You haven't shown us an aggregation to destroy, yet. >_<
[16:43:20] <tantamount> I am not ultimately trying to select documents that match a user in that collection, the only thing that will happen in that case is I will output a 1 instead of a 0 so that I can ultimately AGGREGATE using grouping
[16:44:40] <tantamount> Do you honestly think I'm at liberty to dump my data and partial solution?
[16:50:31] <tantamount> It has been used extensively elsewhere during the aggregation
[16:52:57] <tantamount> What is the easiest way to convert a value to a single-element array?
[16:58:08] <regreddit> tantamount, you can not even post a notional document, so that those trying to help you can understand the structure of a document? One with made up data?
[16:58:55] <tantamount> What would be the POINT? Every field I haven't mentioned is not RELEVANT to the problem
[16:59:44] <tantamount> The only way to know if a value is in an array is to use set operators and set operators only accept array arguments
[16:59:53] <Derick> tantamount: the point is that people in general prefer concrete data to play with, as opposed to an abstraction.
[16:59:59] <cheeser> we're trying to help but we can't force you to give the context that might help.
[17:00:12] <Derick> tantamount: I would also recommend you stop using CAPS to make your point, and using words like "fucked".
[17:00:24] <tantamount> I would recommend you stop being over-sensitive
[17:00:25] <GothAlice> I am not able rightly to apprehend the kind of confusion of ideas that could provoke this approach to problem solving.
[17:01:08] <tantamount> The truth is, you *might* be right that it would still be possible to calculate all totals correctly with a series of unwinds but it would still devolve into madness
[17:01:32] <tantamount> The only way to keep the solution sane is with set operators
[17:01:43] <cheeser> wow. ok. well, i'm done with this one. that's a terrible attritude for someone seeking free help.
[17:02:17] <tantamount> What is? Not wanting to use unwind?
[17:04:25] <tantamount> It's not as if I have some sort of aversion to using unwind; I'm using it to get this far, but only because it made sense. Past this point it's breaking documents into such tiny pieces that it almost becomes impossible to reasonable what they represent
[17:05:26] <tantamount> That isn't really a satisfactory solution when the only reason I can't use set operations is because Mongo doesn't support searching for a scalar in an array even though it supports comparing arrays against arrays
[17:06:17] <tantamount> Even a hack to convert a value into a single-element array would be preferable to a series of additional unwind and group steps
[17:12:22] <tantamount> How are the other set operations "flat"?
[17:12:45] <cheeser> projections are about field selection not matching criteria
[17:13:55] <tantamount> I understand that, perhaps what I didn't understand was $in
[17:15:08] <tantamount> The other idea would be to permit the comparison set operations like setEquals and setIsSubset to take allow any argument type as their first argument
[17:54:40] <GothAlice> At least with a blog post about it, there'd probably be actual examples. >_<
[17:55:28] <GothAlice> thedoza: SSL requires either Enterprise MongoDB or you to recompile MongoDB from source yourself with the relevant option enabled.
[17:55:52] <GothAlice> thedoza: On Gentoo this is pretty easy, add the "ssl" use flag and re-install.
[17:56:02] <GothAlice> thedoza: Elsewhere… less easy.
[17:59:10] <thedoza> Thanks for the response.. If i add the command from the shell - it seems to take it --sslMode preferSSL and it asks for the PEM etc
[17:59:27] <thedoza> its only if I use the config file that it seems to complain
[18:00:35] <thedoza> OpenSSL version: OpenSSL 1.0.1k-fips 8 Jan 2015
[18:22:34] <MacWinner> hi, when you turned on wiredtiger, does it turn it on for all my databases? eg, my activity database as well as my gridfs database
[19:11:05] <regreddit> MacWinner, yes, but you'll want to mongodump your db's before upgrading/switching
[19:11:27] <regreddit> then start mongodb proess with wiredtige storage engine in your config
[19:11:44] <regreddit> then mongoimport your DBs,a dn they will be converted to the wiredtiger engine
[19:13:05] <regreddit> MacWinner, if you are using a replicaset, you can upgrade a secondary to wiredtiger, and its initial sync with the primary will converty to wiredtiger
[19:13:16] <regreddit> thats the way I did it when i upgraded
[19:15:28] <MacWinner> regreddit. got it, thanks! here is my plan for my replicaset. Shutdown secondary, update config to use YAML (since I'm on old config format right now). Start secondary, make sure everything works with new YAML config. Shutdown secondary, upgrade binaries to 3.2.1, update YAML to use wirediger. delete data files. Start secondary and let it sync.
[19:15:53] <MacWinner> do that for all secondaries.. then failover primary.. then upgrade replica protocol version
[19:15:56] <regreddit> i believe there is a minimum version you can sync to 3.x with
[19:52:46] <Waheedi> any documentation or examples on this, i've been trying to use it to change the ReadPreference for DBClientReplicaSet? mongo::ReadPreferenceSetting(mongo::ReadPreference ReadPreference_Nearest);
[21:26:41] <Waheedi> what are other available options for building with scons other than --use-system-all?
[22:41:47] <l1ghtsaber> Hello, is there a core-developers channel somewhere? I've got a couple of doubts about the internal APIs.
[22:53:32] <cheeser> l1ghtsaber: your best bet is the mongodb-dev google greoup
[23:40:57] <MacWinner> if you're using wiredtiger and do show dbs, does this show the decrompressde size of teh database? or compressed?