[00:15:19] <gottreu> I have a mongoldb instance on a VPS. Can I safely hand out mongodb accounts to strangers on the internet without fear?
[00:25:42] <StephenLynx> if you set authentication properly, yes
[04:58:41] <jr3> can someone provide some clarity on "the dataset should fit in memory"? Does this mean that if I have a db that's 16gigs I should have 32 gig to bring it all in memory?
[05:04:44] <Boomtime> @jr3: where does it say "the dataset should fit in memory"?
[05:04:55] <Boomtime> perhaps you mean "the working set should fit in memory"?
[05:05:29] <jr3> how do I determine what my working set is?
[05:05:32] <Boomtime> working set is a bit trickier to work out - there is some documentation on how to go about that but it gets a bit waffly
[05:05:52] <Boomtime> you can start with how broadly you hit indexes, and how many of them you hit
[05:06:12] <Boomtime> you really want the pertinent parts of indexes to fit into memory together
[05:06:34] <Boomtime> db.stats() and the like are your friend
[05:47:31] <Waheedi> i currently have two kvm-hosts each with two guests running mongod (replicaset), the two hosts are located in different physical location, i have applications that are mainly doing reads on the db on both locations, whats the best way to make applications read only from its physical hosts and I can take care of the writes problem :)
[05:47:47] <Waheedi> would nearest read preference be my best option?
[06:44:01] <Boomtime> there is also a threshold value, between which all servers will be considered equal - this value is up to the driver, but i think they default to 15ms
[06:44:35] <Boomtime> anyway, whatever the value is, if all servers measure ping within that threshold range then they are all equal according to 'nearest'
[06:44:40] <Waheedi> yeah i just wanted to make sure that ping is a major factor
[06:45:15] <Waheedi> I'm having 80ms on some far away nodes
[06:45:32] <Waheedi> so 15ms would definitely default to my closest secondaries or primary!
[06:46:07] <Boomtime> ok, so example time; if your app measures ping to server 'A' at 10ms, and server 'B' at 30ms, then nearest will pretty much always go to server 'A'
[06:46:47] <Waheedi> and do you know every how often it updates its ping values?
[06:47:05] <Boomtime> but if server 'B' improves a little, and server 'A' gets a bit burdened and slows time slightly, making pings come out at 13ms for server 'A' and 25ms for server 'B', then suddenly they are 'equals'
[06:47:19] <Boomtime> slows time? i meant, slows down
[06:47:41] <Boomtime> update rates are driver specific as well... but i think there is a guide in the spec
[07:08:01] <Boomtime> all writes go to the primary, regardless so that value doesn't matter, a query must go to a functional and accessible member first most, and then the most appropriate one if there are multiple candidates
[09:48:01] <m3t4lukas> grepwood: then please use the ubuntu system monitoring tools. you could use top for knowing whether the process is working and for ram usage, you could view the logs of mongod and you could use df for analyzing your disk space
[09:48:25] <grepwood> m3t4lukas, please tell me it's not a graphical program
[09:49:00] <m3t4lukas> there are also mongostat and mongotop for diagnosing and monitoring purposes
[10:23:32] <Derick> it's btw not weird if you look at so many other things with geo
[10:23:53] <Derick> bounding box coordinates, can be any of : n, s, e, w; n, e, s, w; ... and reversed versions
[10:24:14] <Derick> and IMO, having it follow x, y, z (like in proper charts) also makes sense to me
[10:26:33] <Derick> http://www.macwright.org/lonlat/ is a nice FAQ on it
[13:49:34] <jamiejamie> Hey, I've got a pretty simple database to make, but I'm really unsure how to structure it. It's a database that'll store some information about stuff in a game, for example.. "cards, abilities, heroes". I'm considering just putting these all in an "items" collection, since I want to be able to search all from a single box and just get the closest match, is that really bad practice, just having one big collection?
[13:49:37] <jamiejamie> Or should I separate into cards, heroes, abilities collections, and then have an "items" collection which just has records that points to the ID of its respective item
[13:49:57] <jamiejamie> Almost like an index collection I guess, which would just purely be used for autocomplete
[14:03:26] <m3t4lukas> jamiejamie: only polymorphic stuff into the same collection. At least what best practice is concerned. You can put into one collection whatever you think needs to be together
[14:04:26] <jamiejamie> I mean it could work in both cases I guess m3t4lukas, they're all technically "items", but I feel like smaller collections by item "type" is still the way to go, but maybe that's SQL talking to me
[14:04:47] <jamiejamie> I guess I came here for reassurance that splitting into smaller collections like that is OK
[14:05:19] <deathanchor> splitting now insures you don't have to split it later :D
[14:18:38] <dretnx> I mean the way stream works in node
[14:18:44] <cheeser> does the node.js api support that? i think the java api just takes an InputStream
[14:28:26] <StephenLynx> dretnx, yes. that implementation is on mongo side.
[14:28:56] <StephenLynx> the node driver just implements the spec.
[14:32:59] <StephenLynx> the important question is that if your driver implements a high-level behavior on top of the spec or if you will have to implement the stream abstraction.
[14:55:00] <m3t4lukas> jamiejamie: I'd definitely use different collections. Even if it's just for the sake of aggregation performance
[15:45:37] <Svedrin> is there an operator or stage in the aggregation pipeline that does the equivalent of "for( var varname in this.data ){ emit(varname, this.data[varname]) } "?
[15:46:10] <Svedrin> so, kinda like $unwind, just for objects instead of arrays
[15:55:40] <GothAlice> Svedrin: Well, $project can sorta do that. It lets you re-name fields, including un-nesting them if desired. The top level is always a document, though, not a bare value, unlike map/reduce which may use bare values. I think. Haven't used map/reduce since aggregates were introduced. :P
[15:56:42] <cheeser> but with a $project, you'll have to list each field explicitly.
[15:56:47] <Svedrin> GothAlice, I didn't find a way to use $project without having to explicitly specify all the field names... which I can't really do because data can have kinda-arbitrary keys
[15:57:10] <cheeser> i should write a $flatten. see just how hard it is to write new operators.
[15:58:05] <Svedrin> the documentation states »you don't want to store {key: value}, store [{key: "key", value: "value"}] instead«
[15:58:13] <Svedrin> but I don't really understand the difference
[15:58:45] <Svedrin> what is it that makes a list of {key: k, value: v} dicts easier to process than a {k: v} dict itself? :/
[15:59:29] <GothAlice> I have a slightly higher level example of that mandate from the documentation.
[16:00:39] <GothAlice> When you have "arbitrary" fields, if you ever need to search the contents of those fields, or search for the presence of one of those fields, you either have to manually construct indexes for each one (which is kinda nuts, and very expensive), and on the latter you have to use $exists, which can't use an index.
[16:01:25] <GothAlice> The "good" version stores the name and value as sub-document fields in an array/list. It's then trivial to index both the field names and values, while still maintaining the ability to add and remove them, and update them individually using $elemMatch updates.
[16:06:12] <GothAlice> cheeser: https://jira.mongodb.org/browse/SERVER-15815 is now more than a year old. *prods this to get fixed before adding new things*
[16:06:33] <GothAlice> Since that ticket is solely responsible for marrow.task not being released.
[16:06:57] <StephenLynx> oh yeah, that expiration of tailable cursors are ass.
[16:07:27] <GothAlice> I find it mildly amusing that it flies in the face of how queries are documented to work. Tailable cursors are special snowflakes.
[16:07:28] <StephenLynx> it also boned me when I had to add some sort of IPC between the daemon and the terminal
[16:07:41] <StephenLynx> I ended up using unix sockets
[16:07:55] <GothAlice> (They shouldn't be special snowflakes.)
[16:08:02] <StephenLynx> which caused me a different plethora of issues, but at least it worked.
[16:08:39] <cheeser> well, TCs were added later to support replication. not entirely suprising they'd be a slightly different code path.
[16:08:51] <GothAlice> cheeser: marrow.task being a pure-MongoDB alternative to Celery or other distributed RPC / task workers, with support for things Celery can only dream of, like deferred multi-host generator chains.
[16:08:52] <cheeser> still, the inconsistency is a bit of a pita, i'm sure.
[16:09:13] <cheeser> i was just think of writing a messaging/queuing POC on mongodb actually.
[16:09:55] <GothAlice> https://gist.github.com/amcgregor/4207375 < m.task has basically existed for 4 years, and this has been a problem the entire time.
[16:09:55] <StephenLynx> GothAlice, why not using TCP for multi-host messaging?
[16:10:18] <GothAlice> StephenLynx: 1.9 million DRPC calls per second with two consumers (workers) and four producers on one host.
[16:10:25] <GothAlice> StephenLynx: Because performance isn't the bottleneck. :P
[16:10:41] <StephenLynx> ok, but mongo just can't do that due to that expiration on tailable cursors.
[16:10:48] <GothAlice> (And being query-able has all sorts of benefits.)
[16:11:09] <GothAlice> A task worker/runner never "stops", really, so timeouts are a non-issue.
[16:11:17] <GothAlice> At least for fire-and-forget tasks.
[16:11:37] <GothAlice> It's the producer waiting on the result of one of those tasks that gets taken out back and shot by the timeout behaviour.
[16:12:32] <GothAlice> This mostly matters for the aforementioned distributed generators, where multiple workers are (basically) waiting on each-other.
[16:14:34] <GothAlice> Fun fact: from the project those presentation slides came from, we calculated out that one million simultaneously active games would require 8GB of capped collection to avoid roll-over. I've never heard of people using capped collections that large before. XP
[16:31:30] <tantamount> Is it possible to sort a collection in a document during an aggregation? That is, I want to sort a field rather than the entire collection of documents
[16:32:03] <GothAlice> tantamount: Hmm, $unwind, $sort, $group would do it.
[16:34:04] <GothAlice> cheeser: Wagh; $sample was added, yay, but doesn't accept a PRNG seed so it's impossible to have deterministic behaviour. :(
[16:36:26] <GothAlice> Consider: "featured products today" with a seed that changes at midnight.
[16:37:28] <GothAlice> (That'd let you, with say, the current Julian calendar date as the seed, easily go to any point in time to see what the featured products were on that day.)
[16:39:12] <cheeser> GothAlice: file a Jira (tee hee) but that's not quite what $sample was intended for
[16:39:55] <GothAlice> Random sampling of documents implies a RNG or PRNG, docs mention a PRNG, it needs to be seeded. Not allowing that seed to be passed through is unintentionally restrictive on use of the stage.
[16:39:55] <cheeser> it's intended to get a representative sample of your documents for schema evaluation (both in Compass and the BI Connector)
[16:41:21] <GothAlice> https://jira.mongodb.org/browse/SERVER-22068 is close.
[16:41:49] <GothAlice> https://jira.mongodb.org/browse/SERVER-22069 okay, ouch, this rabbit hole just keeps going. XD
[16:42:35] <cheeser> 3.4 planning is going on now so now would be the time to file/vote
[16:42:48] <cheeser> *might* be a 3.2.x fix depending...
[16:43:30] <GothAlice> "We've done quite a bit of testing to confirm the WiredTiger PRNG doesn't cycle and returns uniform results across the space…" — that's worrying. Uniform distribution is also un-random. 1,1,1,1,1,1,1,1,1 is just as likely a random sequence as any other, despite apparent bias. :P
[16:45:39] <GothAlice> Entropy correlation testing is its own thing. The key is that the next generated value shouldn't be predictable, not that it isn't the same as the last few generated. ;)
[17:50:56] <Waheedi> its really huge this mongo :)
[17:51:17] <GothAlice> A database engine is no simple affair.
[17:51:30] <Waheedi> surely but not that huge too :P
[17:51:31] <NoOutlet> The Advocacy Hub advertises it like "Nominate yourself for the Masters Program." I don't consider myself a master, but I do consider you one.
[17:52:23] <NoOutlet> But if you're not interested in the Advocacy Hub and all that stuff, no worries.
[17:53:03] <GothAlice> NoOutlet: There are too few hours in the day; I haven't investigated it, but I will do so in the near future.
[17:53:31] <GothAlice> Waheedi: You have me curious. I'm now collecting the source for MongoDB, Postgresql, and MariaDB, and will now run sloccount across them for comparison. :)
[17:57:34] <GothAlice> Well, that was just the "core server"; none of the fancy additions like JSON support and such.
[17:59:04] <Waheedi> thank you for that GothAlice, its always good to know
[17:59:55] <Waheedi> the weather is a bit crazy in Sacramento today the trees about to flee :)
[18:02:17] <NoOutlet> Well I have to go. I'm at work and probably chatting on IRC is not okay. Hopefully I'll have some more free time soon though and I've got a MongoDB project that I'm working on so I'll see you all later.
[18:24:26] <GothAlice> Also, any way to challenge the MongoDB U course certifications? :D
[18:24:54] <StephenLynx> what even is this advocacy hub?
[18:26:58] <GothAlice> http://s.webcore.io/2t3V3C2Z1o3g < basically a gamified education and outreach portal. Complete challenges, rank up. Challenges range from "go over here and read this important bit of documentation" to "write a blog post".
[19:04:08] <zivester> hmm.. i dont think a mongodump/mongorestore can restore to a new database with the --archive option, is that true ?
[19:57:24] <Waheedi> at which version ReadPreferenceSetting got introduced to dbclient I need to use nearest read preference and not sure which version is the closest version to 1:2.0.4-1ubuntu2.1 that is ReadPreferenceSetting enabled?
[20:00:21] <GothAlice> Waheedi: Sharding is required for readPreference support, i.e. the old replication mechanism used "slave_ok" instead. That'd make it 2.2.
[20:02:02] <GothAlice> Rather, the new replication mechanism, not sharding itself. (Replication vs. master/slave.)
[20:03:12] <MacWinner> GothAlice, are you updgrade to 3.2 with wiretiger?
[20:03:37] <GothAlice> MacWinner: I am. My stress test issues have been resolved for that version, and I'm using WT with compression in production to great effect now.
[20:04:10] <MacWinner> awesome! what's your compression ratio looking like?
[20:05:21] <GothAlice> That'll be very dataset-dependent. On one production set, which is mostly natural language text, I'm getting a ~60-70% reduction. On my pre-aggregate data which already squeezes every possible byte out through key compression, the results are far less impressive. (10% or so.)
[20:05:32] <MacWinner> also, do you recommend first upgrading to WT on my 3.0.8 replica set, then upgrading to 3.2? or upgrade to 3.2+WT in a rolling basis
[20:06:05] <GothAlice> I always recommend upgrading, then testing, one major version upgrade at a time. There are things like index and authentication migrations that need to be performed, and safest is to do so one step at a time in a controlled manner.
[20:06:19] <GothAlice> I wouldn't switch to the WT engine until 3.2, however.
[20:06:50] <GothAlice> Mixed-version clusters (across major versions) are problematic for the same differing-schema reason.
[20:07:36] <MacWinner> i'm trying to find a good guide from someone who has done this. i feel like i have a pretty vanilla setup with e a 3-node replica set on 3.0.8.. not using WT
[20:08:30] <GothAlice> https://docs.mongodb.org/manual/release-notes/3.2-upgrade/ is a complete outline of the process.
[20:08:41] <GothAlice> Each major manual version includes step-by-step instructions for upgrading.
[20:39:53] <MacWinner> should this upgrade plan from 3.0.8 to 3.2.1 work for replica set. 1) shutdown secondary node. 2) upgrade binaries 3) convert config to YAML (including WT) 4) delete data files 5) start secondary node..
[20:40:13] <MacWinner> I'm thinking this will just go through the standard resync process for a secondary node.. but it will use WT as the storage engine
[20:40:30] <MacWinner> after all nodes are upgraded in this fasion, then I can upgrade the replication protocol version
[20:48:52] <Doyle> Hey. Backup question. In a sharded cluster, if you have to recover from a config server backup, is the cluster able to re-discover any migrations/splits, etc that happened after the backup took place?
[20:52:09] <Boomtime> @Doyle: migrations and splits are stored in the config servers, if you restore a config server to a point before the data was moved then it's gone - just like if you restore your harddrive to a point before you copied some data to it, how can it 'rediscover' what you copied to it?
[20:52:12] <GothAlice> Doyle: So… you're asking if a snapshot can predict the future?
[20:52:32] <cheeser> my crystal ball says ... "maybe?"
[20:52:46] <GothAlice> My 8-Ball came back with "Looks dubious."
[20:52:54] <Doyle> Well, if you have a busy sharded cluster, what's the point of a config server backup?
[21:29:30] <cornfeedhobo> hello. i am still tackling an issue that is driving me nuts.
[21:30:11] <cornfeedhobo> I have a cluster that i am playing with replication on, and while testing fail-over scenarios, I have found that the instructions on https://docs.mongodb.org/manual/tutorial/configure-linux-iptables-firewall/ don't seem to work
[21:31:13] <cornfeedhobo> i can get the cluster happy and all, but if i take a node down, give it a new ip, start it back up, and update the /etc/hosts entry for the node that was cycled, and then everything just hangs saying it is waiting for a config
[21:32:20] <Boomtime> @cornfeedhobo: you modify the /etc/hosts entry on the node that changed? nowhere else?
[21:32:40] <cornfeedhobo> Boomtime: no, on the master. sorry, should have been more clear
[21:33:02] <Boomtime> ok, how many members do you have? just two?
[21:33:47] <magicantler> Could I use sharding to act as a sort of load balancer for users? Do a hashed shard on a consistent field where the values of 1, 2, or 3, and give each user a value?
[21:34:01] <Boomtime> so out of 4 members, how many of them know about the new name you put in /etc/hosts?
[21:34:31] <Boomtime> also, please don't use the term 'master' that is old terminology and actually refers to a different mode
[21:36:17] <cornfeedhobo> Boomtime: good point re:master. will use "primary". currently, i have only tested by updating the host entry on the primary. the reason was that in testing, i found that after bringing the cycled node back up, everything works/connects properly if i flush iptables for ~30s before replacing them how they were.
[21:37:33] <cornfeedhobo> that is where i have become lost. everything works, but i just have to flush iptables for 30s for the cycled node to receive a config/resync
[21:37:47] <Boomtime> the replica-set config uses names? or ips?
[21:40:49] <magicantler> i'd like to map the client app to the proper mongod instance, before having mongos reroute
[21:40:54] <cornfeedhobo> Boomtime: i am hopping in the mongo shell and checking rs.status() until i see it stop complaining about being unable to retrieve a config
[21:41:54] <Boomtime> @magicantler: if you have a mongos at all, then the app connects to those only
[21:42:28] <magicantler> Boomtime: one in front of each api, but i want to intelligently map client calls to correct API, to avoid a network hop
[21:42:43] <magicantler> Boomtime: Based on the chunks or shard-hash
[21:43:01] <Waheedi> so read preference settings were used start from v2.4.0+
[21:43:02] <Boomtime> @cornfeedhobo: ok, that should be much faster than 30 seconds - check the primary server log (where you change the hosts) and see the connections it is trying to make
[21:43:20] <Boomtime> it should print both the name and the IP it resolved
[21:43:49] <Boomtime> if the IP is wrong then you know your hosts isn't being picked up real quick, if the IP is right then maybe you have some other problem
[21:44:06] <Waheedi> I'm on 2.0.4 which sounds a bit too old,. what would be the right way to go for upgrading my code to work with 2.4.0?
[21:44:37] <Boomtime> @magicantler: you just want a specific application to connect to a specific mongos?
[21:45:15] <magicantler> Yes, I'd like the client app to hit the api which lies on the same mongod sharded instance as the given primary for that specific user's hash
[21:46:12] <cornfeedhobo> Boomtime: good idea! i will try that now. brb
[21:46:13] <magicantler> Boomtime: basically, i want to try to try to intelligently map to the correct ip address that its primary exists on in runtime, through the app, and i want to update it if the chunks balance out
[21:47:35] <Waheedi> I'm not sure what are the changes, and for me to know the api changes would take few days :/
[21:48:07] <Waheedi> between 2.0.4 2.4.x, but I will need to that anyway
[21:48:13] <Boomtime> @magicantler: connecting to a specific mongos is trivial; it's just a hostname, put the appropriate hostname of the mongos to connect to in that application connection-string
[21:48:51] <magicantler> Boomtime: I know, but i'd like to have the client correct to the "best" mongos, i.e. the one that lies on the same primary for that given call
[21:48:55] <Boomtime> @magicantler: but your other words don't make a lot of sense - what do you mean about 'given primary'? in a sharded cluster you do not connect to a primary
[21:49:26] <magicantler> Right, so I have 3 servers, with 3 shards (rotating primary, secondary, and arbiter between them)
[21:49:51] <magicantler> Boomtime: and i'm hashing user_id values as the shard, but i'd like that users future call to send them to the proper mongos server consistently
[21:49:59] <magicantler> as each server also has an api and mongos (beefy servers)
[21:50:21] <Derick> is each server running two mongod instances?
[21:50:58] <Waheedi> magicantler: you want to set read preference settings to nearest for your apps/clients
[21:51:20] <magicantler> Each server has 3 mongod instances + a mongos
[21:51:23] <Waheedi> which is pain in the *** for me to do as I'm on 2.0.4 version :)
[21:51:36] <magicantler> Waheedi: They are in same datacenter
[21:51:44] <Derick> magicantler: don't do that, they will fight for memory
[21:51:54] <magicantler> Derick: That's what I keep hearing, but they are massive servers.
[21:52:05] <Derick> magicantler: okay, feel free to ignore the advice.
[21:52:13] <magicantler> Derick: Is there no way around it?
[22:06:56] <magicantler> :[! Well is there anyway to get the routing table that mongos uses ? Then i could hand that back to the client and send calls preemptively
[22:07:32] <Boomtime> i think you have a misunderstanding about what mongos does
[22:08:14] <Boomtime> mongos may (and will frequently) talk to more than one shard in the course of serving a single query
[22:08:47] <Boomtime> the 'routing table' you are asking for would be the contents of the config server - a complicated set of metadata about how the database is distributed across shards
[22:10:24] <magicantler> wow actually yes, I didn't grasp it right. i thought it could just use the config meta-data to know where to go
[22:12:13] <Boomtime> data is partitioned across shards, it is the mongos that locates all the relevant data for a query and makes it look like it came from a single place
[22:12:41] <Boomtime> so mongos is a 'router' but not the simple network router you are thinking of
[22:13:30] <magicantler> right, totally with you on that, but we're using documents and queries that will only map to specific users anyway, and thus won't span more than one shard.
[22:13:38] <magicantler> i have weird requirements for this project.
[22:14:02] <magicantler> it won't span more than one shard, b/c it's a single document per user.
[22:15:09] <Boomtime> is a user document is only accessible from a single location?
[22:16:14] <magicantler> i really feel like we're making a hacky load balancer and backup router (in case a primary goes down) via sharding.
[22:17:04] <Derick> sharding is not meant for failover, that's what replication is for. You're messing that all up with your two data nodes per server
[22:17:16] <Boomtime> if you only write a document in a single location, and only ever read it from that same location.. what is the sharding for?
[22:17:17] <magicantler> well each shard is a replica set.
[22:17:33] <magicantler> the sharding is to expand parallel reads + writes. distribute out users
[22:18:29] <Boomtime> but you're overlaying that with your own rules about which user can operate on which server
[22:18:49] <Boomtime> you've already made up your mind which server to use before you even involve the database
[22:23:09] <magicantler> true, and solid point. so i guess i could just use 3 replica sets, split across the 3 servers instead, but then i'd have to manage 3 mongodbclient connections on my own
[22:24:23] <Derick> one data node -> one physical server
[22:24:37] <Derick> don't try to reinvent things yourself and go against how MongoDB is supposed to be used.
[22:24:44] <magicantler> Yeah... boss man said no to that..
[23:05:01] <cornfeedhobo> Boomtime: okay. I have tried what you recommended. when i change ips on the "failing" node, the logs output a notice about failure to contact. once I update the /etc/hosts entry for the failing node on the primary, the log entry stops appearing, but the failing node stays in a perpetual state of "DOWN"
[23:23:36] <jiffe> that ring a bell with anyone? two of the three members in a replica set are dying with this and won't start up again
[23:56:41] <regreddit> i have found what i think is a bug, but im not sure where, or how bad it is, but its a blocker for me - can i describe it here, to see where i should file it? I can also easily replicate it and have a test case, if someone is willing to test it
[23:57:51] <regreddit> i'll paste the replication test case in a gist