[00:02:40] <uuanton> @GothAlice is there a way then to drop local database and then "copy/sync" new oplog to secondaries ?
[00:03:41] <Bajix> Not seeing how we'd do for periods in range either
[00:04:45] <GothAlice> uuanton: Why are you thinking to copy an oplog around?
[00:04:52] <GothAlice> uuanton: That's… never a good idea.
[00:06:05] <uuanton> I mean secondaries resync from scratch even tho they have same data as primary from snapshot
[00:06:45] <GothAlice> uuanton: Based on an _old snapshot, outside the oplog time range_. So that behaviour is completely correct.
[00:06:48] <uuanton> but because i drop local database with production oplog and replset settings there are no common points in oplogs
[00:07:46] <GothAlice> uuanton: Which is why I recommended the approach of spinning up a restore of the snapshot as a brand new replica set instead of attempting to preserve the old configuration.
[00:08:39] <GothAlice> "Basically, restore a single member, convince it it's not a replica set, re set up the replica set, optionally with a fresh snapshot/backup to restore to the new secondaries. That's the process, and the last link there outlines how to force it if it won't cooperate. ;)"
[00:09:38] <GothAlice> The key factor of not having the new secondaries sync the first time is creating a _new_ snapshot, with current oplog time, which hopefully won't have expired by the time it's transferred and restored on the new secondaries.
[00:10:28] <GothAlice> If a new snapshot isn't an option, then syncing the first time must be.
[00:14:59] <GothAlice> Bajix: I'll have to ponder this. This, BTW, is the huge reason I don't ever store multiperiod data which requires further processing; I can highly recommend investigating pre-aggregation for this statistic in the future. Switching to pre-aggregation may also work as a solution, here, but it's a bit more heavyweight than crafting a cunning query or aggregate.
[00:15:30] <Bajix> Well, there could be multiple period durations as well
[00:15:47] <Bajix> This was an after the fact request
[00:16:34] <GothAlice> (Basically; any time the user performs an event which updates their lastActive, also atomically increment a user count and $addToSet the user's identifier to a pre-aggregate record for the "current period", with a query to not increment if they're already in the set.)
[00:16:51] <GothAlice> Pre-aggregation like that is how we do this activity tracking at work.
[00:17:53] <GothAlice> http://www.devsmash.com/blog/mongodb-ad-hoc-analytics-aggregation-framework being the article I generally link to introduce this approach; it covers several approaches for storing event data including some backing numbers for storage space, indexes, and different query performance.
[00:17:55] <Bajix> So, basically $addToSet the rounded down hour with every update
[00:18:23] <GothAlice> https://gist.github.com/amcgregor/1ca13e5a74b2ac318017#file-eek-py-L29-L47 < this part of my aggregate example code
[00:18:39] <GothAlice> (Snapped to hours, in that example.)
[00:19:06] <Bajix> It doesn't look possible from what I can tell to even use push twice per document
[00:19:26] <GothAlice> Taking your existing data and processing it to populate that pre-aggregate collection can be done client-side and as inefficiently as desired. (Since it only has to happen once, when first implementing this approach.)
[00:19:56] <GothAlice> Nope. The amplification problem; $unwind is the only thing that creates variable amounts of additional data to process, everything else preserves or reduces.
[00:20:00] <Bajix> From what I can tell, $push & $addToSet would be the only options for building an array that can be unwound, neither of which can push multiples
[00:21:10] <yoofoo> I'm new at mongodb. We are looking at mongodb for an MLM company. I need to know if it is a good choice or not. Most MLM use SQL for the relational/hierarchical nature of MLM. However, mongodb seems to provide more flexibility and performance while also offer supports for relational/ hierarchical. can I achieve the middle ground with mongodb and come out ahead at the end? Please advise.
[00:21:11] <Bajix> That's sort of what I thought going into this
[00:21:49] <GothAlice> yoofoo: MLM describes a graph; I'd recommend using a real graph database to store the connections. For other data storage needs, though, MongoDB can be great. (Even SQL doesn't do graphs well.)
[00:21:51] <Bajix> But it does seem like aggregate is so much more performant as to warrant pre-seeding
[00:22:26] <Bajix> Can $currentDate do current hour at all?
[00:22:41] <GothAlice> Bajix: Pre-aggregation is the "right way" to seed this data, in all likelihood. It'd also let you have one place to update all user-event-in-time-period related statistics.
[00:22:44] <Bajix> If I wanted to do something like addToSet + currentDate rounded down
[00:23:35] <yoofoo> GothAlice, Thx. what would recommend for the graph db?
[00:24:27] <GothAlice> Bajix: The expression fed to $addToSet can be as complicated as you need, but I'm still not seeing where you're really going with that. If doctoring up the original event data is what you're thinking, I'd recommend keeping the precalculated data separate. I.e. I store hits in db.hits, but only keep N days of individual events for auditing purposes. But db.analytics are kept forever.
[00:24:39] <GothAlice> yoofoo: Neo4J is what we use at work; it's Java, but good people.
[00:25:26] <GothAlice> Certain questions about your data, i.e. the six-degrees problem, can only really be answered in a sane way by a real graph DB. ;)
[00:25:40] <Bajix> I was just going to add this to my periodic updates to atomically add the current hour, rounded down
[00:25:57] <GothAlice> Bajix: To every user actively logged in? That doesn't sound very scalable. ;)
[00:26:21] <GothAlice> Pre-aggregation is strongly (not "all"… not quite) about scaling your analytics.
[00:27:32] <GothAlice> No matter how many tracked clicks we get at work, there will only ever be one pre-aggregate record per time period per job we're tracking. Querying a two-week time-frame for a job will _never_ query more than 336 records, period, making aggregation over the pre-aggregated data for report display a constant-time affair.
[00:28:32] <Bajix> GothAlice: It would be to their Tokens, which just contain session data, so we wouldn't be doing insane things to the user collection
[00:28:55] <GothAlice> Mixing of concerns, though. Makes my skin creep. ;P
[00:29:13] <GothAlice> Data vs. instrumentation of that data.
[00:30:26] <GothAlice> The query time and scalability of including it in the data itself is dependant on the popularity of your site/tool/service. The more users, the slower your reporting will get, and if you go the $unwind approach, it's geometrically worse over time, not linearly.
[00:32:09] <GothAlice> Pre-aggregated, identifying the number of active users for any single period (i.e. right now, for a dashboard counter thingy) is O(1). A single exact record fetch on the (hopefully indexed) time period.
[00:32:34] <GothAlice> Not pre-aggregated, it's O(active-user-count).
[00:34:00] <Bajix> The token's storing only how they logged in, how long their session is, whether they're a guest user, and what websocket channel they're using
[00:34:21] <Bajix> It's not the "session data" itself
[00:34:23] <GothAlice> Bajix: But you do see that the number of records being evaluated grows unbounded?
[00:34:48] <GothAlice> Re-phrasing that in English: the more active your site, the more data that needs to be ploughed through to give you an answer to that analytic question.
[00:35:10] <Bajix> How would you get around that though?
[00:36:08] <GothAlice> "Number of active users right now?" becomes db.activity.find({period: utcnow().replace(minute=0, second=0, microsecond=0)}, {total: 1}).first().total
[00:36:47] <GothAlice> "Active users in the last 24 hours?" becomes an aggregate matching the time range of periods $sum'ing total, with the number of records evaluated being… exactly 24.
[00:36:49] <Bajix> Well I have that atomically updated on the websocket channel
[00:39:55] <GothAlice> Your approach of bundling time periods into each user's login record would technically… work…
[00:40:02] <Bajix> I need the Tokens regardless for other reasons though
[00:40:12] <Bajix> Such as producing average session durations within a time span
[00:40:24] <Bajix> or comparing login method percentiles
[00:40:57] <GothAlice> Well, trick is, you can fire off the pre-aggregate upsert with a writeConcern of None. It can be re-built (worst case, this has never happened to us but we test the rebuild process anyway ;) from the token data as needed, the result is never used anywhere, etc.
[00:41:18] <GothAlice> So where your user data is required for each request, reading the count of active users… probably won't be.
[00:41:33] <Bajix> I always had the impression though that it would be more preferable to have more docs as opposed to using addToSet and having massive sub documents
[00:41:54] <GothAlice> (Also, for the… scope of the potential query here, running a separate query to read a single integer out of the DB is going to be way, way more performant than running a whole aggregate or map/reduce!)
[00:43:04] <GothAlice> {_id: <date>, total: 0, users: []} — that'd be an example empty pre-aggregate record.
[00:43:19] <Bajix> It's a little more complicated than that
[00:43:35] <GothAlice> For active user count? Oh, right, you are tracking both active and logged in but inactive.
[00:43:36] <Bajix> The tokens build in idempotency of the session duration updating
[00:44:13] <GothAlice> I care not for your tokens; the problem revolves around statistics pulled from their creation and last modification dates, the rest is unimportant. The pre-aggregated records are stored in their own collection.
[00:44:17] <Bajix> They could have multiple open web socket connections, and without tokens to maintain a cursor I wouldn't be able to do idempotent updating
[00:48:33] <GothAlice> Pre-aggregated requires no $unwind.
[00:49:16] <GothAlice> The user ID tracking is solely for the purpose of preventing double-accounting if the application pings back too frequently; I'm not sure how your application exactly tracks user activity, so I go with the safer approach out of the box. Obv. customize any and all of this to fit!
[00:55:08] <Bajix> One of my recent jobs had that... I couldn't talk about the fact that a termination agreement existed
[00:56:02] <GothAlice> Well, that's often to ensure other investors in a project don't get butthurt over each other's negotiated bonuses. ;)
[00:56:42] <GothAlice> Similar weird thing at work with job postings that don't actually mention a company, because the company can't admit they're looking to fill the position. Usually because they currently have someone in the position, and their contract might mention such behaviour…
[00:56:50] <GothAlice> HR is almost as weird as some NDAs.
[00:58:47] <Bajix> Left me real sour... I built a pipeline that accepts JSON/CSV as inputs... the director of engineering wanted me to parse JSON, convert it to CSV, then parse the CSV.... you can probably understand why I told him no
[00:59:08] <GothAlice> I'm glad I wasn't drinking anything when I read that.
[00:59:13] <GothAlice> Or I'd need a new keyboard.
[00:59:42] <GothAlice> I often have to explain patiently why I can't just give staff dumps of unmodified data as Excel workbooks. ¬_¬
[00:59:58] <GothAlice> I always have to ask, no, really, what exactly do you need from this data?
[01:00:32] <Bajix> This was a company in which I was the first engineer at, in which I had architected everything
[01:01:04] <Bajix> Then they got funding, and decided to hire dinosuars that didn't know the technology to manage things...
[01:01:32] <GothAlice> See also: the mythical man month.
[01:01:47] <Waheedi> I have a replicates with 4 nodes, primary node 1 and 2,3,4 are secondaries
[01:02:35] <Bajix> Yea. It set them back 10 months by firing me, and cost many millions
[01:02:56] <Waheedi> 4 was added recently from 1, when I do rs.status() from 1,2,3 I can see 1,2,3,4 and they are syncing fine, but when I check from 4 It seems its stuck somewhere
[01:03:16] <GothAlice> Waheedi: what do the logs on that node say?
[01:03:17] <Waheedi> from the log of node 4 replSet info Couldn't load config yet. Sleeping 20sec and will try again
[01:04:14] <Bajix> I'm twitter creeping you, just FWI
[01:04:17] <GothAlice> Bajix: For the most part that name is me everywhere, except on YouTube. *shakes a fist* There I'm TheRealGothAlice. ;)
[01:04:35] <Bajix> I'm either Jester831, Bajix or Bajix831
[01:05:15] <GothAlice> Waheedi: Could you gist the rs.status() from a working secondary and the non-working one? Also, is authentication enabled on this cluster?
[02:02:56] <GothAlice> Freman: You do realize that Go and PHP define variables in very different ways? Google is also not informing me as to what golang/mongo (cli) actually is, or where it comes from.
[02:03:34] <Freman> no, I mean in the encoded bson
[02:04:28] <GothAlice> If whatever that is differed in any way on how it encodes the BSON specification, it wouldn't be able to communicate. BSON is also binary, not ASCII, so… uh… not pasteable into IRC, generally.
[02:07:35] <GothAlice> Since BSON is, as mentioned, a ratified standard with pretty good library coverage, the immediate culprit that jumps to mind has to be user error.
[02:08:53] <Freman> it's the way the libraries are putting their queries together
[02:09:19] <Freman> it's not mongo's fault except that the mongo cli uses "query" but the documentation for the wire protocol is "$query"
[02:09:24] <GothAlice> Also link the library? Still can't find anything calling itself golang/mongo involving a command line interface.
[02:11:08] <Freman> no, the cli that comes with mongo, and the mgo golang library for mongo
[02:11:37] <Freman> http://pastebin.com/Xsv7HDTU is the code echoing that and https://github.com/facebookgo/dvara/blob/master/protocol.go is where readDocument comes from
[02:11:51] <Freman> the document on the wire varies from library to library
[02:12:02] <Freman> (that's better, "on the wire")
[02:12:59] <Freman> not a serious complaaint, I'm just amazed mongo works for everyone
[02:13:14] <Freman> and annoyed that I had to modify my code to work for both
[02:16:27] <cheeser> i think you're conflating things here, Freman
[02:17:27] <cheeser> when you say 'the mongo cli uses "query",' what are you referring to?
[02:23:06] <Freman> gimme a second, I'll do this better
[02:24:20] <yoofoo> When using mongoDB, what is the recommended guideline/practice for mapping the mongodb data using graphDB? For example, leave the data in mongodb flat (with no relationship) and use graphdb to map all the relationships? or what's the guideline? please advise.
[02:27:51] <yoofoo> let me rephrase my last question
[02:29:33] <yoofoo> in MLM, when using graphDB+mongoDB, what is the recommended guideline/practice for mapping the mongodb data using graphDB? For example, leave the data in mongodb flat (with no relationships) and use graphdb to map all the relationships? or what's the guideline? please advise.
[02:36:23] <Freman> This is the on the wire bson (https://docs.mongodb.org/v3.0/reference/mongodb-wire-protocol/#op-query) for the query from php http://pastebin.com/beuNTRRY and the same query from mongo cli http://pastebin.com/bAesaJC0
[02:43:28] <cheeser> i think what you're seeing there is the difference in how drivers submit queries to the server. the command format changed on the server end.
[02:43:33] <joannac> Freman: And? Why do you care?
[02:43:43] <cheeser> these details are largely irrelevant to driver users
[03:34:08] <Freman> looks that simple, mongo still answers both forms, I'm just making sure the queries use indexes, worst case scenario I fail to parse and they get told to check their query again
[10:29:14] <noncom> i am rather new to mongo, so maybe i don't know something essential, but it'd be nice if someone explained, what does "local" actually mean then?
[10:31:40] <noncom> ah, it's rally rather basic... found an answer on SO that it's an oplog to restore back in time if i need
[11:11:25] <CustosLimen> Lope, can you pastebin the complaint ?
[11:12:20] <Lope> hmm, I'm going to try reboot. Cos I tried to paste a multi line script in my terminal and it might have written some weird chars into that location. so just incase I corrupted something in the kernel I'm gonna reboot.
[11:27:09] <Lope> same siuation: echo never > /sys/kernel/mm/transparent_hugepage/enabled && cat /sys/kernel/mm/transparent_hugepage/enabled shows me "always madvise [never]"
[11:28:37] <Lope> oh, but mongodb is not complaining anymore. apparently the output I'm getting now is acceptable for it.
[11:49:04] <torak> hello everyone. I am trying to create a backend with mongo for my android app. I am using node.js for coding but how can i upload the node scripts to the server? Is it safe to use ftp? Or any other better way for that?
[11:53:30] <StephenLynx> that is just a linux question.
[11:53:44] <StephenLynx> unrelated to both mongo and node.
[12:57:05] <CustosLimen> on rhel, where mongodb installs /etc/init.d/mongod - how can I get multiple instances going ?
[12:57:14] <CustosLimen> or is there no specific mechanism ?
[13:26:48] <synthmeat> CustosLimen: you can just run "mongod" as any user, but be careful to run it with a new config so it doesn't wreck your existing database/logs/dirs
[13:27:40] <CustosLimen> synthmeat, not worried about user so much - but its ok - the documentation did say that for other processes (mongos, arbiter, config server, etc) I should just make new init script based on existing one
[13:28:32] <synthmeat> CustosLimen: it's more maintainable just to run it with a new config to me (for, say, development purposes). ymmv
[13:28:49] <CustosLimen> synthmeat, each instance will use its own config
[13:29:07] <CustosLimen> but they will all be on same server
[13:29:28] <synthmeat> CustosLimen: unrelated, but why would you want to do that in the first place?
[13:30:30] <synthmeat> (few use cases do come to mind, like using different storage engine or alike)
[13:32:00] <CustosLimen> also in addition to that each server will have mongos process
[13:32:05] <synthmeat> ah, ok. never went web-scale myself :)
[13:32:36] <CustosLimen> so at max mongodb related process will be 4 on one server, but only one will be data carrying
[14:31:39] <Tomasso> why the following query that should calculate a simple average is always returning 0.000000 ? db.getCollection('Accuchecks').aggregate({$group: {_id: "$algorithm",accuracy: { $avg: "$difference.close"}}})
[14:32:36] <StephenLynx> what number did you expect?
[14:33:24] <StephenLynx> try just getting the sum and amount of documents.
[14:33:44] <Tomasso> StephenLynx: different than 0.. since almost all values from difference.close are different than 0
[15:36:44] <StephenLynx> when you are indicating it on a query or a simple projection you use as cheeser posted.
[15:37:19] <StephenLynx> when you are referencing on an operation as the value of something and not the key, you must include the $.
[15:38:10] <StephenLynx> like {amount:"$aggregatedValue"} will output the value of aggregatedValue in the field amount
[16:26:58] <jfhbrook> I have some find(find).skip(skip).limit(limit) calls that don't appear to be dealing with a deterministic ordering of results, is this expected and/or what could cause this?
[16:27:29] <Tomasso> grgrg is something like $unwind but of objects and not arrays? i still get 0
[16:28:43] <StephenLynx> jfhbrook, if you are not sorting the results
[16:29:02] <StephenLynx> I don't think mongo makes an effort to give you results in a certain order
[16:29:34] <jfhbrook> StephenLynx: suppose they're sorted, but not on literally all fields, and some fields might have the same value
[16:32:38] <jfhbrook> it's proprietary code, you understand
[16:32:44] <StephenLynx> what do you mean a nondeterministic ordering?
[16:33:19] <StephenLynx> m8, if you are not creative enough to show pieces of proprietary code without not actually revealing too much, you got bigger issues.
[16:34:19] <jfhbrook> I mean if I have 2 elements a and b, and a and b are equivalent based on my sorting algorithm, sometimes the order will be [a, b] and other times [b, a]
[16:35:12] <StephenLynx> i assume in that case they won't be moved.
[16:35:34] <StephenLynx> so it depends on the order they were in the first place.
[17:23:05] <julio> Hi! I'll have a system where my users can include your tasks. My category are: Morning, Afternoon, Evening and Night. Each user can include your task in a category, example: morning. User1 - Morning: clean the bedroom, wash clothes. Afternoon: make lunch, dry clothes, evening: study for exam, prepare dinner. Night: watch tv, go to sleep. Which is the best way to store this data. One single Database with collections separeted by ID or one collection for each
[17:23:05] <julio> user? I'll need to find some informations. Day X, how many task users Y did. Day X, how many time user Y clean the bathroom...
[17:24:12] <StephenLynx> a single collection for all users.
[17:24:24] <StephenLynx> having dynamic table/collection generation is one of the worst mistakes you can make.
[17:46:27] <julio> what about indexes, where is the best... _id of users collection and user_id and date on tasks? I'll need to find some task made by user X in day Y...
[17:54:55] <StephenLynx> but bcrypt is not bad either.
[17:57:49] <GothAlice> StephenLynx: As a rather important note, pbkdf2 is a key derivation function, not a hashing function. It has some peculiarities beyond just salting. Security is hard, man. ;P
[17:58:24] <GothAlice> bcrypt/scrypt tend to be better from the "it's more secure with less understanding" perspective. Black boxes that deal with passwords.
[17:58:25] <StephenLynx> I know, but it works really well for that purpose.
[17:59:06] <StephenLynx> and from what I heard, has a higher ceiling than bcrypt
[18:02:04] <GothAlice> pbkdf2 uses 2^rounds, bcrypt I believe just uses the rounds straight.
[18:02:46] <GothAlice> But no, the major key point here is that you're not trying to derive a cryptographic key for use with block ciphers with the user's password, you're trying to verify the user knows a secret. :| Very different problems.
[18:03:24] <julio> StephenLynx, yes. My searchs are based: user_id where date = X and period = Y
[18:03:29] <GothAlice> Which is why I'm a fan of: https://en.wikipedia.org/wiki/Secure_Remote_Password_protocol
[18:03:32] <julio> StephenLynx, I'll read about pbkdf2.
[18:03:40] <StephenLynx> julio, I suggest you make that index on the user_id
[18:03:58] <StephenLynx> not having the others unindexed shouldn't be an immediate concern, but the user_id will
[18:04:10] <julio> just user_id at tasks collection?
[19:12:04] <tinylobsta> i switched from using embedded documents to using references because i ran into a use case where i needed to run a comparison that required having three nested loops. i'm curious, however, how efficient having nested queries is (i'm using mongoose on node) relative to having a nested loop structure?
[19:12:41] <StephenLynx> I strongly suggest that you don't use mongoose.
[19:13:10] <tinylobsta> is it really that inefficient?
[19:13:22] <tinylobsta> i'd have to tear apart my entire app to move away from it
[19:16:24] <tinylobsta> so this is what i should leverage, then
[19:16:37] <stickperson> for a $projection in an aggregate query, is it possible to say “only give me this field if it exists”
[19:18:16] <GothAlice> stickperson: Hmm, {optionalfield: 1} as a projection always includes an "optionalfield" field in the result, even if the source document didn't have it?
[19:18:42] <GothAlice> It certainly will if you project using the renaming syntax: {myfield: "$optionalfield"}
[19:18:49] <stickperson> GothAlice: only when i do an aggregate query. when i just use find() it doesn’t get returned
[19:19:05] <GothAlice> stickperson: Could you gist/pastebin your aggregate? Or at least, the project stage?
[19:26:35] <GothAlice> stickperson: Yup, in this example it's the $group doing you in. Fields defined within its stage will always be given a value, even if that value has to be null.
[19:28:02] <stickperson> GothAlice: damn. oh well, not a huge issue. will just have to add some more logic elsewhere
[21:34:07] <magicantler> suppose we insert lots of documents, then delete them right away. is this a bad measure of performance since maybe it doesn't have time to build indexes? @joannac
[21:34:08] <joannac> magicantler: you think I can tell you that without knowing anything about your data or hardware?
[21:34:49] <joannac> magicantler: I don't understand your question.
[21:35:05] <uuanton> joannac you mean ssh to each secondary and restart mongo ?
[21:35:16] <magicantler> if i insert a shit load of documents, then try and delete them right away, will it be slower than if i had waited 10 minutes before deleting?
[21:35:49] <GothAlice> magicantler: Indexes are updated as the inserts are applied, so there's no impact there.
[21:36:11] <GothAlice> It's not like it inserts it, then for an indeterminate amount of time the record won't show up because the indexes haven't been updated.
[21:36:37] <uuanton> oh GothAlice mebbe you know the smart way to restore whole replica set to new data, i talked to you yesterday
[21:37:04] <uuanton> im not longer care if secondaries have to resync again from scratch
[21:37:49] <magicantler> GothAlice: I figured it built them behind the scenes, and until it was built, the lookup just ran slower
[21:38:00] <GothAlice> uuanton: Then the process I described yesterday should work great; restore a snapshot, demote it from being a replica set, promote it back to being a *new* replica set, then spin up new secondaries.
[21:38:15] <joannac> magicantler: the insert won't succeed until the index entries are updated
[21:39:02] <GothAlice> magicantler: Such an approach as to defer index updating wouldn't result in slowness, it'd result in the record not being there for any query using the index. That's a no-go anti-feature in a database engine. ;)
[21:39:22] <joannac> To be fair, we do have background indexes
[21:39:25] <uuanton> sorry im not quite understand the steps. I believe i need to stop secondaries first right
[21:39:34] <GothAlice> joannac: But that's just incremental initial construction, no?
[21:39:43] <GothAlice> I.e. the index doesn't actually "exist" for use until it's done.
[21:40:15] <joannac> right, that's index creation, not updates
[21:41:13] <GothAlice> uuanton: You don't necessarily need to, no. Especially if you want read-only access to the older data while the restore and new replication is set up.
[21:41:45] <GothAlice> But that'll depend on quite a number of things, like if you're changing hostnames during the restore.
[21:42:33] <uuanton> GothAlice, actually the goal is to access new data as soon as possible. While secondaries are going to be rebuilded with new data primary can still serve the traffic ?
[21:42:43] <GothAlice> At work we use incremental numeric DNS names, like r01s01.db.example.com — replica 1 of set 1. If we restore, we bump the set number, using a CNAME on db.example.com to point to the active cluster.
[21:43:48] <GothAlice> uuanton: Yes, though you need to make sure you have a sufficient oplog size to handle the incoming changes. I.e. it needs to be large enough to record all of the changes that happen while the secondaries are syncing, or the sync will never finish successfully.
[21:45:14] <uuanton> wouldn't secondaries start from scratch ?
[21:45:43] <GothAlice> They would, but step one is to copy the existing data. While that's happening, changes accumulate in the oplog. Once the initial data sync is complete, it then "catches up" by reading through the oplog and applying the changes listed there.
[21:46:03] <GothAlice> Finally, during normal operation, it "tails" the oplog to apply changes as they happen.
[21:51:32] <GothAlice> uuanton: To protect your data, if a primary can't see (ping) a majority of the other nodes it knows about it calmly turns itself into a read-only secondary. It has no confidence that it's actually still the primary, since the missing majority could vote against it and it'd never know.
[21:52:37] <GothAlice> And if that happened, and it did still think it was a primary, it might accept changes to be written that the other one doesn't know about, and you have what's referred to as a "split brain problem". Once the nodes could see each-other again, the equivalent of all hell would break loose. ;P
[21:53:13] <GothAlice> MongoDB prefers to not open gates to netherworlds, so does the safe thing.
[22:10:16] <uuanton> additional secondaries you delayed ?
[22:10:44] <uuanton> someone delete from records how you going to restore the data ?
[22:11:21] <GothAlice> Indeed, we have two replicas in our office. One is "live" (and lets us run reports against our data without needing to pop out to the internet for each query) the other is delayed by 24 hours.
[22:11:38] <GothAlice> The first lets us continue operating even if the rest of the world is on fire. The second helps recover from user error.
[22:13:23] <uuanton> in my office we have staging where all testing and development occurs
[22:13:57] <GothAlice> We stage on the same infrastructure that production runs on, just in a separately scaled set of instances. I like my testing environments to be as close to real as possible. :)
[22:14:06] <uuanton> right now we mongodump production and mongorestore to staging and it takes almost 6 hours
[22:15:13] <GothAlice> Ouch. A faster approach would be to have an in-house secondary you disconnect from the set and demote back to standalone, with a filesystem snapshot made prior to this such that when you're done testing you can roll back the snapshot and it'll catch up with the rest of the cluster again. (Again, assuming sufficient oplog size to cover the testing period.)
[22:15:59] <uuanton> but staging is replica set too
[22:17:37] <uuanton> the goal of staging not only in testing development but to have similar infrastructure as production that if you need to upgrade mongodb or do something that you can practice on staging
[22:18:33] <uuanton> and secondaries located around the world that make it hard to test without same staging infrastructure
[22:18:33] <GothAlice> Sorry, I'm not seeing the difficulty. The process of taking an in-house secondary of your production data, demoting it out of the set, then promoting it to be the primary of a new in-house staging set, would still be faster than mongodump/restore.
[22:19:25] <GothAlice> I.e. it'd take five minutes before it started being able to answer queries.
[22:22:01] <GothAlice> In our case at work we have enough oplog for 50 hours of database activity (at our current level of utilization). That'd let us easily have a "live" (not delayed) secondary in the office that is filesystem snapshotted while a secondary, then demoted and promoted to being the primary of a new in-house set, every 24 or 48 hours (I'd go for 24; 48 doesn't leave comfortable enough headroom for us. ;)
[22:22:43] <GothAlice> Basically letting it breathe for an hour around midnight each night to "catch up" with production, only to re-snapshot, re-demote, and re-promote.
[22:23:15] <GothAlice> (We don't actually do this, though, we mongodump and restore about once a week, but our data takes tens of minutes to load, not hours. ;)
[22:23:39] <uuanton> but i don't think they would let me to remove and add secondaries from production to restore staging
[22:24:15] <GothAlice> It's a pretty natural way to get MongoDB to synchronize information, given replication is how MongoDB synchronizes information. ;P
[22:24:48] <GothAlice> (And if you add it as a non-voting hidden member, the rest of the set won't even be aware. You can even reduce load on your primary by having this staging secondary replicate and catch up from another secondary.)
[22:25:16] <uuanton> another reason why this is not possible that production and staging completely on different networks
[22:26:49] <GothAlice> AWS do offer VPNs, though I haven't tried running a replica intermittently over such an arrangement.
[22:27:17] <GothAlice> I have a secondary at home, too, that does the midnight catchup routine. (But not for promotion/demotion or anything, just as a mother backup.)
[23:39:09] <Doyle> Hey. Anyone know if the slow listDatabases issue was confirmed in MMAP? Seems that it was unable to be tested. https://jira.mongodb.org/browse/SERVER-20961?jql=project%20in%20(SERVER%2C%20TOOLS)%20AND%20fixVersion%20%3D%203.0.9%20AND%20resolution%20%3D%20Fixed%20
[23:43:46] <Boomtime> @Doyle: all the listed effects concerned wiredtiger only - the fixes also only affected wiredtiger code
[23:44:16] <Boomtime> are you seeing something that you think is a similar problem in mmap?
[23:46:18] <Doyle> listDatabases hung. Connecting to a router via robomongo was not possible. The only thing in the logs that was happening was an index built, and a big query. They both finished at about the same time, and the routers became available again