PMXBOT Log file Viewer

Help | Karma | Search:

#mongodb logs for Tuesday the 21st of October, 2014

(Back to #mongodb overview) (Back to channel listing) (Animate logs)
[00:20:27] <morenoh149> can I get a recommendation for a gui for mongodb?
[00:20:34] <morenoh149> http://docs.mongodb.org/ecosystem/tools/administration-interfaces/ which do you use?
[00:21:22] <LouisT> i used JMongoBrowser a while back.. now it's UMongo
[00:22:16] <morenoh149> LouisT: hmm says cluster. would it work fine for a single mongo instance?
[00:22:38] <LouisT> yea
[00:22:47] <LouisT> well, it was back in the day.. but it should be fine still
[00:31:54] <morenoh149> https://github.com/fluent/fluentd
[00:32:01] <morenoh149> https://github.com/paralect/robomongo/
[00:32:07] <morenoh149> https://github.com/bobthecow/genghis/
[00:32:12] <morenoh149> https://github.com/jeromelebel/MongoHub-Mac
[00:32:21] <morenoh149> those are the most popular ones by stars on github
[00:47:29] <TylerWalts> I'm looking to scale out from a single datacenter with 3-node replica set and add 2 more data centers each with its own replica for better read performance in those regions. Is there any way to to also perform writes to a secondary member of a replica set and have them eventually sync with the rest of the nodes?
[00:51:19] <TylerWalts> Shorter question: Is it possible to configure mongo to allow writes to a secondary member of a replica set?
[00:51:30] <joannac> TylerWalts: no
[00:51:46] <TylerWalts> joannac: Thanks
[00:51:59] <Boomtime> TylerWalts: you can never write to a secondary, you want to use sharding to achieve what you describe
[00:52:31] <TylerWalts> Boomtime: Thank you for the tip
[04:20:34] <shoerain> morenoh149: jesus that's a lot of mongodb UIs...
[05:29:02] <jrbaldwin> if i run: db.users.ensureIndex({'email':1},{unique:true, dropDups:true}) ——> how do i turn off dropDups for future data written to users collection
[05:29:23] <jrbaldwin>  db.users.ensureIndex({'email':1},{unique:true, dropDups:false}) ??
[05:31:00] <Boomtime> it's a unique index, once you have a unique index you can't get duplicates in future
[05:31:15] <jrbaldwin> ok
[05:31:16] <Boomtime> dropDups only makes sense while the index is being built
[05:31:29] <jrbaldwin> i have my schemas built in mongoose
[05:31:44] <jrbaldwin> would it be best to specify dropDups in the mongoose schema?
[05:32:16] <Boomtime> is the index normally created before any data is inserted?
[05:33:14] <jrbaldwin> if mongoose rebuilds the index everytime the node server is restarted then yes
[05:33:31] <jrbaldwin> i am trying to build in index for a large dataset with some duplicates
[05:33:34] <jrbaldwin> an
[05:34:36] <Boomtime> dropDups has no effect on data you insert later
[05:35:01] <Boomtime> once the index is built, unique:true means it will reject any attempt to insert duplicates
[05:35:14] <jrbaldwin> ok
[05:35:20] <jrbaldwin> thanks
[05:35:27] <Boomtime> while the index is building, dropDups:true means it will just straight up delete any documents that conflict
[05:35:42] <Boomtime> dropDups:true is for mad people
[05:36:10] <jrbaldwin> yeah...
[10:14:54] <disaster_> hi all
[10:15:33] <disaster_> I try to set authentification in mongodb when it seems doesn't work
[10:15:51] <disaster_> I have followed this help : http://docs.mongodb.org/manual/tutorial/add-user-administrator/
[10:16:28] <disaster_> But after I enable auth and I have restart mongodb, I can auth with my account but I haven't admin right in mogodb
[10:16:40] <disaster_> I can't display collection in admin database
[10:16:52] <disaster_> I can't create user, etc
[10:17:06] <disaster_> Have you any idea about this ?
[10:22:42] <joannac> disaster_: what does db.auth(user, pass) return
[10:23:08] <joannac> and then, what does use admin; db.system.users.find() return
[10:29:20] <sachinaddy> hi.. i m adding user from frontend in localhost mongodb server. but when i m connecting mongodb from robomongo, i m not able to see that user.
[10:34:38] <Climax777> How does MMS generate alerts on the metrics it gathers? I can see how it may work for metrics as measured when the pings come in. But what about the avg/sec metrics?
[10:41:45] <joannac> sachinaddy: adding mongodb users?
[10:42:17] <joannac> Climax777: well, it knows when the last ping it got was. why can't it generate avg/sec metrics?
[10:42:42] <Climax777> Avg/sec or any avg for that matter, is dependant on previous data
[10:44:02] <joannac> oh
[10:44:10] <joannac> it's not average since the beginning of time
[10:44:29] <Climax777> no it is probably based on the last minute/5 minutes
[10:44:32] <joannac> it's average over whatever timescale you're looking at
[10:44:40] <joannac> so i don't understand
[10:44:50] <joannac> t=0, #queries = 100
[10:44:59] <joannac> t = 5min, #queries = 5000
[10:45:10] <Climax777> I'm not talking about the graphs, I'm talking about configurable alerts. I can configure an alert for when page faults are xxx avg/sec
[10:45:24] <joannac> queries = (5000-100)/60/5 queries/sec
[10:45:42] <joannac> how is that different?
[10:46:43] <Climax777> It means, alerts for these metrics can't be generated when receiving a new ping (event driven). Some how, either by querying the last 5 minute's data, or by polling periodically, the alerts must be generated/updated
[10:47:23] <Climax777> That can quickly escalate if there are a number of these alerts setup
[10:48:01] <Climax777> So my question is actually, how can this be done more efficiently
[10:48:39] <joannac> why? are you implementing MMS?
[10:49:04] <Climax777> Not for monitoring MongoDB, but for monitoring some M2M device metrics.
[10:50:00] <Climax777> Also if no new pings come in (the device is dead or something), averages will decay, and should affect the alerts
[10:50:58] <Climax777> It doesn't seem to be trivial
[10:55:03] <Climax777> relevant stackoverflow: http://stackoverflow.com/questions/26484963/mongodb-mms-like-alert-generation
[10:55:58] <disaster_> @joannac: If i run use admin and then db.auth("admin", "password")
[10:56:03] <disaster_> It's return 1
[10:56:37] <disaster_> and when I use db.system.users.find(), it's return
[10:57:23] <disaster_> { "_id" : "admin.admin", "user" : "admin", "db" : "admin", "credentials" : { "MONGODB-CR" : "1c6421f54c02df81746601d0f7c108f" }, "roles" : [ { "role" : "userAdminAnyDatabase", "db" : "admin" } ] }
[10:57:37] <joannac> okay
[10:57:43] <joannac> you should be able to add a new user
[10:58:55] <joannac> Climax777: i don't know what to tell you. if you want to be able to alert of metrics that must be calculated, you have to actually calculate them
[10:59:15] <joannac> disaster_: try adding a new user
[11:03:16] <Climax777> joannac: I want to try getting everything event driven, this however seems impossible to do with only event driven design
[11:04:26] <sebastianlulu> hey guys i have a db archtiecture/model question.
[11:04:46] <Climax777> joannac: calculation is already done with pre-aggregation
[11:05:22] <disaster_> @joannac: I can create use account on command line
[11:05:57] <joannac> disaster_: cool. that's it. if you want other privileges, you need to give them to that user, or (better) create another user
[11:06:10] <joannac> userAdminAnyDatabase lets you admin user access, but that's it
[11:06:57] <joannac> Climax777: i don't understand why it can't be event driven. you get a new ping (event), you do your calculations, and then check alerts
[11:07:09] <joannac> oh wait, i see the problem
[11:07:49] <Climax777> Some of the metrics will decay even without new pings coming in.
[11:07:58] <joannac> yeah, you're not doing it right. why do you want an event-driven system?
[11:08:31] <Climax777> What is the alternative? poll for every account, every alert for every device metric
[11:08:35] <joannac> like you said, what happens when a server dies and there are no more events?
[11:08:59] <sebastianlulu> lets say i have one schema that can embed one out 3 possible schemas
[11:09:09] <sebastianlulu> how would you implement that?
[11:09:19] <joannac> Climax777: yes
[11:09:20] <Climax777> The server is least of my problems, the units (being mobile and all) will definitely miss some pings
[11:09:37] <Climax777> I'm not sure if that is even feasible
[11:09:53] <joannac> I can assure you it is, since that's what MMS does :P
[11:10:33] <Climax777> true, but what would you suggest. one background task, iterating trough all alerts, or one per MMS group?
[11:10:40] <Climax777> or even more fine grained
[11:10:43] <joannac> I'm not sure how else you could do realtime monitoring
[11:10:57] <joannac> Climax777: design decision. what makes sense for you
[11:11:57] <Climax777> thanks joannac
[11:13:07] <Climax777> one could make the assumption that if no pings come in, no alerts will further be generated. I'm just thinking that a down unit alert always trumps other metrics
[11:14:02] <Climax777> and with that we're back to event driven (closest to real time) design
[11:14:08] <Climax777> thanks for bouncing ideas
[11:16:58] <sebastianlulu> if i have schema A and he can contain either B or C - but has to contain on of the two possibilities , how can i do that?
[11:17:15] <sebastianlulu> *contain one
[11:17:20] <joannac> can't. do it in your app
[11:17:27] <joannac> or use an ORM
[11:17:32] <sebastianlulu> using mongoose
[11:17:41] <Climax777> sebastianlulu: save an extra discriminatorKey field
[11:17:49] <Climax777> or something like _type
[11:18:06] <Climax777> and on app side you can determine which kind it is, and also in other queries
[11:18:14] <sebastianlulu> ok so A has a type field
[11:18:31] <Climax777> Either A, or( B and C)
[11:19:00] <sebastianlulu> but when i define the scheme
[11:19:24] <Climax777> to elaborate, you can say A contains a type B/C or you can say this sub document is type B/C
[11:19:35] <sebastianlulu> yeah i get it
[11:19:49] <Climax777> Have a look here: https://github.com/briankircho/mongoose-schema-extend
[11:20:27] <Climax777> It just doesn't work that great at this stage for embedding, but it gives you an idea
[11:21:02] <sebastianlulu> looks good Climax!
[11:21:43] <Climax777> :)
[11:22:23] <sebastianlulu> you rock guys thanx!
[11:22:42] <sebastianlulu> joannac +1
[11:27:28] <sachinaddy> how to enable oplog in mongodb?
[12:27:44] <mlowicki> Hi, Is this the right place to ask about MMS? I'm adding new deploy after couple of months. In deployment view I get "Deployment is empty" but monitoring agent is working fine (syncing at least once per min). Any idea?
[12:28:18] <mlowicki> when I want to add new server I get info that AWS hasn't be configured..
[12:28:24] <mlowicki> but I use my own servers
[14:18:41] <mlowicki> I want to upgrade MongoDB from 2.6.3 to 2.6.5 on debian wheezy. Launched sudo apt-get update and then sudo apt-get install -y mongodb-org but mongod --version still returns 2.6.3
[14:19:10] <mlowicki> dpkg --list | grep mongo shows that mongodb-org is in 2.6.5 (metapackage) but version of mongodb-org-server is 2.6.3
[15:28:32] <legrandin> hey guys, how can i get a day_of_year calculation out of a datetime field
[15:35:01] <mike_edmr> Q: just upgraded 2.4->2.6. some of my indexes appear to be gone. Re-added one, watched it build, then it disappeared.
[15:35:14] <mike_edmr> added it again,, watched it build, then it disappeared
[15:35:29] <cheeser> well, that's weird
[15:35:32] <mike_edmr> what the hell could be causing it to quietly disappear when it gets done building??
[15:36:37] <GothAlice> mike_edmr: Possibly that it detected the index could not help a query. I.e. only one out of every 100 records has a value to be indexed in a non-sparse index. (Such an index would be a hinderance to most queries, not a help.)
[15:36:56] <GothAlice> However I wasn't aware that MongoDB does that type of rather glaring optimization. It may not.
[15:36:59] <mike_edmr> this one is absolutely essential and prevents people logging in when it's not there
[15:37:07] <GothAlice> Ouchies.
[15:37:21] <mike_edmr> because their access tokens cant be looked up before a timeout happens
[15:37:32] <GothAlice> Double ouchies.
[15:39:15] <GothAlice> mike_edmr: If you are reproducing the problem in development, try enabling the audit log in mongod, forcing an index build, and see what happens?
[15:39:50] <GothAlice> mike_edmr: As an aside, were you using map/reduce queries in 2.4?
[15:40:19] <mike_edmr> did in older versions of app, not currently using map/reduce
[15:42:48] <GothAlice> Alas, the audit log is probably your best hope for identifying the issue. I've never heard of or encountered your unique and beautiful problem. :/
[15:45:15] <mike_edmr> ty for the tip
[15:56:51] <MxxCon_> does anyone here have experience with Ansible's `mongodb_user` module?
[17:08:30] <doug1> According to http://docs.mongodb.org/manual/tutorial/deploy-shard-cluster/, the config servers should be installed first, followed by the mongos instance. However, if I do that, how can I add the admin user when there's nowhere to store anything yet?
[17:09:58] <GothAlice> doug1: If you have authentication enabled in your cluster, your sharding/replication behaviour is secured using a pre-shared key, not in-database authentication.
[17:11:28] <GothAlice> doug1: When bootstrapping a new cluster I've spun them up w/o authentication enabled, created the initial replication configuration, got replication working, then enabled authentication, added my admin user to the primary in isolation, then configured the PSKs. After restarting the cluster the authentication data would be propagated.
[17:11:29] <doug1> GothAlice: I have the key, but I need to create the admin user.
[17:11:50] <doug1> I'm trying to automate this with chef, so would like to minimise manual steps
[17:12:50] <GothAlice> You could automate it. Spin up your initial master, configure its sharding and authentication (w/o authentication enabled at this point), *then* enable authentication and spin up the slaves.
[17:14:40] <GothAlice> (In fact, not a single step there can't be automated. We do this in-house, but alas, I can't share our scripten.)
[17:18:23] <doug1> GothAlice: Nah, that's not gonna fly
[17:18:55] <doug1> plus, i have two shards, each with a master and two replicas
[17:19:57] <GothAlice> Much of our MongoDB-based automation is template-generated JavaScript passed to the mongo interactive shell. Also note: rebalancing shards is cheap and easy if the dataset is empty. As with most IT problems, breaking it down into discrete steps can help clear things up. ;)
[17:20:18] <doug1> But... can I add the admin user to the mongos if there are no shards yet?
[17:20:59] <GothAlice> I add it to the master mongod prior to enabling sharding or replication.
[17:21:16] <GothAlice> (Well, prior to enabling *final* secure replication, and prior to any sharding.)
[17:21:38] <doug1> Sigh. The documentation says to bring up the mongod LAST
[17:22:10] <doug1> If it's last, there is no _before_
[17:22:28] <doug1> wait, that made no sense. omg mongo is confusing
[17:24:00] <GothAlice> doug1: It can be, but don't lose hope. I play around with my sharding/replication setups on a single testing VM (my finagling port numbers). Damn, my home internet connection must be down or I'd check my Exocortex code; it automatically constructs local three-node replication clusters for testing. T_T
[17:24:18] <GothAlice> And it's a simple bash script.
[17:24:26] <doug1> I have to get this working with chef... using the community Edelight one
[17:27:18] <GothAlice> Waaaitaminute. Are you trying to replicate shards, or shard a replication set? (I.e. client -> replication -> sharding or client -> sharding -> replication)
[17:27:56] <GothAlice> Sharding replication sets seems weird to me. (You could have three replication boxen for shard 1, but six for shard 2 for no reason.)
[17:28:04] <doug1> GothAlice: Uhm...
[17:28:11] <doug1> It's two shards, each with 3 copies of the data
[17:28:28] <GothAlice> Why not three copies of the data, each split into two shards? Eh? Eh? ;^)
[17:28:43] <doug1> GothAlice: Why not? Because I have no idea what I am doing
[17:29:27] <doug1> Because I work for a startup that expects me to automate the shit out of it and understand it without training and the docs are sketchy and we had the mongo folks come up and they weren't interested at all in automating mongo
[17:30:37] <cheeser> https://university.mongodb.com/courses/M102/about
[17:30:37] <GothAlice> Step one before "flipping all the switches" (sharding, replication, authentication) is getting a simpler setup automated. Without authentication is infinitely easier. Most deployments pretty explicitly do not need authentication.
[17:30:59] <GothAlice> (You secure MongoDB through explicit firewall routing.)
[17:31:17] <doug1> Gotta have auth or the ops guy will have a kitten
[17:31:28] <doug1> no point in automating simpler setup because that's not what will be tested
[17:31:55] <cheeser> baby steps...
[17:32:29] <GothAlice> … considering you'll need to include the password in your code, any explot that gains access to disk or memory will leak the DB password, making the password pointless from a security standpoint—that is, unless you fail to firewall MongoDB. If you don't firewall it (i.e. only allow connections from host X) you're doing the internet wrong and your ops guy will have a litter of kittens.
[17:33:16] <doug1> GothAlice: the app code will only have an app user. there are ways to obfuscate the admin password
[17:33:59] <GothAlice> doug1: Not if it's ever used. ;) The deobfuscation code has to be bundled with it in order for you to ever use the value.
[17:34:19] <doug1> GothAlice: Not if we use EC2 IAM roles
[17:35:46] <GothAlice> … I worked for a startup for a while. 8 million SLoC later we had some pretty epic EC2 automation for Postgres—requiring no physical storage at all! Of course, we also failed and folded—the project wasn't to automate Postgres. ;P
[17:37:10] <doug1> GothAlice: That might work here if we didn't have a 1970's style division between ops and dev. Dev doesn't touch prod, which means dev has to give something shrink wrapped and boxed to QA and ops to deploy.
[17:37:23] <GothAlice> The only legitimate use I've had to date for authentication in MongoDB (since actual host-to-host security is handled properly via the firewall and IPSec) is to isolate multiple simultaneous users of a single cluster. I.e. my "cheap cheap cheap hosting" service uses a shared cluster set up this way. (It's also not sharded, just replicated, as these users' quotas won't ever require it.)
[17:37:33] <doug1> I was in ops until last week when I said I'd had enough. I was moved to eng.
[17:38:43] <GothAlice> Vagrant FTW. Here, have a complete stable VM images representing "an application server" (automatically configuring the front-end proxy on startup) and another representing "a database cluster member" (automatically identifying the existing cluster and joining it, or recovering from latest snapshot + oplog replay.)
[17:38:50] <doug1> or.. if it needs to be boxed for other people to deploy.....
[17:39:55] <GothAlice> Packaging up a new VM snapshot is a 'git flow release' hook, or running 'tox' locally to build a temporary VM for local testing. Arguments (i.e. the DNS name to load the intiial configuration from) supplied to the Linux kernel and processed by init scripts. :)
[17:42:31] <GothAlice> However, authentication seems to be the key sticking point you're running into with your sharding setup. It's also something you probably don't actually need, given EC2's excellent control over firewalls. (Love the tag-based access control.) Hell, if you're running this on EC2 or any other "cloud" provider, you don't even have physical control over your data. Authentication (beyond external access control) is doubly pointless. :D
[17:43:35] <doug1> not gonna go the 1970's gold image route...
[17:44:11] <GothAlice> Not going the "golden master" route increases the likelihood that production and staging environments will differ in subtle, but issue-masking ways.
[17:45:00] <GothAlice> For continuous integration it's also helpful to have an *exact* snapshot of the failed build that is reproducable on any machine with VTx support. ;)
[17:49:46] <GothAlice> https://gist.github.com/amcgregor/4947201 — waaay old, but should demonstrate some of the ideas.
[17:52:20] <doug1> GothAlice: how's that related?
[17:54:10] <GothAlice> It's to demonstrate that a lot of what one might have as application-level deployment automation can in fact be VM startup automation, using VM snapshots as your "boxed up" solution.
[17:55:49] <doug1> GothAlice: Think we're about as opposed as we can be on that. :)
[18:24:22] <mike_edmr> are there any major changes going from 2.4->2.6 which would have destroyed my write performance and cause excessive global write locks?
[18:24:37] <mike_edmr> because thats what I'm seeing now with all my indexes back to normal
[18:24:57] <mike_edmr> updates and inserts are piling up waiting for a write lock on the db
[18:29:01] <mike_edmr> hello?
[19:17:10] <mike_edmr> are there any major changes going from 2.4->2.6 which would have destroyed my write performance and cause excessive global write locks?
[19:17:21] <mike_edmr> anybody?
[19:17:33] <GothAlice> mike_edmr: Do you use map/reduce?
[19:17:36] <mike_edmr> no.
[19:18:05] <mike_edmr> I have got 200 or so queries on average now waiting for a write lock on the database
[19:18:06] <GothAlice> Welp. That's one idea thrown out. Have you attempted a full database repair, and if so, I assume the symptoms persisted.
[19:18:12] <mike_edmr> before 2.4
[19:18:20] <mike_edmr> I have not attempted a full database repair
[19:18:29] <mike_edmr> how long does that take, and does it bring the db offline?
[19:18:48] <mike_edmr> currently my app is somewhat useable because only writes are going very slow
[19:18:57] <GothAlice> Ouch; this is production data? A repair would require being offline for as long as it takes to copy and re-pack the on-disk stripes.
[19:19:04] <mike_edmr> yes it is production
[19:19:15] <GothAlice> T_T Always test things like this on a clone of production data in staging.
[19:19:51] <mike_edmr> it worked fine on staging
[19:20:18] <mike_edmr> with handlefuls of users and test suites
[19:20:26] <mike_edmr> in production the writes pile up and its crap
[19:22:27] <GothAlice> O rly? Any diff between their configurations? (This includes sysctl settings at the kernel level.) You could try enabling slow query logging "live" (no downtime needed) and watching the logs to see exactly what MongoDB thinks its doing during those writes. There is also an audit log you can enable (requires a quick service restart to enable AFIK) which may provide more details.
[19:25:10] <GothAlice> Slow writes on my own cluster are generally attributable to the on-disk stripes becoming sparse, but with no holes large enough for the data to insert. (I.e. needs to investigate and reject potentially thousands of locations for the data.) Re-packing (via --repair or compact). Both repairing and compacting block use of the database.
[19:25:15] <GothAlice> (Though compact should be generally faster.)
[19:26:04] <GothAlice> Note that compact will not free space, unlike repair, and will require up to 2GB of additional disk space during operation. Repair requires free space equal to your current dataset size + 2GB.
[19:36:46] <GothAlice> mike_edmr: Are your writes using non-default write concerns? Your writes could all be waiting for on-disk sync()s that are slow to complete.
[19:37:01] <GothAlice> (Having journalling enabled or not will have an impact on this.)
[19:37:10] <mike_edmr> some of them were at majority
[19:37:25] <mike_edmr> they have been moved back to 0 with no difference
[19:38:39] <GothAlice> I'd be curious to see if 'diff' finds a difference between production and staging mongod.conf or the output of `sysctl -a` from each machine.
[19:38:55] <GothAlice> Open file handle limits are something I ran into a problem with a long time ago.
[19:39:57] <GothAlice> mike_edmr: Finally, how does your production and staging environment's on-disk storage differ, if at all? I.e. production runs on EC2 EBS volumes, staging runs on the EC2 ephemeral volume?
[19:45:29] <mike_edmr> 19:37 < mike_edmr> they have been moved back to 0 with no difference
[19:45:32] <mike_edmr> oops
[19:45:44] <mike_edmr> sysctl not available on those machines
[19:47:32] <mike_edmr> staging and prod were identical 3 vm replicasets
[19:47:44] <mike_edmr> issue in staging is we didnt have enough load to see the issue
[19:48:00] <mike_edmr> prod we have 5000 req per minute
[19:49:07] <morenoh149> what does the __v field mean?
[19:49:20] <morenoh149> { _id: 5446b881fa549606ddd0b16f, name: 'fluffy', __v: 0 },
[19:50:05] <morenoh149> mike_edmr: would https://loader.io/ help?
[19:50:17] <mike_edmr> in retrospect, yes
[19:52:26] <morenoh149> http://stackoverflow.com/questions/12495891/what-is-the-v-field-in-mongodb
[20:02:53] <mike_edmr> we have reverted to 2.4 and...
[20:06:25] <mike_edmr> no more operations waiting for write locks
[20:06:37] <mike_edmr> everythings nice and fast
[20:06:47] <GothAlice> morenoh149: Yeah, _id is the only MongoDB-enforced key on documents, AFIK. (Excluding GridFS which is its own bucket of bolts.)
[20:06:58] <GothAlice> Er, attribute, not key.
[20:08:02] <mike_edmr> perhaps after more testing on the staging cluster i'll be able to share something
[20:08:15] <mike_edmr> on why it 2.6 shat the proverbial bed
[20:18:57] <speaker1234> if I have a list of objects, will mongo split them into separate records or do I need to do that?
[20:19:22] <GothAlice> speaker1234: MongoDB allows you to store lists within your documents, so it depends entirely on what you want it to do.
[20:19:50] <GothAlice> speaker1234: If you do want separate records, than either an iterative series of separate inserts or, if your driver offers it, use bulk_insert.
[20:20:33] <speaker1234> what I have is a bunch of individual records, all the same format, stored in a list so I could transport them as a single element
[20:20:39] <speaker1234> all json
[20:21:38] <speaker1234> one thought was I should convert them from json to python native before injecting them into the database as a simple sanity check that I was getting good data
[20:21:59] <speaker1234> and then with pymongo reinject them into mongo
[20:31:49] <GothAlice> speaker1234: If you're using Python, I can highly recommend MongoEngine, which includes a bulk_insert utility. And yeah, it's a good idea to json.loads() before dumping into the DB.
[20:33:33] <speaker1234> okay, I need you to go slowly and speak small words because I'm new to mongodb. I installed the stock package for mongo from the mongo repository
[20:34:38] <speaker1234> and I was going to install pymongo
[20:35:24] <speaker1234> is mongoengin that much better? all I'm doing is installing records, and retrieving them according to certain parameters as if it were a priority queue
[20:35:57] <GothAlice> MongoEngine is a layer on top of pymongo that gives you some goodies. https://github.com/marrow/marrow.task/blob/develop/marrow/task/model/query.py#L9-L58 is one of the goodies I added to MongoEngine. (Still need to submit that patch upstream, though.)
[20:36:00] <speaker1234> later I'll create reports on the state of the data for different customers.
[20:36:12] <GothAlice> That link, BTW, allows you to treat MongoDB *as* a queue (like redis).
[20:37:49] <speaker1234> the database entries go through several states such as queued, in process, delivered, failed, retry.
[20:38:04] <speaker1234> The usual stuff
[20:38:17] <speaker1234> looking at the patch, it looks like it does most of what I need
[20:38:28] <GothAlice> Yeah… look down. >:P
[20:39:05] <GothAlice> https://github.com/marrow/marrow.task/blob/develop/marrow/task/model/task.py#L162-L171 shows how to use that TaskQuerySet with a MongoEngine document schema.
[20:40:24] <GothAlice> Almost all of my states are programatically calculated, rather than storing a single string value to represent state: https://github.com/marrow/marrow.task/blob/develop/marrow/task/model/task.py#L221-L240
[20:40:25] <speaker1234> this is going to take a bit of study. It also tells me that I need to move the two entries, once they are finished on to a different database so that the queue won't grow overly large
[20:42:15] <GothAlice> Indeed. marrow.task handles this in two ways: collections that act like queues have an inherantly limited size, and are called capped collections for this reason. They're implemented as a ring buffer. (I.e. it gets to the end, returns to the beginning and keeps writing.) The second is that the Task records themselves get set to expire:
[20:42:16] <GothAlice> https://github.com/marrow/marrow.task/blob/develop/marrow/task/model/task.py#L188
[20:42:58] <speaker1234> I think that's where our models diverge. Tasks never expire. And I need to handle something like 2 million tasks a day
[20:43:31] <GothAlice> That's the real trick, my Tasks only expire after they are complete, and only 30 days after that.
[20:43:36] <speaker1234> because the systems of the far end sometimes get busy, tasks can queue up for a number of days but they still need to be delivered
[20:43:40] <GothAlice> https://github.com/bravecollective/core/blob/develop/brave/core/account/model.py#L198 < MongoDB can handle expiry of completed tasks for you.
[20:44:26] <GothAlice> speaker1234: Not an issue. The "remote" in your case would search the Tasks collection for ones that haven't been completed, ignoring the queue, and once caught up would then listen to the queue again.
[20:45:36] <speaker1234> I get my task lists from the database upstream (customer does geolocation stuff for handing out tasks to workers)
[20:45:53] <GothAlice> I.e. for task in Task.objects(complete=None): pass # do something \n for notice in TaskMessage.objects(task__gt=last_task.id): pass # process tasks added during catch-up
[20:46:31] <GothAlice> The length of the queue would be determined by how far behind you need to let the queue get before giving up and doing ^ as a more aggressive catch-up.
[20:46:44] <speaker1234> but when I pull things off the queue, but either pull them off sequentially (low-cost service) or, pull them off by customer name sequentially for the higher priority, i.e. they spent more $$ tasks
[20:47:34] <speaker1234> the idea is out have one fetcher that would always take the new, untried, tasks and the second for the stuff that filled the first time around
[20:47:43] <speaker1234> field/failed
[20:47:49] <speaker1234> filled/failed
[20:47:50] <GothAlice> marrow.task internally handles priority levels by having the queue runner picking up tasks rapidly hand tasks off to a local distributor (i.e. concurrent.futures) that has been made priority-aware.
[20:47:53] <speaker1234> joys of speech recognition
[20:48:09] <GothAlice> Heh; it's been impressively great so far.
[20:48:13] <ejb> Hi, I need some design guidance for creating a weighted tag system. Essentially, every doc has an array of tags and I need to query and sort the docs based on how many tags are matched. The weighted aspect is that the tags in every query will vary in weight. So the tag weight for one query may be { foo: 10, bar: 5 } while the next one is { baz: 10, foo: 2 }. Does this make sense?
[20:48:46] <speaker1234> GothAlice, okay, I'm by doing. So I'm going to download the engine and your patch (where do I put it?)
[20:48:53] <speaker1234> that should be I learn by doing
[20:49:05] <GothAlice> m.task handles task retrying, that above "catch-up" process, and even RPC calling (and iteratively retrieving) remote generators, i.e. to indicate progress upstream.
[20:49:10] <ejb> In the former, docs tagged with foo should rank higher than those with bar. And in the latter, docs tagged with baz should rank higher than those with foo.
[20:49:29] <GothAlice> speaker1234: m.task is currently in the process of being extracted and sanatized from my work, propriatary codebase. It's incomplete.
[20:49:31] <speaker1234> can you give me a sample stone simple inject/pull off example I can work with an experiment from?
[20:49:46] <ejb> I'm new to mongo so any guidance would be much appreciated.
[20:49:53] <GothAlice> speaker1234: However the queue runner I linked first is fully functional.
[20:50:02] <mike_edmr> ejb: switch to postgres badum pshhh
[20:50:30] <mike_edmr> jk
[20:50:33] <GothAlice> ejb: Store your tags as a list of subdocuments, i.e. [{t: 'foo', w: 1.0}, {t: 'bar', w: 2.0}]
[20:50:39] <speaker1234> mike_edmr, as much fun as it may be to tease us with little knowledge, it isn't a very kind. :-)
[20:50:57] <speaker1234> mike_edmr, just think, we may know something you don't and we may return the favor. :-)
[20:51:06] <GothAlice> ejb: Then index on tags.t (in this sample)
[20:51:08] <mike_edmr> i just wanted to take a jab at mongo because it made my life terrible today
[20:51:13] <mike_edmr> i would help if i could
[20:51:43] <ejb> GothAlice: the weights change with each query
[20:52:18] <ejb> GothAlice: the weights aren't an attribute of the stored tags
[20:52:24] <GothAlice> ejb: A ha.
[20:52:36] <GothAlice> ejb: Then your problem simplifies down to careful aggregate query construction.
[20:53:44] <GothAlice> How many weights on average?
[20:53:59] <ejb> probably 8
[20:54:16] <ejb> could be up to say, 20
[20:54:33] <GothAlice> Cool, so not *too* intense. I assume you only care about documents that match at least one of the tags referenced by the weights? (Or do you require all?)
[20:54:56] <ejb> Yep, just those matching at least one
[21:00:30] <GothAlice> Hmm.
[21:00:35] <GothAlice> This is an interesting challenge.
[21:00:47] <GothAlice> Interesting enough that I must gist my solution.
[21:04:38] <ejb> GothAlice: I have another variable to make it even more interesting ;)
[21:06:56] <ejb> These documents also have a geometry attribute (GeoJSON Point ... lat/lon). I want to weigh distance ($geoNear) into the sort order.
[21:07:17] <GothAlice> I'll leave that bit as an exersize for the reader. Think I'm nearly done.
[21:07:30] <ejb> GothAlice: lovely.
[21:12:21] <GothAlice> This aggregate query just keeps getting more funky. Now I've even got a $let in there.
[21:20:26] <ejb> GothAlice: Sorry, but I actually need to get going. Can you give me your GH username so I can find the gist when you're done?
[21:20:34] <GothAlice> ejb: amcgregor
[21:20:47] <GothAlice> I'm having difficulty internally debating between several different approaches.
[21:21:06] <ejb> GothAlice: Much appreciated. I'm sure I'll have some questions... so I'll be back to discuss. Thanks
[21:21:24] <GothAlice> $let w/ $sum of a Python-generate chain of $cond's, or $unwind, or $map…
[21:22:10] <GothAlice> ejb: https://gist.github.com/amcgregor/702c18ac18570ca3e231
[21:22:14] <GothAlice> I'll update it as I progress.
[21:23:23] <ejb> GothAlice: Cool. I'll drop back in tomorrow
[21:39:22] <speaker1234> is there a simple commandline tool that can dump the content of a database so I can see what's there?
[21:39:33] <GothAlice> speaker1234: mongoexport?
[21:39:47] <speaker1234> Okay. Didn't find that yet
[21:43:22] <speaker1234> GothAlice, thanks, that did the trick
[21:43:29] <GothAlice> :)
[21:43:57] <speaker1234> I'm just trying get familiar with the basics and I'll probably move to your module after I make a very simple queue work
[21:48:10] <diegows> where is the log output documented? I can't find it... I don't remember the meaning of some information about slow queries
[21:49:41] <speaker1234> there any good examples of atomic find modify write cycles in mongo?
[21:51:57] <Boomtime> speaker1234: what do you want to know? the update command is atomic, for example
[21:52:38] <speaker1234> I'm just learning so I'm not sure what I don't know. using pymongo. I want to find a record, change a field, and write it back to the database as an atomic operation
[21:52:52] <speaker1234> is that what update does?
[21:53:09] <GothAlice> diegows: /var/log/mongo… usually. Or syslog, or nowhere, all depends on your configuration and mongod command line arguments.
[21:53:34] <GothAlice> speaker1234: One of the most powerful patterns in MongoDB is "update-if-not-different", allowing you to implement things like locks on documents atomically.
[21:53:54] <Boomtime> speaker1234: yes
[21:54:14] <GothAlice> speaker1234: Task.objects(owner=None).update(set__owner=me) — set me as the owner only if no other owner is set; gaurenteed not to conflict with other simultaneous attempts to do this. (One will win, the others will lose.)
[21:54:23] <Boomtime> speaker1234: use $set to set only specific fields in a single operation
[21:54:36] <diegows> GothAlice, I was taling about documentation :)
[21:54:58] <speaker1234> looking at http://api.mongodb.org/python/current/api/pymongo/collection.html update method
[21:55:13] <Boomtime> speaker1234: your update will match any document whose owner=None
[21:56:01] <Boomtime> if there are 2 such documents present, then running the update twice will update different documents
[21:56:11] <GothAlice> diegows: The mongod configuration documentation goes to some length to describe the available logging options. http://docs.mongodb.org/manual/reference/configuration-options/#systemLog and http://docs.mongodb.org/manual/reference/configuration-options/#auditLog
[21:56:22] <speaker1234> okay, I'm still missing some basic information. For example the definition of a spec or document
[21:56:54] <GothAlice> speaker1234: In simple terms: a mapping or dictionary of values.
[21:57:00] <GothAlice> Any mapping or dictionary of values.
[21:57:37] <diegows> GothAlice, thanks
[21:59:21] <kellyp> heyo, is it possible to get an the _id field as a string inside of a aggregation $project?
[22:00:11] <speaker1234> so is spec a dictionary of a key and the value that key must have?
[22:00:13] <GothAlice> kellyp: $concat [ '$_id', '' ] maybe?
[22:00:53] <kellyp> I tried that and it returned "$concat only supports strings, not OID"
[22:01:46] <GothAlice> kellyp: T'was worth a shot. ;)
[22:02:04] <kellyp> :) thanks
[22:02:29] <GothAlice> ObjectID support in aggregate queries has been a bane of my existence, as I generally avoid adding explicit 'created' dates to my documents. ('Cause you can just pull that from the _id. Not so much in an aggregate query, though.)
[22:03:06] <GothAlice> speaker1234: I'm not quite clear on what you are referring to as a 'spec'.
[22:03:13] <speaker1234> How would you specify a set of values for the spec? In the updated example, the spec is {'x':'y'} but what if you wanted to match more than 'y'. What if you want to match something like {'x':'y'|'g'|'h'|'z'}
[22:03:47] <speaker1234> or {'x': 0<y<20}
[22:03:55] <GothAlice> {'x': ['y', 'g', 'h', 'z']} # "x must be one of these values
[22:04:28] <speaker1234> GothAlice, I'm assuming that it doesn't need to be string literals for the values?
[22:04:37] <GothAlice> {'x': {'$gt': 0, '$lt': 20}} # 0<y<20
[22:04:44] <GothAlice> ^ not at all!
[22:04:47] <GothAlice> http://docs.mongodb.org/master/core/read-operations-introduction/#query-interface
[22:05:02] <speaker1234> Thank you for the missing piece.
[22:05:29] <GothAlice> I enjoy MongoDB querying much more than SQL. :3
[22:05:46] <speaker1234> GothAlice, to answer your question, 'spec' is the first argument to update in pymongo.
[22:05:56] <GothAlice> Ah, yes.
[22:06:19] <GothAlice> http://docs.mongodb.org/master/core/write-operations-introduction/#update
[22:06:43] <GothAlice> Core docs call it "update criteria" (like "query criteria" for find).
[22:06:54] <speaker1234> GothAlice, oh I am orders of magnitude more successful with mongo in the first two hours of using it that I was in the first two days of using sqlite
[22:07:32] <GothAlice> Interesting SQLite fact: SQLite ignores your schema if you choose to write a non-conforming value to a field. (I.e. the column may be `int`, but you can write strings to it.)
[22:07:38] <GothAlice> SQLite is almost as schemaless as Mongo. ;)
[22:07:45] <speaker1234> honest, I didn't even look at mongo until about two hours ago
[22:08:03] <speaker1234> this is a tool worth learning deeply
[22:08:08] <GothAlice> Yes, yes it is.
[22:08:28] <GothAlice> My first MongoDB map-reduce: https://gist.github.com/amcgregor/1623352
[22:08:42] <speaker1234> I hope it will help me make money in my consulting practice. :-)
[22:09:03] <speaker1234> ever the mercenary...
[22:09:10] <GothAlice> If nothing else it'll let you rapidly prototype ideas like a bat out of hell. :D
[22:09:38] <speaker1234> It's also easy to use when your hands are broken. This is why I write code in Python. It's not just a good language, it is something I can dictate with minimal overhead
[22:10:33] <GothAlice> I managed to use Mac OS's built-in recognition to code JavaScript and Python for a while. Quite handy. "function hello open parenthesis name equals double quote world double quote close parenthesis open curly brace newline …"
[22:10:37] <GothAlice> I've felt your pain. ;)
[22:11:18] <speaker1234> I've been living with it for 20 years. And the problem is, people assume that writing code as yougave as an example, is an acceptable solution.
[22:11:32] <GothAlice> Oh! I have a presentation for you, if you weren't the one who gave it…
[22:11:50] <speaker1234> Is a big guy that showed program using speech recognition Python convention. Forget his name
[22:12:17] <GothAlice> https://www.youtube.com/watch?v=OWyMA_bT7UI
[22:12:30] <GothAlice> He had a really complex multi-machine setup, though.
[22:12:40] <GothAlice> (Skinny fellow.)
[22:12:53] <speaker1234> yes. He uses what is called the "grunt, belch, fart" school of speech recognition
[22:13:07] <GothAlice> Yes.
[22:13:12] <GothAlice> It's amazing to behold.
[22:13:15] <speaker1234> he is actually brutally harassing the language model and is reducing its effectiveness for ordinary speech.
[22:13:37] <speaker1234> I'm working on a different technique which works with the language model and is more intentional in its interface than expressive
[22:13:57] <speaker1234> look up on YouTube togglename and you will find a very raw demonstration of the first of the techniques
[22:14:07] <GothAlice> Can't argue with success, though. I could imagine his approach is speedier once you're acclimatized (or if you develop the linguistic set yourself from scratch, you'll know it to start with).
[22:14:19] <speaker1234> editing video is not something one can do easily in your hands don't work.
[22:15:35] <Moogly2012> hello
[22:15:42] <GothAlice> speaker1234: True that. I'm thinking that a metric ton of Automator bindings as Speakable Items *might* work for iMovie, but professional software? Not even close.
[22:15:45] <GothAlice> Moogly2012: Howdy!
[22:15:55] <speaker1234> https://docs.google.com/presentation/d/1nKJu5m9B19FjpvWVZlvY0hJTKG61ohvNG5zkrViRyvM/edit?usp=sharing
[22:16:23] <speaker1234> this is the presentation showing some of the next stage. It's a way a template expansion can be driven through a question-and-answer session
[22:16:33] <speaker1234> as a result, you can do very high level operations for very few words
[22:16:40] <speaker1234> which is the sweet spot for speech recognition
[22:16:57] <GothAlice> speaker1234: Slide 3: No cursor insertion point support? (i.e. drop into the __init__ arglist?)
[22:16:57] <Moogly2012> hi GothAlice
[22:18:01] <rainabba_> I have a dedicated mongodb instance on AWS and I need to move my data to a shared instance with Modulous.io where I expect that I'm lacking many permissions. Should I bother trying CopyDB (http://docs.mongodb.org/manual/reference/command/copydb/) ? Any suggestions on what likely would work that I should try if not that?
[22:18:01] <speaker1234> it's a work in progress. :-) There are other things I need to add like dock strings and integration into a programming environment
[22:18:45] <morenoh149> how do you query for every row in a table?
[22:18:59] <morenoh149> with sql I would do `select * from tablename;`
[22:18:59] <speaker1234> and as for the init arg list, see slide 6
[22:19:01] <GothAlice> rainabba_: mongodump from source, mongorestore to destination. You can prune the sub-folder under the "dump" folder that gets created to remove collections you aren't allowed to write to to avoid errors during import.
[22:19:12] <speaker1234> GothAlice, and as for the init arg list, see slide 6
[22:19:39] <GothAlice> morenoh149: db.collectionName.findAll() in the shell. Standard "find" query from a driver (i.e. from Python) will provide a cursor you can iterate over to retrieve all results.
[22:20:06] <rainabba_> GothAlice: Thank you. Will give that a go.
[22:21:18] <GothAlice> rainabba_: At work I actually have to run one mongodump per collection I wish to export, since the production user has no access to the system collections in the database, so the pruning is done during export rather than import. It's a PITA, though.
[22:21:24] <Moogly2012> can anyone help me find a way to get the highest value of a field? I've tried sorting, but it returns back empty values (empty string), and I can't seem to do the right incantations to block them out
[22:21:30] <Moogly2012> thanks to anyone who can
[22:22:18] <Moogly2012> just wasted a good amount of time on Google / MongoDB docs to no success
[22:22:38] <GothAlice> Moogly2012: {myfield: {$not: null, $exists: 1}} — add this to your query to exclude them with extreme predjuice. :) (Think I've got that right; the driver I use hides the raw queries from me.)
[22:22:45] <rainabba_> GothAlice: Dump creates a folder-structure where ever I'm running the command from so as long as I can connect to source and destination from there, I don't need to worry about moving the intermediate data?
[22:22:47] <Moogly2012> I'll try it thanks
[22:22:54] <GothAlice> rainabba_: Correct.
[22:23:07] <speaker1234> GothAlice, another update question: 1) can multi be limited to a a finite number of matches and, 2) does update return the records it updated?
[22:23:13] <GothAlice> rainabba_: Run 'man mongodump' or `mongodump --help` for details of the setup.
[22:23:49] <Moogly2012> wow that worked
[22:23:56] <GothAlice> Moogly2012: :)
[22:24:13] <Moogly2012> is there a way to set how it sorts it from ascending to descending?
[22:24:40] <joannac> Moogly2012: add a sort clause
[22:24:53] <GothAlice> Moogly2012: MongoDB, like most databases, has four-state logic. Values may be "truthy" (1, true, non-empty string), "falsy" (0, false, empty string), null (explicitly devoid of value), or completely missing (value not defined).
[22:24:59] <joannac> Moogly2012: sort({a:1}) sorts ascending on a, sort({a:-1}) is descending
[22:25:20] <GothAlice> Moogly2012: This means when trying to exclude "empty" and "missing" values, you have to handle both cases to be safe.
[22:26:15] <GothAlice> Moogly2012: Sorting on your field -1 (descending) and limiting to a single result will find your "maximum" value. :)
[22:27:43] <Moogly2012> I still get negative numbers, been able to exclude empty values, but when it comes to sorting I have no such luck for whatever reason
[22:28:40] <Moogly2012> been doing db.thecollection.find().sort({myfield: -1}).limit(10)
[22:28:48] <Moogly2012> that's my initial attempt
[22:29:13] <Moogly2012> but I had gotten back empty values, now I just get negative values with $not: null, $exists: 1 included
[22:29:34] <joannac> Moogly2012: and you have positive values in your collection?
[22:29:38] <GothAlice> Moogly2012: If you're only getting negative numbers, I suspect all your numbers are negative. ;)
[22:29:55] <Moogly2012> I've seen some positive values in the collection yes
[22:30:06] <Moogly2012> when running a basic find() query
[22:30:24] <GothAlice> Moogly2012: Add "$gt: 0" to the query on myfield.
[22:30:52] <Moogly2012> I tried that earlier, I'll try it again
[22:31:20] <Moogly2012> I don't understand how negative numbers would be greater than 0
[22:31:25] <Moogly2012> LOL
[22:31:29] <GothAlice> They're strings, not numbers.
[22:31:31] <GothAlice> That's how.
[22:31:32] <joannac> are they strings?
[22:31:33] <Moogly2012> ah
[22:31:42] <Moogly2012> that's entirely possible yes
[22:31:46] <GothAlice> :P
[22:31:59] <GothAlice> Step #1 when diagnosing database problems should always be: check your data.
[22:32:07] <joannac> if you want to sort numerically, your data needs to be numbers
[22:32:10] <joannac> :p
[22:32:39] <Moogly2012> I'm acustomed to other databases, never used mongo, so I don't even know how to check for data types
[22:32:44] <GothAlice> Yeah, seeing negatives with a "$gt: 0" = you're using strings. ;)
[22:32:46] <shoerain> hm, 1. are there examples of migration scripts in mongodb (changing field names, removing fields, updating values, etc) and 2. examples of #1 using mongoose?
[22:33:40] <shoerain> I have some examples from work that look less than ideal, but I could just have a hard time following callbacks.
[22:34:03] <Moogly2012> odd, in some of the results the field is a string, while in others, it's a int
[22:34:27] <Moogly2012> I guess that's how mongo rolls
[22:34:38] <GothAlice> shoerain: Most of those operations are quite simple and straightforward. Changing isn't (you'll actually need to load the record up and issue an update per-record), but removing and mutating values is quite easy. $set to assign a new value, $unset to remove, and various atomic operations such as appending to lists, incrementing / decrementing, etc.
[22:35:17] <GothAlice> Moogly2012: MongoDB enforces no schemas. I use MongoEngine (Python driver wrapper) to provide those for me where needed.
[22:35:31] <GothAlice> Moogly2012: Even though I use an ODM, though, I still end up rolling a lot of manual queries.
[22:35:34] <Moogly2012> GothAlice: recommended way to remove documents with a specific empty field? Maybe that might help
[22:36:45] <GothAlice> Moogly2012: db.collection.find({somefield: {$exists: 0}}).delete() ?
[22:36:55] <Moogly2012> I'll try that
[22:36:59] <GothAlice> Waaait.
[22:37:08] <Moogly2012> I've got a backup
[22:37:09] <GothAlice> Run the query by itself (no .delete()) to see what you'll be doing.
[22:37:11] <GothAlice> ;P
[22:37:18] <Moogly2012> will do
[22:37:39] <GothAlice> The number of times I've accidentally my whole database with queries I thought were rock-solid I can't count on my hands.
[22:38:07] <Moogly2012> hahaha
[22:38:24] <Moogly2012> I once dropped a very important table... on a game server
[22:38:34] <Moogly2012> wasn't backed up, big mistake
[22:39:46] <GothAlice> The last screwup was an unbounded update() that nuked three different attributes worth of data for an entire collection. Before that was the incident that made me drop MySQL like a hot potato: after a lengthy EC2 failure I spent 36 hours straight (without sleep) recovering the data by reverse engineering the on-disk structure… prior to getting 8 teeth removed. (Wisdom teeth and impacted molars.)
[22:40:11] <Moogly2012> ouch
[22:40:27] <GothAlice> The latter screwup was a doozie. Learned enough about innodb on-disk formats to make me literally fear storing data that way ever again.
[22:42:06] <GothAlice> (directory-per-db? Not if you use innodb! It lies straight to your face about it and pretends to work, but pools important data structures outside the DB folders.)
[22:43:30] <Moogly2012> haven't much in databases since, just toying with mongo now
[22:43:41] <Moogly2012> not since my last screw up, just since about a year or two
[22:44:15] <Moogly2012> mongo didn't really seem to like appending .delete() to that find
[22:44:40] <Moogly2012> ['delete'] is not a function (shell):1
[22:44:42] <Moogly2012> odd
[22:44:49] <GothAlice> Moogly2012: That's because I'm a terrible person who uses abstractions waaaay too much. db.collection.delete({criteria})
[22:44:58] <Moogly2012> I tried that too
[22:45:00] <GothAlice> Er, .remove
[22:45:05] <Moogly2012> oh right
[22:45:06] <GothAlice> Ugh. I *was* on my game today. ;)
[22:45:07] <Boomtime> :D
[22:45:07] <Moogly2012> LOL
[22:45:15] <Moogly2012> then I came and ruined it
[22:45:17] <Moogly2012> sorry
[22:45:28] <Moogly2012> I had to mention mysql didn't I
[22:46:15] <Moogly2012> it ran, but I still got one oddball of a field that still has no value
[22:46:26] <GothAlice> Ah, but it is present, yes?
[22:46:38] <GothAlice> (I.e. visible in the record, with a value of '' or null?)
[22:46:44] <Moogly2012> yeah
[22:48:11] <GothAlice> Then you'll need to adjust your removal criteria to hanlde that case. The one I gave you only handles if the value is actually missing. :)
[22:48:11] <GothAlice> (You'll want to either run several remove() queries, or use $or to combine the criteria.)
[22:52:58] <GothAlice> For an advanced look at one of the niftiest things MongoDB can do, see: https://github.com/bravecollective/forums/blob/develop/brave/forums/component/thread/model.py#L99-L161
[22:53:57] <GothAlice> This is for forum software that stores all replies to a thread within the thread document. These methods handle appending a comment, retrieving a single nested comment, and updating a specific comment in the thread. (At the bottom are some queries that get either the first or last comment in a thread.)
[22:54:24] <GothAlice> $slice: -1 = win
[22:54:38] <Boomtime> what happens if the thread grows large? do you have a cutoff?
[22:55:27] <GothAlice> Boomtime: We calculated it out. Migrating the entirety of the gaming group's forums (a few thousand users at the time) from the phpBB fork they were using (eveBB) to this would require a total of a single document. (< 16MB for *everything*.)
[22:55:53] <GothAlice> I estimated that the average thread size would have to grow by over 1000x before I would even consider adding automatic splitting of thread replies.
[22:56:05] <GothAlice> (16MB is a LOT of text!)
[22:56:59] <GothAlice> Also the migration (properly splitting the threads) ran great and now they're up to > 10,000 users. :)
[22:57:12] <Boomtime> yes, 16MB is lots of text, but the limit is not unreasonable for a popular forum thread to reach - what happens at that point?
[22:57:34] <Moogly2012> brb my nets being funky thanks for the help GothAlice I will possibly be back when my net is not being obnoxious
[22:57:34] <GothAlice> Boomtime: The thread effectively locks.
[22:57:47] <GothAlice> Moogly2012: No worries. It never hurts to help.
[22:58:09] <Boomtime> goodo, as long as it doesn't die and you have a response strategy, that is all that matters
[22:58:37] <Boomtime> your system sounded good, i did not read it all though
[22:58:46] <Boomtime> (the code i mean)
[22:59:32] <GothAlice> Boomtime: By a growth of 1000x before I would have a concern I mean the average number of replies would have to grow to 40,000 on a single thread (based on the average length of a reply across all forums) before we hit 10MB per thread. Lots of headroom to actually implement proper thread spanning. ;)
[23:00:04] <GothAlice> I was more eager to add things like live push updates and other "sexy" features. ;^)
[23:01:12] <GothAlice> (That's about 43 words per reply on average.)
[23:08:11] <GothAlice> Huh; now that I check it, by the time I left the project they were already up to 82 words per comment on average. (1,323 threads containing 14,281 comments with 5,734 up-votes, viewed 380,187 times, containing 1,175,505 words.)
[23:09:24] <doug1> I'm getting "Error parsing INI config file: unknown option auth" ... I have auth = true in mongodb.conf. Why?
[23:09:44] <GothAlice> doug1: Which version of MongoDB?
[23:09:50] <doug1> GothAlice: 2.6.1
[23:09:52] <doug1> oops 2.6.5
[23:10:11] <doug1> on a mongos instance...
[23:10:50] <GothAlice> I would strongly recommend upgrading to the YAML-based configuration. http://docs.mongodb.org/manual/reference/configuration-options/
[23:11:08] <doug1> GothAlice: Can't. Using the community chef cookbook
[23:12:28] <GothAlice> Ouch. It may also be that running as mongos, authentication isn't supported, or wasn't supported under the old configuration syntax. You either can't enable auth there, or must upgrade to YAML. :(
[23:12:51] <doug1> Sigh. Where would I add the admin user then?
[23:14:52] <GothAlice> http://docs.mongodb.org/v2.4/reference/program/mongos/#bin.mongos — yup, no --auth option. You'd add it to the mongod (shard-specific) configurations, not the management process.
[23:15:20] <GothAlice> Note that enabling auth will require you to use a pre-shared key to secure communication between the shards.
[23:15:27] <GothAlice> doug1: ^
[23:15:44] <doug1> i need a valioum
[23:15:46] <doug1> valium
[23:15:51] <joannac> yeah, --keyFile
[23:15:58] <GothAlice> That's a good idea, doug1.
[23:15:58] <doug1> i have that.
[23:16:11] <doug1> it's not clear to me which instance I enable auth on.
[23:16:13] <joannac> then remove --auth from your mongos conf file?
[23:16:21] <joannac> --keyFile enables auth
[23:16:32] <GothAlice> Ah, tricky.
[23:16:33] <doug1> joannac: are you sure?
[23:16:51] <doug1> because I need to precreate the admin user , I think
[23:17:43] <Boomtime> "joannac: --keyFile enables auth" -> yes
[23:18:08] <joannac> very sure
[23:18:20] <joannac> and no, you can use the localhost exception afterwards
[23:18:37] <joannac> but if you want to create the admin user first to be safe, go for it
[23:19:58] <speaker1234> I'm missing something update. How do you get the list of documents you changed
[23:21:52] <joannac> speaker1234: do a find beforehard
[23:22:16] <speaker1234> but doesn't that leave you open to a race condition
[23:23:55] <speaker1234> I guess I could create external lock for all access to the database I do find and update
[23:25:05] <joannac> back a step, what's the usecase?
[23:25:09] <GothAlice> joannac: localhost exception; I KNEW I was forgetting something important earlier, with another user who had authenticated sharding/replication questions.
[23:25:35] <joannac> you might be one of the rare people who wants findAndModify?
[23:26:15] <GothAlice> speaker1234: It's less of a race condition than one might think. If you apply the same atomic operation (i.e. set X to 27 or append Y to a list) client-side after confirming the success of the update, you should be fine.
[23:26:18] <Boomtime> "documents", plural
[23:26:26] <speaker1234> that's right. Find and modify was mention earlier. I have a bunch of records, each one indicates some is to be done. I changes state the record and hand it off to worker bee
[23:27:17] <GothAlice> speaker1234: https://github.com/marrow/marrow.task/blob/develop/marrow/task/model/task.py#L30-L90 is an example of implementing atomic locking on MongoDB records, FYI.
[23:27:34] <speaker1234> I've got a pretty good shot triggering that race condition
[23:27:46] <GothAlice> speaker1234: Also directly applicable to your use case, as this is part of a distributed task execution system. ;)
[23:27:47] <speaker1234> if for no other reason than I have that kind of luck
[23:28:20] <Boomtime> if a race condition exists, you may as well assume you'll hit it
[23:28:40] <speaker1234> GothAlice, you keep throwing me all this code I don't quite understand yet. :-)
[23:28:49] <speaker1234> I'll figure it out though.
[23:28:55] <Boomtime> can you do an update that specifically sets an identifier field on documents that it touches? allowing you to find the modified ones afterwards?
[23:29:04] <GothAlice> speaker1234: https://gist.github.com/amcgregor/4207375 is the slides from the presentation I gave on this which truly simplify this type of thing down to the bare essentials. Link in the comments there to the more complete code which marrow.task is based on.
[23:31:52] <GothAlice> https://gist.github.com/amcgregor/4207375#file-4-job-handler-py-L7-L8 sets the `state` to "running", the `acquired` date to now, and the current "owner" of the job to the current process, but *only if* the job is still in the pending state, with no owner when we execute that query. Update-if-not-different is awesome. (Two processes can execute that statement at exactly the same time, only one will win.)
[23:32:30] <GothAlice> Note that this code doesn't even bother to get the record first, it waits for success on that lock before trying to find() the actual document.
[23:34:11] <speaker1234> I think I get the gist of it. You are still doing some magic that I don't understand where it comes from. For example the owner identity and using __class__ objects.
[23:34:53] <speaker1234> To be truthful, I'm seriously tempted to use portalocker on an external file just out of expediency
[23:35:01] <GothAlice> For the most part you can ignore my Python-fu. An "identity" (like https://gist.github.com/amcgregor/4207375#file-1-example-job-record-js-L10) is just a combination of [hostname, pid, parent_pid].
[23:35:19] <speaker1234> performance wouldn't be horrible if I was careful about defining the lock
[23:35:44] <GothAlice> speaker1234: Expert at read-many, write-one locks? Locking isn't easy.
[23:36:00] <speaker1234> it would work okay in a wsgi environment
[23:36:03] <doug1> Still don't get it. Where would I create the admin user?
[23:36:05] <Boomtime> the database already has a lock, and it's multi-host compatible, use it
[23:36:31] <speaker1234> Actually, I was, in a previous life, a deep OS weenie. I could untangle lock and memory management problems others were baffled by
[23:36:44] <GothAlice> Boomtime: Not to mention optionally redundant (replicated) and optionally partitioned (sharding) which are both monumental tasks by themselves. ;)
[23:36:51] <Boomtime> GothAlice beat me to it, seriously do not implement your own lock system if you can even remotely avoid it
[23:37:10] <speaker1234> but in this case, I would use exclusive locks based on the type of state change
[23:37:24] <doug1> someone shoot me
[23:37:30] <GothAlice> speaker1234: That's exactly what https://gist.github.com/amcgregor/4207375#file-4-job-handler-py-L7-L8 does. ;)
[23:37:42] <joannac> doug1: I'm not sure where you're stuck
[23:37:56] <doug1> joannac: where would I add the admin user?
[23:38:04] <joannac> in the admin database?
[23:38:11] <GothAlice> doug1: Spin up the new cluster with auth enabled, SSH into the master and use the mongo shell to add the user.
[23:38:21] <doug1> joannac: on which server type? config server? data node? router?
[23:38:30] <doug1> GothAlice: gotta automate
[23:38:39] <joannac> on the mongoS
[23:38:49] <joannac> and then also on the primary of each shard
[23:38:58] <doug1> joannac: someone just said to do it on the .... omg
[23:39:02] <joannac> so you can do replica set maintenance
[23:39:05] <doug1> joannac: why? doesnt the mongos do that for me?
[23:39:13] <joannac> no
[23:39:22] <joannac> we don't propagate users mongoS to mongoD
[23:39:38] <doug1> egads
[23:39:48] <doug1> i've been at this for 4 months. sigh
[23:40:01] <doug1> mongo folks came out and didn't care
[23:40:11] <joannac> doug1: :o
[23:40:14] <speaker1234> GothAlice, maybe I'm just tired but I'm not grokking how that Locke works. Maybe. Are you using the record ID and a separate container to record the lock status?
[23:40:34] <doug1> ok, on the router box but DON't use auth =
[23:41:17] <joannac> doug1: who was it? I doubt they "didn't care"... at least i hope not
[23:41:26] <GothAlice> speaker1234: Because of the way MongoDB implements its "queue" analog I had to split storage of the actual task data into its own collection and only use the "capped collection" (queue) for push notification of events on those tasks. Thus the runners get woken up and pushed a new task ID, then execute that lock in a thread/process pool to try to grab the task so it can be worked on. Only IDs get passed around cross-process this way.
[23:42:46] <GothAlice> speaker1234: You would have to determine the size of your capped collection based on the estimated number of waiting jobs. For the Facebook game I wrote those presentation slides from we estimated 8GB capped collection to handle 1M simultaneous users active' gameplay.
[23:42:49] <doug1> joannac: they had no solutions to provide regarding the automated installation of a mongo cluster. They're main priority seemed to be seeing MMS. When asked about chef they said they heard the Edelight cookbook works, which it plain does not. That's the one I've spent 4 months trying to get to work
[23:42:50] <speaker1234> So maybe the trick for me is to find a record, update the state for just that record ID, I need to go back and get another record
[23:43:01] <doug1> s/seeing/selling
[23:43:32] <doug1> I logged into the MMS GUI yesterday. I can't even work out how to start a cluster with it
[23:43:59] <joannac> doug1: did you deploy hosts? install the agent?
[23:44:21] <speaker1234> doug1, remember that the only intuitive user interface is the mammalian nipple and as any nursing mom will tell you, there are some that still don't get it
[23:45:01] <GothAlice> speaker1234: That is the general approach that the runners use. They have a thread constantly pulling in "new job" messages and handing the IDs off to a pool of actual task executors which then do the locking, as quickly as possible. (In my case all runners are notified of all eneuqued tasks and all locally queue the task for execution. Failure, due to the lock, is very fast.)
[23:45:18] <doug1> joannac: I created a host ... then what happens?
[23:45:31] <doug1> apparently there's no docs on MMS yet either
[23:45:34] <GothAlice> doug1: MMS is for monitoring, not construction of your cluster.
[23:45:57] <doug1> GothAlice: errr
[23:46:00] <GothAlice> I.e. you need a cluster which you then install the agent onto in order to get statistics and monitoring.
[23:46:12] <joannac> doug1: deploy onto it?
[23:46:19] <joannac> GothAlice: you're out of date ;)
[23:46:24] <doug1> now I can't find the damn web page
[23:46:26] <GothAlice> joannac: I may be. ^_^;
[23:46:38] <doug1> There is some new mongo product for hosting clusters
[23:46:45] <joannac> doug1: mms.mongodb.com
[23:46:48] <GothAlice> Rackspace and many other providers offer "rollouts", i.e. you can simply tell Rackspace you want a three-node cluster and it goes off and does it. Not very customizable in my experience, though. (I.e. no automation of enabling authentication.)
[23:46:56] <speaker1234> GothAlice, okay great. So I can use the find an updatesuccess or failure trick to guarantee that I get the record. Now which would be better, grabbing one record at a time or finding a set of records that match and updating one at a time to find out which ones I "own"
[23:47:16] <doug1> Ok, what's the name of the mongo product for a hosted cluster?
[23:47:30] <joannac> doug1: I don't understand the question
[23:47:39] <doug1> sigh
[23:47:45] <joannac> you have a host. you probably don't want to deploy a sharded cluster on a single host
[23:47:58] <doug1> sigh x 2
[23:48:12] <joannac> doug1: I'm trying to help. Please use your words
[23:48:20] <doug1> "MMS can deploy MongoDB on any internet connected servers, but on Amazon Web Services (AWS), MMS does even more."
[23:48:35] <doug1> that doesn't sound like monitoring to me.
[23:49:00] <doug1> "Supported instance type provisioning: m3.*, c3.*, r3.*, g2.2xlarge, hs1.8xlarge, i2.xlarge"
[23:49:10] <joannac> doug1: after finding some hosts, and deploying a cluster / replica set, you can monitor them
[23:49:14] <doug1> That's at mms.mongodb.org
[23:49:27] <doug1> "MMS reliably orchestrates the tasks you currently perform manually — provisioning a new cluster, upgrades, restoring your system to a point in time, and many other operational tasks."
[23:49:43] <doug1> ie hosted mongo
[23:49:58] <joannac> yes
[23:50:10] <joannac> hosted on AWS
[23:50:13] <doug1> Sure
[23:50:16] <joannac> via your own instances
[23:50:17] <GothAlice> speaker1234: Tasks tend to be atomic, i.e. each one can complete independently of the rest. Depending on how you distribute tasks within your runner (i.e. threads, subprocesses, or pure linear execution) will determine which approach you take. If you fill local queues to feed threads and subprocesses, you can batch it all together as each thread or subprocess will individually handle a single task.
[23:50:28] <joannac> it's not like MongoLab, if that's what you're expecting
[23:50:33] <doug1> I signed into the GUI. I tried to do something... like start a clustter... not apparently functionality could be found
[23:51:07] <GothAlice> speaker1234: A pure linear approach would only ever tail the capped collection, executing each task as it gets them. (You could then use a tiny 'catchup' script to re-inject tasks into the capped collection that were skipped because they were added before the runners were launched; or just increase the size of your capped collection to compensate.)
[23:51:21] <speaker1234> GothAlice, there will be a bunch of parallel requests. Think 100 worker bees wanting some more work
[23:52:11] <GothAlice> speaker1234: I'm basically saying that you can load the data at effectively any point: when your worker is initially notified of the task or when your worker actually gets around to running the task.
[23:52:36] <doug1> The MMS UI seems to be totally devoid of anything except monitoring.
[23:52:48] <GothAlice> speaker1234: My "worker bees" each have their own thread pool, each thread handling the locking, loading, and execution of a single task.
[23:52:49] <speaker1234> GothAlice, ah thanks. Let me see how it breaks :-)
[23:53:31] <speaker1234> my worker bees are on 10 other machines. It's an industry-specific CDN
[23:54:17] <GothAlice> speaker1234: I'm hoping by the end of this coming week-end to have marrow.task in an operational state, if you can wait that long. Patches are welcome and encouraged! ;)
[23:54:33] <GothAlice> (As are feature requests.)
[23:54:51] <speaker1234> I need to do a test run by the start of next week. This is I can prove to customers that we have a viable infrastructure, then I will have more money to go back and revisit implementation details.
[23:55:56] <speaker1234> I'm having to do the simple approximation first, prove that the idea works, get the money to do more, and then revisit with a second implementation. In other words, what mythical Man month tells you will happen
[23:56:17] <speaker1234> but I have enough scar tissue to avoid most of the second system syndrome
[23:56:30] <doug1> Where do I get the agent for MMS? This is so confusing
[23:56:55] <doug1> This http://www.mongodb.com/blog/post/installing-mms-monitoring-less-5-minutes just talks about it in the context of monitoring, not building a cluster
[23:57:02] <speaker1234> GothAlice, once I prove things,I have one customer that wants to use the queue similar to what you have for capability and I'll be at least able to either hire someone to help or do and contribute the work myself
[23:57:14] <GothAlice> Yeah, that's pretty tight. https://gist.github.com/amcgregor/4773553 is the "more complete" implementation of the snippits from the presentation. Won't run out of the box (has some hanging references to internal code) but should be pretty quick to patch and get running.
[23:57:42] <doug1> what am I missing here...
[23:57:48] <joannac> doug1: do me a favour
[23:57:57] <joannac> look at the top left of your MMS screen
[23:58:07] <joannac> do you see the word "classic" there?
[23:58:21] <GothAlice> joannac: I now see my problem there…
[23:58:38] <joannac> GothAlice: problem?
[23:58:51] <GothAlice> Not realizing MMS does so much more now. ;)
[23:58:58] <joannac> GothAlice: :)