PMXBOT Log file Viewer

Help | Karma | Search:

#mongodb logs for Friday the 24th of October, 2014

(Back to #mongodb overview) (Back to channel listing) (Animate logs)
[00:00:15] <hejbacon> Boomtime: joannac http://pastebin.com/bc4qamQM
[00:17:11] <freeone3000> How do I get around "Error getting results.MongoError: exception: aggregation result exceeds maximum document size (16MB)"?
[00:17:41] <GothAlice> Adjust the query so that the results fit within a single document, or $out the results to a real collection.
[00:17:46] <GothAlice> freeone3000: ^
[00:18:42] <freeone3000> It's an aggregation pipeline. The output I care about is a single number, I just have to run a $where, $where to do an and, a $group for uniqueness, then a $group for count.
[00:19:11] <freeone3000> Apparently if any stage of this goes over 16MB, including the reuslts after the *first* $where, it'll not give me results.
[00:19:39] <GothAlice> freeone3000: Having two $group operations is dubious to me, but. Individual stages have 100MB of RAM to work with unless you enable the option to let them use on-disk temporary files.
[00:20:25] <GothAlice> freeone3000: Could you pastebin/gist your aggregate query?
[00:20:41] <freeone3000> GothAlice: It looks like https://gist.github.com/freeone3000/6a46ac80c7dc661feeda
[00:21:12] <GothAlice> Yup, that explains it.
[00:21:16] <GothAlice> Combine your duplicate stages.
[00:21:22] <GothAlice> (One $match, one $group.)
[00:22:02] <freeone3000> GothAlice: How would I combine those two groups? {"_id":"$requestInfo.userId", "userCount": {"$sum":1}} gives me results like [{"_id":"user1","userCount":22}]
[00:23:02] <GothAlice> And your desired output would be?
[00:23:15] <GothAlice> Oh, right, I see what's going on now.
[00:23:35] <freeone3000> GothAlice: I want the total number of unique userIds.
[00:25:33] <GothAlice> [{$match: {dateAdded: {$gte: startDate}, source: {$in: routes}}}, {$project: {uid: "$requestInfo.userId"}}, {$group: {_id: "$uid"}}, {$group: {_id: 1, userCount: {$sum: 1}}
[00:25:36] <GothAlice> ]
[00:26:02] <GothAlice> Might need to move the project in front of the match simply to avoid loading all that extra data and blowing out the 100MB-per-stage limit.
[00:32:05] <GothAlice> We do statistical aggregation at work, but we pre-aggregate all the values that user-triggered queries run against for performance. (Processing four weeks of click data to produce one chart takes ~50ms against a dataset pre-aggregated into one-hour slices by client, invoice, job, and click source.)
[00:32:50] <GothAlice> Processing clicks pre-aggregated into hourly chunks means I have far, far fewer records to handle.
[00:34:46] <freeone3000> These are already pre-aggregated by conversation. We just have a bit of load.
[00:35:48] <freeone3000> (140 countries, 20m users, ~500qps)
[00:36:12] <GothAlice> If your aggregate is running out of memory, that's a query and dataset problem, not a load problem. ;^P
[00:36:48] <GothAlice> Have you tried what I gave above?
[00:37:19] <GothAlice> freeone3000: [{$match: {dateAdded: {$gte: startDate}, source: {$in: routes}}}, {$project: {uid: "$requestInfo.userId"}}, {$group: {_id: "$uid"}}, {$group: {_id: 1, userCount: {$sum: 1}}]
[00:38:03] <freeone3000> GothAlice: Yeah. It may or may not be working.
[00:38:09] <GothAlice> Heh.
[00:38:51] <GothAlice> The trick is combining the $match queries; you didn't actually want to "find all records with a dateAdded greater than startDate", store the temporary results, then filter those results to only matching routes.
[00:39:05] <GothAlice> Also $project to reduce the dataset down to only the values you need for the calculation.
[00:40:04] <GothAlice> freeone3000: My aggregate queries tend to look like: https://gist.github.com/amcgregor/7bb6f20d2b454753f4f7#file-aggregate-py-L6
[00:42:45] <freeone3000> GothAlice: Ah. Yes, perfectly successful. Thanks.
[00:42:50] <GothAlice> np
[00:45:07] <GothAlice> freeone3000: https://gist.github.com/amcgregor/1ca13e5a74b2ac318017 is one of my aggregates a bit more like your test case here. The DOTW conversion is hideous. XD
[00:46:11] <GothAlice> (If anyone has any suggestions on how to improve that aggregate, I'm all ears. ^_^)
[01:39:17] <zak_> HI.... late to the party on mongo... is a document with a field which is an array of documents queryable at all...so { title : "something", authors : [{name : "bob", ...}, ...] , ... }
[01:39:41] <joannac> sure, authors.name
[01:39:54] <GothAlice> Also http://docs.mongodb.org/v2.6/reference/operator/query/elemMatch/
[01:40:45] <zak_> maybe i mistyped that idea, so i have a collection of documents like that
[01:40:50] <zak_> and i want to get all the authors out
[01:41:54] <GothAlice> You could do it with a standard query, but with duplicates: db.foo.find({}, {authors: 1}) — loop through the result with a nested loop over .authors
[01:42:02] <joannac> db.collection.find({}, {authors:1, _id:0})
[01:44:03] <GothAlice> Or you could do an aggregate query to only get the *unique* authors out: db.foo.aggregate([{$project: {a: "$authors"}}, {$unwind: '$a'}, {$group: {_id: '$a'}}])
[01:44:10] <GothAlice> (I think that's right.)
[01:52:52] <zak_> GothAlice, joannac, i got kicked off for a second -- so that query returns something like a Map (book => (book.id, book.authors)). what's the mongo pattern here, is it better to have a second "authors" collection and refer to them like foreign keys
[01:53:22] <GothAlice> zak_: The answer to that question lies in how much duplication of author information you have.
[01:53:35] <zak_> lots
[01:53:45] <GothAlice> Then yes, splitting it out is an extremely good idea.
[01:54:05] <zak_> but unfortunately it's not reliable, i have to make a model of some kind to disambiguate them
[01:54:15] <GothAlice> You would instead store a list of author IDs (rather than a list of author documents).
[01:54:26] <zak_> so i guess the model will be represented as another collection
[01:55:08] <GothAlice> At work we have a process to merge (automatically or manually) records we come across that are duplicates. How are you "model"ing your data?
[01:55:14] <GothAlice> (Language, driver, ODM?)
[01:56:47] <zak_> right now it's scala + reactivemongo
[01:56:53] <zak_> and i'm using mongo to store and index all these flat files i have that have details on books
[01:56:57] <zak_> they're basically books
[01:57:55] <GothAlice> You had authors: [{name: "bob", ...} in your model, so {name: "bob", …} would be one of these "authors" collection-level documents.
[01:59:03] <GothAlice> Alas, though, I am unfamiliar with scala and reactivemongo.
[02:00:42] <zak_> i'm sure the insert/select capabilities are the same as any. i was doing this with postgres a few days ago, but i inserted millions of records in minutes using mongo, so i'm convinced i should be learning it
[02:00:46] <zak_> seems like its really easy to build collections that are not at all queryable
[02:01:22] <zak_> whereas i'm used to building dbs with a ton of tables and massive inner join nonsense, so i guess it's the same problem
[02:01:52] <GothAlice> Yes. MongoDB seems to penalize bad data architecture more than other databases. Then again, the "flat file" mentality is pretty… restrictive. You end up inventing approaches like entity-attribute-value to work around the limitations, and you'd pretty much never do that in MongoDB.
[02:02:17] <GothAlice> http://docs.mongodb.org/ecosystem/use-cases/product-catalog/ is a good article about that, actually.
[02:03:22] <GothAlice> When I first told people I'm writing forums software that stores all replies to a thread within the thread, my relational friends looked at me like I was a crazy person. ^_^
[02:05:49] <zak_> GothAlice, well that seems like a great application of this right, since the root level document being the post itself is good
[02:06:02] <zak_> but what do you do when you want to get all comments by a given user
[02:06:05] <GothAlice> But the data duplication will be crippling.
[02:06:21] <GothAlice> Wait, yours will be. Yeah, this is perfect for forums. ^_^
[02:06:35] <GothAlice> $elemMatch
[02:07:25] <GothAlice> Well, that's "all threads and the first comment by that person on the thread". Aggregate $match, $unwind, $match, $project for the "all posts by a user" bit.
[02:07:59] <zak_> yeah i think i get it
[02:08:45] <GothAlice> ($match to find all threads commented in by that user, $unwind to continue comment-by-comment-by-thread, $match to find just that user's comments, $project to clean up the data to just the fields I want.)
[02:11:08] <GothAlice> zak_: https://github.com/bravecollective/forums/blob/develop/brave/forums/component/thread/model.py#L99-L192 are some of the other things you can do. (I.e. find a specific comment anywhere, add a comment, modify one in-place, get the first and last (or any arbitrary index) comment, etc.
[02:12:22] <zak_> hey thanks for the link, i've been looking at solr for a new project too so i'm sure this will be good to look at
[02:13:09] <GothAlice> But! Forums and categories of forums are stored separately, you'll note. :)
[02:13:25] <GothAlice> (https://github.com/bravecollective/forums/blob/develop/brave/forums/component/thread/model.py#L70-L80)
[08:52:48] <Mmike> kali: I can combine master and replset options, right?
[08:53:05] <Mmike> That is, i can have 3 boxes in replset, and each of the members can be a master and have a slave attached to it?
[08:53:16] <Mmike> (not that I know why would anyone use such setup, but it is possible?)
[09:00:11] <alphamarc> Hi all
[09:00:58] <alphamarc> What are the greatest advantages of Mongo (and NoSQL in general) when it comes to store log information
[09:01:01] <alphamarc> lots of it
[09:01:21] <alphamarc> Can you point me to some ressources I could read ?
[09:03:30] <alphamarc> I would like to have a fast and generic system to store logs for different services within a firm.
[09:10:51] <kali> Mmike: master/slave is more or less being deprecated, and it becomes less and less relevant with the recent availibility of huge replica set (50 nodes)
[09:11:58] <Mmike> kali: still, you can have a slave on each of you replset nodes - that particular replset node is then 'master, right?
[09:54:15] <Tark> db.struct.find({'number': {$type: 1}}) returns all the collection. There is all values is double! Holy hell! I want int. What should I do?
[10:16:28] <Tark> hm, I always have a 64-bit version
[10:16:37] <Tark> already*
[10:20:12] <jabclab> hey all, is it possible use functionality from an npm lib within a map() function of mapReduce (using http://mongodb.github.io/node-mongodb-native/contents.html)?
[10:20:35] <jabclab> or would you recommend getting the raw data out and manipulating in code?
[10:43:05] <Tark> hm... "Works as Designed" https://jira.mongodb.org/browse/SERVER-2844
[10:43:47] <Tark> I think, I need to add |int in the templates... Design is so design
[10:43:59] <Tark> thanks for helpin
[10:59:36] <Mmike> heh
[10:59:37] <Mmike> Fri Oct 24 09:16:12.118 [initandlisten] ERROR: can't use --slave or --master replication options with --replSet
[10:59:48] <Mmike> but seems that mongod doesn't really care about that...
[10:59:53] <Mmike> as I got my cluster running
[12:55:07] <Luser> hey guys. I am getting invovled to project which uses mongodb on monday and I would like to get some experience with it this weekend before i start. Could you drop me a link or two with learning resources?
[12:55:54] <cofeineSunshine> Luser: oficial university: https://university.mongodb.com/
[12:57:15] <Luser> Looks nice, but I'd have to wait for courses to begin.
[12:59:16] <Di> What platform project?
[13:01:08] <Luser> It uses ruby on rails for backend.
[13:03:44] <Di> Try this https://github.com/mongodb/mongo-ruby-driver/wiki/Tutorial but you need to learn mongo's query language documentation anyway
[13:08:50] <Luser> Thank you Di i will check it out for sure.
[13:38:03] <inad922> hello
[13:38:18] <inad922> How can I connect to mongodb via a socks proxy?
[13:39:55] <leenug> Hi, is anyone able to see why the follow query doesn't return any results: http://laravel.io/bin/LN8LO, there are plenty in the collection, I think I may be doign something wrong with dates, thanks
[13:58:24] <shoerain> hmm, what's the equivalent non nested version of `db.campaigns.find({'topics.included': true }).count()` ? I thought it would be `{topics: { included : true }}` or `{topics: [{ included : true }]}`, but I get 0 count for both.
[14:04:21] <shoerain> leenug: does some subset of the query work? Also you can merge $gt and $lt together, like so: http://cookbook.mongodb.org/patterns/date_range/
[14:22:57] <kadosh> hi
[14:23:13] <kadosh> I'm looking for a way to recover admin password
[14:23:40] <kadosh> I used to log in with db.auth("admin","pass")
[14:23:58] <kadosh> but now I can't do anything on the data base
[14:49:48] <shoerain> kadosh: would you be looking for somtehing like this: http://dba.stackexchange.com/a/63295 ?
[16:06:42] <locojay_> hi need to run a repairDatabse() since disk size and db size do not match. i only have 200gb left on a 700gb db so i m using --repairpath You must use a --repairpath that is a subdirectory of --dbpath when using journaling
[16:07:57] <locojay_> don't get this
[16:15:31] <locojay_> storagesize and filesize are 500gb off. i need to clean somehow
[16:25:20] <locojay_> sylinking doesnt help
[16:28:38] <GothAlice> locojay_: I'm reeeally not sure what you'd be symlinking, and why that might have "helped", but there will always be a mismatch between the on-disk size and the actual size of the data contained within.
[16:28:48] <GothAlice> MongoDB pre-allocates stripes using ever increasing power-of-two sizes.
[16:29:03] <locojay_> i removed unused files of 100gb's from gridfs
[16:29:12] <locojay_> i think this should be reflected on my disk
[16:29:16] <GothAlice> Oh, no.
[16:29:28] <GothAlice> Not at all. There are now holes in the on-disk files that will get filled in as you add new data.
[16:29:35] <locojay_> what if i need 2 add more docs and will get a size problem?
[16:29:50] <GothAlice> Those two files will 99.9% be inserted into an existing hole.
[16:30:05] <GothAlice> (GridFS chops files up, too, into separate records. This lets them be packed more efficiently.)
[16:30:33] <locojay_> k so you say i should not me woried about the actual filesize but just worry if i have enought dispace - dataSize?
[16:30:36] <GothAlice> However, what you want is "compact". http://docs.mongodb.org/manual/reference/command/compact/
[16:30:46] <GothAlice> locojay_: Generally, yes.
[16:31:23] <locojay_> thanks
[16:31:28] <GothAlice> In development I use --smallfiles a lot, but in production I don't.
[16:41:35] <darkblue_b> hm interesting.. what is the VACUUM equivalent ?
[16:42:22] <darkblue_b> (I run analytics data so its mostly read-only, but still it could come up)
[16:42:42] <GothAlice> Compact.
[16:42:55] <locojay_> running compact on a secondary atm
[16:43:13] <locojay_> disck get freed :) thanks
[16:43:15] <GothAlice> (At least, vacuum full = compact. MongoDB normally self-vacuums dead objects.)
[16:43:30] <GothAlice> (Think it does this once a minute.)
[16:44:17] <sub_pop> Is there a way to remove a value for a key from a bson_t struct?
[16:44:39] <GothAlice> sub_pop: In the raw struct, yes, there's probably a BSON helper lib function to assist with that.
[16:45:12] <GothAlice> If it's already stored in MongoDB, $unset
[16:45:17] <sub_pop> I'm staring at the API doc for libbson and cannot find it.
[16:45:26] <sub_pop> Yea, I'm working with libbson directly ^_^
[16:45:35] <sub_pop> Not *actually* using a mongodb
[16:46:55] <sub_pop> I see bson_copy_to_excluding()
[17:00:56] <jaequery> hi
[17:01:32] <jaequery> ive heard stories of how mongo is not really suitable for huge databases? is this really true or just a misguided information?
[17:01:45] <jaequery> im talking like 100gb's+
[17:01:49] <GothAlice> jaequery: I have 24TiB (growing at 12-24GiB per day) in MongoDB without issue.
[17:02:05] <jaequery> wow
[17:02:12] <jaequery> what type of service are you running?
[17:02:13] <LouisT> Soooo massively misguided information heh
[17:02:52] <LouisT> But you also have to remember it greatly depends on your hardware..
[17:02:52] <GothAlice> jaequery: I record everything, and have been since 2001. (Transparent HTTP proxy, natural language processing for automated tagging and content extraction, etc., etc.)
[17:03:54] <jaequery> do you ever come across times where you need to restructure your collection definitions? and how would you handle that?
[17:03:59] <GothAlice> Non-HTTP content gets slurped in on a regular basis (i.e. from my chat logs).
[17:04:25] <GothAlice> jaequery: With extreme difficulty.
[17:04:50] <GothAlice> $unset (to remove fields), $set (to add new fields), etc. work, but across the entire dataset those queries can take hours.
[17:04:59] <GothAlice> (Especially adding. Adding is the worst.)
[17:05:27] <jaequery> in SQL, you'd just create a new table and join after. can same be said for mongo?
[17:05:40] <GothAlice> However generally I don't need to ever do those against the whole dataset. Each document in the primary metadata collection has a different structure related to the type of data it referneces.
[17:06:05] <GothAlice> (I.e. music has {artist: '', …} while books have {author: '', …}
[17:08:26] <GothAlice> jaequery: For the most part, since I won't have data for newly added fields most of the time, I don't worry about adding a "null" or "empty string" value to all of the data. There's no point. It'll be used and present on any record I update where I set it.
[17:10:24] <GothAlice> jaequery: The version of my exocortex (the name of the project) that last used SQL used multiple table inheritance. It was unbelievably painful, but I hadn't learned of EAV at the time.
[17:20:24] <obeardly> Hello Mongoers; can anyone here tell me if it is possible to assign a role to multiple users at once?
[17:21:09] <GothAlice> obeardly: In theory you could execute an update statement against the system.users collection that would update several records at once. $push is the one you want to "add" a value to a list.
[17:21:31] <obeardly> GothAlice: thank you
[17:23:15] <jaequery> gothalice: could you elaboare with how you utilize EAV w/ mongo?
[17:23:24] <GothAlice> jaequery: You don't. ;)
[17:23:41] <GothAlice> entity = {_id: ObjectId(…), attribute: value}
[17:23:56] <GothAlice> Where you can have as many attributes and values as you want, up to the document size limit. ;^)
[17:24:19] <GothAlice> EAV is a solution to make SQL like Mongo.
[17:24:32] <GothAlice> (At the expense of some pretty insane JOINs.)
[17:35:43] <ejb> GothAlice: Remember my query? I have that list of generated $cond's (in Python - or JS in my case)
[17:35:51] <GothAlice> ejb: Aye?
[17:35:54] <ejb> GothAlice: And I'm $add'ing those up
[17:36:43] <ejb> GothAlice: That's fine. But if I add another item to the $add list like this: {$divide: [{$subtract: [radius, '$distance']}, radius]}
[17:37:21] <ejb> GothAlice: When I $sum the score in the $group stage, the number is a whole number, not a float like it is in the $project stage
[17:43:47] <ejb> GothAlice: hm, I guess I fixed it.
[17:43:59] <GothAlice> What did you do?
[17:46:31] <ejb> Not sure. I broke the $cond array out into its own variable then did push(divideExp)
[17:46:34] <ejb> and it worked
[17:46:50] <ejb> I was chaining _.map().push() before
[20:06:56] <GothAlice> d/dx(e^Ϝ) == Ϝe (derivitive of e to the wau is "wau e"; oh math jokes)
[20:08:57] <baegle> Is there a way to create a database, fill it with data, and then set it to read-only?
[20:09:42] <GothAlice> You can semi-enforce read-only access by only accessing the database on a replication secondary. (Writes must go to the primary.)
[20:10:36] <GothAlice> baegle: Otherwise you could have a boolean field on your documents indicating read/write status, and filter your update()s on that. (I.e. only allow updates on records whose ro: field is false or rw: property is true.)
[20:10:49] <baegle> OK, so the answer is no.
[20:11:15] <GothAlice> The answer is yes, through a variety of means all under your own control.
[20:11:52] <baegle> No, the answer is I can prevent applications from writing to the database by various mechanisms but I cannot create a database, fill it with data, and then set it to read-only
[20:12:26] <GothAlice> Only connecting to a secondary has the effect of making the entire connection read-only. Access your read-only database through a secondary, and it's really read-only.
[20:12:29] <GothAlice> (Writes will explode.)
[20:13:12] <baegle> the connection is read-only in that sense
[20:13:16] <GothAlice> Thus careful management of your database connections (i.e. using a connection to the full cluster for most, a direct conenction to a secondary for read-only data) can secure your data in the way you want.
[20:13:38] <baegle> the equivalent conversation about file systems would be that no, you cannot set -w, but you CAN mount the entire FS with ro.
[20:13:55] <baegle> in that conversation the question "Can I set a file to read-only?" has the answer of "no"
[20:14:03] <GothAlice> baegle: Think of it more like having a filesystem, then mount -o bind,ro a subsection of it.
[20:14:18] <GothAlice> You can mount -o bind a single file if you wish, making that one file read-only.
[20:14:34] <baegle> but that's not actually true because the file system also has to be mounted writable for it to have a replicant, correct?
[20:14:38] <GothAlice> (Not really, but this is an imaginary hypothetical.)
[20:14:46] <baegle> It's not like I can shut-down the writable database and leave the read-only one there
[20:14:58] <GothAlice> baegle: Actually, you can. It's a terrible idea, but you can.
[20:15:19] <GothAlice> Spin up two shards and an arbitrator, populate the primary, kill it, the secondary will freak out and enter r/o mode.
[20:15:54] <GothAlice> At the same time, since it can't get a majority of votable nodes, it likely won't accept new connections either.
[20:16:01] <GothAlice> (But that would need to be tested.)
[20:16:54] <GothAlice> Really, though, at work we store frequently-accessed purely readonly data in SQLite whose on-disk file is non-writable.
[20:17:35] <GothAlice> (Putting the r/o data as close to the application as possible.)
[20:22:57] <GothAlice> (My managers keep using the words "big data"… I tried to make it clear to them we do not qualify for that phrase, yet.)
[20:24:17] <mike_edmr> ugh. buzzwords.
[20:24:23] <GothAlice> Yeeah.
[20:25:02] <mike_edmr> great article the other day that touched on 'big data fever'
[20:25:03] <mike_edmr> http://spectrum.ieee.org/robotics/artificial-intelligence/machinelearning-maestro-michael-jordan-on-the-delusions-of-big-data-and-other-huge-engineering-efforts
[20:25:29] <mike_edmr> how we're in a phase of heightened expectations for what it can do, and theres likely going to be a bit of a backlash
[20:26:10] <GothAlice> At work here I've been tasked with producing analytics across our job/click/referral data. The key metric is cost per applicant. There is zero possibility we can calculate this value, but they still want it.
[20:26:25] <mike_edmr> heh
[20:26:32] <GothAlice> (We don't know if the user clicks on 'apply' and actually fills out the form after the tracked click to the job detail on the foreign ATS.)
[20:26:34] <GothAlice> Painful.
[20:27:56] <GothAlice> So they were, like, well, we *can* measure intent to apply, right? No. Intent to apply would be clicking apply regardless of actually filling out the form. We can't get that either… just cost per click. So we're calling "cost per click" (CPC) "intent to apply" anyway. >_<
[20:28:46] <GothAlice> mike_edmr: Save me from managers. Please. ;)
[20:29:32] <mike_edmr> GothAlice: when i figure out how to save myself from mine, you'll be next in line
[20:31:28] <GothAlice> Great article, BTW. That guy uses average sentence lengths that exceed mine, which is kinda nuts. XD
[20:45:04] <GothAlice> On the natural language bit, I just asked my phone "what was the phase of the moon on July 12th in Argentina?" (waxing gibbous, 99%), "what is the specific heat of hydrogen cyanide" (1.328 J/g K gas, 2.612 J/g K liquid), "what is the molecular structure of lithium cobalt oxide?" (it showed me), and "what is the speed of light in furlongs per fortnight" (about 1.8 trillion). The modern era is kinda mind-blowing when you think about it.
[20:55:24] <daidoji> fo sho
[20:58:40] <dbb> .. new to mongodb.. I am just walking through some examples.. question about roles/users in 2.6
[20:59:09] <GothAlice> dbb: What's the q?
[21:00:10] <dbb> I started the mongo shell as dbb, but mongod is running from mongodb user
[21:00:30] <dbb> I want to just make a few things and try them.. will I get tangled on this somehow ?
[21:00:36] <GothAlice> dbb: The mongo shell is a MongoDB "client", like a webapp would be. It connects to the mongod process over a network connection, even on the local machine.
[21:00:50] <GothAlice> (It's just a local network connection.)
[21:01:20] <dbb> ok yes.. is it advisable to make a dbb user and make it an admin, to simplify the exercises
[21:01:24] <dbb> its my own machine
[21:01:46] <dbb> if so, so I have to pass all the params in a JSON struct ?
[21:01:49] <GothAlice> As long as you either don't care about your playtesting data set, or have backups, you can always stop mongod, nuke /var/lib/mongodb/*, and start mongod back up again to start from scratch. When I run mongod locally for development, I don't even bother with authentication at all. (I do make sure it only accepts local connections, though. --bindHost 127.0.0.1 or similar.)
[21:02:15] <dbb> yes local connections only now
[21:02:37] <dbb> whan I make a new user from the shell command line, the JSON struct is required?
[21:02:57] <dbb> I take it that db.createUser() is preferred in 2.6
[21:03:01] <GothAlice> dbb: Almost all data in MongoDB, even when using the shell, are passed around as JSON/BSON structures. You can use the helpers, and you can also directly query against the collections that the helpers manipulate.
[21:03:39] <GothAlice> A la: http://docs.mongodb.org/manual/tutorial/add-user-to-database/#create-the-new-user
[21:03:40] <dbb> I have a VM with rockmongo in it.. but I wont want to depend on that going forward..
[21:03:51] <dbb> ah hm
[21:04:24] <voidDotClass> Can you have nested information in a document? E.g a Family document which has nested documents for each child?
[21:04:37] <voidDotClass> or a hotel document with nested documents for each room?
[21:04:54] <dbb> I have read that neting is unlimited in mongo.. but that doesnt sound right
[21:04:57] <dbb> nesting
[21:05:10] <dbb> you are trying to make a relation in a flat world
[21:05:32] <GothAlice> Yes. MongoDB stores rich, deep structures, and provides ways to query and manipulate that data. However often it's best to step back from the data and really examine how best to model it. Nesting is unlimited, but not useful beyond a certain point. Don't do that.
[21:05:58] <voidDotClass> What would be a better way to query for the children of a particular family, in my above example?
[21:06:10] <dbb> two collections perhaps
[21:06:13] <GothAlice> For example, consider simple forums. You have threads/topics, replies/comments to those threads, forums (that group threads), and categories (of forums).
[21:06:53] <voidDotClass> dbb: that would be ugly
[21:07:05] <GothAlice> You could store something like: {category: "General", forums: [{forum: "Random", threads: [{title: "General randomness.", comments: [{author: "bob", comment: "Oh boy."}]}]}]}
[21:07:16] <GothAlice> It'd basically be impossible to use it that way, though.
[21:07:50] <GothAlice> Instead, comments make sense to store in a thread (the finest-grained thing you can look at is a thread) while storing the forums and categories separately.
[21:08:15] <dbb> forums is a collection and categories is a collection ?
[21:08:24] <GothAlice> Yes.
[21:08:30] <GothAlice> And threads is a collection.
[21:08:31] <dbb> voidDotClass: yuh huh
[21:08:33] <voidDotClass> GothAlice: What about if forums was an array of forum ids, so if you got a category, you'd have a list of its forum ids, and could then query for those ids on the forum collection?
[21:08:58] <GothAlice> voidDotClass: Doing it that way suffers some problems.
[21:09:03] <GothAlice> But it is a valid approach.
[21:09:09] <voidDotClass> what problems?
[21:09:17] <voidDotClass> what would be the best approach then?
[21:09:31] <voidDotClass> i.e say you wanted to query the forums within a particular category
[21:09:49] <GothAlice> {category: "General", forums: [ObjectId(…), …]} — this would let you manually order the forums within the category (good), but require two lookups if you want to search forums on anything else than category.
[21:10:11] <dbb> voidDotClass: "how do I use a document-oriented flat store to model my relational problem"
[21:10:58] <GothAlice> voidDotClass: And would require updating of a record in another collection if you add or remove a forum.
[21:11:28] <voidDotClass> In your example, what's ObjectId? is it an array, or a function..?
[21:11:56] <GothAlice> voidDotClass: It's a constructor for an ObjectId datatype. MongoDB doesn't use auto-increment integer keys, it uses something that scales a whole lot better.
[21:12:10] <GothAlice> In my example, it's p-code (since I don't supply an example ID.)
[21:12:25] <voidDotClass> I see.
[21:12:50] <GothAlice> voidDotClass: For further reading: http://docs.mongodb.org/manual/reference/object-id/
[21:13:52] <voidDotClass> GothAlice: Is that the same as my above example then, where I said it could category could store an array of forum ids?
[21:14:00] <GothAlice> Yes.
[21:14:07] <GothAlice> (That's what it would look like.)
[21:14:09] <voidDotClass> I see. Thanks
[21:14:45] <voidDotClass> Is there any ORM or such for Java in Mongo, which would take care of automatically updating the forums array on categories, if any forums were added or removed to a category?
[21:15:07] <voidDotClass> Anyone else know?
[21:15:21] <GothAlice> I'm sure there must be a client driver that simulates triggers for you. MongoEngine on Python does.
[21:15:33] <voidDotClass> Triggers?
[21:15:45] <GothAlice> "Database triggers", i.e. "when I insert into this collection, run this function", etc.
[21:16:41] <voidDotClass> I know, I was curious if Mongo supports triggers.
[21:16:55] <GothAlice> It does not itself, no. Client drivers provide that behaviour.
[21:17:01] <voidDotClass> Ahh.
[21:17:22] <voidDotClass> Overall, I have to say I'm leaning towards Mongo a lot more than Couch.
[21:18:12] <voidDotClass> Mongo seems easier to use than couch. Only thing I'm considered about is clustering. Do I need to setup 5 db nodes even when I only need one, i.e when I have 0 users?
[21:18:33] <GothAlice> voidDotClass: https://gist.github.com/amcgregor/18e3180215736369b49e is an example of one of my triggers, and how to use it. Very simple in Python/MongoEngine.
[21:18:59] <voidDotClass> thanks.
[21:19:45] <GothAlice> voidDotClass: Replication is to handle reliability and failover, as well as backups and offloading read-only queries, and sharding is to help scale. You don't need these, but they do become useful in production environments.
[21:20:14] <GothAlice> voidDotClass: https://gist.github.com/amcgregor/c33da0d76350f7018875 is a tiny script to start a small authenticated, replicated, and sharded cluster on one machine. :)
[21:20:35] <GothAlice> (just follow the instructions and edit your copy before running)
[21:21:21] <voidDotClass> GothAlice: If the cluster is run on one machine.. i.e one node on AWS, how will it scale?
[21:21:49] <GothAlice> voidDotClass: Oh, it won't. It'll be pretty terrible, in fact. That script is to help test application response to an environment like that in development.
[21:22:09] <GothAlice> I.e. "did we set up our sharding rules right"
[21:22:12] <voidDotClass> so in practice, you would need to start 5 nodes / machines?
[21:22:20] <GothAlice> A minimal set is three.
[21:22:27] <joannac> GothAlice: replication is not to scale reads! askasya.com/post/canreplicashelpscaling
[21:22:44] <GothAlice> This script constructs 10 processes, in theory three of them would be shared on machines doing other things, so that setup would need 7 nodes.
[21:23:55] <voidDotClass> So, i read somewhere that mongo can automatically start new instances to help scaling. Does that include auto starting new nodes on AWS / load balancing?
[21:24:15] <voidDotClass> or will i need to take care of starting new nodes on AWS manually?
[21:24:50] <GothAlice> voidDotClass: There are ways of automating that, yes. MMS seems to have that capability these days, but not sure about actually spinning up new VMs.
[21:25:00] <voidDotClass> MMS?
[21:25:01] <GothAlice> voidDotClass: http://mms.mongodb.com
[21:25:11] <joannac> voidDotClass: where's the "somewhere"?
[21:25:24] <joannac> I haven't heard of such a thing
[21:25:27] <voidDotClass> on the case study for foursquare, i think
[21:25:38] <voidDotClass> http://www.mongodb.com/customers/foursquare
[21:26:38] <voidDotClass> "Instead of writing its own sharding layer, Foursquare can rely on MongoDB's automated scaling infrastructure and spin up new nodes as its application grows"
[21:26:42] <dbb> I will certianly be doing this also.. though I expect to write ansible scripts to do that
[21:26:52] <GothAlice> voidDotClass: They likely rolled their own automation to scale VM nodes. That article mostly mentiones auto-failover.
[21:27:21] <dbb> I expect I will create a master node, and then inform it as each slave is created
[21:27:23] <GothAlice> And the auto-sharding rebalancing thing, yeah.
[21:27:37] <GothAlice> dbb: You can do it one at a time like that, yes.
[21:27:49] <dbb> crawl before walk before run :-)
[21:27:50] <GothAlice> dbb: The example script I gave you tells each replication set about all nodes in that set all at once.
[21:27:51] <voidDotClass> Lets say if you have your own automation and you spin up a new VM, how will you add that VM to your mongo cluster?
[21:28:26] <GothAlice> voidDotClass: My VMs announce their startup to a smaller offsite management cluster, which then pings back to the other members to let them know to add the member.
[21:28:38] <joannac> voidDotClass: thanks for the link. I'm not actually sure what that refers to; I think it refers to the fact you can add a new shard at any time, but MongoDB won't just start new instances automatically (because that would be a bit weird)
[21:28:43] <dbb> oh neat
[21:29:12] <voidDotClass> GothAlice: The other members = members in your offsite management cluster or your main cluster?
[21:29:47] <GothAlice> https://gist.github.com/amcgregor/4342022 is the "name pool" collection for my automation, and https://gist.github.com/amcgregor/2032211 a monitoring document (also in MongoDB).
[21:30:04] <GothAlice> voidDotClass: Main cluster. The management cluster would just record the new member in the name pool.
[21:30:15] <GothAlice> (That's how hosts get assigned DNS names, FYI. They ask for one.)
[21:30:27] <voidDotClass> what's the purpose of using the management cluster?
[21:30:36] <voidDotClass> why can't you just talk to the main cluster directly?
[21:30:40] <GothAlice> We've had cross-zone failures on Amazon EC2 in the past.
[21:30:47] <GothAlice> So we simply don't trust it.
[21:31:08] <GothAlice> The management cluster can, in theory, spin up VMs on any number of providers, making it much more resillient.
[21:31:11] <dbb> it seems to neatly solve some instantiation corners
[21:31:15] <dbb> I like it
[21:31:16] <voidDotClass> Ahh
[21:31:27] <GothAlice> (We currently use Rackspace and Amazon.)
[21:31:30] <voidDotClass> How many nodes do you have in your main cluster GothAlice ?
[21:31:41] <voidDotClass> approx
[21:31:51] <dbb> more than you ?
[21:31:59] <voidDotClass> obviously
[21:31:59] <GothAlice> voidDotClass: Peak was ~190 cores. Average is ~60 or so cores.
[21:32:15] <GothAlice> (Each VM having 4 cores on average.)
[21:32:17] <voidDotClass> how many mongo nodes per core?
[21:32:23] <voidDotClass> or per vm
[21:32:37] <GothAlice> One mongod per vm, with the exception of the config servers which are piggybacked.
[21:32:51] <dbb> yes makes sense
[21:33:20] <voidDotClass> so you give 4 cores to each mongodb node?
[21:33:32] <GothAlice> The servers also don't have permanent storage. (Amazon ephemeral storage is fscking awesome, and insanely fast compared to EBS volumes which are the bane of my existence.)
[21:33:38] <dbb> I suspect that depends alot on the work
[21:34:18] <voidDotClass> GothAlice: Do you use any particular AMI for AWS? Or anything else to setup your vms when you launch them?
[21:34:20] <dbb> btw GA, I talked to my wizened admin neighbor about DNS.. he believes that I can do my cluster of Amazon VMs without actually naming them
[21:34:59] <dbb> the 'remote' manager might help with that, too
[21:35:00] <GothAlice> dbb: You certainly can. It's not advised to do that, but…
[21:35:15] <GothAlice> My management cluster, you'll note, spends a lot of effort to manage DNS names. ;)
[21:35:36] <dbb> on each VM creation, send a msg to the remote with some profile info and the 10.x IP addr
[21:35:50] <dbb> then instruct the master to add the node
[21:35:57] <GothAlice> voidDotClass: We run our own homogenous AMI. (One AMI for all servers regardless of role.) Eases distributing updates across the cluster.
[21:36:13] <voidDotClass> I see.
[21:36:53] <voidDotClass> GothAlice: What AWS instance type and how many instances would you recommend for a completely new service, that has 0 traffic but might grow or might stay low / zero?
[21:37:09] <voidDotClass> or what ram / cpu per vm
[21:37:14] <GothAlice> voidDotClass: On EC2 you can pass arguments to the VM at the time you request it, too. (We use that to inform the VM which role/branch to git checkout into.)
[21:37:19] <dbb> what work do you expect the VM to do ?
[21:37:27] <GothAlice> ^ That's the question.
[21:37:37] <voidDotClass> Run MongoDb nodes?
[21:37:43] <dbb> uuhhh
[21:37:46] <GothAlice> Doing what?
[21:38:02] <voidDotClass> The usual database work, querying, inserting, updating
[21:38:09] <voidDotClass> more querying and inserting than updating
[21:38:33] <GothAlice> Storing large volumes of data? Running costly queries? Map/reduce? How many queries per minute/hour/etc estimate?
[21:39:16] <GothAlice> dbb: Hey, ignorance can be corrected and is nothing to be ashamed of. The ability to pass on knowledge is what made us human.
[21:40:25] <voidDotClass> No mapReduce or costly queries. Volume of data and queries will be 0, and will depend on traffic.
[21:40:52] <GothAlice> voidDotClass: So for a basic start you could begin with a single instance with a minimum of 2GB RAM, must be 64-bit (for a variety of reasons), and two cores. I'd recommend three in a replica set to make sure your data doesn't accidentally itself.
[21:41:01] <GothAlice> (Remember I don't trust AWS. ;)
[21:41:14] <voidDotClass> Three mongo nodes?
[21:41:18] <GothAlice> Aye.
[21:41:30] <voidDotClass> all on the same instance?
[21:41:37] <GothAlice> No, on three instances.
[21:41:40] <GothAlice> Otherwise there is no redundancy.
[21:41:42] <voidDotClass> Ah
[21:41:43] <GothAlice> http://docs.mongodb.org/manual/tutorial/deploy-replica-set/
[21:42:01] <voidDotClass> Thanks so much for your help.
[21:42:06] <GothAlice> You can start with one and later replicate, but your data is "at risk" as long as it's only on one node.
[21:42:21] <GothAlice> It never hurts to help.
[21:43:25] <voidDotClass> how would you determine if more VMs are needed? by checking the CPU / memory usage on each VM?
[21:43:38] <GothAlice> voidDotClass: It can seem silly to only have, say, one application server and *three* database servers, especially when first starting, but…
[21:45:22] <GothAlice> voidDotClass: Generally you want your entire dataset to fit in RAM. When you can't scale a single machine up (by adding more RAM) any more, you have to split (shard) your data into separate chunks on separate servers. There's a break-even point to it that I'll leave as an exersize for the reader. (Where the cost of adding ram is more than the cost of adding a new VM.)
[21:46:08] <GothAlice> However CPU load is a thing, too. That can be somewhat helped with sharding (because sub-sections of a query are run across different machines), and potentially improved by altering your data design to simplify expensive queries.
[21:47:00] <voidDotClass> GothAlice: But sharding is automatic in mongo, right? I.e I'll just need to spin up a VM, start the node, have it ping the master, and the rest will be taken care of automatically?
[21:48:12] <GothAlice> voidDotClass: Somewhat; but the word "automatic" is misleading, here.
[21:48:32] <GothAlice> When you add a new node to a sharding cluster, data will re-balance across the cluster to spread things around as evenly as possible.
[21:48:54] <GothAlice> But it does this according to your sharding index—something you configure yourself. What you choose to shard *on* in a given collection will have a huge impact on if sharding is even useful or not.
[21:49:21] <voidDotClass> I see. I have some more reading to do on the sharding then.
[21:49:54] <GothAlice> voidDotClass: I spend a lot of my data architecting just meditating on the structures, plans, and repercussions.
[21:50:16] <joannac> if I could offer one piece of advice: think long and hard about your shard key
[21:50:27] <GothAlice> It is of utterly critical importance.
[21:50:55] <GothAlice> Wrong key and all of your data may end up on shard A and not spread around. (Or related data gets spread around when it should actually be on the same server for performance reasons, etc., etc.)
[21:51:31] <voidDotClass> I see.
[21:51:42] <voidDotClass> Thanks again for the help.
[22:06:28] <GothAlice> voidDotClass: You're welcome. :)
[22:06:55] <Synt4x`> I have a bunch of test scores by date, example { 2014/10/05, mean, median, sample, [ {name, score}, {name, score}, {name, score}, ... ] }
[22:06:59] <voidDotClass> It seems cassandra's scaling is more automatic, seems like its just a matter of spinning a VM on it, and that's it.
[22:07:10] <Synt4x`> I want to track data from the individuals as they take tests over time (they take one every day)
[22:07:20] <Synt4x`> things like percentile, %diff from mean, %diff from median, etc.
[22:07:44] <GothAlice> joannac: Could you confirm for me that if you shard on _id ObjectIds, documents will be distributed effectively round-robin to the shards?
[22:07:53] <Synt4x`> should I store this in the {name, score} pair, so {name, score, percentile, %diff_mean, %diff_median}, {name, score, percentile, %diff_mean, %diff_median}...
[22:08:01] <Synt4x`> or calculate it each time I want to grab it?
[22:08:08] <Synt4x`> not sure whats best DB/programmign practices
[22:08:35] <GothAlice> Synt4x`: Depends on how frequently you need access to those stats (which would be "pre-aggregated" in a sense) vs. how expensive it is to calculate.
[22:08:43] <GothAlice> If rare, more expense is acceptable.
[22:09:24] <GothAlice> We try to pre-aggregate data just enough to keep our main user-facing queries (i.e. the ones the user actually has to wait for to see something) below 100ms.
[22:09:35] <Synt4x`> GothAlice: I might make the results available to the public, meaning they would be on a web server where people could look them up all the time, so it could end up getting read quite a bit
[22:09:37] <GothAlice> (total for all queries per request)
[22:10:13] <Synt4x`> so I wasn't sure just how expensive it is to store them vs to calculate them (which takes 2 $groups, and a couple of for loops atm)
[22:10:22] <GothAlice> Then the cost of storage is negligable to the potential cost to calculate repeatedly. Storing would be a good idea in that situation, and it makes your data more useful with less work on the part of the consumers. :)
[22:10:49] <GothAlice> (Whereas for something like a monthly report, there's little point in storing the values.)
[22:10:56] <Synt4x`> ok thanks :), that was the approach I was taking and wanted to make sure it wasn't something super newb :-p
[22:10:57] <Synt4x`> ty
[22:11:31] <GothAlice> MongoDB asks not what you think your data should look like but instead asks, how are you going to *use* the darn stuff?
[22:11:34] <GothAlice> ;^)
[22:14:00] <dbb> aha - I see this in the mongo log Failed to authenticate dbb@mongo_fdw_regress with mechanism MONGODB-CR: AuthenticationFailed UserNotFound Could not find user dbb@mongo_fdw_regress
[22:14:31] <dbb> I am trying out the postgres foreign data wrapper.. something isnt right yet..
[22:14:53] <dbb> user plus database name ?
[22:19:23] <GothAlice> MongoDB authentication can be non-obvious. You have system-level users (i.e. a global admin) and database-level users (e.g. application user). Which database you "authenticate against" can be different than the one you wish to actually use. (I.e. to authenticate as the global admin on anything other than the admin table, you need to explicitly say authenticate as admin from the admin db but use 'foo'.)
[22:19:48] <GothAlice> s/table/database
[22:20:24] <GothAlice> I'm not familiar with pg-fdw, though. T_T
[22:20:45] <GothAlice> dbb: In which database did you add your dbb user?
[22:21:35] <dbb> hmmm
[22:21:52] <dbb> user is per-database in mongo ?
[22:21:58] <GothAlice> Yes.
[22:22:03] <dbb> I might have messed that up
[22:22:05] <dbb> looking
[22:22:18] <GothAlice> (This is extremely useful, BTW. All of my clients have an "admin" user they can use that is only admin on their database.)
[22:22:31] <GothAlice> (But isn't actually allowed to create new users…)
[22:28:56] <dbb> oh hm - apparently I was in the admin database when I made the dbb user, which has role 'root'
[22:29:19] <GothAlice> You may wish to remove that user and add the user to the correct database. :)
[22:29:24] <dbb> when I use the test db, that db.createUser() doesnt work, because there is no role root
[22:29:35] <dbb> ok - getting used to this.. thx
[22:29:53] <GothAlice> dbb: Pro tip: create a generic global admin on the admin db. Use that to populate the other db users.
[22:30:02] <dbb> in pg I have a general purpose superuser called dbb
[22:30:13] <dbb> and roles are global.. but no matter
[22:30:23] <GothAlice> Sample MongoDB roles for an admin: https://gist.github.com/amcgregor/c33da0d76350f7018875#file-cluster-sh-L78
[22:35:29] <dbb> http://paste.debian.net/128603/\
[22:35:32] <dbb> success !
[22:35:48] <dbb> break - document steps - mtg :-)
[22:36:32] <GothAlice> The number of catgirls being killed by processing a rich, potentially deeply nested document system through flat SQL tables is too darn high.
[22:37:15] <dbb> ooohh it will be ok :-)
[22:38:03] <dbb> this will be running clustering analysis on tons of strings, anyway.. so no deeply nested structures here
[22:38:31] <GothAlice> (I nearly have an aneurism every time a client asks me "can I have that data in an Excel file?" ;)
[22:39:32] <dbb> side story - the academics here do SOOO much R.. the planners who get paid lots of money.. ? excel
[22:39:35] <GothAlice> dbb: Other than learning (which is why I do most of the things I do) any particular reason to not store that data in pg itself?
[22:40:08] <dbb> I had my conf call with the clients ... it appears they want to use map/reduce to do the clustering analysis..
[22:40:36] <GothAlice> Then why access via pg at all? Why not Mongo direct?
[22:41:55] <GothAlice> dbb: Use https://gist.github.com/amcgregor/7bb6f20d2b454753f4f7 to convince them they really want aggregate queries, not map/reduce. (Same query in both… one is clearly more understandable!)
[22:41:58] <GothAlice> ;^P
[22:42:14] <dbb> ooohhhh that would be great
[22:42:28] <dbb> if I get stuck on some horrible thing, I can fall back to fixing it/understanding it/backing it up on pg
[22:42:41] <dbb> this same setup will work if I am talking to a master node in a cluster, across a wire
[22:47:03] <dbb> (I have to sign in to github to see that gist.. that means they track what I read.. )
[22:49:02] <dbb> I forked that GIST and I will give it a try
[22:52:57] <GothAlice> dbb: It's a private gist, that's why.
[22:53:03] <GothAlice> dbb: (code from work)
[22:53:35] <dbb> I do not trust github - I am very appreciative of this channel and GA
[22:53:40] <GothAlice> And yes, the Python code at the end is a method of storing aggregate queries within documents.
[22:54:14] <GothAlice> (Our report engine builds reports from documents which describe them, including the aggregate pipeline stages.)
[22:54:26] <dbb> these are my test notes from today http://paste.debian.net/128605/
[22:55:41] <dbb> ok ! off to meetup... thx very very much .. more later 8-)
[22:55:53] <GothAlice> You're welcome, and enjoy. :)
[23:52:57] <mongoooo> hello! i am trying to find a good way to store/check days of the week in mongo. my current solution is a weekday field which is an array of booleans for each weekday m-f (e.g., [true, false, false, false, false] to indicate Monday) -- does anyone have any susggestions for the best way to query this array for a match (e.g., if show two users who selected monday)?
[23:54:51] <GothAlice> mongoooo: Oh my yes.
[23:54:54] <GothAlice> Give me a moment.
[23:55:14] <GothAlice> mongoooo: See: https://gist.github.com/amcgregor/1ca13e5a74b2ac318017
[23:57:20] <mongoooo> GothAlice: thanks! i will take a look!!
[23:58:09] <GothAlice> We do click analytics at work. Day-of-the-week is somewhat painful due to differences of interperetation over when the week actually starts. ;)