[01:18:55] <scrandaddy> Hey guys. I have a tailable cursor to a capped collection that I want to constantly be processing on. How can I design this in a scalable way so that I can have multiple workers operating on the data as it comes in?
[01:20:50] <scrandaddy> Do I need a master worker to maintain a single tailable cursor and dispatch new data to workers or can I do it with independent workers?
[01:21:34] <scrandaddy> Each maintaining their own cursors
[01:51:12] <Boomtime> scrandaddy: you can do either, you just need to ensure you process each "job" (or whatever it is) only once
[02:15:40] <scrandaddy> Boomtime: are there any conventions?
[02:16:15] <scrandaddy> If I push them to a task queue maybe I should just do that to begin with and skip the capped collection?
[02:17:08] <Boomtime> I would certainly recommend you do one or the other, not both, otherwise you just have two task queues
[02:17:26] <Boomtime> which, by the way, I see all the time
[09:57:45] <h4ml3t> it is advisable to create a static class to access Datastore instance? Boilerplate code somewhere?
[11:27:40] <arch4> Is there a good pattern for maintaining an ordered list of objects containing a ref in mongo? ie [{ref: id, order: 1}, {ref: id, order, 2}] such that the list can be retrieved and rendered in order + updated relatively efficiently
[11:30:31] <arch4> for a group of objects that could number around 1000 the ref document is quite small 5 fields of no more than 100 chars
[12:50:04] <Grim__> hi. i need to delete a large amount of documents (about 14 million) from a collection on a production server without killing it too much. whats the best way to do it? I tried deleting in bulks but a single operation seems to take alot for even as much as 5 documents
[13:01:16] <cofeineSunshine> Grim__: copy docs thats should survive to another collection, then drop old collection and rename new one to right name
[13:01:33] <cofeineSunshine> this will be faster than deleteing docs
[13:02:04] <Grim__> cofeineSunshine: the database is being updated all the time, how will I be sure all the data was copied?
[13:04:12] <cofeineSunshine> make new records go to new collection with temporary name
[13:04:31] <cofeineSunshine> and at the same time move data to new collection from old one
[13:04:47] <cofeineSunshine> then remove old, rename temp collection
[13:10:34] <solars> hi, quick question, I'm frequently getting this auth error on a slave: https://gist.github.com/solars/257250f58a1257b21d6d
[13:10:43] <solars> and I cannot explain it, because the credentials are valid
[13:10:52] <solars> can anyone give me a hint what to look at?
[13:14:28] <solars> hm ok this seems to be a sync problem
[13:31:25] <arch4> how can I resolve a list of nested references [{ref: _id, .. more fields}, {ref: _id, ... more fields}] such that I have the list of original documents grouped with the other fields? Is there anything better than peeling the refs and running {$in : refIds} then comparing the two arrays?
[15:59:51] <ernetas_> After upgrading to 2.6, MongoDB cannot be stopped using upstart anymore.
[16:00:40] <Forest> Hello, i would like to ask you one question about indexes :) Classic property indexes are implemented in b+ tree i know that,but are the geospatial indexes using geohas or are they implemented in r-tree in the latest mongo 2.6? I need it for my bachelor thesis :)
[16:00:54] <ernetas_> What could be the issue? It simply fails to detect that it's running and always thinks that it is stopped. It can successfully start it, though. This is running default everything on Ubuntu 14.04, MongoDB 2.6 from Mongo repos, configuration is in YAML however, but rather default, too: https://gist.github.com/ernetas/e6b2da59bc72880db315
[16:01:56] <ernetas_> /data/db/mongo*.lock file specifies a correct PID...
[16:03:42] <ernetas> Nevermind. Forking was the issue.
[16:51:41] <GothAlice> Has anyone attempted to store hierarchical (tiered) huffman trees in MongoDB for de-duplication and compression of highly redundant data by class?
[16:51:56] <GothAlice> (I.e. to determine the PDF base huffman tree analyze the probabilities of all patterns in all PDFs and take the top 5%, then each individual PDF will reference the base and its own per-document huffman tree for the remainder of its data…)
[17:46:59] <libbaloa> hey there guys, are mongodb capped collections a good solution to use as a write-back cache?
[17:48:19] <libbaloa> i want to process a bunch of http requests offline and need a place to store them temporarily. i was looking at redis before but then found capped collections and they seem like they might work. one problem with redis was that i was limited by the max connections of the redis cloud solution i was looking at and the mongo one doesn't have conneciton limits
[18:03:20] <GothAlice> libbaloa: I make heavy use of capped collections for a variety of purposes (capped log storage, distributed RPC message bus, etc.) Capped collections aren't restricted by connection count (they're effectively just like any other collection) excepting that they are restricted by allocated disk space.
[18:03:59] <GothAlice> They act as a ring buffer (when writing hits the end, start at the beginning and overwrite as you go) so you need to carefully design for your processing lag. Also note that records inserted into a capped collection are never allowed to grow (since this would require moving them) so updates can only ever keep the document the same size or make it smaller.
[18:04:31] <GothAlice> (Typically for your use case you'd have a "processed" flag that defaults to false which a "worker" would set true in order to claim ownership of the work unit.)
[18:05:44] <GothAlice> db.work.find({_id: ObjectId(…), processed: false}, {$set: {processed: true}}) — this is an atomic operation, and if you check the result and no records were updated then a different worker has already accepted the work and the current worker can try the next.
[18:07:15] <GothAlice> For details, see: https://gist.github.com/amcgregor/4207375 (and the linked reference implementation in the comments)
[18:08:35] <libbaloa> GothAlice: would I want to use a tailable cursor with these?
[18:17:06] <GothAlice> It's one approach, and one that my dRPC system uses.
[18:17:32] <GothAlice> libbaloa: With tailing cursors, you get IMAP IDLE-style push notifications of inserts. (For another analogy, similar to "tail -f" in UNIX.)
[18:17:52] <GothAlice> If you have multiple workers, you'll need an atomic lock like the one I described above to prevent duplication of work.
[18:18:29] <GothAlice> You'll also need to measure your "backlog" — if the backlog grows beyond the capped collection size, you'll lose work.
[18:19:15] <GothAlice> (This is why my dRPC system uses two collections: one collection of actual task data, another collection only to notify of new jobs and completed jobs; the notification queue / capped collection can be wholly rebuilt from the real task collection.)
[18:59:06] <thisGuy> Hi everyone! Can I submit a question here?
[19:00:19] <thisGuy> Great :) so I'm trying to create a Meteor app to manage events (and I'm kind of new to mongodb), so I have tis array of objects and every object has a property called "managed". I just need to make it true whenever the user selects it (I've handled that part already). I've used forEach but I just can't figure out how to make the update... This is that part of the db structure http://pastie.org/9740896 I hope you can give me a hand.
[19:04:38] <GothAlice> thisGuy: Because you have event IDs (which are hopefully unique within any given array of driverEvents!) you can explicitly update the "managed" field of a specific sub-document quite easily:
[19:06:55] <GothAlice> That *should* be roughly correct. (I use a ODM abstraction layer which handles $elemMatch updates for me.)
[19:07:58] <GothAlice> So if the user "checks" the checkbox for eventId 3 of record ObjectId('foo'), you'd fill in the … and replace the "9" in my example above. (This operation is atomic and, with the right indexes, should be blindingly fast since no data has to move around.)
[19:09:11] <GothAlice> (You'd probably want a compound unique index on: _id, realTimeData.events.driverEvents.eventId)
[19:10:09] <GothAlice> thisGuy: Was that helpful? Any questions?
[19:12:19] <thisGuy> GothAlice: Thanks a lot! I will try with that, now one question, can I use any number of properties inside $elemMatch? Because there can be more than one event with the same eventId. Right now I'm using some additional values like the date
[19:12:38] <GothAlice> That's exactly what $elemMatch is for.
[19:14:01] <GothAlice> (If you simply queried the fields directly, a la {realTimeData.events.driverEvents.eventId: 27, realTimeData.events.driverEvents.start.date: 42} you'd get records that have an event with an ID of 27 and contains an event (which doesn't have to be the same one!) with the date 42. (Which isn't right; you want both criteria to match on the same sub-document.)
[19:15:22] <GothAlice> Note! You can cheat and use ObjectIds as unique IDs on child documents. I do this on my forums. https://github.com/bravecollective/forums/blob/develop/brave/forums/component/thread/model.py#L73 — a thread has an array of comments, and when adding one (line 100-104) I generate a new ID for it prior to insertion.
[19:16:02] <GothAlice> (Then $push the resulting sub-document to the array. At no point other than display of a whole thread do I need to load *all* comments. :)
[19:17:52] <DustinBlackman> Hi there, was hoping to get a quick answer to a question about Background indexing if possible please. :) At http://docs.mongodb.org/manual/core/index-creation/#index-creation-background under behavior there's a line saying "Queries will not use partially-built indexes: the index will only be usable once the index build is complete.". Is this goin
[19:17:53] <DustinBlackman> g to take down my index while it builds it, or will it build a secondary in the background and then replace the first once it's finished?
[19:17:57] <thisGuy> GothAlice: Thank you, I will definitely try it! I didn't know there was an active community here :)
[19:18:13] <GothAlice> thisGuy: We have our moments. ;^)
[19:18:44] <GothAlice> DustinBlackman: The index will be ignored by the query planner; it'll continue to build in the background, though.
[19:19:14] <GothAlice> If you are in the process of *rebuilding* an index, that index is already toast and won't be used until rebuilding is complete.
[19:20:02] <GothAlice> thisGuy: https://github.com/bravecollective/forums/blob/develop/brave/forums/component/thread/model.py#L124-L127 — shows finding a thread (and single comment) by comment ID. :)
[19:21:09] <DustinBlackman> Rebuilding I understand kills it completely, that's fair. And so it's ignored by the query planner. So if the index is 'updating' for example when I insert new data, the entire index will be ignored until it's done?
[19:21:27] <thisGuy> GothAlice: Great, I'll take a look at that. Thanks
[19:21:49] <GothAlice> Most operations in MongoDB are atomic, inserting a record adds a new node to the tree (usually, or hash-table if using hashed indexes) without interruption to other queries.
[19:22:30] <GothAlice> Full-text indexing may have a noticeable delay before the inserted record appears in the index, however, due to the additional processing (stemming, reduplication, ranking) needed.
[19:23:03] <GothAlice> (However this is done along with the insert itself, so AFIK, and I may be wrong, the record itself won't be available until its index is built and committed.)
[19:24:07] <DustinBlackman> GothAlice: Ah I see! So as long background indexing is enabled, my search abilities will still work, just the new data won't be available until the background index is complete?
[19:24:51] <GothAlice> Well, again not quite. If you are fully rebuilding (or constructing new) and the index is ignored you'll encounter expensive collection scans in order to perform your query.
[19:25:10] <GothAlice> Background indexing only has an impact on new build/rebuild operations.
[19:25:22] <GothAlice> (Not insertion or update of existing records.)
[19:25:30] <DustinBlackman> No no, not rebuilding, just simply updating with new data/updating data.
[19:25:46] <GothAlice> Those updates happen atomically along with the insert/update.
[19:26:23] <GothAlice> So yes; if you perform a query in the next time slice after an insert or update, the query might not reflect the insert/update.
[19:26:40] <DustinBlackman> Fantastic! That's really helpful. Would background indexing _id (as we use a bit of it in our queries) be intelligent? Or would it be best to just leave that alone?
[19:26:48] <GothAlice> (Which is made worse if you add replication to it; querying a slave is pretty much guaranteed to be just a tiny bit behind.)
[19:27:05] <GothAlice> _id is automatically indexed; no need to worry on that one. :)
[19:27:49] <DustinBlackman> GothAlice: You've truly been a big help, can't thank you enough. :)
[19:36:02] <GothAlice> DustinBlackman: An important point I feel needs to be re-iterated after reviewing the above is that background building of indexes is an option to the command to construct the index, not an "option" of the index itself. (Again, that choice will have no impact on insert/update index update operations.)
[19:39:00] <GothAlice> (Unlike "sparse", which is an option of the index itself, as an example.)
[19:44:03] <DustinBlackman> GothAlice: Ah, so it won't effect insert/update at all, that's the part I misunderstood?
[19:47:57] <DustinBlackman> GothAlice: It only applies to creation and rebuilding?
[20:22:52] <tomservo291> an interview with the MongoDB CTO said that WiredTiger supports multi-document transactions with it's MVCC implementation... but I can't find anything (blog, announcements, mailing lists, jira or driver api docs) which make any mention of adding multi-document transactions to the client drivers in 2.8... why not?
[20:24:30] <GothAlice> tomservo291: Certain aspects of issue management on JIRA (and how the MongoDB folks use it) result in comment history (and potentially whole tickets) that are viewable only to MongoDB staff. Considering the scale of such a change (and like the enterprise need driving it) it may be hidden in this way.
[20:26:16] <GothAlice> OTOH, MongoDB itself does not currently offer transaction support, nor was I aware of any plans to ever support it, so this sounds like a feature of an abstraction layer utilizing atomic updates and careful use of ordered or unordered bulk operations.
[20:26:59] <GothAlice> (I.e. along the lines of current two-phase commit emulation.)
[20:27:48] <tomservo291> Since 2.8 is already in RC phases and there's no mention of it, unfortunately I would have to assume it's not in 2.8. I'll just have to cross my fingers for 3.0 (same interview the CTO claimed that if WT works well it will likely become the default storage engine in 3.0), hopefully then they would support transactions at the client driver
[20:28:38] <tomservo291> This is the interview I'm speaking of: http://www.zdnet.com/mongodb-cto-how-our-new-wiredtiger-storage-engine-will-earn-its-stripes-7000036047/
[20:29:23] <GothAlice> tomservo291: Ah, thanks for the link. I must needs read it thoroughly tonight. (Likely transactions would be implemented in a similar way to the existing Bulk methods, just with transactional safety. A la: http://docs.mongodb.org/manual/core/bulk-write-operations/)
[20:32:59] <GothAlice> Adding a flag to the existing Bulk objects, i.e. {transactional: true}, would be a nicely pragmatic approach.
[20:35:03] <tomservo291> From that link though, it's limited to a single collection... i would hope that WT's doucment-level MVCC would span multiple collections
[20:35:53] <GothAlice> Indeed; some form of db.initializeTransactionBulkOp() vs. the per-collection one right now. (Would return a new Database object scoped to the transaction.) Blue-skying is fun. :)
[20:37:41] <GothAlice> You could then nest actual BulkOp classes to perform ordered/unordered bulk operations within the scope of the transactions. That's how I'd _like_ to use it, at least! ^_^
[20:38:22] <tomservo291> Alright, thanks for the insights
[21:10:16] <blizzow> I'm trying to set up a sharded cluster with three replica set shards that each have three members, how big(CPU/RAM/disk) should my config servers be? I see mongos servers don't need much headroom, but I see no guidelines for config servers. Anyone here have suggestions?
[21:12:17] <GothAlice> blizzow: Config servers I typically co-host with the application servers. They are very light-weight.
[21:15:02] <blizzow> GothAlice: application being mongo itself or my application? I could understand hosting a mongo router with my application, but not a mongo config server.
[21:15:30] <GothAlice> "Your" application, in this scenario.
[21:16:46] <GothAlice> I also place the router on the application servers. The key is that I want information about which shard to send the query to as close to the application as possible (so that the roundtrips needed to look that up are as quick as possible.)
[21:58:28] <Mmike> How do I properly check what replSetGetStatus returns, if it fails?
[21:58:55] <Mmike> sometimes it fails with "not running with --replSet", sometimes with "replset still initializing", and so on
[22:01:22] <jsjc> what will be the best approach to use mongodb for a queue?? I have tasks that need to be done and will be done distributed so I need some sort fo locking mechanism… I am a newbie and some clues that will guide me trought will be great.
[22:03:23] <GothAlice> jsjc: Here, have a presentation I gave on that: https://gist.github.com/amcgregor/4207375 (with link to working sample codebase in the comments)
[22:04:01] <GothAlice> It demonstrates the types of records you'd need, and how to implement locking as well as actual tailing (infinite tailing!) of the notification capped collection.
[22:04:30] <GothAlice> jsjc: If you want timeouts waiting for data, you'll want to add your vote and yourself as a watcher to: https://jira.mongodb.org/browse/SERVER-15815
[22:05:07] <Mmike> seems I will need to parse the string mongo is giving me :/ no errorcode, nothing :/
[22:05:14] <jsjc> timeouts its a nice improvement.. I use it with redis :)
[22:05:33] <GothAlice> Mmike: Are you running that from the mongo shell, or via remote command execution?
[22:06:09] <Mmike> GothAlice, actually, via pymongo
[22:07:05] <GothAlice> Mmike: With the error bouncing around like that I suspect one of two things: either your app is handing the query occasionally to a secondary that hasn't been set up yet (if the cluster is still initializing), or you somehow have two mongod processes listening on the same port, one configured as replset, the other not.
[22:07:23] <GothAlice> (The latter is a possibility on some platforms that don't exclusively lock listening TCP ports to one process.)
[22:07:30] <Mmike> for instance, when I connect to a mongobox that's not replset'ed, I get: command SON([('replSetGetStatus', 1)]) failed: not running with --replSet
[22:08:22] <GothAlice> That's the result of capturing an OperationFailed exception, no?
[22:08:33] <GothAlice> Or OperationalFailure, or whatever it's actually called.
[22:08:39] <Mmike> so when I connect to mongodb server and when I do 'replSeteGetStatus' I was hoping to have sane interface on what is going on. But I don't :/ When I call that via mongo shell I get nicely formated json
[22:08:59] <Mmike> correct, that's what's under OperationFailure exception
[22:09:00] <GothAlice> Mmike: Run the "command" in the mongo shell without parenthesis. :)
[22:11:32] <Mmike> well, a list of all the error messages that replSetGetStatus can return would be also peachy :/
[22:11:58] <GothAlice> In your case if an OperationFailure is raised as a result of that command, you're not in a replica set. The OperationFailure will contain the exact JSON data the shell outputs.
[22:12:29] <GothAlice> Note the "ok": 0 in the shell output (in the event the RS isn't set up) — that causes pymongo to raise an exception with the value of "errmsg" as the first argument.
[22:15:44] <GothAlice> Well, this is actually pretty good. *Why* there was a failure is less important than the simple fact that whatever you were trying to do didn't work.
[22:17:22] <GothAlice> Mmike: From http://docs.mongodb.org/manual/reference/command/replSetGetStatus/#dbcmd.replSetGetStatus — not running in a replica set configuration is the only error condition for this function.
[22:18:10] <Mmike> nop, I need to know why there is a failure. If i'm not running with --replset, then I'm an idiot and need to reconfigure mongodb
[22:19:42] <GothAlice> Ah, those can be handled using locale-independent substring searches. Search for the replsSetInitiate keyword 'in' the message. (if "replSetInitiate" in str(e): …)
[22:20:46] <Mmike> and hope that this particular one is the only one mentioning replSetInitiate
[22:20:54] <GothAlice> Technically that member isn't actually part of the replica set yet, so the "the only error is that it's not a replica" thing remains true. ;)
[22:22:33] <Mmike> would be nice that I could call replSetInitiate in a way that it won't return untill replset is actually initiated
[22:48:19] <Synt4x`> in the mongo shell itself, I did a query db.X.find() and got something that looks like {'name of thing': [ {},{},{},{},{}...]} , how do I find the length of the list in there
[22:49:03] <GothAlice> var record = db.findOne(…); record.nameOfThing.length
[22:51:04] <Synt4x`> thansk GothAlice: i'll try it now
[22:51:38] <GothAlice> (use record['name of thing'] if the name actually does contain spaces. Pro tip: don't do that. ;)
[22:52:39] <Synt4x`> haha yea it doesn't, it's _ between :-p
[22:53:42] <GothAlice> Synt4x`: Note that the keys are encoded on every document; depending on your use case, the overhead of storing all those strings can add up quite a bit. Underscores just add one extra byte per… (finally an actually valid reason to prefer feedingCamelCase over underscores…)
[22:57:04] <Synt4x`> GothAlice: interesting I had never considered that, the initial DB isn't mine I'm working with someone else's data and they use some _ and __ for conventions, but maybe I'll re-create internally without them
[22:58:03] <GothAlice> I strip all field names down to the first unique character. That makes every key exactly 7 bytes. (1 marker, 4 length, 1 character, 1 null.)
[23:02:38] <Synt4x`> GothAlice: what if tables have overlapping fields? for instance I have one table game and one table player, and in each instance of player (referring to his results in a game), I include a game__id referring to the game in which those results came from
[23:03:05] <GothAlice> Synt4x`: Collections, not tables, and each collection is allocated its own namespace.
[23:03:16] <GothAlice> (As is each index on each collection.)
[23:03:39] <Synt4x`> Collections** sorry, I'm not sure what you mean by allocated their own namespace
[23:03:58] <blizzow> If I have a sharded cluster behind a firewall and want external hosts to have access to the cluster, I know my replica set members and routers need external IP addresses, do my config servers need external addresses as well?
[23:04:30] <jsjc> I have created a script and locally I see in mongostat how runs quickly with bunch of getmore/s but when I do it in another computer remotely it doesnt it goes slow 1 getmore/s every few secs… What could be the reason??
[23:05:11] <jsjc> Might not be very efficient but I want to return the whole collection
[23:07:50] <GothAlice> jsjc: Problem: you're returning the whole collection. Locally the roundtrip delay is zero on getmore(), you're effectively streaming from a memory mapped file over a memory mapped file. Over a full network, you need to consider the size of the data to transfer.
[23:08:22] <GothAlice> Synt4x`: Having the same field name between collections will have zero impact. It's fully allowed, each document in each collection's keys are handled separately.
[23:08:50] <jsjc> Thanks GothAlice I will reduce the number of fields to see if I can speed it up.
[23:09:43] <GothAlice> jsjc: Compare this query run against a remote mongod instance: db.example.find().forEach(function(r){}) — vs — db.example.find({}, {_id: 1}).forEach(function(r){})
[23:09:56] <Boomtime> blizzow: clients connect to mongos only
[23:10:07] <GothAlice> jsjc: I suspect you'll find that while the second does take a moment to iterate the whole collection, it'll be substantially (orders of magnitude) faster than the first.
[23:22:49] <Synt4x`> GothAlice: I know it does, for instance game collection has an '_id' for each one, player has a '_id' for each one as well (totally fine), but player has a 'game__id' in it which reference which game his results were from
[23:25:23] <GothAlice> Synt4x`: So… call the reference "g" instead? I'm not seeing the issue. :|
[23:30:59] <Synt4x`> GothAlice: ah ok of course, no issues obviously, sigh silly