[00:03:30] <disappearedng> Hey guys, I have a 100mb json file. I am interested to slap a database on top without specifying any schema and be able to search in this json file. Can mongo do this?
[00:06:14] <Boomtime> disappearedng: does the json file consist of an array of documents or just a single big object?
[00:07:59] <disappearedng> Boomtime: http://stackoverflow.com/questions/15171622/mongoimport-of-json-file from the looks of this I can just call mongo import right?
[00:09:12] <Boomtime> I don't know if your file is a JSON array or not
[00:09:26] <Boomtime> you have 100MB of JSON, is the entire thing an array of documents?
[00:09:40] <Boomtime> if so, then yes, you can import the file directly
[00:10:13] <Boomtime> if not, then import may import some of it, up to the point where it no longer makes sense - or it may not be able to import any of it
[00:10:49] <Boomtime> it depends on whther you have one single enormous document, or an array of smaller documents
[00:13:48] <cook01> Has anyone ever run into a situation where a basic query that includes $elemMatch returns more results than one that does not? For example:
[00:14:30] <cook01> You have documents that look like: { array : [ { _id : …
[00:26:52] <Boomtime> cook01: if you change the $nin to $in in both queries (making it a positive match), you observe that the number of documents returned by the $elemMatch variant is less (possibly zero), right?
[00:27:42] <Boomtime> i.e $elemMatch now does what you "expect" in that it returns fewer matches than the straight query, do you understand this?
[00:28:42] <cook01> Boomtime: Ahhh, got it. The $nin threw me for a loop
[00:28:44] <Boomtime> to state that another way: the set of matches for $elemMatch variant is less than the set of matches for the "any" match variant
[01:16:13] <disappearedng> I am getting this error for a json file that is 100mb in size. Any workarounds? https://gist.github.com/vicngtor/128c82d2c0e31a95d65f
[01:23:56] <joannac> objects can't be larger than 16MB
[01:24:04] <joannac> disappearedng: your objects are too large
[01:25:15] <disappearedng> joannac: so if I have a file like this: https://search-prod.weedmaps.com:9201/weedmaps/dispensary/_search?size=10 (but imagine this being a lot bigger with size=99999), how can I use mongo to import all of this?
[01:25:55] <Boomtime> disappearedng: as i said to you before, mongodb is a document store, you don't have documents, you have a "document" singular
[01:26:21] <Boomtime> even if you get it to import the result is meaningless, since you only have 1 document
[01:26:36] <Boomtime> no matter what query you would construct it either finds that 1 document or it doesn't
[01:27:04] <Boomtime> you need your data to be in some sort of documents structure to be meaningful
[01:27:24] <Boomtime> right now you basically just have a very large text file
[01:28:11] <disappearedng> Boomtime: Oh ok I didn't really understand you just now
[01:29:02] <Boomtime> a json array would have a structure something like this: [ { x:1, y:1, ...}, { b:1, a:"hello", ... }, { ... } ]
[01:29:16] <Boomtime> note the opening square brackets
[01:29:27] <disappearedng> Boomtime: well I can edit this file to be like that
[01:29:37] <Boomtime> maybe, it might even be close already
[01:29:59] <Boomtime> you need to understand your data structure to start with though, you can't find meaning in noise
[01:31:01] <Boomtime> mongoimport should still work without the square brackets, so long as the documents otherwise form something that could be parsed as an array
[01:31:44] <disappearedng> yeah and there's a jsonArray option
[01:35:49] <Boomtime> disappearedng: the link you provided appears to be a single document
[01:36:02] <disappearedng> Boomtime: yeah I am editing it right now
[01:36:07] <Boomtime> it starts with a curly brace and ends with a matching curly brace
[01:36:25] <Boomtime> everything seems to match up - is your 100MB just a sequence of these?
[01:36:44] <Boomtime> if so, you should get away with just having a comma after each one
[01:37:02] <Boomtime> this will make the entire file correctly formatted as a big JSON array
[01:37:25] <Boomtime> then mongoimport will see a huge sequence of documents and import each one
[01:48:30] <disappearedng> Boomtime: basically "hits:" is a gigantic array
[02:51:54] <tully> can you add an arbiter function to a bulk insert that gets called before each document is written somehow?
[02:52:44] <tully> ie the boolean returned should decided whether a write attempt is made for each
[02:53:46] <tully> and that a succeeded/failed report can be made in an async way for bulk insert
[07:26:46] <Ceseuron> Anyone know how to unstick a "stuck" shard drain? I got a shard drain going with 4 chunks left and it's been at that number all day now.
[07:27:07] <Ceseuron> Lots of movechunk notices in the logs about an "aborted" transfer.
[08:10:40] <adsf> err is this correct date format?
[09:40:03] <Zelest> i wish there was more riceos in here :(
[11:00:18] <agend> hi - have a question about Bulk operations - if I have to make lets say 10k upserts - and the docs wouldn't fit my memory - what kind of speed up may i expect between doing upserts one by one - vs using Bulk builder?
[11:02:05] <agend> i mean - would it be 3x faster or more like 15% ?
[11:07:09] <agend> and true is i need much more per second - so not sure i can get that with mongodb, i mean sharding would probably help, but not sure how many nodes I would need, thinking about trying cassandra for this kind of throughput
[11:10:20] <kali> agend: i just insert 312k docs in a database, it took 22s. on my laptop, and they are not particularly small.
[11:10:52] <kali> so i would say the performance you're looking for is achievable, depending on about a dozen factors
[11:11:27] <cofeineSunshine> including disk partition block size
[11:11:46] <agend> kali: u made inserts - i need upserts - problem is on my testing system i get 20k upserts in 5 minutes
[11:12:18] <agend> kali: and not sure what i could do to get 100k upserts per second
[11:17:19] <agend> kali: not true actually - its 20k events, and one event creates about 76 docs to upsert - so its about 1.5 mil upserts in 5 minute - and I need 760k upserts in 1 second
[11:19:05] <agend> kali: kind of - i'm processing events in real time and must preagregate them - cause i need to read stats fast later for presentation
[11:19:55] <agend> kali: but cant just write events because later processing events for lets say one year would take a long time
[11:19:57] <kali> agend: I don't expect everybody to agree here, but I don't think mongodb (or any document oriented db) is a good fit for that
[11:44:35] <kacper_> hi, one question. I want to learn ANGULAR + NODE + MONGO + EXPRESS.. And I want to do it in proper way, please tell me what would be better and absolutely pro: do it with or without mean.js ? Thanks!
[12:06:50] <Siyfion> kacper_: Well MEAN and Meteor aren't the same.
[12:07:20] <Siyfion> kacper_: Perhaps you'd be better asking that in #angularjs or #mongodb
[12:07:53] <Siyfion> kacper_: Argh.. my bad, wrong window! lol
[12:09:00] <Siyfion> kacper_: Personally I'd recommend learning each part of the MEAN stack individually, as you're much more likely to fully understand it all than if you use a boilerplate generator such as mean.js
[12:14:43] <mellow0324> But which particularly element of the stack do you recommend to start with? I'm thinking Node is obvious choice. But maybe Mongo?
[12:43:12] <nick__> Hi I just heard about "mongodb" database, could some1 please guide me to install mongodb on ubuntu ( I know I can google it, but I want to do it right) so please help me with this
[14:13:05] <jar3k> I tryied import a CSV file (80MB) using mongoimport and I got an error: "exception:BSONObj size: 28347377 (0xF18BB001) is invalid. Size must be between 0 and 16793600(16MB) First element: id: 181696
[14:50:01] <scruz> aggregate question: my documents have a property (let’s call it ‘list’) that may be null/nonexistent, an empty array, or any valid subset of [1…5] with no repeated elements.
[14:52:13] <scruz> i need to *both* get a count of documents that have a nonempty value for list (reported) and those that have an empty value (missing), as well as have counts of each valid value in list. i want to know if i can do this in the same aggregation step or not
[14:53:34] <PirosB3> scruz: well you can also do total - one of the two
[15:01:04] <Lope> If I have a replica set with one master, and one slave node that is forced to always be slave. And sometimes the master server is off, and sometimes the slave server is off, is it okay? I just want the slave to sync it's DB (like a backup) whenever they're both on and can connect to each other?
[15:13:33] <FIFOd[a]> { [MongoError: cursor killed or timed out] name: 'MongoError' } This error message hasn't been googleable. It appears 1s after my last Cursor.nextObject(function(err, msg) {}); Any ideas? (CC GothAlice)
[15:14:42] <GothAlice> FIFOd[a]: Could you pastebin your exact query?
[15:15:54] <FIFOd[a]> error is hitting on line 14.
[15:16:37] <GothAlice> … a ha. Is this deferred use of a query standard practice with your driver?
[15:17:17] <GothAlice> You construct the cursor on line 2, then use it an indeterminate amount of time later in an event callback. This is probably a very bad idea.
[15:17:38] <FIFOd[a]> Why would that be a bad idea?
[15:18:14] <cheeser> cursors timeout after 10 minutes
[15:18:25] <GothAlice> It also looks like you're trying to replicate capped collection (tailable cursor) behaviour entirely client-side. Why are you not passing in the message to migrate to the nextMessage handler instead of trying to pull it from a (likely dead) cursor?
[15:18:38] <FIFOd[a]> The error occurs after ~55 minutes
[15:19:13] <GothAlice> FIFOd[a]: Is it consistently reproducible, or what's the margin of error on that time estimate?
[15:20:44] <FIFOd[a]> GothAlice: I'm not sure I understand "replicate capped collection (tailable cursor)
[15:21:26] <GothAlice> Then not *consistently* reproducable. Yeah, this sounds more and more like a real timeout issue. You really should either look into capped collections (I use it for async notification of processing events) which can stream process records without timeouts like this while also allowing for efficient re-querying, catch-up, and replay, and/or restructure what you've got there to either pass in the record, or pass in the ID to .findOne().
[15:21:43] <GothAlice> (10 minutes on 60 minutes is a pretty high stddev. ;)
[15:22:39] <FIFOd[a]> GothAlice: well I can reproduce it 100% of the time :P
[15:22:56] <GothAlice> (I.e. pass the record or an efficient way to look up the record rather than having to query a long-lived cursor for it.) Constant and consistent are different things. XP
[15:23:01] <FIFOd[a]> GothAlice: also this is a 1 time migration so I'm not sure why capped collections
[15:24:20] <GothAlice> So much async hoop jumping for a one time chunk of code… T_T
[15:24:36] <FIFOd[a]> GothAlice: I wrote it in Go the first time but DBAs were not happy
[15:24:46] <GothAlice> Why not as a mongo shell js file?
[15:24:55] <GothAlice> (I.e. run the migration as close to the data as possible.)
[15:25:14] <FIFOd[a]> GothAlice: that would work but I don't know it well enough to move subdocuments around
[15:26:24] <FIFOd[a]> GothAlice: when I'm happy the previous message has been migrated successfully
[15:26:45] <FIFOd[a]> Since we are moving from just flat list of messages to a nested data structure
[15:27:05] <FIFOd[a]> So the next message may depend on the previous one
[15:27:47] <FIFOd[a]> the events are just to ensure a previous message is written to mongo
[15:28:42] <FIFOd[a]> GothAlice: I'm not sure the cursor isn't being exhausted either
[15:28:54] <GothAlice> So that's triggered somewhere deep (likely in a closure callback) in migrateAMessage, then. Hmm. Your use case has reinforced my belief that no-one should ever Async All The Things™. If you ran this in a straight for loop across the records it'd be infinitely easier to diagnose. :'(
[15:29:13] <FIFOd[a]> GothAlice: 100% yes, but the problem is MongoClient is async
[15:29:32] <FIFOd[a]> So I can't just for loop and fire off a bunch of migrateAMessage() calls
[15:30:01] <FIFOd[a]> GothAlice: yes it all happens in a closure callback based on a insert or an update
[15:33:28] <GothAlice> Alas, I don't MongoClient. All I can suggest is restructuring your migration to make it less halting (i.e. make it multi-pass) to speed up use of the cursor. And/or make your cursor "retrying" (i.e. track last seen ID, if there's an error re-query from there as $gt works on IDs) and keep going until actually exhausted, but that's just a hotfix for an otherwise undiagnosed issue.
[15:36:22] <GothAlice> Who knows, depending on if you host your data yourself or not (i.e. do you use compose.io, mongolab.com, etc?) your host may have a regular process that runs and kills *all* long-running cursors beyond a certain limit. (I do this on my own hosting, too.)
[15:36:41] <FIFOd[a]> GothAlice: I'm running it locally on my mac
[15:39:35] <GothAlice> FIFOd[a]: Not sure how much this may help you, but here's an example cursor retry setup. It's just Python and leaves the async up to MongoDB itself… https://gist.github.com/amcgregor/4773553#file-1-db-py-L22-L52
[15:40:29] <FIFOd[a]> Yeah I'm going to have to try something like that, but I'd like to understand why it's getting killed :(
[16:13:58] <FIFOd[a]> Hopefully nobody I work with is in here >.>
[16:14:00] <GothAlice> At least you aren't being forced to write Clueless. (.clu code) One day my language will rule the world, but today is not that day. XD
[16:16:22] <GothAlice> For example, that's how you define the "return" statement within Clueless itself. (It's a metaprogramming programming language.)
[16:17:19] <GothAlice> (That uses itself to define itself.)
[16:35:09] <GothAlice> Huzzah, found it back. https://gist.github.com/amcgregor/a816599dc9df860f75bd (Hello world with flow control and binary operator macro samples.) The only flow control Clueless offers in the core runtime is the nested block, "forever" loop, _procedure_ call, and exception handling. FIFOd[a]: Be very glad! XD
[17:15:22] <FIFOd[a]> GothAlice: that's the only one
[17:17:10] <FIFOd[a]> GothAlice: that message seems to only appear right before it dies, in the 100k line file it only appears 3 times
[17:21:33] <testerbit> how come when I drop a collection my database size does not get smaller?
[18:01:18] <GothAlice> testerbit: MongoDB is like a filesystem; it allocates "stripes" on disk which it fills with data, but when data is deleted (even whole collections) the database stripes simply have their data bits marked as "deleted", for re-use later. To reclaim the space, make sure you have at least as much disk space free as your dataset size + 2.1GB and run compact.
[18:02:06] <GothAlice> FIFOd[a]: (apologies for the delay) Then I think I know your problem. The cursor *is* being idle cleaned up, and you'll need to code up retrying (like my Python example). When you perform the query, MongoDB will send you a batch of records. You then slowly iterate through that batch, but by the time you have consumed the whole batch, the query is dead.
[18:02:45] <GothAlice> FIFOd[a]: (When a batch of data from a cursor is exhausted, it finally asks the server for more via a getmore call. It is this call, specifically, that is failing.
[18:07:20] <GothAlice> FIFOd[a]: My own code avoids this particular issue by streaming the absolute minimum amount of data needed into a worker queue as quickly as possible, then having workers chew on that queue separate from the originating query.
[18:07:33] <GothAlice> (The minimum generally just being the record's ID for findOne() within the worker.)
[18:23:21] <Lope> If I have a replica set with one master, and one slave node that is forced to always be slave. And sometimes the master server is off, and sometimes the slave server is off, is it okay? I just want the slave to sync it's DB (like a backup) whenever they're both on and can connect to each other?
[18:24:57] <GothAlice> Lope: That is a valid arrangement. Be sure to mark the slave as a "hidden" replica that way queries don't get sent there.
[18:26:01] <GothAlice> Lope: Note, however, that if you shut down the secondary then start it up after a substantial delay it may be forced to perform a full synchronization with the primary which can unexpectedly saturate your network.
[18:31:07] <GothAlice> Lope: You can also save processing power and storage space by having the backup secondary not produce indexes. See: http://docs.mongodb.org/manual/reference/replica-configuration/#local.system.replset.members[n].buildIndexes
[18:33:14] <GothAlice> Lope: See http://docs.mongodb.org/manual/core/replica-set-delayed-member/ for some discussion about having a "delayed" secondary and for notes on the oplog size (which effects how long your secondary can be offline before it needs to hose your network to re-synchronize everything)
[19:09:02] <FIFOd[a]> GothAlice: I see. I'm going to disable the timeout and see if that does it.
[19:57:56] <florinandrei> in db.serverStatus(), in the network section, what is the meaning of numRequests? I thought it's the total from all values under opcounters, but the numbers don't add up. What am I missing?
[20:00:11] <GothAlice> florinandrei: I think that really ought to be mentioned more explicitly in the documentation, but http://docs.mongodb.org/manual/reference/command/serverStatus/#serverStatus.network.numRequests
[20:00:48] <GothAlice> These are network-level requests (including health notifications and other ping-like behaviour, I'd suspect) not database-level requests.
[20:01:20] <GothAlice> I.e. a query might generate one query op, but several network-level requests (the initial query + each getMore)
[20:14:33] <GothAlice> florinandrei: https://gist.github.com/amcgregor/d70290dfe00dfb04762c are the stats from one of my smaller clusters. (There's a rather noticeable pattern of usage, there.)
[20:16:27] <quuxman> I recently upgraded our DB from 2.4 to 2.6.5 and have noticed a considerable drop in performance in several common queries sour app makes. The same query in 2.6.5 scans 10x more rows and takes 10x longer
[20:16:58] <GothAlice> quuxman: Could you pastebin the query and the resulting explain under both?
[20:17:37] <quuxman> Sure, it's a large query that contains an '$in' clause with large lists of ideas which I'll ommit
[20:17:52] <GothAlice> quuxman: Just include one as a sample, if possible.
[20:18:16] <quuxman> oh whatever, it's not like the IDs are sensitive information
[20:18:49] <GothAlice> quuxman: ObjectIds contain creation timestamp information that may be somewhat sensitive, depending on application.
[20:29:19] <florinandrei> GothAlice: so if I use MongoDB as a logging server, and I want to keep an eye on the number of records sent by the clients, numRequests is not the best metric
[20:29:42] <GothAlice> florinandrei: No. For logging, you'd care about the number of inserts.
[20:30:12] <GothAlice> (No it isn't, that is. Silly English binary questions. ;)
[20:30:31] <florinandrei> GothAlice: in your gist, it seems like numRequests is pretty close to being the sum of everything under opcounters. I don't think that's the case with my instance. Why?
[20:30:58] <GothAlice> florinandrei: I almost never have queries with large numbers of results, thus very few getMore requests.
[20:31:22] <GothAlice> florinandrei: I tune my queries to a level that satisfies my OCD. ;)
[20:34:46] <quuxman> But you can see that in 2.6 it scans 53k rows and takes 276 milliseconds, but in 2.4 it only scans 8k rows and takes 25 ms
[20:35:03] <quuxman> It also looks like 2.6 is applying several indexes, where 2.4 has a simpler approach of just using created_-1
[20:35:48] <GothAlice> florinandrei: Of course, I'm being blind and didn't notice that getmore is treated as its own op. Still, the network counter (instead of op counter) handles network-level requests (which include db-level operations) but also all of the other communication that goes on. It's a superset.
[20:36:17] <quuxman> I'm wondering if I can solve this with hints, different indexes, or maybe even downgrading back to 2.4
[20:36:42] <GothAlice> quuxman: Well, downgrading is less viable. Have you tried hinting?
[20:38:09] <GothAlice> Hmm; "dupsTested": 26967 — looks like MongoDB tried to play it smart , but because of the way you've structured your $or segments results in substantial duplication of effort in checking the same record several times against each part of the $or.
[20:39:23] <quuxman> You mean with "hint: {created: -1}"?
[20:44:39] <quuxman> GothAlice: oh rad, that did the trick. Back down to 27 ms. Should've thought of that
[20:45:09] <quuxman> this is clearly a case of 2.6 trying to be too smart ;-P
[20:45:18] <GothAlice> quuxman: Yeah; that query of yours is probably the hariest set of or's I've seen in a while. XP
[20:46:04] <quuxman> We could definitely optimize it quite a bit, but that would take some engineering, data migration, and more work. One of the things I love about MongoDB is you can throw nasty stuff at it like that and it still generally performs reasonably well
[20:46:10] <GothAlice> florinandrei: Alas, I'm at work and can't spend the time to dive into the mongod source to see *exactly* what it adds to the numRequests counter…
[20:46:58] <florinandrei> GothAlice: no problem, thanks. I'll start tracking the inserts for the clients.
[21:01:46] <quuxman> Hm, now I'm having trouble with actually using hint with pymongo. In the docs it says in version 2.7.2 it's a method of cursor, but it's missing for me (I in fact have 2.7.2)
[21:02:32] <quuxman> If I use the { '$query': q, '$hint': {'created': -1} } syntax, I get "Bson command ... failed: exception: unknown top level operator: $hint"
[21:02:43] <quuxman> but that works with my local version of MongoDB
[21:07:49] <Synt4x`> any idea why $addToSet wouldnt work here? db.hu_summary_db.insert({'_id': {'player':c, 'site':'$site', 'sport':'$sport'}, 'opponent':{ '$addToSet': {'name':'$_id.name', 'score':'$_id.score', 'opp_score':'$opp_score','entry':'$entry','winnings':'$winnings'}}})
[21:08:03] <Synt4x`> I'm seeing the following: bson.errors.InvalidDocument: key '$addToSet' must not start with '$'
[21:18:38] <cheeser> or load() from in the shell, iirc
[21:21:51] <quuxman> Arhg, this is totally baffling. When I run this against our live DB hosted by compose.io, with the same hint that greatly improves performance on my local copy, it does a full table scan and is even slower than without the hint
[22:26:45] <quuxman> I don't know how to get .explain(true) output from pymongo
[22:45:10] <quuxman> of course I was doing it wrong
[22:48:10] <quuxman> blarg, pymongo API is significantly different from the native one
[22:49:19] <quuxman> why is .hint() missing from cursor??
[23:11:46] <sobel> good morning. i found exactly the docs i needed for aggregate() on the web, and then discovered i can't use it because my mongodb is 2.0.4
[23:12:09] <sobel> and i am further stumped by trouble finding any docs online for 2.0.4. can anyone help?
[23:12:28] <GothAlice> sobel: Your version of MongoDB is dangerously behind.
[23:12:50] <sobel> GothAlice: tell me about it. it's not even mine. i'll be advising an upgrade.
[23:13:00] <GothAlice> Also, aggregate queries aren't even a thing in that version.
[23:15:35] <joannac> sobel: hope for what? I'm not sure what you're looking for?
[23:28:41] <GothAlice> sobel: Not for that version. And by dangerous, I mean there are 3,851 closed issues between your version and current 2.6. ;)
[23:29:29] <GothAlice> http://bit.ly/1uxqgUk for the JIRA ticket list. (Apologies for the bit.ly on that; JIRA loves to produce URLs long enough not to be pasteable in IRC.)
[23:29:34] <quuxman> whenever I run my query from a web server, I get exception: command SON(...) exception: unknown top level operator: $hint", but when running interactively with as close as I can get to the exact same environment (same version of pymongo, same DB connection), it works
[23:34:25] <GothAlice> sobel: http://bit.ly/1zwYqsQ for just the critical / blocking issues in the 2.0.x series above yours, including a C++ client driver crash fix, segfault correction in map/reduce, un-breaking of sharding behaviour, and a correction on a journalling assertion error.
[23:38:36] <GothAlice> quuxman: That tells me the server itself isn't understanding the command. https://github.com/mongodb/mongo-python-driver/blob/master/pymongo/cursor.py#L977-L982 — you'll have to deal with Python's attribute mangling and call __query_options and __query_spec yourself. (I believe they'd be something like __Cursor_query_spec.)