[00:55:33] <eskatrem> Hey, how does one use mongo with python flask? somehow pymongo doesnt work with apache (and my .wsgi is just like describe here: http://api.mongodb.org/python/current/examples/mod_wsgi.html#pymongo-and-mod-wsgi)
[13:08:36] <Thijsc> Hey all, I'm building 2.8.0-rc0 and running into something. Not sure if it's a bug, something missing in the docs are a misunderstanding on my end.
[13:08:50] <Thijsc> When I run scons install (after first running core and tools) I get this error: Source `src/mongo-tools/bsondump' not found, needed by target `/root/build/mongodb/bin/bsondump'.
[13:09:05] <Thijsc> Is there some kind of separate step that needs to be done first in 2.8?
[13:51:53] <kexmex> how do i prevent mongo from logging contents of slow running commands to log file?
[13:53:54] <uberg0su> Hi everyone, im trying to set up replica set connection using mgo (golang driver) but it seams that it does not handle by default (im not sure if its posible anyway) rs.stepDown(), when it happen im getting EOF error message. Any advice how to handle this is welcome.
[15:45:43] <Mmike> Hola, lads. Where I can find list and explanation of error messages returned by rs.add() ?
[15:45:51] <Mmike> I don't see that anywhere in the docs.
[15:49:29] <Mmike> The reason I'm asking is because 'ok' key sometimes returns: '1', after successful operation, and sometimes '0'
[16:01:40] <brennon_> I'd like to use the aggregation framework to build a list of all fields on all documents in a collection. This is pretty straightforward with map-reduce...is there an operator, etc. that lets me get at actual field names in the aggregation framework?
[16:57:53] <whomp> so i'm creating an aws setup for a rails server with a mongodb backend, and my biggest issue is importing a very large dataset (26-50 million rows) every hour. any tips on what instance i should get? how important is it to separate my rails instance from my mongodb setup?
[17:01:28] <whomp> as for the queries, there's just one, which returns the nearest 50-150 rows of data for a given lat-lon pair
[17:08:53] <GothAlice> whomp: Is the previous data being nuked prior to each hourly import? (Or are those imports always adding to the set?) If no to the former, what amount of data overlap is there between imports? Is pre-aggregation (I.e. grouping locations within a degree of each-other into a single record) viable?
[17:09:18] <GothAlice> *if yes to the former — I'm terrible with boolean questions. XD
[17:09:24] <whomp> GothAlice, i load all of the new data into a new table and then switch them quickly
[17:12:27] <GothAlice> Is that data (to insert in bulk) being generated throughout the prior hour? (I avoid bulk load strategies like the plague due to performance reasons. You thrash IO during those imports.)
[17:13:31] <whomp> i think it all gets updated at once, on the hour
[17:17:45] <hfp_work> Hi all, I don't understand how this `mongo` command is not valid: https://gist.github.com/anonymous/31f1bc9750dad50d3224
[17:18:13] <GothAlice> hfp_work: I've never seen a mongodb username containing symbols before.
[17:18:47] <GothAlice> However… isn't user specified as: user@sub1-db.example.com/foo — i.e. in the URL portion?
[17:18:53] <hfp_work> GothAlice: even if I remove it and just keep `test`, it returns the same invalid argument error
[17:19:19] <hfp_work> Well I used this as reference http://linux.die.net/man/1/mongo
[17:19:57] <GothAlice> hfp_work: You have a very telling issue in that paste. 'mongod --help' — you were attempting to actually run mongod, not mongo. Double-check your command at the beginning of your command line.
[17:20:09] <whomp> GothAlice, so how should i set this aws shindig up?
[17:20:31] <hfp_work> GothAlice: No I'm not, I'm trying to connect to a distant mongodb that has its own running `mongod` instance
[17:20:51] <GothAlice> hfp_work: Line 1 of your paste and line 3 of your paste do not match.
[17:22:19] <hfp_work> GothAlice: I know, it doesn't make sense to me either. I'm running `mongo` so why is it even talking about `mongod`?
[17:22:39] <GothAlice> hfp_work: Run "mongo --help" — does it give you mongod help output? If so, your install is borked in a quite magical way.
[17:23:38] <hfp_work> GothAlice: wtf..? Yes it does seem to give the help for `mongod`
[17:23:50] <hfp_work> I used machomebrew to install it
[17:24:03] <bazineta> whomp I'd allocate a server instance that will allow you around 8000 IOPS to start. Memory probably isn't important for your use case. Run your import and find out just how bad things are during background flush, and they will be, I presume quite bad. From that figure how much of a server instance you'll need to get the IOPS you'll need. Go with the new GP SSD volumes.
[17:24:09] <GothAlice> hfp_work: I, too, use homebrew. Mine doesn't exhibit your behaviour.
[17:24:28] <hfp_work> GothAlice: I'll just wipe it clean and reinstall, see what happens
[17:25:35] <GothAlice> whomp: Note, however, that taking a step back and re-evaluating a bulk load strategy can help reduce that "quite bad" IO/locking situation. Stream load wherever possible, for the love of the penguins. ;)
[17:26:17] <whomp> GothAlice, interesting... can you point me towards some resources on stream-loading?
[17:26:30] <bazineta> whomp What GothAlice said. If you can spread that out at all, life will be much better. One massive import of that many records, you're going to need to bring some cash to the game.
[17:27:37] <GothAlice> whomp: This is one of the reasons I spent so much time asking about your data; trying to understand your use case, and how it may be changed (acceptably; i.e. if combining records is a possibility) to reduce the impact. Simply throwing hardware at it is one solution, but it's rarely a "good" solution.
[17:29:08] <GothAlice> whomp: Streaming comes down to instead of recording whatever event triggers a change in the data (you haven't described this part, so I need to be pretty generic in how I describe it) and queuing up the changes for bulk processing once an hour, update the data in the DB when the event happens. (Event may be form submission, pushing a button, receiving updated GPS data from sensors, whatever.)
[17:29:19] <whomp> GothAlice, most of the new rows replace ones existing in the database. it's basically forecast data for a bunch of fixed positions, and the forecasts get updated each hour
[17:29:30] <bazineta> Yes, immutable constraints are one of EC2's favorite things; they're perfectly willing to work the problem to the last penny of your money.
[17:29:50] <GothAlice> whomp: If you still want the "hourly" approach, each hour copyDatabase() the "current" one, and apply the streamed updates to this copy. Then at the beginning of the next hour, like you currently do, just delete the old one, rename, and re-copyDatabase.
[17:30:19] <whomp> we can afford 2-3 hours of lag btw
[17:30:25] <GothAlice> whomp: Why is a JSON file being generated then bulk loaded instead of the application generating the JSON directly inserting the records itself?
[17:30:57] <whomp> we get a large .csv file from the forecasting server, which is run by some government 3rd party
[17:31:16] <whomp> then i convert it with sed to json and put it in with mongoimport
[17:31:33] <GothAlice> It's especially concerning since JSON isn't a streaming format (like YAML can be), so loading such large files is also inherently expensive. Loading directly from the CSV using mongoimport may be better. (CSV can be streamed.)
[17:31:40] <whomp> this is not at all a defense of what i do, i'm just explanining :)
[17:34:55] <GothAlice> A Python script to do it would be… six lines or so? (Excluding however you choose to wrap the document literal.)
[17:35:37] <GothAlice> Or you could curl | mongoimport with --fields set (which I think lets you map to embedded documents, but I'm not sure)
[17:35:53] <GothAlice> (I.e. do the flat to nested conversion within mongoimport itself.)
[17:38:00] <bazineta> depends what you're comfortable with. Python is nice, I'd probably use node.js just as a preference, C++, whatever. Alice's piped solution would be certainly the smallest if it worked.
[17:38:42] <whomp> here's what i do currently: https://gist.github.com/michaeleisel/dd15e1da8bd659cd4f37
[17:38:48] <GothAlice> (Since this'll be IO-bound, Python is reasonable. Python is less reasonable when it comes to parallelism due to the global interpreter lock.)
[17:39:04] <bazineta> but basically, stream in the csv, munge the record, insert is the loop
[17:42:57] <GothAlice> Instead of using the "csv" Python module to parse the data, use that one. (Also use .raw instead of .text on that HTTP request!)
[17:44:11] <GothAlice> (Both requests and pygrib there are external libraries. It's recommended to isolate them using something like 'virtualenv', which is a light-weight pseudo-chroot for Python.)
[17:50:34] <whomp> awesome, i'll check this out, ty GothAlice bazineta :)
[20:50:21] <whomp> GothAlice, should i start with a dedicated mongo instance and a dedicated rails instance?
[20:50:42] <whomp> if so, how can i guarantee that they'll communicate quickly with each other? i'm on aws btw
[20:58:37] <GothAlice> whomp: AWS (and Rackspace, and most other cloud providers) offer "internal" network interfaces in addition to the "public" network interface on each VM.
[20:58:53] <GothAlice> You can avoid bandwidth costs (within the same zone on AWS) by using the internal network.
[20:59:15] <whomp> what does that mean? there's a different ip for internal use?
[20:59:58] <whomp> and if i get two instances in the same zone, i'm guaranteed they'll have quick and cheap communication like this? is it better (faster, cheaper, etc.) to do separate instances than a single instance?
[21:00:14] <GothAlice> For example, one of my hosts has the eth0 address of 50.56.yy.xxx, and internal eth1 address of 10.181.yyy.xx. Only eth0 is "billed".
[21:01:01] <GothAlice> whomp: Pretty much, yes. There are durability reasons to separate, as well as memory cache contention and other issues if you try to colocate your app on the same "box" as your DB.
[21:04:36] <GothAlice> whomp: Oh, one last weird item. On whichever VM you set up the DB on, make sure it has at least as much swap space as RAM. This is required to allow the Linux kernel to make use of all of available RAM, mostly for caches. Without swap the Kernel will be more resistant to full allocation.
[21:05:05] <whomp> GothAlice, should swap be on a separate partition?
[21:05:22] <GothAlice> whomp: Generally, yes, but if you actually start diving into the swap space you have other issues. ;)
[21:05:47] <GothAlice> (I.e. any swapping at all will be kryptonite for the database.)
[21:06:08] <whomp> GothAlice, so i want swap, but i'll also try not to use it?
[21:06:14] <GothAlice> It's paradoxical that you need swap to make the most efficient use of RAM while at the same time not wanting to swap. ;)
[21:06:55] <GothAlice> Basically monitor RAM, and if "active + wired" ever grows too high (squeezing out "cache" RAM), you may need to take action.
[21:07:45] <whomp> if i switch over to a sql db, because that's what it looks like engine yard prefers, is that strategy of loading in new rows one-by-one to the server still good?
[21:08:55] <GothAlice> whomp: If you can do it in a transaction, it's even better. You can also coalesce multiple inserts into one (INSERT … VALUES …) to do them in batches.
[21:09:20] <GothAlice> (I'd recommend the upload-in-batches-within-a-transaction approach.)
[21:09:26] <whomp> GothAlice, so i would do like 100 at a time?
[21:09:36] <whomp> would it be even faster than row-by-row in mongo?
[21:09:41] <GothAlice> But until the transaction is committed, none of the data will "show up" for other clients.
[21:09:47] <GothAlice> Likely not too much faster.
[21:10:06] <GothAlice> The process a DB needs to follow to insert data will be relatively similar between them.
[21:10:39] <whomp> ok, so i would create a big query string for those 100 rows in python and then run it?
[21:11:27] <GothAlice> … no. You'd construct a compiled query, then repeatedly use it. This will greatly reduce the amount of overhead DB-side for parsing the statements. (It'll only have to do this once.)
[21:12:08] <GothAlice> *prepared statement, not "compiled query"
[21:15:36] <GothAlice> (Not to mention, also helping to secure your queries from injection attacks. Never not use prepared statements.)
[21:19:24] <GothAlice> The prepared approach would likely, for initial prototyping, require single-record inserts. You could "upgrade" it later to batch, if the measured overhead is too high.
[21:19:54] <GothAlice> (I.e. optimization without measurement is by definition premature ;)
[21:53:09] <GothAlice> linuxhiker: MongoDB operations tend to be "atomic" in both the sense that either the whole query works or doesn't, and also that MongoDB treats updates in a linear fashion; it's not generally possible for two connections to literally simultaneously modify a single document. This behaviour powers update-if-not-different behaviour.
[22:21:08] <saml> how can I copy a document from one remote db to another?
[22:23:17] <Boomtime> saml: i don't think there is a way to get two instances to transfer just one specific document to each directly, you'll need to retrieve the doc and then insert it
[23:32:50] <dlong5> http://docs.mongodb.org/manual/core/cursors/ Batch size will not exceed the maximum BSON document size. Is this a hard or soft limit?
[23:35:05] <Boomtime> "maximum BSON document size" (which is 16MB) is a hard limit
[23:37:36] <dlong5> K. We're using an older version of mongodb and we're getting a batch greater than the max which causes the driver to throw an error. We'll likely have to upgrade.
[23:53:15] <dlong5> I'm using a server with a 4MB limit. I'm running a query with a 5MB total result size with documents between 100KB and 1MB. The second page I'm getting from the server is just over the 4MB size.