[06:05:51] <hdm> I am looking for the equivalent of multiple emit()'s [m/r] for the aggregation framework, $project isnt quite what I need, any suggestions?
[06:07:26] <hdm> example would a doc with fields {a,b,c}, i need to count how many a/b/c's are seen across all docs matching the query with individual counts per field
[06:08:28] <hdm> could use $project : { a : { $cond : [ "$a", 1, 0 ] }, b : { $cond : [ "$b", 1, 0 ] }, c : { $cond : [ "$c", 1, 0 ] } } to normalize it
[06:09:22] <hdm> but then i still need to count those somehow - grouping would help with _id : { a : $a, b : $b, c : $c }, but it would return one result for every unique combination and sitll have to be added together in the app
[06:09:58] <hdm> if anyone has ideas let me know. the easy fix is to go back to m/r and use multiple emits
[08:40:26] <jammanbo> So, vague question, but is there any chance that even a pre-release that supports a geospatial polygon type/index would become available in foreseeable future (like a year or something), or is such a thing just not even on the radar?
[10:52:26] <navaru> Someone around? I have a quick question about TTL, expires (working with mongoose)
[14:50:12] <bwellsnc> Hello everyone, I have doing some research and thought I would get the opinion of the room. What is the best backup strategy for a sharded cluster. Thanks
[15:44:32] <konr_trab> How should I run https://gist.github.com/941172 in mongo? `mongo [params] < script`?
[17:17:21] <konr_trab> how can I debug a mapReduce script I send to mongo? 'print' doesn't actually print anything inside a function
[17:20:22] <mids> print should be sent to your log file
[18:45:42] <jtomasrl> is it correct if i repeat values of a collection inside a value of another collection?
[18:47:36] <jtomasrl> i have a collection named items with the item name and price and inside another collection named users i have orders that is an array of user orders that include the item name and price from the item collection an a order date, is that correct?
[18:49:22] <IAD> but remember about max document size (16 mb)
[18:50:18] <jtomasrl> i think im collecting that data on a data warehouse so i'll "empty" those orders once a month or week
[19:04:40] <RONNCC> so...i seem to be writing too fast - what should i do? I have mongodb running on my computer and am writing ~500 mb every few minutes... all the data piles up in my ram and them my swap... ram ~ 2gb, swap ~7 gb... what should i do? --> sharding?
[19:17:30] <hdm> RONNCC: get more ram, reduce indexes, get faster disk i/o, then consider sharding
[19:18:54] <hdm> a cluster of 2gb ram systems isnt going to help much, you generally want to tackle resources as a matter of ram -> disk i/o -> cpu, you can use sharding to help with any of those three, but I found it easier to just bump specs on primary systems first
[19:20:14] <hdm> at some point ram stops being a bottleneck (flushing to disk eats your disk i/o), ssds or sharding helps there, or adding more spindles and raid
[19:20:48] <hdm> check your "iostat" output, if your disk is spending all of its time doing reads, you need to reduce your indexes or move to SSDs
[19:20:59] <RONNCC> hdm: this is a corporate env. i can boost ram to 12, but that doesn't matter. on some 10 documents it ate 2 gb of ram, 7 gb of swap o-o. then disk i/o how do i do that... i'm using a corporate vm. i can ask for cores/ram/harddirve..... should i ask for for hd's to mount? can i do something like raid for mongo ... would i have to get a software raid controller >_>" ... and i'm running 6 cpus at the moment. only ~2 get used by mongo w
[19:21:01] <hdm> (doing reads even though you are primarily doing writes)
[19:21:27] <hdm> you may be misreading the stats, it will map 2gb of memory, but only use a small portion of that
[19:21:50] <hdm> youre only going to see a single cpu core get used per mongod on average
[19:22:03] <hdm> cpu usually isnt the bottleneck though
[19:22:11] <hdm> the output of mongostat and iostat
[19:22:14] <RONNCC> hdm:.... not it uses everything ._.
[19:30:19] <hdm> thats only ~12Gb even though res is 65.6Gb
[19:30:23] <RONNCC> so it's of the form {time:{IP: number ... etc}
[19:30:32] <hdm> i could get by with ~20Gb of ram even if it wasnt pleasant
[19:30:34] <RONNCC> the IP and internal data _MAY_ be a repeat with a different time
[19:31:11] <hdm> RONNCC: do you want to manage a giant cluster and be able to query everything at once, or keep it usable on small systems and break things into multiple queries?
[19:31:27] <RONNCC> uhh.. i don't think i have a giatn cluster?
[19:31:33] <RONNCC> i have one vm that computers and has mongodb running atm
[19:32:40] <hdm> you need to figure out what the actual hw is, IOPS, etc
[19:33:23] <hdm> lets just say you dont have a lot of resources then; you need to make sure your indexes stay small, so they fit into ram, and that your lmited disk i/o doesnt drag you down too much
[19:33:35] <hdm> how many records are you looking at per month?
[19:34:10] <hdm> how many entries per day, average size of the entry, etc
[19:34:26] <hdm> you can see average size in db.stats() -> "avgObjSize" : 1200.4287256402074,
[19:35:49] <hdm> ok, how many of those per day / per month etc
[19:36:00] <RONNCC> so i have ~ 100 files i need to parse. they have ~ 200000 lines in each. so each entry is of the form {time: {dataa:1, data2: ...}}
[19:36:03] <hdm> and how far back do you want to look for your most common query/use case
[19:36:21] <RONNCC> this isn't really a business this is because i have a ton of data that i want to run analytics on....
[19:36:36] <RONNCC> so essentially i'm parsing a file and throwing all of the data into a db
[19:36:44] <RONNCC> so i can later run queries on it for criteria
[19:37:04] <hdm> how you query will determine how you store it, since anything not in an index is slow
[19:37:11] <hdm> but anything in an index costs ram
[19:37:32] <RONNCC> uhhm... i want to query all fields
[19:37:34] <hdm> you can break the data into multiple databases or collections, but then you have to query each one individually
[19:38:01] <hdm> you can, but it will be slow, so figure out what fields are most important, and what records need to be in the same db or collection for your use case
[19:38:08] <RONNCC> this is an example of the entries
[19:42:29] <hdm> ok, a month of that data would be about 60Gb (~30m records, take into accoutn the journal overhead, etc) and if you indexed based on just _id (default), id guess around ~8gb of ram for the index, but that isnt a good thing to index on if you want to query it
[19:42:44] <hdm> you might want to limit it to one download a day instead
[19:42:52] <hdm> and then work on it with a new db per month
[19:43:58] <hdm> unless you index /everything/ that should fit into ~12gb of ram and generally be manageable
[19:45:10] <hdm> you wont be able to work with a full month of that data if you download it hourly, you may not be able to even load it into a single db within an hour
[19:45:17] <hdm> once the index starts to hit disk
[19:45:26] <RONNCC> the file is 18mb but when i run sys.getsize() recursively
[19:45:34] <RONNCC> on the dict entry it comes to like 500 mb
[22:21:56] <kgraham> hello, quick question about mongos/configdb's , i have three configdb's and their order in the mongos config file for our worker machines in ec2 (which are autoscaled) has been set for a while, I've recently tried to make it so depending upon what region the machine is launced in, it would connect to that regions' config db to reduce the load, well when I attempted this, I got this error ... http://pastebin.com/PQJibGm8
[23:15:41] <teotwaki> Hi, quick question, I see that the C++ driver is said to use some "core MongoDB code", does this mean that the C++ driver is also licensed under AGPL?