PMXBOT Log file Viewer

Help | Karma | Search:

#mongodb logs for Saturday the 10th of November, 2012

(Back to #mongodb overview) (Back to channel listing) (Animate logs)
[00:55:43] <John> Hi all. Does anyone know how I could find all geo elements within a bounding box but reduce the result to 1 in a 10 mile radius?
[01:03:58] <redir> canone do ttl expirations on a gridfs collection?
[01:06:04] <redir> looks like no
[06:05:51] <hdm> I am looking for the equivalent of multiple emit()'s [m/r] for the aggregation framework, $project isnt quite what I need, any suggestions?
[06:07:26] <hdm> example would a doc with fields {a,b,c}, i need to count how many a/b/c's are seen across all docs matching the query with individual counts per field
[06:08:28] <hdm> could use $project : { a : { $cond : [ "$a", 1, 0 ] }, b : { $cond : [ "$b", 1, 0 ] }, c : { $cond : [ "$c", 1, 0 ] } } to normalize it
[06:09:22] <hdm> but then i still need to count those somehow - grouping would help with _id : { a : $a, b : $b, c : $c }, but it would return one result for every unique combination and sitll have to be added together in the app
[06:09:58] <hdm> if anyone has ideas let me know. the easy fix is to go back to m/r and use multiple emits
[06:21:12] <hdm> sorted it out: $group : { '_id' : 'stats', 'a' : { $sum : '$a' }}, ... }
[08:40:26] <jammanbo> So, vague question, but is there any chance that even a pre-release that supports a geospatial polygon type/index would become available in foreseeable future (like a year or something), or is such a thing just not even on the radar?
[10:52:26] <navaru> Someone around? I have a quick question about TTL, expires (working with mongoose)
[11:48:00] <meonkeys> I guess not.
[11:48:52] <ron> no, you guess wrong.
[11:49:04] <ron> what you need to do is ask your question and wait. not ask to ask and wait.
[11:49:08] <ron> that's just how IRC works.
[11:49:16] <ron> :)
[11:49:43] <meonkeys> heh, true that.
[14:50:12] <bwellsnc> Hello everyone, I have doing some research and thought I would get the opinion of the room. What is the best backup strategy for a sharded cluster. Thanks
[15:44:32] <konr_trab> How should I run https://gist.github.com/941172 in mongo? `mongo [params] < script`?
[17:17:21] <konr_trab> how can I debug a mapReduce script I send to mongo? 'print' doesn't actually print anything inside a function
[17:20:22] <mids> print should be sent to your log file
[17:22:07] <konr_trab> whoa! thanks
[18:45:42] <jtomasrl> is it correct if i repeat values of a collection inside a value of another collection?
[18:47:36] <jtomasrl> i have a collection named items with the item name and price and inside another collection named users i have orders that is an array of user orders that include the item name and price from the item collection an a order date, is that correct?
[18:48:08] <IAD> yep
[18:49:22] <IAD> but remember about max document size (16 mb)
[18:50:18] <jtomasrl> i think im collecting that data on a data warehouse so i'll "empty" those orders once a month or week
[19:04:40] <RONNCC> so...i seem to be writing too fast - what should i do? I have mongodb running on my computer and am writing ~500 mb every few minutes... all the data piles up in my ram and them my swap... ram ~ 2gb, swap ~7 gb... what should i do? --> sharding?
[19:17:30] <hdm> RONNCC: get more ram, reduce indexes, get faster disk i/o, then consider sharding
[19:18:54] <hdm> a cluster of 2gb ram systems isnt going to help much, you generally want to tackle resources as a matter of ram -> disk i/o -> cpu, you can use sharding to help with any of those three, but I found it easier to just bump specs on primary systems first
[19:20:14] <hdm> at some point ram stops being a bottleneck (flushing to disk eats your disk i/o), ssds or sharding helps there, or adding more spindles and raid
[19:20:48] <hdm> check your "iostat" output, if your disk is spending all of its time doing reads, you need to reduce your indexes or move to SSDs
[19:20:59] <RONNCC> hdm: this is a corporate env. i can boost ram to 12, but that doesn't matter. on some 10 documents it ate 2 gb of ram, 7 gb of swap o-o. then disk i/o how do i do that... i'm using a corporate vm. i can ask for cores/ram/harddirve..... should i ask for for hd's to mount? can i do something like raid for mongo ... would i have to get a software raid controller >_>" ... and i'm running 6 cpus at the moment. only ~2 get used by mongo w
[19:21:01] <hdm> (doing reads even though you are primarily doing writes)
[19:21:27] <hdm> you may be misreading the stats, it will map 2gb of memory, but only use a small portion of that
[19:21:50] <hdm> youre only going to see a single cpu core get used per mongod on average
[19:22:03] <hdm> cpu usually isnt the bottleneck though
[19:22:11] <hdm> the output of mongostat and iostat
[19:22:14] <RONNCC> hdm:.... not it uses everything ._.
[19:22:15] <RONNCC> *no
[19:22:31] <RONNCC> the ram + swap ... and then the cpus go to like 2% because no more memories
[19:22:31] <RONNCC> D:
[19:22:34] <hdm> thats because you are swapping, which triggers a system meltdown
[19:22:37] <hdm> turn off swap
[19:22:39] <RONNCC> yeah.......
[19:22:43] <RONNCC> because it ate all the ram
[19:22:44] <RONNCC> ..
[19:22:53] <RONNCC> i'll ask for 12 gb but i think it' sgoing to eat that too...
[19:23:00] <RONNCC> so how do i flush faster?
[19:23:09] <RONNCC> .....aka is there any thing like raid that mongodb does
[19:23:13] <RONNCC> besides making me do raid?
[19:23:18] <hdm> not necessarily, check mongostat: the item you want is "res" not "vsize"
[19:23:26] <hdm> and even that isnt always whats used
[19:23:40] <hdm> no, if you are blocked on i/o, you need SSDs or more spindles
[19:24:05] <hdm> raid-0 of a bunch of small disks actually helps a lot
[19:24:23] <hdm> the bigger your index gets, the more data it has to read while doing inserts to upate the index
[19:24:31] <hdm> so you need to make sure your index size is always < ram per node
[19:24:48] <hdm> you can find the index size by doing db.stats() in mongo shell
[19:25:27] <hdm> ive got a single box that handles getting on 2 billion rows, but it took a while to find the right balance
[19:26:15] <RONNCC> uhhh....what?
[19:26:27] <RONNCC> ok mongostat... do you mean db.collection.stats()?
[19:26:37] <hdm> sure, that works too
[19:26:53] <hdm> mongostat is a command line tool that moniros mongodb
[19:27:00] <hdm> it will tell you how much ram is actually being used and for what
[19:27:08] <RONNCC> i see
[19:27:10] <hdm> db.stats() will show you how much ram it actually needs for the index of that db
[19:27:23] <hdm> the 'res' value of mongostat may be bigger than db.stats() index size and thats fine
[19:27:26] <RONNCC> uhhh under which column?*/row
[19:27:28] <hdm> only the index size is really mandatory
[19:27:31] <RONNCC> oh also how do i get a cloak?
[19:27:43] <RONNCC> i need one when i use webchat from work
[19:27:49] <hdm> got me
[19:27:54] <RONNCC> hmm
[19:28:40] <hdm> example of mongostat output: 0 0 0 0 0 1 0 1432g 2866g 65.6g 0 local:0.0% 0 0|0 0|0 62b 3k 1 13:25:2
[19:28:55] <hdm> only 65.6 gb is in ram on this system, but it has almost 3T mapped
[19:29:00] <RONNCC> oh ahh
[19:29:11] <RONNCC> i was thinking db.coll.stats() again... what's the point of that
[19:29:28] <hdm> > db.stats() shows -> "indexSize" : 12237338064,
[19:29:45] <hdm> thats how much ram i would need if i queried all of the indexes at once
[19:29:52] <hdm> before it would have to dig into the disk again
[19:29:52] <RONNCC> ahh......
[19:30:06] <RONNCC> so how should i organize my data
[19:30:11] <RONNCC> it's a time series
[19:30:19] <hdm> thats only ~12Gb even though res is 65.6Gb
[19:30:23] <RONNCC> so it's of the form {time:{IP: number ... etc}
[19:30:32] <hdm> i could get by with ~20Gb of ram even if it wasnt pleasant
[19:30:34] <RONNCC> the IP and internal data _MAY_ be a repeat with a different time
[19:31:11] <hdm> RONNCC: do you want to manage a giant cluster and be able to query everything at once, or keep it usable on small systems and break things into multiple queries?
[19:31:27] <RONNCC> uhh.. i don't think i have a giatn cluster?
[19:31:33] <RONNCC> i have one vm that computers and has mongodb running atm
[19:31:33] <RONNCC> ...
[19:31:37] <hdm> well, you could go buy one, but its an option
[19:31:50] <RONNCC> uhhh i don't think my company would pay for a giant cluster xD
[19:31:53] <hdm> so how much ram / individual disks can you spend on this project?
[19:32:01] <RONNCC> uhhhmmm i have no clue
[19:32:03] <hdm> 12G / 1 disk?
[19:32:07] <RONNCC> i'll ask for their limits
[19:32:11] <RONNCC> it's a vm
[19:32:13] <RONNCC> so i have
[19:32:17] <RONNCC> a 20 gb mounted on /txt
[19:32:23] <RONNCC> and a 1 tb mounted on /data
[19:32:34] <RONNCC> uhhh
[19:32:39] <RONNCC> uhh /test not /txt
[19:32:40] <hdm> you need to figure out what the actual hw is, IOPS, etc
[19:33:23] <hdm> lets just say you dont have a lot of resources then; you need to make sure your indexes stay small, so they fit into ram, and that your lmited disk i/o doesnt drag you down too much
[19:33:35] <hdm> how many records are you looking at per month?
[19:34:10] <hdm> how many entries per day, average size of the entry, etc
[19:34:26] <hdm> you can see average size in db.stats() -> "avgObjSize" : 1200.4287256402074,
[19:35:15] <RONNCC> uhhhmmm
[19:35:15] <RONNCC> so
[19:35:21] <RONNCC> i don't know why but 201
[19:35:49] <hdm> ok, how many of those per day / per month etc
[19:36:00] <RONNCC> so i have ~ 100 files i need to parse. they have ~ 200000 lines in each. so each entry is of the form {time: {dataa:1, data2: ...}}
[19:36:03] <hdm> and how far back do you want to look for your most common query/use case
[19:36:21] <RONNCC> this isn't really a business this is because i have a ton of data that i want to run analytics on....
[19:36:36] <RONNCC> so essentially i'm parsing a file and throwing all of the data into a db
[19:36:44] <RONNCC> so i can later run queries on it for criteria
[19:37:04] <hdm> how you query will determine how you store it, since anything not in an index is slow
[19:37:11] <hdm> but anything in an index costs ram
[19:37:32] <RONNCC> uhhm... i want to query all fields
[19:37:34] <hdm> you can break the data into multiple databases or collections, but then you have to query each one individually
[19:38:01] <hdm> you can, but it will be slow, so figure out what fields are most important, and what records need to be in the same db or collection for your use case
[19:38:08] <RONNCC> this is an example of the entries
[19:38:09] <RONNCC> https://reputation.alienvault.com/reputation.data
[19:38:11] <hdm> for example, break int into a new collection based on month
[19:38:25] <RONNCC> there's 200000 lines
[19:38:29] <RONNCC> so i split that into
[19:38:33] <hdm> how much time do those files represent?
[19:39:22] <hdm> is that file updated daily, hourly, monthly, etc
[19:39:28] <RONNCC> {day: IP: 109.234.78.117, Type: Scanning Host ,Country: DE, Region: cityname, Coordinates: (51.0,,9.0) Misc: (11 ,4,2)}
[19:39:34] <RONNCC> uhhh... i have ~ 100 of those files
[19:39:38] <RONNCC> i want to put them all in a db ....
[19:39:43] <RONNCC> so i can query common characteristics
[19:39:50] <hdm> you said time series, where does the time come from?
[19:39:59] <hdm> yeah, i get that, youve said that a few times
[19:40:05] <RONNCC> well i download that everyday
[19:40:07] <hdm> but you need to figure out hwo to store it and that depends on the time interval
[19:40:09] <RONNCC> so... its what date
[19:40:14] <RONNCC> and 24 time s aday
[19:40:18] <RONNCC> *24 times per day
[19:40:25] <RONNCC> so...... it's <day>:<hour>
[19:40:28] <RONNCC> sorry uhh
[19:40:35] <RONNCC> <day>:<hour>-<minute>
[19:42:29] <hdm> ok, a month of that data would be about 60Gb (~30m records, take into accoutn the journal overhead, etc) and if you indexed based on just _id (default), id guess around ~8gb of ram for the index, but that isnt a good thing to index on if you want to query it
[19:42:44] <hdm> you might want to limit it to one download a day instead
[19:42:52] <hdm> and then work on it with a new db per month
[19:43:58] <hdm> unless you index /everything/ that should fit into ~12gb of ram and generally be manageable
[19:44:19] <RONNCC> ...... uhhh what?
[19:44:24] <hdm> im giving up :)
[19:44:26] <RONNCC> but i need all of the data ._.
[19:44:45] <RONNCC> it's weird
[19:45:10] <hdm> you wont be able to work with a full month of that data if you download it hourly, you may not be able to even load it into a single db within an hour
[19:45:17] <hdm> once the index starts to hit disk
[19:45:26] <RONNCC> the file is 18mb but when i run sys.getsize() recursively
[19:45:34] <RONNCC> on the dict entry it comes to like 500 mb
[19:45:36] <RONNCC> ...how is that possible?
[19:46:00] <hdm> mongo allocates files based on incremental sizes, capping at 2gb files
[19:46:18] <RONNCC> uhh.... so i have to insert a ton of entries
[19:46:19] <RONNCC> >_>
[19:46:42] <hdm> good luck, you need to do a bit more planning if you want it to work
[19:49:56] <RONNCC> ...
[20:48:45] <mattbillenstein> hi all
[20:48:57] <mattbillenstein> how do I clone a database from a secondary?
[20:49:33] <mattbillenstein> db.cloneDatabase(<otherhost>) returns an error message: query failed ciddb.system.namespaces
[20:49:42] <mattbillenstein> do I need to set a secondary read preference somehow?
[21:02:23] <joe_p> mattbillenstein: https://jira.mongodb.org/browse/SERVER-5577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
[21:30:54] <mrpro> hmm
[21:31:00] <mrpro> my windows mongod logs roll
[21:31:01] <mrpro> but linux doesnt
[21:31:04] <mrpro> what am i missing
[21:38:31] <joe_p> mrpro: http://www.mongodb.org/display/DOCS/Logging#Logging-Rotatingthelogfiles
[21:39:23] <mrpro> lame
[21:39:25] <mrpro> thx
[21:43:10] <silverfix> Can I do this: Collection.update({'_id':id}, {$set:{name:value}}); ?
[21:43:25] <silverfix> (in python)
[22:21:56] <kgraham> hello, quick question about mongos/configdb's , i have three configdb's and their order in the mongos config file for our worker machines in ec2 (which are autoscaled) has been set for a while, I've recently tried to make it so depending upon what region the machine is launced in, it would connect to that regions' config db to reduce the load, well when I attempted this, I got this error ... http://pastebin.com/PQJibGm8
[22:22:02] <kgraham> any ideas, thanks much
[23:15:41] <teotwaki> Hi, quick question, I see that the C++ driver is said to use some "core MongoDB code", does this mean that the C++ driver is also licensed under AGPL?
[23:28:32] <ckd> teotwak: http://www.apache.org/licenses/LICENSE-2.0
[23:29:58] <teotwaki> ckd: yes, I understand that, but because it uses AGPL files, wouldn't that apply to the whole C++ client driver?
[23:29:59] <silverfix> OperationFailure: database error: invalid operator: $or <- any hints ?
[23:30:41] <ckd> silverfix: what is the query you're trying?
[23:31:04] <Derick> ckd: issue should be fixed now - RC2 on Monday
[23:31:09] <silverfix> ckd: in python: cursor = collection.find({'sentiment': None, 'last_access': {'$in': [{'$lt': five_min_ago}, None]}})
[23:31:25] <ckd> Derick: it's Saturday, stop working!
[23:31:28] <silverfix> ckd: http://codepad.org/3qVEUlF8
[23:31:48] <Derick> ckd: hehe, i wasn't :P just saw you active here and thought I'd give an update
[23:31:56] <ckd> Derick: FINE!
[23:32:03] <Derick> night ! :)
[23:32:11] <ckd> Derick: later :)
[23:35:25] <ckd> silverfix: does "None" actually work?
[23:36:31] <silverfix> ckd: yes
[23:37:53] <silverfix> ckd: mmm the error is replicable under mongo console
[23:38:03] <silverfix> db.labeler.find({'sentiment':null, 'last_access': {'$or':[{'$lt':new Date()},null]}})
[23:38:03] <silverfix> error: { "$err" : "invalid operator: $or", "code" : 10068 }
[23:38:03] <silverfix> >
[23:38:17] <ckd> oh, is None equivalent to null in python?
[23:38:30] <silverfix> yes but ^ it is javascript
[23:47:51] <ckd> silverfix: oops, forgot i had this open… try this: http://pastie.org/5358189