[00:32:26] <TkTech> addisonj: (Not myself, but we evaluated it for our MongoDB replicas)
[00:33:03] <addisonj> just curious how performance was for high loads, I am pretty far into going AWS, mostly for more scalable mongo, but was curious what others thought
[00:34:20] <addisonj> it seems like their higher end instances do offer some decent IO, and they have some fancy management stuff for repl sets
[00:39:05] <TkTech> addisonj: We currently run on AWS EC2 and are happy with it. Within the same region the IO bottleneck is pretty much non-existent with our fairly high update and insert load.
[00:39:33] <addisonj> yeah, thats mostly what I have heard about mongo on EC2, do you raid EBS?
[00:39:35] <TkTech> addisonj: What ends up killing us is the map reduces on 2.0.6 which for whatever reason chuck cannot yield warnings every couple thousands record iterations
[00:40:13] <TkTech> addisonj: Yes and no, the primary runs on Amazon's new high-up instance with 60-something GB of RAM.
[00:40:24] <TkTech> addisonj: The secondaries are used to create backups
[00:41:16] <TkTech> The improvement with the SSD's is orders of magnitude
[00:41:38] <TkTech> The instance storage is also way faster than EBS
[00:42:46] <addisonj> thats the other thing about EC2, so much flexibility in scaling up, joyent/rackspace/everyone else don't give that, but just the configuration/management overhead is making me feel different atm
[00:43:36] <TkTech> Well, most of the MongoDB cloud providers are just reselling AWS and OpenStack providers.
[01:25:12] <tomlikestorock> anyone here use mongoengine to connect to replicasets?
[05:11:28] <Hotroot1> Hello, would appreciate some help, getting really pissed off here.
[07:51:38] <stabilo> thx by the way. That does it now
[08:08:16] <_johnny> stabilo: yeah, i had a "run in" with python regex aswel. besides what NodeX, if you want to do a { field: /^something/i } (case insensitive), you can use { "field": re.compile("^something", re.IGNORECASE) }
[08:08:24] <_johnny> it's evaluated the same as in python
[08:15:57] <stabilo> _johnny: looks at least cleaner, but the re.compile() should take more time than just defining regex = '^8' (I want to match files that have a filename starting with 8)
[08:20:03] <_johnny> stabilo: how so? it is my understanding that a compiled regex obj is more efficient. at least when used multiple times. i'm not aware how pymongo translates it however
[08:42:51] <neil__g> i have a P-S-A replica set configured. When I do batch imports, the secondary's replication delay just increases and increases until it's an hour behind the master. Is there any way to force a sync or give replication priority?
[09:24:58] <NodeX> one approach is to process your collection every minute
[09:25:15] <justin_j> so you're doing about 500k per minute?
[09:25:31] <NodeX> it's made easy by the nature of an ObjectId being timestamp coordinated
[09:25:59] <NodeX> you can do _id : { $gt : ObjectId("LAST_ID") }).limit(5000)...
[09:26:15] <NodeX> I dont no because I dont run it every minute
[09:27:17] <NodeX> if I was doing it on a minute basis I would esitmate traffic per minute and limit by that
[09:27:17] <justin_j> that's neat using the ObjectId
[09:27:38] <NodeX> when you start it you must use 24x0's if you dont have an object ID
[09:28:09] <NodeX> the good thing about doing it that way is there is already an index on _id so you dont need anymore :)
[09:28:23] <justin_j> which is always a good thing
[09:29:13] <NodeX> yes, mongo's memory hogging with indexes can be annoying
[09:29:53] <justin_j> the latest issue I've got is that I've used a doc structure for storing unique user counts (and other counts) and I'm trying to allow filtering access by attributes (color=blue, item=shirt) etc
[09:30:17] <justin_j> yeah, mongo takes what it can!
[09:30:59] <NodeX> so you want an infinite way of analysing your data?
[09:31:15] <NodeX> you want to analyse it in anyway you see fit at any time I mean?
[09:31:36] <justin_j> yes, the data is aggregated into a time series collection
[09:35:41] <justin_j> then, for a given day I find the doc and iterate through the fields. My idea was that I can do a string match on attributes to filter.
[09:36:43] <justin_j> Now, you can imagine that a user would have a blue shirt in the morning and then a red shirt in the afternoon so his unique count would be present in both 'u's in the doc
[09:37:21] <justin_j> if I filter wholly on ',color=blue,item=shirt,' or ',color=red,item=shirt,' then everything is fine - he's only counted once
[09:37:41] <justin_j> if I only specify ',item=shirt,' then it'll count him twice
[09:54:19] <NodeX> no because interset would not work on different keys
[09:54:26] <justin_j> that was the intuition of my first get-around
[09:54:46] <NodeX> color=blue,item=shirt and item=shirt are not equal so the intersect would fail
[09:54:47] <_johnny> ah, okay, sorry for meddling. i didn't quite understand your "color=red,item=shirt" schema :)
[09:55:08] <justin_j> so, my first solution was to say, 'ok, he'll be counted twice if we're just looking for shirts - can we put an offset somewhere to fix that?'
[09:55:13] <NodeX> justin_j : you're gona have to somehow store the uid
[09:57:14] <justin_j> Hmmm, that's going to be a lot of uid's - potentially 50k+ per day per app
[09:59:10] <justin_j> the other thing I tried was storing and updating all combinations of attributes given
[09:59:44] <justin_j> that works but it blows up quickly, 8 attributes = 256 sets of counts
[10:35:12] <Cubud> If you don't want 2 people making an appointment for a client at the same time, but want people to be able to edit a client while someone else raises an appointment then perhaps you need a ClientAppointments collection
[10:35:56] <Cubud> If you want to be able to make simultaneous changes to the appointments for a single client then have a collection where each appointment is its own document
[10:36:08] <Cubud> rather than one document holding lots of appointments for a client
[10:36:18] <Cubud> I tend to think of it as locking granularity
[10:36:51] <Cubud> But I know nothing of MongoDB yet, these are just standard programming guidelines
[10:36:52] <NodeX> if however you are going to query that alot it can be expensive so it may make sense to have hot data embedded
[10:52:00] <NodeX> http://www.mongodb.org/display/DOCS/Retrieving+a+Subset+of+Fields#RetrievingaSubsetofFields-RetrievingaSubrangeofArrayElements <---- will get you last 50 elements of your messages
[11:02:53] <_johnny> Cubud: also if you do a findOne() and sort desc on obj id?
[11:03:06] <Cubud> I created 5million users "1@home.com" "2@home.com" etc
[11:03:07] <NodeX> caching is maintained by the operating system, if you insert a record it -may- still be in memory and it -may- not, if you access that documnent a few times it will be in MRU
[11:11:50] <_johnny> right, no, i'm sorry, i didn't say it right
[11:12:25] <_johnny> i mean, from what i can tell from the docs, indexes only work when you do regex queries, if you start them with ^, and don't use case insentive
[11:12:45] <NodeX> yeh, you must use a prefix else it wont use an index
[11:14:10] <_johnny> when using prefix, definately yes
[11:14:45] <NodeX> I currently use it for a 28million doc collection (UK Postcodes) with a prefix and can find (using a suggestive ajax style approach) a match in 1 few ms
[11:15:55] <_johnny> funny, i do something a bit similar. i have 2 mio address records, where i do ajax search complete for street names. i use a street name collection aswel to look up the street names first, to work around the case insensitive
[11:16:18] <_johnny> i was thinking of auto titlecasing it, which would work for most cases, but not all :(
[11:16:38] <_johnny> so if someone types "her", it would query for ^Her
[11:16:44] <NodeX> I did "idx_field" and uppercased it, and "display_field"
[11:16:57] <NodeX> waste of space but I had loads so it didnt matter
[11:17:05] <_johnny> that's actually a really good idea
[11:17:40] <_johnny> i'll just add that to the lookup collection. it's still relatively small, and the mongodb isn't in production yet
[11:17:56] <NodeX> it will increase performance a heck of alot
[11:18:18] <_johnny> yeah, you're absolutely right. i don't know why i hadn't thought of that
[11:19:14] <_johnny> how long did it take you to fill those postcodes in? :) i had an excessive xml schema which i had to parse when inserting to mongo (new schema). took 5 hours for 2 mio elements. there's something i'd rather not do again, lol
[11:19:14] <NodeX> I did it a couple years ago when I delved into the viability of mongo for a job board and wrote one as POC
[11:19:58] <NodeX> I have the postcode data that I have been building for years in Json that I exported from SQL, it took about 2 hours to insert and a few to index
[11:20:36] <NodeX> I recently did the same thing with the world GEO database
[11:21:02] <_johnny> while i was loading the data in, i ran a periodical query to check places near a certain point. was fun to see how the "world" got populated as it went along :p
[11:21:12] <NodeX> with a 2d index so I can get the closest "large place" (town/ city/landmark) to a given lat/long
[11:21:42] <_johnny> how do you define large place? predefined or on condition?
[11:22:00] <NodeX> but I also made it so that I could regex it
[11:22:03] <Cubud> I have to go out for an hour or so now, but here is my really simple test app source if you want to see what code is being executed that gets progressively slower
[11:22:12] <NodeX> the GEO data has a "type" defined with it
[11:22:27] <_johnny> i was thinking of running a map/reduce gimmick to find the most crowded places in the most crowded cities, and vice versa the most stranded :)
[12:47:57] <_johnny> phpmoadmin might be what you're looking for. personally i've used rockmongo a bit, but not extensively
[12:50:54] <s1n4> _johnny: I dont like a web based program as monogo admin ui
[12:51:03] <stabilo> hrm. what is the best way to get a gridfile using a query object. I first thought that the collection.find() method will do it, but it returns only a cursor (over dictionaries) and I don't know to get a gridfile from that. One solution would be to extract the exact filename from the dictionary and make a second call to the database like db.fs.files.get_version(filename). But that sucks (TWO calls to the DB for ONE file)
[12:51:14] <s1n4> _johnny: I'm looking for something like pgadmin3
[12:51:49] <_johnny> ah, i thought pgadmin was web based
[13:40:36] <circlicious> so i am trying tod o mapreduce for the first time
[13:40:48] <circlicious> from reduce i would like to return all fields of the # documents from the collection i am working with - does that make ssense?
[13:43:16] <NodeX> this is the current object (document
[13:57:01] <circlicious> yes, actually i want on a specific item_id, but for now i am doing all, as i think adding the item_id =x then will be easy anyway
[13:57:24] <NodeX> why don't you distinct "created_at"
[13:58:03] <circlicious> distinct gives me records with distinct created_at
[13:58:21] <NodeX> then add the optional query param
[13:58:23] <circlicious> i only want records that have same created_at with another document
[14:02:30] <circlicious> 1 thing i can now do is, map reduce can give me the created_at that appear 2 or more times. i can then query once for each created_at then
[14:46:39] <circlicious> can you help me with https://gist.github.com/53039e8b3f209759d091 ?
[14:46:42] <Cubud_> _johnny: Yes, I am doing 5 million individual selects. I am trying to see how quickly I will be able to serve a DB request. For inserting it is 20,000 per second (wow)
[14:46:54] <Cubud_> But selecting gets progressively slower
[14:47:17] <Cubud_> It's as if MongoDB is sequentially scanning the rows rather than using the index, because it takes longer to select data near the end
[14:49:40] <NodeX> Cubud_ : use explain to see what it's doing
[14:51:33] <Cubud_> I will see if the library supports it
[14:53:38] <circlicious> NodeX: did you understand my problem?
[14:57:20] <estebistec> In explain-plan output, is allPlans indicating indexes that were actually used in the query or simply considered by mongo because they all have related fields?
[14:57:32] <estebistec> if there is multiple index usage with de-dup going on it isn't clear
[14:57:33] <NodeX> I would just guess that your index is spilling to disc for whatever reason
[14:58:01] <NodeX> using findOne() 1000 times is probably not going to happen in your app
[14:58:20] <NodeX> the disk will be thrashing all over the place looking for the data as no MRU / LRU will be touched
[14:59:06] <NodeX> if the whole working set were in RAM it would be alot faster
[15:12:28] <_johnny> NodeX: i'm about to parse all my xml again, wanting to put it in a mongoimport complaint format. i'm trying to store it as json, but even for smaller files (36kb) i catch the exception BSON representation of supplied JSON is too large. i've added \n's as it looked like mongoimport reads lines at a time, but still same garble. am i doing something completely upsides down ?
[15:14:30] <circlicious> no one wants to help me ;(
[15:27:05] <Cubud_> How can I check when Mongo has finished rebuilding an index?
[15:27:53] <_johnny> d'oh, forgot to use --jsonArray, my bad :(
[15:53:28] <therealkoopa> If I want to provide users with the ability to reset their password, I'm doing it by generate reset tokens that are used. Is it better in mongo to create a separate small collection with pending reset tokens, that reference the user. Or to store the reset token information on the users collection?
[15:55:41] <Cubud> because you are going to load the user in in order to reset the password anyway
[15:57:00] <therealkoopa> Cubud: I was worried about searching for the reset key if the users collection is massive.
[15:57:34] <Cubud> I get them to put in their email + reset key
[15:57:48] <Cubud> or username + key (whichever you have keyed the user table on)
[15:58:17] <Cubud> Having the key alone is a bit weak
[15:58:20] <estebistec> asking again, briefly as I can: in explain output, are all the indexes in allPlans actually being used or are they candidate plans?
[15:58:54] <Cubud> but you won't guess a key + emailaddress, especially if it has an expiry of 30 minutes
[16:04:06] <therealkoopa> Cubud: Okay, I like that. Thanks
[16:04:12] <NodeX> estebistec : how large is your user colleciton?
[16:04:34] <NodeX> and do you really want an index just to search user reset key ?
[16:04:48] <estebistec> NodeX, I'm testing out my indexes with 10k docs
[16:05:00] <estebistec> we'll have several M in reality
[16:05:10] <therealkoopa> NodeX: Well right now about 3. Hopefully when we deploy, it's going to be massive :). I'm a little worried about searching by reset key. I'll probably somehow encode the email address in the key, and search on that? Um
[16:15:30] <NodeX> so sending your "sniffed" data means nothing if you can't send me what my app salted in the first place to compare it against
[16:15:57] <jjanovich> question, my server is set in EDT timezone, but looks like mongo is returning in a different timezone
[16:16:04] <jjanovich> how can I get those to sync up
[16:16:34] <Cubud> the server sends an email with some key in it, if that key is sufficient to get the user in then it would be sufficient to get in the person that sniffed the key wouldn't it?
[16:18:12] <therealkoopa> Okay, so it's either store it as key/value in redis with an expiration, whcih is nice and easy. Or base64 encode some magic and have to deal with the timeout manually. THink they are both about the same on a 'goodness' level?
[16:19:21] <Cubud> NodeX, using the email address as my key gives me 12,500 lookups per second without any speed degradation
[16:21:09] <jjanovich> does mongodb store all datetime as UTC and I have to convert or is there a way to force it to same in local timezone
[16:24:01] <Cubud> NodeX: Very few faults - http://pastebin.com/PqcfvLet
[16:26:38] <Cubud> I think that seeing the high speed of selecting against a specific key is enough to convince me to use Mongo :)
[16:27:41] <therealkoopa> Cubud: You don't have to worry about storing reset tokens on the user if throwing redis in front.
[16:28:19] <therealkoopa> And it seems silly to have the reset token be an index
[16:43:44] <Cubud> therealkoopa - I suggested that you don't use an index for the reset code
[16:43:53] <Cubud> think ResetCode rather than ResetKey
[16:44:04] <Cubud> NodeX: Nearly done, 2 more mins :)
[16:44:15] <tsally> if I write a new document to a sharded collection, how do i know whena read of this document is guarenteed to succeed?
[16:44:57] <_johnny> ok, so i have a question about mongoimport. how on earth do anyone manage to import anything? i have a 20mb array file which is too large. any tools for splitting this seemingly "huge" file?
[17:20:08] <jrdn> weird question, should i cache things like "names" in memory when querying for an _id field? we have reports and are storing object ids (b/c names change often)… since i'm assuming it's in memory anyway, is it worth it to cache them anyway to get rid of what is like 40 queries on one report page?
[17:21:19] <jrdn> i.e.) http://docs.mongodb.org/manual/faq/fundamentals/#does-mongodb-handle-caching not sure if it's even worth it to build a cache layer right now
[19:36:55] <icedstitch> I mean, this is useful, http://media.mongodb.org/zips.json . i was wondering if there were other ones
[19:44:39] <kchodorow> icedstitch: here's an example using chess games http://www.kchodorow.com/blog/2012/01/26/hacking-chess-with-the-mongodb-pipeline/
[19:45:02] <kchodorow> pretty much any data feed that gives you json will work
[19:48:28] <estebistec> wow, so with pymongo .find({'a': re.compile('^AB')}) is much faster than .find({'a': {'$regex': ^AB'}})
[21:15:03] <tomlikestorock> why would my python script that connects to a replica set hang at the end of execution and not close the connection to mongo?
[21:18:11] <Bilge> How do you man the index of an embedded document unique only to the containing document and not the entire collection?
[21:22:36] <Bilge> In fact I don't even understand what it is that is unique about a compound index compising a root document's it and a child document's field
[21:23:43] <Bilge> I thought making a unique compound index compising the parent id and an embedded document's field would be a way to make that embedded document's field unique only in the context of the parent document but it does not
[21:23:49] <Bilge> There appears to be nothing unqiue about it at all
[21:40:49] <crudson1> Bilge: a unique index simply means that a combination of the keys specified must be unique in the collection
[21:41:41] <Bilge> So then I SHOULD NOT be able to SPECIFY that the parent ID and the embedded field is UNIQUE and yet STILL have the same embedded field with the same value in the same document!
[21:43:32] <Bilge> You can have a myriad of duplicate embedded field values regardless of the unique index constraint
[21:44:34] <Bilge> The only way to make it really unique is to create a single-field unique index for the field in the embedded document, e.g. { embedded.field : 1 } but this makes the field unique across all documents, not just the parent document to which the embedded document belongs
[22:21:44] <lacker> another question: in general, are there some types of error where a client like the ruby connection object will need to be recreated after that sort of error?
[22:32:43] <tomlikestorock> In my web app (pyramid based), is it good practice to open a replicaset connection on every request, and attach it to the request object, or should I open one for the app and attach a reference to every connection? I've also noticed that the replicasetconnection class causes my web app scripts to hang because it doesn't delete itself when the script is done running? How do I prevent this?
[22:33:38] <wereHamster> tomlikestorock: or pool connections and take one that is available or wait for one te become available
[22:34:07] <wereHamster> generally it's a bad idea to open a new connection to the database for each request.
[22:34:14] <wereHamster> either pool them or open a single one
[22:35:17] <tomlikestorock> wereHamster: right, that's what I figured. Okay. Now, what's the deal with the ReplicaSetConnection preventing my scripts from finishing?
[22:35:22] <tomlikestorock> they just hang at the end :(
[22:35:51] <wereHamster> no idea. Which programing languaeg you are using. Also, it's past midnight. I'm off.
[22:39:40] <Smith_> I am trying to run this example "https://github.com/mongodb/node-mongodb-native/blob/master/examples/simple.js" on Node.js v0.8.7 and getting this error "Error: TypeError: Cannot read property 'BSON' of undefined" on line 9's code. Any suggestions please?
[22:39:49] <winterpk> Can someone please tell me what the radius variable is supposed to be in this example: db.places.find({"loc" : {"$within" : {"$center" : [center, radius]}}})