PMXBOT Log file Viewer

Help | Karma | Search:

#mongodb logs for Tuesday the 21st of August, 2012

(Back to #mongodb overview) (Back to channel listing) (Animate logs)
[00:04:39] <crudson1> skiz: probably put in for debugging the threading that was added
[00:25:53] <addisonj> anyone here run mongodb on joyent cloud?
[00:32:04] <TkTech> addisonj: Tried it, what's up?
[00:32:26] <TkTech> addisonj: (Not myself, but we evaluated it for our MongoDB replicas)
[00:33:03] <addisonj> just curious how performance was for high loads, I am pretty far into going AWS, mostly for more scalable mongo, but was curious what others thought
[00:34:20] <addisonj> it seems like their higher end instances do offer some decent IO, and they have some fancy management stuff for repl sets
[00:39:05] <TkTech> addisonj: We currently run on AWS EC2 and are happy with it. Within the same region the IO bottleneck is pretty much non-existent with our fairly high update and insert load.
[00:39:33] <addisonj> yeah, thats mostly what I have heard about mongo on EC2, do you raid EBS?
[00:39:35] <TkTech> addisonj: What ends up killing us is the map reduces on 2.0.6 which for whatever reason chuck cannot yield warnings every couple thousands record iterations
[00:40:13] <TkTech> addisonj: Yes and no, the primary runs on Amazon's new high-up instance with 60-something GB of RAM.
[00:40:24] <TkTech> addisonj: The secondaries are used to create backups
[00:40:29] <addisonj> with the SSDs?
[00:40:31] <TkTech> Aye
[00:40:34] <addisonj> nice
[00:40:52] <TkTech> Our dataset is many times our available RAM, and page faults were becoming a bottleneck
[00:40:56] <addisonj> how big is your working set?
[00:40:58] <TkTech> Prior to that, it was on an EBS RAID
[00:41:06] <TkTech> Near a TB now I believe?
[00:41:16] <TkTech> The improvement with the SSD's is orders of magnitude
[00:41:38] <TkTech> The instance storage is also way faster than EBS
[00:42:46] <addisonj> thats the other thing about EC2, so much flexibility in scaling up, joyent/rackspace/everyone else don't give that, but just the configuration/management overhead is making me feel different atm
[00:43:36] <TkTech> Well, most of the MongoDB cloud providers are just reselling AWS and OpenStack providers.
[01:25:12] <tomlikestorock> anyone here use mongoengine to connect to replicasets?
[05:11:28] <Hotroot1> Hello, would appreciate some help, getting really pissed off here.
[05:11:33] <Hotroot1> https://gist.github.com/b1da92e2bbcb7f2d944c
[05:11:41] <Hotroot1> dbpotions is a collection
[05:11:57] <Hotroot1> Using mongolian
[05:12:18] <Hotroot1> The forEach on the rows returned never goes off
[05:40:48] <Guest90591> hello there
[06:29:17] <anybody> hi guys, how do you expand "has more" in the command line?
[06:32:39] <anybody> it
[07:05:53] <stabilo> Someone currently here that can answer questions regarding pymongo?
[07:34:33] <[AD]Turbo> hola
[07:39:35] <stabilo> so. regarding pymongo... let the following statement be a query that is valid in the mongo console
[07:39:39] <stabilo> > db.fs.files.find( { filename : /^8/ } )
[07:40:07] <stabilo> this returns 11 results for my collection
[07:41:02] <stabilo> Now, how do I construct the very same query from within python? I assumed it has to be
[07:42:12] <stabilo> dbase.fs.files.find( { "filename" : "/^8/" } )
[07:42:25] <stabilo> but that returns 0 results
[07:42:59] <stabilo> whereas dbase.fs.files.find() would return 11 results
[07:43:26] <NodeX> err
[07:43:34] <NodeX> "" makes it a string
[07:43:43] <fleetfox> hey guys. { $ne : true } will return true for false and where key does not exist?
[07:43:43] <stabilo> yes
[07:43:48] <NodeX> try this .. dbase.fs.files.find( { "filename" : /^8/ } )
[07:44:07] <NodeX> fleetfox : try $exists
[07:44:29] <fleetfox> so i need $exsists and bool checking?
[07:44:31] <NodeX> {key : { $exists : true } }
[07:44:45] <fleetfox> i need to check wher it's fals or not set
[07:44:45] <stabilo> NodeX: this gives an invalid syntax error
[07:44:55] <stabilo> pointing to the first /
[07:45:11] <NodeX> this is on the shell ?
[07:45:29] <stabilo> from within a python script
[07:46:09] <stabilo> for the mongo console it's 09:39 < stabilo> > db.fs.files.find( { filename : /^8/ } )
[07:46:33] <NodeX> fleetfox : $elemMatch
[07:46:44] <NodeX> right, I dont know python sorry
[07:46:59] <NodeX> perhaps try single quotes ?
[07:47:23] <stabilo> I tried various combination of quotes and double quotes already :)
[07:47:31] <NodeX> http://stackoverflow.com/questions/3483318/performing-regex-queries-with-pymongo
[07:47:33] <stabilo> combinations*
[07:47:46] <NodeX> db.collectionname.find({'files':{'$regex':'^File'}})
[07:48:04] <NodeX> google "pymongo regex" .. first answer
[07:48:07] <NodeX> :/
[07:48:11] <stabilo> -.-
[07:49:05] <stabilo> I always wonder why such fundamental things cannot be found in the docs
[07:49:18] <NodeX> that's what google is for ;)
[07:51:38] <stabilo> thx by the way. That does it now
[08:08:16] <_johnny> stabilo: yeah, i had a "run in" with python regex aswel. besides what NodeX, if you want to do a { field: /^something/i } (case insensitive), you can use { "field": re.compile("^something", re.IGNORECASE) }
[08:08:24] <_johnny> it's evaluated the same as in python
[08:15:57] <stabilo> _johnny: looks at least cleaner, but the re.compile() should take more time than just defining regex = '^8' (I want to match files that have a filename starting with 8)
[08:20:03] <_johnny> stabilo: how so? it is my understanding that a compiled regex obj is more efficient. at least when used multiple times. i'm not aware how pymongo translates it however
[08:42:51] <neil__g> i have a P-S-A replica set configured. When I do batch imports, the secondary's replication delay just increases and increases until it's an hour behind the master. Is there any way to force a sync or give replication priority?
[08:47:33] <Bartzy> Hello
[08:48:16] <Bartzy> I update a document and using upsert - if the document was only updated, can I get the _id without doing find again ?
[08:50:02] <justin_j> Good morning! anybody here?
[08:50:18] <Bartzy> yes
[08:51:23] <justin_j> greetings. I've got a question about mongo - do you use it in anger?
[08:52:27] <justin_j> wait, that wasn't my question - just asking if you use mongo a lot
[08:52:28] <justin_j> :-)
[08:52:56] <justin_j> looking to pick the brains of anyone using mongo for analytics
[09:13:25] <justin_j> anyone here using mongo for analytics?
[09:14:47] <NodeX> yup
[09:15:16] <justin_j> cool - are you dealing with incremental counters or unique users at all?
[09:15:49] <NodeX> I distinct everything to unique users
[09:16:05] <NodeX> then i run through each user and create a mapped timeline of what that user did
[09:16:16] <justin_j> gotcha
[09:17:02] <justin_j> that sounds fairly close to my use case although I'm doing it differently
[09:18:03] <justin_j> I keep a collection of users and track whether they've been seen today
[09:18:17] <justin_j> so I do an $inc only if they haven't been seen
[09:20:03] <NodeX> I keep a history collection and track everything that goes on everywhere
[09:20:28] <NodeX> then when I run my analytics I parse out what the users have been doing and archive my history collection
[09:20:51] <NodeX> I typically parse 500k documents at a time and then export them elsewhere incase I need the data for somehting else later
[09:21:48] <justin_j> interesting
[09:22:16] <justin_j> I'm processing the data as it lands attempting to place it into aggregate buckets and so on
[09:22:21] <NodeX> amazon just announced glacier too which is really cheap per GB to archive
[09:22:29] <justin_j> yes, saw that
[09:22:44] <NodeX> I initialy tried it your way but my apps were to busy to cope with it
[09:23:14] <NodeX> and I realised it wouldn't scale that well, I also dont need realtime analytics
[09:23:27] <justin_j> yeah, I'm aiming for realtime
[09:24:14] <NodeX> it depends how busy your app is
[09:24:19] <justin_j> current pipeline is that I receive events from apps, queue them, and then the aggregator reads from the queue
[09:24:31] <NodeX> even apps that claim realtime analytics are a minute or so behind
[09:24:34] <justin_j> indeed
[09:24:58] <NodeX> one approach is to process your collection every minute
[09:25:15] <justin_j> so you're doing about 500k per minute?
[09:25:31] <NodeX> it's made easy by the nature of an ObjectId being timestamp coordinated
[09:25:59] <NodeX> you can do _id : { $gt : ObjectId("LAST_ID") }).limit(5000)...
[09:26:15] <NodeX> I dont no because I dont run it every minute
[09:27:17] <NodeX> if I was doing it on a minute basis I would esitmate traffic per minute and limit by that
[09:27:17] <justin_j> that's neat using the ObjectId
[09:27:38] <NodeX> when you start it you must use 24x0's if you dont have an object ID
[09:28:09] <NodeX> the good thing about doing it that way is there is already an index on _id so you dont need anymore :)
[09:28:23] <justin_j> which is always a good thing
[09:29:13] <NodeX> yes, mongo's memory hogging with indexes can be annoying
[09:29:53] <justin_j> the latest issue I've got is that I've used a doc structure for storing unique user counts (and other counts) and I'm trying to allow filtering access by attributes (color=blue, item=shirt) etc
[09:30:17] <justin_j> yeah, mongo takes what it can!
[09:30:59] <NodeX> so you want an infinite way of analysing your data?
[09:31:15] <NodeX> you want to analyse it in anyway you see fit at any time I mean?
[09:31:36] <justin_j> yes, the data is aggregated into a time series collection
[09:31:40] <justin_j> with a doc for each day
[09:31:58] <justin_j> and then structured to contain unique counts and so on
[09:32:04] <justin_j> that's all working fine
[09:32:26] <justin_j> allowing filtering on attributes is turning out to be difficult - I thought I had a working solution, but it's flawed
[09:32:34] <NodeX> what;s the flaw?
[09:32:50] <justin_j> it doesn't work :-)
[09:32:55] <justin_j> see if I can explain,
[09:34:21] <justin_j> so, I store something like: { ',color=blue,item=shirt,': { c: 645434, u: 1232 }, 'color=red,item=shirt': {c: 64324, u: 343 } }
[09:34:49] <justin_j> where 'c' is raw count and 'u' is uniques.
[09:34:57] <justin_j> in one doc
[09:35:41] <justin_j> then, for a given day I find the doc and iterate through the fields. My idea was that I can do a string match on attributes to filter.
[09:36:43] <justin_j> Now, you can imagine that a user would have a blue shirt in the morning and then a red shirt in the afternoon so his unique count would be present in both 'u's in the doc
[09:37:21] <justin_j> if I filter wholly on ',color=blue,item=shirt,' or ',color=red,item=shirt,' then everything is fine - he's only counted once
[09:37:41] <justin_j> if I only specify ',item=shirt,' then it'll count him twice
[09:38:20] <justin_j> that's the problem
[09:38:39] <NodeX> out of interest why dont you store the item and colr in the attribs (with the c and u)?
[09:38:43] <NodeX> color *
[09:41:08] <justin_j> could do, so something like { c: 43235, u: 5432, attiribs: {color: 'blue', item='shirt'} }
[09:41:17] <justin_j> it would need to be a list of those I suppose
[09:41:41] <justin_j> still left with the double counting problem though
[09:41:48] <NodeX> I was thinking next to the c and ui
[09:41:50] <NodeX> u *
[09:42:13] <justin_j> that would prevent 'c' and 'u' being used as attributes - those attributes are provided by the client
[09:42:20] <NodeX> c : 1234, u: 1234, color: 'blue', item : 'shirt'
[09:42:46] <justin_j> in an idea world it would be cleaner
[09:42:50] <justin_j> *ideal
[09:43:17] <NodeX> your update would look like ..{u:UID, color:'blue', item:'shirt'} {$inc : {c :1}
[09:44:38] <justin_j> apologies, my 'u' in the unique count, not the UID
[09:44:46] <NodeX> what's the "c" ?
[09:45:03] <justin_j> just the running number of events that have those attributes
[09:45:15] <justin_j> and 'u' is the number of unique users that have those attributes
[09:45:30] <justin_j> the doc is for a single day
[09:45:37] <justin_j> so, it's aggregated
[09:45:48] <NodeX> so you (currently) query on color=COLOR,item=ITEM and loop the results?
[09:46:14] <justin_j> that's the idea
[09:46:47] <justin_j> for a given day, I get the doc and iterate through the fields matching on color and/or item, summing the 'u' and 'c' fields
[09:47:06] <justin_j> that yields the totals for the day, raw count and number of unique users
[09:48:02] <NodeX> sorry my brain isn't working yet
[09:48:17] <NodeX> and you want the total uniques for say color=red,item=shirt for that day ?
[09:48:58] <justin_j> yes
[09:49:33] <justin_j> so for: { ',color=blue,item=shirt,': { c: 645434, u: 1232 }, 'color=red,item=shirt': {c: 64324, u: 343 } }
[09:50:11] <NodeX> in that case I would distinct the color=.. part and add the totals ( I guess that's what you're doing)
[09:50:26] <justin_j> if I specify I want the unique users for blue shirts I just iterate through looking for attributes color=blue, item=shirt
[09:50:58] <justin_j> yes, that's what I'm doing but there's an issue
[09:51:16] <NodeX> if you store the color and the item as I suggested you can query in any fashoin you like
[09:51:21] <NodeX> fashion *
[09:51:32] <justin_j> a user may, in one day, have a blue shirt *and* a red shirt
[09:52:02] <justin_j> which means they are present in both 'u' counts - if I just look for 'item=shirt' then it will count them twice
[09:52:04] <NodeX> unless you store the UID with a view/insert you're never going to get round that
[09:52:39] <justin_j> yeah, that's the music I'm not wanting to hear :-)
[09:52:49] <NodeX> :/
[09:53:25] <NodeX> in your itteeration do you know what users have been counted already at any point?
[09:53:50] <_johnny> i haven't read your entire discussion, but would union - intersect not give you the right number?
[09:54:11] <justin_j> hi johnny
[09:54:19] <NodeX> no because interset would not work on different keys
[09:54:26] <justin_j> that was the intuition of my first get-around
[09:54:46] <NodeX> color=blue,item=shirt and item=shirt are not equal so the intersect would fail
[09:54:47] <_johnny> ah, okay, sorry for meddling. i didn't quite understand your "color=red,item=shirt" schema :)
[09:55:08] <justin_j> so, my first solution was to say, 'ok, he'll be counted twice if we're just looking for shirts - can we put an offset somewhere to fix that?'
[09:55:13] <NodeX> justin_j : you're gona have to somehow store the uid
[09:57:14] <justin_j> Hmmm, that's going to be a lot of uid's - potentially 50k+ per day per app
[09:59:10] <justin_j> the other thing I tried was storing and updating all combinations of attributes given
[09:59:44] <justin_j> that works but it blows up quickly, 8 attributes = 256 sets of counts
[10:06:50] <SpanishMagician> hi all
[10:06:57] <justin_j> hi
[10:07:19] <SpanishMagician> new to mongodb, can someone lend a hand? :)
[10:07:31] <justin_j> shoot
[10:07:40] <SpanishMagician> ok, thanks
[10:08:14] <SpanishMagician> I need a database, there are users, users have clients and you make call and events for those clients
[10:08:28] <SpanishMagician> what would be an optimal approach?
[10:09:28] <justin_j> what are the events?
[10:10:03] <SpanishMagician> not programming events, event as in a show, a presentation, a meeting...
[10:10:17] <SpanishMagician> sorry for the confusion
[10:11:07] <justin_j> have a collection for users, one for clients and one for events
[10:11:23] <SpanishMagician> and one for calls?
[10:11:30] <justin_j> sure
[10:11:38] <SpanishMagician> then use references, right?
[10:11:53] <justin_j> you can do
[10:12:04] <justin_j> the user doc would have a list of clients
[10:12:10] <justin_j> the client a list of events etc
[10:12:22] <Cubud> 200,000 record inserts per second? WOW!
[10:12:52] <SpanishMagician> better than having the calls and events in the client collection?
[10:13:27] <justin_j> it depends upon how many calls and events you can have per client - you could do it like that
[10:14:07] <SpanishMagician> so what is better separate collections or one collection for clients calls and events?
[10:15:19] <NodeX> depends how frequent the data is accessed
[10:15:38] <SpanishMagician> its for a CRM
[10:16:44] <NodeX> great, doesn't tell me how frequent it's accessed though ;)
[10:16:52] <SpanishMagician> hahaha
[10:17:16] <SpanishMagician> well, I don't think there will be too many users and information, it's a small app
[10:17:24] <SpanishMagician> not intended for too many people
[10:17:40] <NodeX> probably be okay with a collection for each then
[10:17:41] <SpanishMagician> yet if it grows I don't want performance issues
[10:18:52] <NodeX> you can store them as embedded if you're worried about performance
[10:19:24] <justin_j> seriously though, I don't think performance is going to be an issue :-)
[10:20:10] <SpanishMagician> are there online examples of different databases?
[10:20:41] <NodeX> schema is left up to youi
[10:20:43] <NodeX> you *
[10:20:50] <NodeX> I have written a few CRM's using mongo
[10:21:06] <NodeX> it's more than capable of it
[10:21:22] <SpanishMagician> and how did you organize the databases?
[10:21:30] <NodeX> the collections?
[10:22:12] <NodeX> collections -> (tables)
[10:22:55] <SpanishMagician> well, i mean the database for the crm and its collections
[10:23:08] <NodeX> I dont understand, what does "organise" mean?
[10:23:29] <NodeX> do you mean what collections do I use and why?
[10:24:00] <SpanishMagician> yes
[10:24:23] <NodeX> it depends on who I am building the CRM for
[10:24:28] <NodeX> and what they want
[10:24:40] <NodeX> .. how big a piece of string!
[10:25:02] <SpanishMagician> let's assume something simple, users, users have clients and make calls and appointments to those clients
[10:25:44] <NodeX> I would have a users collection and a clients collection and an appointments collection
[10:26:37] <SpanishMagician> what about calls?
[10:26:49] <NodeX> what about them?
[10:26:59] <SpanishMagician> a collection for calls?
[10:27:04] <NodeX> define calls?
[10:27:15] <SpanishMagician> for calls I mean user call their clients to make appointments
[10:27:16] <NodeX> does my database "call" the client on the phone?
[10:27:29] <NodeX> no my user does
[10:27:29] <Cubud> I would have one for each entity that can be referenced directly
[10:27:48] <SpanishMagician> no, user calls client and write the results and maybe make an appointment
[10:27:49] <Cubud> PurchaseOrder=Yes ; PurchaseOrderLine=No
[10:28:49] <NodeX> ok so why the need for a calls collection then?
[10:29:38] <SpanishMagician> note the result of the call
[10:29:48] <NodeX> isn't that an appointment ?
[10:30:20] <SpanishMagician> call is you call the client, maybe he's not interested or unreachable, appointment is you go visit the client
[10:30:34] <NodeX> it's your app, not mine, I would probably change my "appointments" colleciton into an "actvities" collection and do it all in there
[10:30:34] <SpanishMagician> sorry for the confusion
[10:30:51] <Cubud> I would not put calls in the client
[10:31:30] <NodeX> it's all down to your app and how you query the data as to the most efficient way to do things
[10:31:35] <NodeX> same as *SQL solutions
[10:31:35] <Cubud> yep
[10:32:04] <SpanishMagician> best approach is have a collection for users, clients, calls and appointments, right?
[10:32:15] <Cubud> If you want to show an individual call on screen then I'd go for a separate collection
[10:32:16] <NodeX> no need for calls and appointments
[10:32:23] <NodeX> have "activities"
[10:32:29] <SpanishMagician> ok
[10:32:39] <NodeX> type: 'call', response :"client said go away"
[10:32:50] <NodeX> type:'apointment', date:'TOmorrow'
[10:32:58] <SpanishMagician> cubud: yes, a call would show on screen by itself
[10:33:03] <NodeX> index on type, client_id
[10:33:20] <elio_> Hello everyone!
[10:33:36] <Cubud> Then I would have it as it's own entity
[10:34:15] <SpanishMagician> japer?
[10:34:25] <Cubud> Order+lines are a single entity, you won't need to have 2 users updating 2 different lines at the same time
[10:35:01] <SpanishMagician> I understand
[10:35:12] <Cubud> If you don't want 2 people making an appointment for a client at the same time, but want people to be able to edit a client while someone else raises an appointment then perhaps you need a ClientAppointments collection
[10:35:56] <Cubud> If you want to be able to make simultaneous changes to the appointments for a single client then have a collection where each appointment is its own document
[10:36:08] <Cubud> rather than one document holding lots of appointments for a client
[10:36:18] <Cubud> I tend to think of it as locking granularity
[10:36:51] <Cubud> But I know nothing of MongoDB yet, these are just standard programming guidelines
[10:36:52] <NodeX> if however you are going to query that alot it can be expensive so it may make sense to have hot data embedded
[10:37:32] <SpanishMagician> thank you guys
[10:37:57] <SpanishMagician> I appreciate your help
[10:37:58] <Cubud> Or a logically separate thing
[10:38:06] <SpanishMagician> I think I have a better understanding now
[10:38:08] <Cubud> 1: Client 2: ClientCalendar
[10:38:26] <Cubud> 2 separate areas of a single logical thing
[10:39:00] <elio_> Cubud, may I ask you how would you model in mongo user conversations
[10:39:14] <elio_> I have a document with an array of usernames
[10:39:35] <elio_> and an array of a nested document which holds the author and content of the message
[10:40:26] <Cubud> I don't do anything in Mongo yet
[10:40:37] <Cubud> I am more accustomed to SQL and domain-driven
[10:40:45] <elio_> ok ok
[10:40:48] <elio_> thanks anyway
[10:40:53] <Cubud> sorry :)
[10:40:54] <NodeX> elio_ : what's the problem
[10:41:13] <Cubud> user conversations = like on IRC?
[10:41:56] <elio_> NodeX, I am trying to model user conversations on amongodb
[10:42:09] <elio_> I have a document Collection
[10:42:24] <elio_> which holds an array of usernames (involved in the conversation)
[10:42:26] <Cubud> If a "user conversation" is a chat then I'd just have a list of actions
[10:42:38] <Cubud> Person X said Y to #MongoDB
[10:42:47] <NodeX> define user conversations?
[10:42:47] <elio_> mmmm
[10:42:51] <Cubud> Peson Y said Z to Cubud
[10:43:09] <Cubud> Person A set mode +v on #MongoDB
[10:43:13] <Cubud> Do you mean like that?
[10:43:26] <elio_> I was defining something much more simple
[10:43:42] <elio_> a conversation is a group of users that exchange text messages
[10:43:43] <elio_> between them
[10:44:26] <Cubud> to groups or individuals?
[10:44:37] <Cubud> or both?
[10:45:02] <elio_> whatsapp style
[10:45:12] <elio_> groups of 2 or more
[10:45:13] <Cubud> Does that do groups?
[10:45:20] <Cubud> Then what I said would work just fine
[10:45:53] <Cubud> Like a script of things to do in order to replay a conversation (or a subsection of it based on a time period)
[10:46:06] <Cubud> X said ..... to #Group1
[10:46:12] <NodeX> *yawn*
[10:46:14] <Cubud> Y said ..... to @X
[10:46:27] <_johnny> NodeX: did you pull an all nighter ? :)
[10:46:29] <NodeX> elio_ : what is the problem
[10:46:44] <NodeX> no, this back and forth defining something is boring me lol
[10:46:56] <Cubud> Just like on twitter, except the client only gets what is destined for that user based on username or group memberships
[10:46:58] <elio_> NodeX, I don't know what would be the best way of defining
[10:47:01] <_johnny> hehe
[10:47:03] <elio_> the collection
[10:47:28] <Cubud> Collection 1: Users Collection 2: Groups Collection 3: Messages
[10:47:33] <Cubud> That's all there is to it
[10:47:45] <elio_> but the mongo style is not to have to do joins
[10:48:00] <Cubud> You wouldn't have to
[10:48:05] <NodeX> again, what are you trying to achieve
[10:48:20] <Cubud> When posting the message to a group you grab the group, see who is in it, then send it to those people
[10:48:33] <elio_> NodeX, a chat like application
[10:48:38] <Cubud> message to #Group 1 (Recipients are A,B,C)
[10:48:56] <NodeX> and what is wrong with this chat like application
[10:49:41] <elio_> NodeX, it has nothing in special I want to know how experienced mongo db users model this problem on the database
[10:50:10] <NodeX> are you just storing say room, user, time, text ?
[10:50:42] <elio_> Conversation: new mongoose.Schema({
[10:50:42] <elio_> users: [String],
[10:50:43] <elio_> messages: [{ author: String, body: String }]
[10:50:43] <elio_> }
[10:50:46] <elio_> I have this for now
[10:50:51] <elio_> something simple
[10:51:05] <elio_> date should be stored I guess
[10:51:17] <elio_> when querying I will want to obtain only the last 50 messages
[10:51:17] <NodeX> one document per chat room?
[10:51:25] <elio_> yep
[10:52:00] <NodeX> http://www.mongodb.org/display/DOCS/Retrieving+a+Subset+of+Fields#RetrievingaSubsetofFields-RetrievingaSubrangeofArrayElements <---- will get you last 50 elements of your messages
[10:52:09] <NodeX> that schema seems fine
[10:52:39] <Cubud> I would have "Message" Sender, Recipients, GroupChatName (for memorandum purposes), Body
[10:53:18] <elio_> great NodeX didn't know about slice
[10:53:20] <Cubud> When sending to a group you would grab the group, see who is in it, then put those people in the recipients list
[10:54:17] <Cubud> I wonder, is there a slice on time rather than only indexes?
[10:56:09] <NodeX> slice on indexes?
[10:56:17] <Cubud> 5 = first 5
[10:56:18] <NodeX> you mean arrays ?
[10:56:21] <Cubud> -5 = last 5
[10:56:55] <Cubud> db.posts.find({}, {comments:{$slice: 5}}) // first 5 comments
[10:57:06] <Cubud> perhaps like this...
[10:57:22] <Cubud> db.posts.find({}, {comments:{$slice: 30s}}) // first 30 seconds
[10:57:31] <NodeX> no
[10:57:32] <Cubud> db.posts.find({}, {comments:{$slice: -30s}}) // last 30 seconds
[10:57:37] <NodeX> not possible
[10:57:51] <Cubud> That's a shame, that would be nice for when you don't know the volume
[10:58:08] <NodeX> the common attack to that is to store the count of the array
[10:58:33] <Cubud> but how could you know how many were in the past 5 minutes?
[10:58:52] <Cubud> Or since a specific date. Index I suppose?
[10:59:06] <Cubud> PS: It is official, I love MongoDB
[10:59:39] <NodeX> you have to store that data
[11:00:53] <Cubud> You mean just store the timestamp of the action, or something else?
[11:01:00] <NodeX> yup
[11:01:11] <Cubud> ok :)
[11:01:43] <Cubud> It seems that fetching data in MongoDB is slower as the record you seek is near the end of the data
[11:01:51] <Cubud> (i.e. most recently inserted)
[11:02:13] <NodeX> slower than what?
[11:02:53] <_johnny> Cubud: also if you do a findOne() and sort desc on obj id?
[11:03:06] <Cubud> I created 5million users "1@home.com" "2@home.com" etc
[11:03:07] <NodeX> caching is maintained by the operating system, if you insert a record it -may- still be in memory and it -may- not, if you access that documnent a few times it will be in MRU
[11:03:12] <Cubud> Was very fast :)
[11:03:40] <Cubud> But when I FindOne with EmailAddress = "1@home.com" it is quicker than when I look for 5000000@home.com
[11:03:47] <Cubud> I have EmailAddress indexed
[11:03:54] <_johnny> NodeX: would one way to force it into MRU be to use the "return inserted document" when adding to the collection?
[11:03:59] <NodeX> the 500000 might have splilled to disk
[11:04:17] <Cubud> I am now iterating from 0..5000000 and doing a FindOne for that email address
[11:04:31] <Cubud> 1000 queries took 0.8s
[11:05:03] <Cubud> For the range 32000 to 33000 it took 7 seconds
[11:05:08] <_johnny> *cough* sharding *cough*
[11:05:14] <Cubud> 33000 to 34000 it took 11 seconds
[11:05:23] <Cubud> for 5 million objects?
[11:05:28] <Cubud> Sharding? Surely not?
[11:05:29] <_johnny> that does soudn odd, Cubud
[11:05:34] <_johnny> sound*
[11:05:34] <NodeX> very odd
[11:05:40] <NodeX> how are you doing the query
[11:06:22] <Cubud> http://pastebin.com/JmhHEhfp
[11:06:46] <NodeX> I have a 140m document colleciton and I can skip/ limit and grab 1000 records deep inside the collection in well under a second
[11:08:57] <_johnny> the largest i have is 2,3 mio, but also super fast
[11:09:25] <_johnny> maybe if you used regex i could understand it
[11:10:16] <_johnny> NodeX: speaking of which, are there any "best practice" on indexing for regex purpose? or is that not possible at all?
[11:10:43] <_johnny> like, using a lookup collection (which is what i do now)
[11:11:07] <NodeX> regex purpose?
[11:11:16] <NodeX> I dont follow sorry
[11:11:50] <_johnny> right, no, i'm sorry, i didn't say it right
[11:12:25] <_johnny> i mean, from what i can tell from the docs, indexes only work when you do regex queries, if you start them with ^, and don't use case insentive
[11:12:45] <NodeX> yeh, you must use a prefix else it wont use an index
[11:13:00] <_johnny> ok, gotcha
[11:13:02] <_johnny> thanks
[11:13:29] <NodeX> it's fairly fast at regex
[11:14:10] <_johnny> when using prefix, definately yes
[11:14:45] <NodeX> I currently use it for a 28million doc collection (UK Postcodes) with a prefix and can find (using a suggestive ajax style approach) a match in 1 few ms
[11:15:55] <_johnny> funny, i do something a bit similar. i have 2 mio address records, where i do ajax search complete for street names. i use a street name collection aswel to look up the street names first, to work around the case insensitive
[11:16:18] <_johnny> i was thinking of auto titlecasing it, which would work for most cases, but not all :(
[11:16:38] <_johnny> so if someone types "her", it would query for ^Her
[11:16:44] <NodeX> I did "idx_field" and uppercased it, and "display_field"
[11:16:57] <NodeX> waste of space but I had loads so it didnt matter
[11:17:05] <_johnny> that's actually a really good idea
[11:17:06] <_johnny> lol
[11:17:40] <_johnny> i'll just add that to the lookup collection. it's still relatively small, and the mongodb isn't in production yet
[11:17:56] <NodeX> it will increase performance a heck of alot
[11:18:18] <_johnny> yeah, you're absolutely right. i don't know why i hadn't thought of that
[11:19:14] <_johnny> how long did it take you to fill those postcodes in? :) i had an excessive xml schema which i had to parse when inserting to mongo (new schema). took 5 hours for 2 mio elements. there's something i'd rather not do again, lol
[11:19:14] <NodeX> I did it a couple years ago when I delved into the viability of mongo for a job board and wrote one as POC
[11:19:58] <NodeX> I have the postcode data that I have been building for years in Json that I exported from SQL, it took about 2 hours to insert and a few to index
[11:20:12] <_johnny> nice :)
[11:20:36] <NodeX> I recently did the same thing with the world GEO database
[11:21:02] <_johnny> while i was loading the data in, i ran a periodical query to check places near a certain point. was fun to see how the "world" got populated as it went along :p
[11:21:09] <_johnny> yea?
[11:21:12] <NodeX> with a 2d index so I can get the closest "large place" (town/ city/landmark) to a given lat/long
[11:21:42] <_johnny> how do you define large place? predefined or on condition?
[11:22:00] <NodeX> but I also made it so that I could regex it
[11:22:03] <Cubud> I have to go out for an hour or so now, but here is my really simple test app source if you want to see what code is being executed that gets progressively slower
[11:22:04] <Cubud> http://pastebin.com/Edy86HHz
[11:22:12] <NodeX> the GEO data has a "type" defined with it
[11:22:27] <_johnny> i was thinking of running a map/reduce gimmick to find the most crowded places in the most crowded cities, and vice versa the most stranded :)
[11:22:32] <_johnny> NodeX: ah, okay :)
[11:22:46] <NodeX> I use it as a reverse Geocode for my apps as they all rely on lat/long and geo hashes
[11:23:05] <_johnny> right. sounds neat :)
[11:23:23] <NodeX> MongoServer.Create("mongodb://localhost/?safe=true"); <---- can you not use a unix socket in your app ?
[11:23:28] <NodeX> *driver*
[11:24:01] <_johnny> looks like C#
[11:24:21] <NodeX> pass
[11:24:48] <NodeX> isn't that what MAC apps are written in .. like iphone stuff?
[11:25:06] <_johnny> http://www.mongodb.org/display/DOCS/CSharp+Driver+Tutorial#CSharpDriverTutorial-Connectionstrings doesn't look like it, no
[11:25:24] <_johnny> no, that's objective-c. C# is part of the .NET framework (windows)
[11:25:42] <NodeX> ah
[11:27:27] <_johnny> Cubud: i'm not that into it, but to me it looks like you're iterating all, rather than skipping?
[11:28:08] <_johnny> you pull findOne 5 mio times
[11:28:23] <_johnny> but you don't reset the start time
[11:30:26] <_johnny> NodeX: a specific reason to use upper case rather than lower case, or just preference?
[11:30:54] <NodeX> No specific reasion
[11:31:14] <NodeX> was faster to type at the time probably (the conversion function)
[11:33:20] <_johnny> ok. is there a upper/lower in mongo internally? like for .update({}, {$set: { new_field: old_field.toUpperCase() } }?
[11:33:45] <NodeX> pass, I did it in my app?
[11:33:52] <NodeX> minus the "?"
[11:34:06] <_johnny> i was afraid you might say that :)
[11:41:17] <algernon> _johnny: .toUpperCase() works in the js shell, yes.
[11:41:51] <algernon> but, I doubt you can use the old field like that.
[11:42:05] <_johnny> yup, i got so far. looking at $refset now :)
[11:45:02] <stabilo> I have 10 items in the gridfs, all having filenames starting with 8. I want to query only one. So I do
[11:45:05] <stabilo> db.fs.files.find( { filename : /^8/ }, null, 0, 1)
[11:45:41] <stabilo> which should limit the number of results to 1 for my understanding. Apparently it does not what I want since I get all 10 files
[11:47:15] <stabilo> findOne is not preferred since I want to get an GridOut object rather than a dictionary
[11:47:19] <algernon> that's because find() accepts query, fields, limit, skip
[11:47:26] <algernon> at least, on the shell
[11:47:49] <stabilo> interesting. wait
[11:47:51] <algernon> so you're passing limit=0, skip=1, by the looks of it.
[11:52:13] <stabilo> thank you. This gives me enough motivation to really start hating the pymongo doc.
[11:52:14] <_johnny> algernon: you're right. the this object in a cli update is the entire shell. hehe :(
[11:52:32] <stabilo> Since there skip and limit are swapped
[11:53:09] <stabilo> and I wonder why my $§( is not $)&ck"$I working
[11:53:44] <stabilo> to make it short: the doc tells bogus
[11:54:44] <stabilo> I should rather consult the sources for documentation in the first place
[12:31:06] <Bartzy|work> I update a document and using upsert - if the document was only updated, can I get the _id without doing find again ?
[12:32:23] <NodeX> Driver specific that is
[12:33:08] <NodeX> save() will do it in the php driver for example
[12:34:04] <algernon> on the wire: no, you can't. your driver may do it for you transparently though.
[12:37:25] <s1n4> hey I'm looking for an GUI tool, something like pgadmin but for mongodb
[12:37:31] <s1n4> does anyone know one?
[12:37:44] <s1n4> a GUI*
[12:47:03] <s1n4> any ideas?
[12:47:04] <_johnny> s1n4: http://www.mongodb.org/display/DOCS/Admin+UIs
[12:47:57] <_johnny> phpmoadmin might be what you're looking for. personally i've used rockmongo a bit, but not extensively
[12:50:54] <s1n4> _johnny: I dont like a web based program as monogo admin ui
[12:51:03] <stabilo> hrm. what is the best way to get a gridfile using a query object. I first thought that the collection.find() method will do it, but it returns only a cursor (over dictionaries) and I don't know to get a gridfile from that. One solution would be to extract the exact filename from the dictionary and make a second call to the database like db.fs.files.get_version(filename). But that sucks (TWO calls to the DB for ONE file)
[12:51:14] <s1n4> _johnny: I'm looking for something like pgadmin3
[12:51:49] <_johnny> ah, i thought pgadmin was web based
[12:51:58] <_johnny> which OS are you on?
[12:52:50] <fleetfox> pgadmin is standalone app
[12:52:58] <s1n4> _johnny: ubuntu
[12:56:39] <s1n4> _johnny: what do you suggest?
[13:01:25] <_johnny> for ubuntu i dunno
[13:02:07] <_johnny> try some of those from the docs, like umongo
[13:10:06] <s1n4> ok, thanks
[13:10:48] <zykes-> how can I do "relation" ish things as in saying what category a a product belongs to ?
[13:14:36] <_johnny> zykes-: storing the category in the product doc is not an option?
[13:14:42] <wereHamster> zykes-: google 'mongodb schema design'
[13:34:19] <circlicious> hi
[13:35:43] <NodeX> hi
[13:38:15] <circlicious> hey NodeX
[13:38:35] <circlicious> was about to ask my question but then i thought i am too tired
[13:38:49] <_johnny> :)
[13:39:22] <NodeX> lolkol
[13:40:36] <circlicious> so i am trying tod o mapreduce for the first time
[13:40:48] <circlicious> from reduce i would like to return all fields of the # documents from the collection i am working with - does that make ssense?
[13:43:16] <NodeX> this is the current object (document
[13:43:23] <NodeX> in a map/reduce **
[13:44:35] <circlicious> not like that, wait let me show you code, i think that will explain
[13:47:56] <circlicious> NodeX: https://gist.github.com/4a201775f56dddcfc37f
[13:51:14] <NodeX> circlicious : you will need to loop 'this' and capture the key and push that
[13:52:37] <circlicious> hm, actually i am confused again. reduce will reduce into 1 document
[13:52:49] <circlicious> so i canot get all the duplicate documents, can i?
[13:53:42] <NodeX> no
[13:53:49] <NodeX> what are you trying to achieve
[13:55:49] <circlicious> trying to get documents that have same created_at
[13:56:24] <NodeX> without knowing the created_at ?
[13:56:32] <circlicious> right
[13:56:38] <NodeX> and you want all of them
[13:57:01] <circlicious> yes, actually i want on a specific item_id, but for now i am doing all, as i think adding the item_id =x then will be easy anyway
[13:57:24] <NodeX> why don't you distinct "created_at"
[13:58:03] <circlicious> distinct gives me records with distinct created_at
[13:58:21] <NodeX> then add the optional query param
[13:58:23] <circlicious> i only want records that have same created_at with another document
[13:58:52] <NodeX> db.foo.distinct("created_at",{item_id:123});
[13:59:15] <circlicious> yes but it will also return documents that have unique created_at
[13:59:22] <circlicious> i dont want those
[13:59:35] <NodeX> [14:56:29] <circlicious> yes, actually i want on a specific item_id,
[13:59:39] <circlicious> and if 2 or more documents have same created_at i want all those documents
[14:00:11] <circlicious> ok, forget that. there i meant i need to get documents with same created_at on item_id = x
[14:00:32] <circlicious> but right now i am doing on entire collection, will add item_id=x filtering later when i know how to do it anyway
[14:00:53] <circlicious> am i making sense ? :(
[14:00:54] <NodeX> are you on the aggregation framework or are you dead set on map/reducing it
[14:01:21] <circlicious> i have read through aggregation page, i dont think they'll help me
[14:01:35] <circlicious> so i tried map reduce, but dunno how to get all documents with all fields
[14:01:41] <circlicious> i am really confused now :S
[14:02:19] <NodeX> http://csanz.posterous.com/look-for-duplicates-using-mongodb-mapreduce
[14:02:24] <NodeX> maybe that will help
[14:02:30] <circlicious> 1 thing i can now do is, map reduce can give me the created_at that appear 2 or more times. i can then query once for each created_at then
[14:02:37] <circlicious> but will be slow?
[14:02:40] <circlicious> k let mee read
[14:03:14] <circlicious> "I wrote a simple script, which failed because our data is too big" :)
[14:03:33] <circlicious> i think my data will also ecome big soon ;D and i have no idea how i will handle it, but i'll try and manage
[14:05:11] <circlicious> NodeX: that wont help, from that i'll show u wat i need
[14:09:54] <circlicious> NodeX: https://gist.github.com/53039e8b3f209759d091
[14:17:28] <circlicious> anyone ? :(
[14:21:44] <circlicious> btw on that post, i dont understand cond: { "created_at"},
[14:21:57] <circlicious> should had been 'created_at': some_val, no?
[14:45:50] <Cubud_> Back :)
[14:46:39] <circlicious> can you help me with https://gist.github.com/53039e8b3f209759d091 ?
[14:46:42] <Cubud_> _johnny: Yes, I am doing 5 million individual selects. I am trying to see how quickly I will be able to serve a DB request. For inserting it is 20,000 per second (wow)
[14:46:54] <Cubud_> But selecting gets progressively slower
[14:47:17] <Cubud_> It's as if MongoDB is sequentially scanning the rows rather than using the index, because it takes longer to select data near the end
[14:49:40] <NodeX> Cubud_ : use explain to see what it's doing
[14:51:33] <Cubud_> I will see if the library supports it
[14:53:38] <circlicious> NodeX: did you understand my problem?
[14:54:16] <NodeX> circlicious : no
[14:54:36] <NodeX> Cubud_ : what sort of specs are your testing server?
[14:54:40] <NodeX> (Ram mainly)
[14:55:33] <Cubud_> 4gb
[14:55:39] <Cubud_> 64bit os
[14:55:46] <Cubud_> 2.6GHz CPU
[14:55:47] <Cubud_> Win 7
[14:56:30] <NodeX> is the index fully built?
[14:56:44] <NodeX> i/e not runnig in the background still
[14:56:50] <Cubud_> I have no idea
[14:56:58] <NodeX> how did you build it?
[14:57:02] <Cubud_> Would it be rebuilt as it is inserted
[14:57:08] <NodeX> no
[14:57:11] <Cubud_> I created it before inserting any data
[14:57:16] <NodeX> ok kewl
[14:57:20] <estebistec> In explain-plan output, is allPlans indicating indexes that were actually used in the query or simply considered by mongo because they all have related fields?
[14:57:32] <estebistec> if there is multiple index usage with de-dup going on it isn't clear
[14:57:33] <NodeX> I would just guess that your index is spilling to disc for whatever reason
[14:58:01] <NodeX> using findOne() 1000 times is probably not going to happen in your app
[14:58:20] <NodeX> the disk will be thrashing all over the place looking for the data as no MRU / LRU will be touched
[14:59:06] <NodeX> if the whole working set were in RAM it would be alot faster
[14:59:25] <Cubud_> It is very likely to happen
[14:59:35] <Cubud_> Imagine it to be a kind of Google Analytics
[14:59:48] <Cubud_> Millions of people accessing unrelated URLs
[15:00:00] <NodeX> not on a single machine it's not
[15:00:15] <Cubud_> No, but each machine would have more than 5 million URLs on it
[15:00:24] <NodeX> at that point your bottleneck would be network card anyway not disc
[15:01:32] <NodeX> can you run mongostat next time you do the benchamrk
[15:01:40] <NodeX> benchmark.. and pastbin it
[15:01:42] <Cubud_> At the start the web app + DB would be on the same machine
[15:01:50] <Cubud_> Only upscaling if it actually became popular :)
[15:02:14] <circlicious> after i map_reduce, how can i filter?
[15:04:19] <circlicious> if map_reudce returns documents like {_id: ..., {'value'=>}
[15:04:44] <circlicious> {_id: ..., value => {count => x}}
[15:04:51] <circlicious> i only want where x is 2 or more
[15:08:34] <Cubud_> What am I looking for in MongoStat?
[15:09:54] <Cubud_> http://pastebin.com/4ZkifeUK - Stats
[15:12:28] <_johnny> NodeX: i'm about to parse all my xml again, wanting to put it in a mongoimport complaint format. i'm trying to store it as json, but even for smaller files (36kb) i catch the exception BSON representation of supplied JSON is too large. i've added \n's as it looked like mongoimport reads lines at a time, but still same garble. am i doing something completely upsides down ?
[15:14:30] <circlicious> no one wants to help me ;(
[15:27:05] <Cubud_> How can I check when Mongo has finished rebuilding an index?
[15:27:53] <_johnny> d'oh, forgot to use --jsonArray, my bad :(
[15:31:53] <NodeX> Cubud_ : it will have
[15:32:08] <NodeX> it wouldv'e built it as you added the emails
[15:33:12] <Cubud_> Okay, so as I select index.ToString() + "@MyDomain.com" it finds them quickly when index is low
[15:33:22] <Cubud_> but the higher value index is the longer it takes to find the data
[15:33:31] <Cubud_> even if I loop through multiple times
[15:33:56] <Cubud_> so a loop of 0..10,000 within an outer loop of 10 times
[15:34:22] <Cubud_> The speed degradation is consistent even over secondary outer iterations
[15:34:44] <Cubud_> I am going to test it with 50 million records
[15:36:28] <NodeX> run mongostat while you test
[15:38:39] <Cubud_> What am I looking for in the output?
[15:38:54] <Cubud_> Mongo inserts are impressively fast :)
[15:39:26] <NodeX> on query look for faults
[15:53:28] <therealkoopa> If I want to provide users with the ability to reset their password, I'm doing it by generate reset tokens that are used. Is it better in mongo to create a separate small collection with pending reset tokens, that reference the user. Or to store the reset token information on the users collection?
[15:55:30] <Cubud> I'd put it in the user
[15:55:41] <Cubud> because you are going to load the user in in order to reset the password anyway
[15:57:00] <therealkoopa> Cubud: I was worried about searching for the reset key if the users collection is massive.
[15:57:34] <Cubud> I get them to put in their email + reset key
[15:57:48] <Cubud> or username + key (whichever you have keyed the user table on)
[15:58:17] <Cubud> Having the key alone is a bit weak
[15:58:20] <estebistec> asking again, briefly as I can: in explain output, are all the indexes in allPlans actually being used or are they candidate plans?
[15:58:31] <Cubud> You might guess a key
[15:58:54] <Cubud> but you won't guess a key + emailaddress, especially if it has an expiry of 30 minutes
[16:04:06] <therealkoopa> Cubud: Okay, I like that. Thanks
[16:04:12] <NodeX> estebistec : how large is your user colleciton?
[16:04:34] <NodeX> and do you really want an index just to search user reset key ?
[16:04:48] <estebistec> NodeX, I'm testing out my indexes with 10k docs
[16:05:00] <estebistec> we'll have several M in reality
[16:05:10] <therealkoopa> NodeX: Well right now about 3. Hopefully when we deploy, it's going to be massive :). I'm a little worried about searching by reset key. I'll probably somehow encode the email address in the key, and search on that? Um
[16:05:39] <NodeX> it will still require an index
[16:06:06] <NodeX> I would (for sanity) use a front facing cache and not mongo for the initial lookup
[16:06:33] <NodeX> something like redis or memcache with the key of whatever and value of the UID
[16:06:45] <NodeX> (expiring of course)
[16:08:01] <therealkoopa> Ah, yea. So the reset key will be in redis and the value will be the user. Then I can look up the user quickly in mongo.
[16:08:08] <Cubud> Do you send an email with a URL containing the key to reset?
[16:08:12] <therealkoopa> Yea
[16:08:59] <Cubud> make the link http://mysite.com/Account/ResetPassword?User=cubud&ResetKey=123456789
[16:09:09] <Cubud> More secure, fast to load
[16:09:42] <therealkoopa> What if the reset key is just the base64 encoded username? Too easy to guess?
[16:10:18] <Cubud> yes
[16:10:19] <NodeX> the most secure way is to salt the email + user agent + session_id
[16:10:42] <Cubud> The more secure way than that is to use a crypto-safe random number generator
[16:10:46] <NodeX> that way someone would have to hijack the session, guess the user agent, email - can add IP for good measure if you like too
[16:11:17] <Cubud> user + random number + a timestamp of when it was generated, so it can only be reset within X minutes
[16:11:40] <NodeX> that stops brute force
[16:11:42] <Cubud> So you have to guess the random number + find a user that has requested a reset + within X minutes of them doing so
[16:11:45] <NodeX> doesn't stop hijacking
[16:11:48] <NodeX> or sniffing
[16:12:38] <Cubud> How would salting prevent sniffing?
[16:13:14] <NodeX> err salting the session id plus the user agent is far more secure than not salting it
[16:13:33] <Cubud> but with sniffing you don't need to understand the data, you just need to possess it
[16:14:02] <Cubud> The resulting code would be as meaningless as a random number, but still as simple to use to access the account
[16:14:05] <therealkoopa> Would you worry about them using a different browser for the reset? That's just a thought.
[16:14:10] <NodeX> untrue
[16:14:18] <Cubud> Then please explain :)
[16:14:26] <Cubud> I like being wrong :)
[16:14:27] <NodeX> I dont need to, head to #security and ask
[16:14:50] <Cubud> Oh okay, I just thought you might back up your statement with facts ;-)
[16:14:51] <NodeX> the reason for a salt is to re-check it at the server end
[16:15:22] <jjanovich> hello
[16:15:30] <NodeX> so sending your "sniffed" data means nothing if you can't send me what my app salted in the first place to compare it against
[16:15:57] <jjanovich> question, my server is set in EDT timezone, but looks like mongo is returning in a different timezone
[16:16:04] <jjanovich> how can I get those to sync up
[16:16:34] <Cubud> the server sends an email with some key in it, if that key is sufficient to get the user in then it would be sufficient to get in the person that sniffed the key wouldn't it?
[16:18:12] <therealkoopa> Okay, so it's either store it as key/value in redis with an expiration, whcih is nice and easy. Or base64 encode some magic and have to deal with the timeout manually. THink they are both about the same on a 'goodness' level?
[16:19:21] <Cubud> NodeX, using the email address as my key gives me 12,500 lookups per second without any speed degradation
[16:21:09] <jjanovich> does mongodb store all datetime as UTC and I have to convert or is there a way to force it to same in local timezone
[16:24:01] <Cubud> NodeX: Very few faults - http://pastebin.com/PqcfvLet
[16:26:38] <Cubud> I think that seeing the high speed of selecting against a specific key is enough to convince me to use Mongo :)
[16:27:41] <therealkoopa> Cubud: You don't have to worry about storing reset tokens on the user if throwing redis in front.
[16:28:19] <therealkoopa> And it seems silly to have the reset token be an index
[16:28:26] <therealkoopa> Or indexed, rather
[16:30:49] <NodeX> Cubud : can you do db.your_collection.getIndexes();
[16:30:55] <NodeX> into a pastebin
[16:36:01] <Cubud> 1 sec
[16:43:44] <Cubud> therealkoopa - I suggested that you don't use an index for the reset code
[16:43:53] <Cubud> think ResetCode rather than ResetKey
[16:44:04] <Cubud> NodeX: Nearly done, 2 more mins :)
[16:44:15] <tsally> if I write a new document to a sharded collection, how do i know whena read of this document is guarenteed to succeed?
[16:44:57] <_johnny> ok, so i have a question about mongoimport. how on earth do anyone manage to import anything? i have a 20mb array file which is too large. any tools for splitting this seemingly "huge" file?
[16:45:40] <Cubud> NodeX: http://pastebin.com/DFGu436R
[16:53:04] <NodeX> can you do db.your_collection.findOne({Email:"some@email"}).explain();
[16:54:05] <wereHamster> tsally: when reading from the master, immediately
[16:55:55] <Cubud> okay, 1 sec :)
[16:59:19] <Cubud> NodeX: Error - http://pastebin.com/B1ND731A
[16:59:25] <Cubud> Doesn't know what explain() is
[17:01:55] <NodeX> can you do db.your_collection.find({Email:"some@email"}).skip(0).limit(1).explain();
[17:04:22] <Cubud> That worked :) - http://pastebin.com/n9UqAFTz
[17:05:16] <ramsey> Derick: does that new beta driver also fix this issue? https://jira.mongodb.org/browse/PHP-377
[17:05:50] <NodeX> the only thing I can say is that your data is non sequential on disk and the reads to get the data are thrashing the disk
[17:06:25] <Cubud> I would say that is correct, because it is 1@MyDomain.com through to 1000000@MyDomain.com
[17:06:40] <Cubud> so it will be distributed all over the place
[17:06:44] <NodeX> the disk will be going mad
[17:06:55] <NodeX> more RAM would have the woking set inside it and it would read from RAM
[17:07:02] <NodeX> and MRU would be fast too
[17:07:21] <Cubud> I will try finding 10000 onwards
[17:11:46] <konr_trab> Is there a conditional operator equivalent to $equal?
[17:12:18] <NodeX> 'key':'val'
[17:12:44] <Cubud> hmmm, I did 100000 to 200000 and the speed was consistent
[17:12:54] <ramsey> Derick: nevermind... we can't really test the beta 1.3.0 driver right now, since our setup requires authentication :-)
[17:12:56] <Cubud> so I changed it back to 1 to 15000 and now that is fast too
[17:13:11] <NodeX> MRU
[17:13:18] <Cubud> Well, I am happy with the speed I am seeing anyway
[17:13:28] <Cubud> Thanks very much for your help NodeX, I really appreciate it!
[17:13:33] <NodeX> the query is warm so it's in the cache
[17:13:37] <NodeX> no probs
[17:15:29] <suri> hello
[17:15:41] <suri> somebody there?
[17:17:08] <suri> i'm trying to use mongodb on django1.4.. as i know still there is no nonrel support for 1.4..
[17:17:23] <suri> is there other way for it?
[17:20:08] <jrdn> weird question, should i cache things like "names" in memory when querying for an _id field? we have reports and are storing object ids (b/c names change often)… since i'm assuming it's in memory anyway, is it worth it to cache them anyway to get rid of what is like 40 queries on one report page?
[17:21:19] <jrdn> i.e.) http://docs.mongodb.org/manual/faq/fundamentals/#does-mongodb-handle-caching not sure if it's even worth it to build a cache layer right now
[17:23:45] <NodeX> 40 queries in one page?
[17:23:55] <NodeX> Yes I should cache
[17:24:56] <jrdn> and yeah, they are essentially a query on _id (indexed obviously) and returning just a name
[17:25:28] <Rtja> heya guys, where i can found JSON.parse alternative in mongo shell?
[17:27:46] <_johnny> when querying on two fields in a collection, is it sufficient (albeit not optimal) to index only one of the fields?
[17:28:10] <_johnny> Rtja: printjson() is what you want?
[17:28:22] <_johnny> or do you mean the other way?
[17:29:11] <Rtja> i mean string to object convertation :3
[17:29:35] <_johnny> printjson() or tojson() will do that
[17:35:25] <Rtja> _johnny, it return a string, but i need object
[17:36:05] <NodeX> it's already an object in the shell
[17:36:26] <NodeX> db.foo.find().forEach()...
[17:41:26] <Rtja> but i need return JSON.parse("string here"), i no need parse documents
[17:41:50] <Rtja> for example i want convert uri string to object
[17:45:34] <_johnny> if you don't need documents, what is it you're using mongodb for? can you give an example?
[17:45:34] <NodeX> that's a javascript issue not a mongo issue
[18:24:46] <Almindor> how would you go about importing a huge table (6 billion entries) from M$ SQL into MongoDB?
[18:24:59] <Almindor> we used freeTDS/odbc/python before but it's simply too slow for this huge thing
[18:25:13] <NodeX> which part is slow?
[18:25:20] <NodeX> the read or the write
[18:25:21] <Almindor> I think freeTDS/connection :D
[18:25:45] <Almindor> but it might be the writes too I guess, never really timed the whole thing
[18:26:19] <NodeX> how slow is slow?
[18:26:20] <Almindor> I was wondering if there might be some non-obvious way of getting data into mongodb
[18:26:28] <Almindor> about 100/s or so
[18:26:34] <Almindor> documents
[18:26:58] <Almindor> but it's not the python part
[18:27:00] <Almindor> CPU usage is low
[18:27:09] <Almindor> mostly io or net I guess
[18:27:32] <NodeX> same machine?
[18:27:37] <Almindor> no that's the problem
[18:27:46] <Almindor> that's why I'm thinking about using an intermediate format
[18:27:55] <NodeX> tbh I would dump the lot to json/csv or something
[18:27:56] <Almindor> just dump, compress, copy and import
[18:27:59] <Almindor> yeah
[18:28:04] <NodeX> it will be quicker in the long run
[18:28:09] <Almindor> yes I think so too
[18:28:26] <Almindor> I guess we should be able to make a huge json file out of it
[18:28:39] <NodeX> or break it into parts ;)
[18:28:42] <Almindor> :D
[18:28:43] <Almindor> yeah
[18:28:46] <Almindor> monthly files
[18:28:51] <Almindor> it could even get cron-jobed :D
[18:28:54] <NodeX> say a billion in each incase it get's interrupted
[18:29:01] <NodeX> (had that happen a few times on large data)
[18:29:14] <Almindor> well hopefully our 3 replicaset rig will hold up
[18:33:02] <Almindor> nodex: another issues it that it's not as simle as a push, I need to "re-link" IDs
[18:33:16] <Almindor> nodex: so it has a lot of lookups too, I think 2-3 per row at this point
[18:33:26] <NodeX> in the SQL ?
[18:33:30] <Almindor> in mongo
[18:33:35] <NodeX> ouch
[18:33:55] <NodeX> no way of replicating on the SQL machine while you import then dumping and moveing the data ?
[18:33:58] <Almindor> this is an events collection and they are connected to devices which in turn are connected to users :)
[18:34:53] <Almindor> replicating on the SQL machine?
[18:37:02] <NodeX> putting your Mongo onto the machine where the SQL is
[18:37:06] <NodeX> save teh network load
[18:38:52] <Almindor> NodeX: mongo on a windows server? you sure a csv dump isn't faster ::D
[18:39:46] <Almindor> I guess we'll try it with a month's set of data and see how long it takes on the mongo db machines to import it
[18:40:21] <NodeX> I know the longer a process runs the longer it takes and the more it slows down for some reason
[18:40:29] <NodeX> I know it's quicker to do things in batches
[18:40:43] <Almindor> I think mongo has an index/memory usage issue
[18:41:00] <Almindor> it grew into 20gb+ on ram usage on such imports on us before
[18:41:37] <NodeX> I remember somehting from a talk from the craigslist guy about it taking ages for him too
[18:41:55] <NodeX> he had a work around for it, checkout the vidoc on the 10gen website
[18:46:49] <Almindor> hmm thanks
[18:48:18] <Almindor> this one? http://www.10gen.com/presentations/mongosf-2012/mongodb-at-craigslist-one-year-later
[18:48:42] <Almindor> seems like it
[18:49:19] <NodeX> it was around 18 months ago I watched it
[19:22:08] <neil__g> hello. could anyone confirm for me that 2.0.7 is a drop-in replacement for 2.0.1? anything i need to be aware of in upgrading?
[19:22:11] <neil__g> pls
[19:23:40] <tsally> what are the performance implications of using safe writes. do drivers just poll getLastError() over and over again?
[19:31:26] <icedstitch> Is there an example dataset for playing with mongodb?
[19:32:07] <jrdn> yeah a json document
[19:36:15] <icedstitch> jrdn: Uhm. Any favorites?
[19:36:55] <icedstitch> I mean, this is useful, http://media.mongodb.org/zips.json . i was wondering if there were other ones
[19:44:39] <kchodorow> icedstitch: here's an example using chess games http://www.kchodorow.com/blog/2012/01/26/hacking-chess-with-the-mongodb-pipeline/
[19:45:02] <kchodorow> pretty much any data feed that gives you json will work
[19:48:28] <estebistec> wow, so with pymongo .find({'a': re.compile('^AB')}) is much faster than .find({'a': {'$regex': ^AB'}})
[20:03:02] <icedstitch> thank you kchodorow
[21:15:03] <tomlikestorock> why would my python script that connects to a replica set hang at the end of execution and not close the connection to mongo?
[21:18:11] <Bilge> How do you man the index of an embedded document unique only to the containing document and not the entire collection?
[21:18:17] <Bilge> man = make
[21:20:02] <ron> I don't think you can.
[21:22:05] <Bilge> Neither do I
[21:22:13] <Bilge> And this makes me VERY SAD
[21:22:36] <Bilge> In fact I don't even understand what it is that is unique about a compound index compising a root document's it and a child document's field
[21:22:48] <Bilge> it = id
[21:23:43] <Bilge> I thought making a unique compound index compising the parent id and an embedded document's field would be a way to make that embedded document's field unique only in the context of the parent document but it does not
[21:23:49] <Bilge> There appears to be nothing unqiue about it at all
[21:40:49] <crudson1> Bilge: a unique index simply means that a combination of the keys specified must be unique in the collection
[21:41:41] <Bilge> So then I SHOULD NOT be able to SPECIFY that the parent ID and the embedded field is UNIQUE and yet STILL have the same embedded field with the same value in the same document!
[21:41:44] <Bilge> It seems like a bug
[21:42:24] <Bilge> There is NOTHING UNIQUE about a compound index comprising a filed from the parent document and a field from an embedded document
[21:42:32] <Bilge> field*
[21:43:32] <Bilge> You can have a myriad of duplicate embedded field values regardless of the unique index constraint
[21:44:34] <Bilge> The only way to make it really unique is to create a single-field unique index for the field in the embedded document, e.g. { embedded.field : 1 } but this makes the field unique across all documents, not just the parent document to which the embedded document belongs
[21:44:40] <Bilge> ARE YOU FOLLOWING ME?
[21:44:45] <Bilge> Because this shit is really frustrating
[21:45:38] <Bilge> Either this is a massive bug or I am massively misunderstanding a core concept of indexing
[21:47:03] <Bilge> To put it another way
[21:47:35] <Bilge> ensureIndex({ _id: 1, 'embedded.field' : 1 }, { unique: true }) === ensureIndex({ _id: 1, 'embedded.field' : 1 }, { unique: false })
[21:47:42] <Bilge> In this context unique does NOTHING
[21:49:10] <kida78> is there a bug in the mongodb shell that doesn't allow you to do queries while comparing to a float / double?
[21:49:44] <kida78> db.users.count({"percent" : { $gte : 0.2 }})
[21:49:46] <kida78> for example
[22:10:50] <lacker> is there a good way to get a graph of average mongo query latency over time
[22:10:54] <lacker> like will MMS provide that?
[22:21:44] <lacker> another question: in general, are there some types of error where a client like the ruby connection object will need to be recreated after that sort of error?
[22:32:43] <tomlikestorock> In my web app (pyramid based), is it good practice to open a replicaset connection on every request, and attach it to the request object, or should I open one for the app and attach a reference to every connection? I've also noticed that the replicasetconnection class causes my web app scripts to hang because it doesn't delete itself when the script is done running? How do I prevent this?
[22:33:38] <wereHamster> tomlikestorock: or pool connections and take one that is available or wait for one te become available
[22:34:07] <wereHamster> generally it's a bad idea to open a new connection to the database for each request.
[22:34:14] <wereHamster> either pool them or open a single one
[22:35:17] <tomlikestorock> wereHamster: right, that's what I figured. Okay. Now, what's the deal with the ReplicaSetConnection preventing my scripts from finishing?
[22:35:22] <tomlikestorock> they just hang at the end :(
[22:35:51] <wereHamster> no idea. Which programing languaeg you are using. Also, it's past midnight. I'm off.
[22:35:56] <tomlikestorock> pymongo
[22:36:03] <tomlikestorock> (python)
[22:36:05] <tomlikestorock> ah, okay
[22:36:12] <wereHamster> I don't do python. I'd rather sleep than use python :)
[22:38:34] <winterpk> Goodmorning
[22:39:40] <Smith_> I am trying to run this example "https://github.com/mongodb/node-mongodb-native/blob/master/examples/simple.js" on Node.js v0.8.7 and getting this error "Error: TypeError: Cannot read property 'BSON' of undefined" on line 9's code. Any suggestions please?
[22:39:49] <winterpk> Can someone please tell me what the radius variable is supposed to be in this example: db.places.find({"loc" : {"$within" : {"$center" : [center, radius]}}})
[22:40:08] <winterpk> integer in miles?
[22:50:43] <winterpk> does anyone konw the answer to my problem?
[22:52:22] <Smith_> How is this related to Mongo? the variable units are defined by the data or definitions of the problem
[22:52:53] <winterpk> It is a mongo query using 2d index with $center
[22:53:00] <winterpk> which is also a mongo keyword
[22:54:21] <winterpk> Lol how is it anything but a mongo question?
[22:54:30] <Smith_> ???
[22:54:46] <Smith_> The formula for a center is agnostic to its units
[22:55:17] <winterpk> Guess I don't get geospacial indexing totally
[22:55:28] <Smith_> also check http://stackoverflow.com/questions/4143556/in-mongodbs-query-what-unit-is-center-radius-in
[22:56:27] <winterpk> thanks that helps
[22:56:47] <Smith_> no problem
[23:31:44] <Smith_> What is the meaning of native parameter (true/false) when opening a DB? (or native in general in the context of mongodb?)
[23:49:20] <tomlikestorock> is there some way that my replicasetconnection from pymongo can close its own monitor thread upon deletion?