PMXBOT Log file Viewer

Help | Karma | Search:

#mongodb logs for Tuesday the 11th of June, 2013

(Back to #mongodb overview) (Back to channel listing) (Animate logs)
[06:08:17] <shashi> Hey, so I have a basic normalization question. I have Projects having many Tasks and Users, Tasks have many Assignees (who are Users). Now when a user logs in, I want to show all the projects he is part of and all the tasks he is assigned. Also, of course when viewing a task (or project), I show who the assignees are. Should I create a new collection to
[06:08:17] <shashi> store Project - Tasks, Project - Users, Tasks - Assignee relationships? How is this done in day to day practice with mongo?
[06:11:06] <shashi> s/Should I create a new/Should I create multiple new/
[07:25:42] <ncopa> hi
[07:26:10] <ncopa> guys, i am packaging mongodb for alpine linux
[07:26:16] <ncopa> (which uses uclibc)
[07:26:31] <ncopa> after a few patches i made it build
[07:26:32] <ncopa> but...
[07:26:40] <ncopa> >>> mongodb*: Package size: 104.3 MB
[07:26:44] <ncopa> 104MB!!!!
[07:26:55] <ncopa> that excludes the libs and the server
[07:27:44] <ncopa> http://sprunge.us/TIjQ
[07:28:47] <ncopa> is that intentional?
[07:29:06] <ProLoser> is there an easy way to export from one db to another?
[07:30:42] <ron> mongodump?
[07:30:51] <ron> mongoexport?
[07:31:26] <kali> db.copyDatabase
[07:31:35] <ProLoser> if i kill the mongo connection will i lose it?
[07:31:40] <ProLoser> it's 2 separate connections
[07:32:29] <ncopa> is it normal that each /usr/bin/mongo* file is 9.5MB? or have I dont something wrong (link static when it should been dynamic)
[07:33:26] <ncopa> done*
[07:42:46] <ProLoser> fuccck
[07:43:29] <ron> I guess you got the answer to that.
[07:45:54] <ProLoser> i only have the db info to the destination, not the source, the source is in mongo_hub, but it hangs when i export
[07:54:55] <kali> ncopa: mine are 16MB...
[07:59:18] <ncopa> kali: ok, i assuem the bloat is intentinal then... :-/
[07:59:23] <ncopa> thanks
[07:59:27] <ProLoser> anyone use mongohub?
[07:59:41] <ncopa> not really surprised, its easy to make things bloaty with c++
[08:01:01] <ncopa> i bet with some dynamic linking the size of /usr/bin/mongo* could be reduced to 1/10 or so
[08:02:51] <kali> ProLoser: i don't, but the current maintainer is a colleague
[08:03:06] <ProLoser> fotonaut?
[08:03:12] <ProLoser> i am tryign to use it's export feature
[08:03:16] <ProLoser> or lack thereof
[08:07:06] <Torsten> does someone know if there's a dedicated irc channel for mongoose?
[08:27:18] <Nevaron_> Hi, is anyone here?
[08:27:43] <Nevaron_> Hi
[08:32:55] <ncopa> hi, i think there are 311 in here
[08:33:05] <ncopa> or 310. bye
[08:34:12] <Derick> Nevaron_: just ask your question, people will answer if they can help
[08:34:29] <Nevaron_> Ok
[08:34:37] <Nevaron_> well I'm having a major problem with aggregate
[08:34:43] <Nevaron_> I'm trying to use group
[08:34:54] <Nevaron_> but I don't know how to select fields
[08:35:04] <Nevaron_> how do I translate this sql:
[08:35:13] <Nevaron_> select * from myTable group by name
[08:35:45] <Nevaron_> if i do { $group: { _id: "$name" } } I only get the name field in the results
[08:49:16] <kali> Nevaron_: you need to project/aggregate the other fields you want... look for $sum, $max, $min, etc
[08:49:47] <Nevaron_> yeah but what if I don't need a sum, max or anything
[08:49:53] <Nevaron_> just the field's value
[08:50:35] <kali> Nevaron_: which one ? you're grouping, so there are potentially several values
[08:50:58] <kali> Nevaron_: if they're all equal, you can use $max or $min
[08:51:07] <Nevaron_> OK
[08:51:22] <Nevaron_> do you have a few minutes to help me figure out my query?
[09:35:58] <Nevaron_> Can anyone help me with this please: http://stackoverflow.com/q/17040100/1313143
[09:44:08] <vargadanis> hello everyone!
[09:44:30] <vargadanis> is there maybe a reference DB somewhere that I could use to play around with mondodb?
[09:44:56] <vargadanis> it is not the most exciting thing to build up a db for learning purposes :)
[09:45:04] <ron> vargadanis: http://try.mongodb.org/
[09:45:49] <vargadanis> ron, I was thinking more on the lines of a DB export that I could import locally and fiddle around with it
[09:46:18] <ron> ingrate.
[09:46:20] <ron> :)
[09:50:26] <vargadanis> http://lmgtfy.com/?q=define%3Aingrate
[09:51:00] <Nodex> there are a few data sets knocking around, depends what you're after really
[09:51:43] <vargadanis> Nodex, to be honest, does not really matter... something on which I could experiment with queries, updates etc...
[09:52:04] <vargadanis> it would be nice if it had embedded documents, maybe on multiple levels
[09:53:00] <Nodex> http://www.json-generator.com/
[09:53:03] <Nodex> make your own ;)
[09:56:07] <vargadanis> sweet!
[09:56:19] <ron> O_O
[09:56:29] <ron> I can't believe he accepted that as an answer.
[09:58:02] <vargadanis> why not? :D that's exactly what I was looking for.. some lazy tool that generates data for me hehe
[09:58:20] <vargadanis> lazyness is good for health :)
[09:58:49] <Nodex> :D
[10:00:59] <stennie> @vargadanis: twitter is a good source of json data which you can import to experiment with. eg: https://search.twitter.com/search.json?q=mongodb
[10:01:58] <stennie> the manual also has example data + queries for aggregation framework: http://docs.mongodb.org/manual/tutorial/aggregation-examples/
[10:16:54] <foofoobar> Hi, is it slower to search for a string if the string is a sentence?
[10:17:56] <Derick> you'd want to look at the experimentatl text search in that case (as I suppose you meant with a Regex search there?)
[10:18:16] <foofoobar> Derick, okay, so then I will think of a different way
[10:19:29] <Derick> had a look at text search?
[10:19:38] <Derick> (you didn't answer my question either btw)
[10:23:15] <shashi> Hello, I have a simple normalization question, I asked before but no one was around to answer I guess.
[10:23:15] <shashi> I am making an app that has Tasks which have many Users (as Assignees). I want to display all the tasks a user is assigned as well as all the users assigned to a task. What is the right way to store data in Mongo for this?
[10:23:56] <ron> oh, there are so many ways to solve it...
[10:24:21] <ron> it'd be difficult to know which one would suit you best though, without going into quite a bit more details.
[10:24:31] <foofoobar> Derick: no.. Let me example what I want to do: Users should be able to search for a city and all results which are near this city should be showed.
[10:24:43] <foofoobar> To get a geo location for the city I plan to do the following:
[10:24:53] <Derick> ah...
[10:24:55] <ron> but for bidirectional relations between entities, I'd consider either using an external indexing, or using a link entity.
[10:25:24] <foofoobar> 1) Look in mongodb if there is a {shortname: $city}. If not, I make an API call to nominatim (openstreetmap api), get the record and write it into the database
[10:25:45] <foofoobar> the problem is that there are some cities which have the same name, so the "display_name" given from nominati is something like:
[10:26:00] <foofoobar> "Plittersdorf, Bad Godesberg, Bonn, Nordrhein-Westfalen, Deutschland"
[10:26:23] <shashi> ron: I thought of storing assignees inside Tasks, but again, to fetch tasks for a particular user, you'll need to go through all tasks to see if the user is assigned. Right?
[10:26:24] <ron> shashi: nothing to look up really. it's just a collection where you say {source: entity id, target: entity_id} if you need a directed link or just {entities: [id1, id2, id3...]
[10:26:30] <ron> forgot the } at the end.
[10:26:42] <shashi> Oh.
[10:26:45] <shashi> Heh
[10:26:59] <foofoobar> Derick: so I thought of the following: I save two values: ShortName (just "Plittersdorf") and the whole DisplayName (Plittersdorf, Bad ....)
[10:27:03] <ron> shashi: yes, which is why I'd suggest using an external indexer, though in some cases mongo's internal indexing would suffice.
[10:27:19] <ron> again, difficult to say without digging deep into the use case and possible future use cases.
[10:27:22] <foofoobar> Now when a user searches for "Plittersdorf" I will make a lookup for the ShortName. If there are more than one let the user decide
[10:27:26] <foofoobar> If not take the first
[10:27:34] <shashi> ron: That's what I ended up doing actually. Just wanted to ask you all once if there's anything I could do better
[10:27:40] <foofoobar> is this a good approach or am I over-seeing something which I can do with mongodb?
[10:28:13] <Derick> foofoobar: one sec, on the phone
[10:28:14] <ron> shashi: 'better' is a relative term.
[10:28:19] <foofoobar> Derick: okay, np
[10:28:21] <Derick> foofoobar: i have used OSM extensively
[10:28:47] <foofoobar> Derick: nice, then I have found the right person to talk with :)
[10:29:39] <shashi> ron: haha, I understand you'll need to know more about the app to give a more relevant answer. But this is about how much I have thought about the app itself :P
[10:30:41] <shashi> ron: you mean I store {entities: [id1, id2, id3 ...]} both in the tasks and users collection?
[10:30:47] <remonvv> \o
[10:31:07] <shashi> * in both
[10:31:26] <ron> shashi: the main concern is the query types that you need, possible future queries and which query is done more often. it would also depend on how you update the data. even in a 'small' use case there are many relevant questions.
[10:33:45] <shashi> ron: okay, so there's no usual way to do this. What I did is create a collection that simulates an "edge table" {left_id: source, right_id: target, edgeData: arbitrary_data}
[10:34:21] <ron> shashi: the only thing I can tell you for sure is that you DON"T want to keep a bi-directional link between the entities INSIDE the entities.
[10:34:26] <shashi> ^ that's the format of the documents I store
[10:34:38] <ron> in that case, you'll end up in inconsistency hell.
[10:35:53] <shashi> Oh yes. When I delete a task, I'll need to make sure all the related entries go away etc :/
[10:36:02] <Derick> foofoobar: right
[10:36:21] <ron> shashi: yes, that's a way to do it. I'd probably use an external indexing, but that's my preference. I'd normally need to have external indexing for other reasons, so I might as well use it to solve that as well.
[10:36:22] <foofoobar> Derick: can we private message? Because it's slightly offtopic
[10:36:46] <remonvv> shashi, stuck to MongoDB? Sounds like a graph db or rdbms might be a better fit for you.
[10:36:59] <Derick> foofoobar: sorry, in channel only
[10:37:18] <foofoobar> Derick: okay ;) So I want to realise a "search results which are nearby <city>"
[10:37:23] <Derick> right
[10:37:30] <Derick> so you store city + coordinates?
[10:37:37] <shashi> remonvv: I can manage with mongo I think, I am using meteor. So, yes, stuck to mongo
[10:37:51] <foofoobar> Derick: not yet, I have to OSM data set, but this does not contain a simple city+coords
[10:38:02] <foofoobar> I figured out that the OSM nominatim has this feature
[10:38:03] <Derick> I know :)
[10:38:06] <Derick> yes, it does
[10:38:30] <foofoobar> so I want to do it like "if entry is stored in db, return this, else sedn request to nominatim-api"
[10:38:37] <Derick> right
[10:38:39] <foofoobar> my problem is how I should store this in mongodb
[10:38:55] <foofoobar> because I get values back like "Plittersdorf, Bad Godesberg, Nordrhein-Westphalen, Deutschland"
[10:39:00] <Derick> well, the question is, how do you want to use the data that you store, how are you going to query against it?
[10:39:10] <foofoobar> And a user does not search for exactly this, he searches for "plittersdorf"
[10:39:14] <Derick> right
[10:39:32] <foofoobar> so I was thinking of saving two entries: 1) ShortName 2) DisplayName
[10:39:36] <shashi> ron: what is an external indexer? Have a pointer?
[10:39:43] <Derick> foofoobar: and coordinates I suppose? :-)
[10:39:51] <foofoobar> Derick: yes, of course ;)
[10:39:57] <foofoobar> So the user searches for ShortName
[10:40:05] <Derick> and plittersdorf already returns multiple results
[10:40:07] <foofoobar> And if this is found, he can choose between the values which were found
[10:40:11] <foofoobar> right
[10:40:19] <foofoobar> (if there are multiple results for this city name)
[10:40:25] <Derick> right
[10:40:30] <foofoobar> Is that a good approach ?
[10:40:35] <ron> shashi: Solr, ElasticSearch, etc
[10:40:38] <Derick> and you want to query later against the full name too?
[10:40:52] <Derick> but yes, so far, so good
[10:41:11] <foofoobar> Derick: The user searches for "plittersdorf" and gets .. lets say 3 results. Now I display the 3 full DisplayNames, so he can take the right one
[10:41:12] <Derick> will you have more than one entry for "plittersdorf" if two users pick a different Long name?
[10:41:18] <foofoobar> Then I will query for the full DisplayName
[10:41:20] <shashi> ron: Ah, cool.
[10:41:22] <remonvv> shashi. Okay ;) I would give your task object an array of assigned user ids.
[10:41:37] <foofoobar> I will gist an example, one moment
[10:41:40] <Derick> k
[10:41:54] <remonvv> Provided that's the actual use case that will give you the best performance for the most common tasks.
[10:42:51] <remonvv> On the assumptions that; a) there are significantly less users than tasks eventually, b) you most commonly need query on work for a specific user rather than users for specific work.
[10:42:56] <shashi> remonvv: Let's say it will give good performance for half of the tasks, the other half is showing a list of tasks which the user is assigned
[10:43:58] <remonvv> shashi, that's not a problem. You have two routes for that. Simply execute two queries or, the more common approach in NoSQL, denormalization :
[10:44:00] <foofoobar> Derick: https://gist.github.com/anonymous/82673bc928b46542d3d0
[10:44:20] <remonvv> {taskName:.., assignees:[{userId:..., userName:..., avatarUrl...}, ...]}
[10:44:25] <foofoobar> Derick: If a user searches for "Plittersdorf" I get multiple results, so I display him both "DisplayNames" and he can choose the right one
[10:44:43] <remonvv> Make assumptions like "a user name is unlikely to change", etc.
[10:44:51] <foofoobar> Derick: If he searches for "Bonn" there is only one result, so he instantly gets this returned
[10:44:58] <remonvv> For visualisation purposes that gives you single queries for most things.
[10:45:12] <foofoobar> Derick: if he searches for "XYZ" which is not in the database, I will make an API call to nominatim and save this into the database
[10:45:21] <shashi> remonvv: hmm, okay what's wrong with storing user _ids instead of the whole user document?
[10:45:51] <remonvv> shashi, there's nothing wrong with it. And I'm certainly not argueing that you embed the user document in the task document.
[10:46:16] <shashi> remonvv: also, I think if I am querying "list of tasks for user X", I'll first need to get all the tasks and go through their assignees to see if X is part of them, no?
[10:46:17] <remonvv> If you just store an ID you're absolutely certain every task->user query will actually require 2 seperate queries.
[10:46:36] <Derick> foofoobar: k
[10:46:40] <shashi> remonvv: oh, that way
[10:46:42] <remonvv> shashi, yes, and that's a single fast query.
[10:46:45] <foofoobar> Derick: sounds reasonable this approach?
[10:46:49] <Derick> foofoobar: yes
[10:46:58] <foofoobar> I was wondering if there is a better way, but this is like a local cache
[10:47:14] <foofoobar> so I will save bandwidth and will not exceed the quota for he nominatim requests
[10:47:14] <remonvv> shashi, I'm telling you to store in the task document what you most typically need for visualisation (username, avatar if supported) but nothing more.
[10:47:23] <Derick> foofoobar: just two tips, use a 2dsphere index for the coordinates and a geojson storage container (isntead of an array with lon, lat)
[10:47:36] <Derick> foofoobar: yes, otherwise the sysadms get angry :-)
[10:47:41] <remonvv> That way if you query task X your UI can immediately display assignees. Only when people click on that user will you have to load the profile or something.
[10:47:49] <shashi> remonvv: Ah, okay. Hmm, but I did not really understand why you made the assumption that eventually there will be more tasks than users. Can you explain?
[10:47:50] <remonvv> NoSQL floats on denormalization, don't resist.
[10:48:07] <foofoobar> geojson storage container? haven't heared of it before
[10:48:24] <foofoobar> The benefits are that I can use things like $near etc. for this, right?
[10:48:25] <shashi> remonvv: Okay, yes I that made sense.
[10:48:26] <Derick> foofoobar: http://docs.mongodb.org/manual/core/2dsphere/
[10:48:30] <remonvv> shashi, most systems that have tasks and users have that pattern. E.g. a project management database would have dozens of users put thousands of tasks.
[10:48:40] <remonvv> If your system has the opposite pattern simply do the opposite schema ;)
[10:48:49] <remonvv> If it's roughly equal choose either.
[10:48:57] <Derick> foofoobar: see also http://derickrethans.nl/talks/mongo-geo-stockholm13
[10:49:06] <remonvv> Only use edge tables if your performance for common queries is very bad.
[10:49:11] <foofoobar> Derick: okay, thanks
[10:49:11] <shashi> remonvv: I was asking why you made the assumption though :P
[10:49:26] <shashi> remonvv: Cool.
[10:49:46] <remonvv> shashi, experience. Most systems that manage some sort of tasks have less users than they have tasks.
[10:50:02] <remonvv> tasks are created by users, users are not created by tasks, hence one should generally outnumber the other.
[10:50:17] <shashi> remonvv: I meant why you lead to the conclusion that userids be contained in tasks from there.
[10:51:39] <shashi> remonvv: s/there/the assumption/
[10:51:51] <shashi> remonvv: s/meant/meant to ask/
[10:52:18] <remonvv> Simple, if a user can have X tasks and a task can have Y users you want to embed task IDs into user if X>Y and vice versa.
[10:52:26] <remonvv> You want embedded structures to be as small as possible.
[10:52:33] <remonvv> Having huge arrays makes indexes huge.
[10:52:39] <shashi> remonvv: Ah, right.
[10:52:40] <remonvv> And less selective.
[10:52:58] <Nodex> indexes over 1kb dont get indexed either
[10:53:04] <remonvv> Makes sense right? If you grab a task in this schema you get a task and a few users.
[10:53:10] <Nodex> s/indexes/fields
[10:53:15] <remonvv> If you grab a user with 1093 assigned tasks you get a HUGE document.
[10:53:21] <shashi> remonvv: Yes it does.
[10:53:24] <remonvv> That has to be read, moved when updated and so forth.
[10:53:25] <remonvv> ;)
[10:53:44] <remonvv> NodeX, true but every array element is indexed seperately.
[10:57:19] <foofoobar> Derick: nice slides! the examples are very good
[10:58:00] <foofoobar> Derick: Is it right that as soon as I use a field with the structure { type: <type>, coordinates: <coords } it is interpreted as a geojson field?
[10:58:43] <Derick> yes, as long as you use a 2dsphere index on it
[11:05:18] <deepy> is there any good wya I can check the amount of inserts/s I can do on my dev machine?
[11:06:01] <deepy> I think it seems to do at most 2,000 inserts per second, which seems low to me since I'm using setting _id
[11:06:22] <deepy> I'm using the java driver
[11:08:04] <deepy> okay, even worse, I'm doing 1.7k
[11:08:19] <quattro> starting mongodb with --configsvr only thing that does is change the port?
[11:12:02] <agend> hi
[11:12:20] <agend> i have a question
[11:12:52] <agend> i'm going to keep stats data in mongodb, and the data will be agregated in hrs, days, months
[11:13:43] <agend> what is the best way to get a sum of some value in these different collections in one go?
[11:14:48] <agend> i mean I'd like to ask for sum for hours in some range and then some days in other range, and some months in anoter range
[11:15:08] <kali> agend: aggregation framework
[11:15:11] <agend> and i'd like to avoid making 3 queries
[11:15:37] <kali> agend: or map/reduce if aggregation framework is not expressive enough
[11:15:39] <agend> can aggregation framework agregate across collections
[11:15:44] <kali> ha ! no.
[11:15:52] <kali> sorry, i had missed that bit
[11:15:57] <agend> can map reduce do it?
[11:16:04] <kali> kindof, yes
[11:17:01] <kali> it's not what you want to hear, but be aware mongodb is not a good fit for time data
[11:17:29] <agend> what should i use then?
[11:17:37] <kali> rrd or whisper
[11:17:39] <Derick> agend: maybe, but you should not do an operation over multiple collections ever
[11:18:11] <agend> Derick: maybe what? i dont follow
[11:18:11] <kali> (whisper being the library behind graphite tools)
[11:18:25] <Derick> agend: operations in MongoDB *never* happen on more than one collection at a time
[11:18:33] <Derick> with M/R you can cheat
[11:18:49] <kali> +1 it's goind to be ugly, cumbersome, and inefficient
[11:19:26] <agend> i'm going to use mongo with python, and was thinking about sending js code to mongo to avoid few db hits
[11:19:31] <Derick> I would suggest that you research whether you can pre-aggregate isntead of storing the full stats
[11:19:54] <kali> irk, js calls
[11:20:07] <agend> i want to pre-aggregate, thats why i will have clicks_hr, clicks_day, clicks_month
[11:20:26] <Derick> agend: why do you need differente collections for that?
[11:20:42] <agend> it's too slow to query hr
[11:21:16] <Derick> if you pre-aggregate you should be able to get one simple query
[11:21:19] <agend> and when i have to get sum of clicks for something what is not full day/month i have to query few collections
[11:21:39] <Derick> but you haven't shown *what* you are storing, and how... so it's a bit theoretical for now
[11:22:01] <deepy> If I'm just inserting a document with an _id using the java driver, is 5k inserts/s considered good speed? I'm trying it with both my laptops and Windows/Linux gives roughly the same numbers :/
[11:22:02] <agend> what if i need sum for 2013-01-01 15:00 to 2013-07-12 17:00
[11:22:30] <remonvv> deepy, safe writes?
[11:22:36] <Derick> deepy: depends on your hardware really...
[11:22:46] <Derick> agend: depends on how you store your data
[11:23:13] <deepy> This one is a Lenovo W520 with 8GB of RAM, I was expecting slightly better speed :/
[11:23:44] <agend> Derick: like i said, it follows the mongodb use case Hierarchical Aggregation
[11:23:53] <deepy> remonvv: I'm not sure, I'm doing coll.insert(new BasicDBObject("_id", i));
[11:24:41] <deepy> I've not really touched any settings hwatsoever
[11:27:10] <kali> 5k/s is not that bad, honestly
[11:27:36] <fredix> hi
[11:27:50] <fredix> when a GridFile is open, should I close it ?
[11:28:14] <deepy> kali: { _id: i } is all I'm inserting, this seems rather small to me and 5k/s seems rather bad becuase of that
[11:28:25] <deepy> well, id is a long
[11:28:35] <deepy> err, i is a long that increases with every iteration
[11:29:22] <kali> deepy: are yu after a real life use case, or just trying to bench ?
[11:29:29] <Derick> deepy: the default is write acknowledged inserts... you might want to turn that off
[11:29:39] <kali> and try bulk insert
[11:32:30] <deepy> kali: if it can only handle 5k/s then it's slower than what I wrote myself and I can't replace it
[11:32:34] <deepy> and I don't want to use what I wrote myself
[11:33:46] <deepy> unacknowledged does 60k/s
[11:34:00] <deepy> Can I speed this up further :D?
[11:34:28] <Nodex> sharding :)
[11:34:54] <Derick> batchinserts
[11:35:14] <deepy> can I shard on a single computer and gain any increase in speed?
[11:36:20] <Nodex> depends on your CPU
[11:58:52] <stefuNz> Hey, i need to store a change history from my entities (from mysql) and i'm evaluating if mongodb would be suitable. those are literally millions (100 million+) change-entities that i would need to store and i would need to be able to query them by entity_id (the identifier of my identifier), the change type and the change date (like in mysql WHERE entity_id = 123 AND type = 'THATFIELDOVERTHERE' AND when > '14 days ago') - can mongodb
[11:58:53] <stefuNz> handle those situations well?
[12:03:36] <kali> stefuNz: so far so good
[12:05:57] <stefuNz> kali: it would be possible that the data size would be greater than my memory, how would that behave? i summarized the question in this stackoverflow question: http://stackoverflow.com/questions/17042760/storing-changes-on-entities-in-mysql-scalability -- i'm already using mongodb for some other stuff, but i'm not an expert.
[12:10:54] <kali> stefuNz: it would be slower. depending on what you're using it for, it may still be good enough as long as the index can fit in memory
[12:11:23] <tiriel> Hello everyone!
[12:11:52] <tiriel> what would be the recommended way of rotating mongo journals on a production environment?
[12:12:11] <kali> tiriel: http://docs.mongodb.org/manual/reference/command/logRotate/
[12:12:22] <stefuNz> kali: thank you! do you have another idea how this could be reasonable saved (i mean maybe another database system?)
[12:13:11] <kali> stefuNz: i think mongodb is a good fit for this
[12:13:15] <tiriel> those are mongo logs, not mongo journals
[12:13:41] <kali> tiriel: ha, sorry
[12:13:56] <kali> i'm not sure i understand the question in that case :)
[12:14:27] <stefuNz> kali: OK i will try to evaluate how big the index would be ;) thank you
[12:31:13] <tiriel> oh, neverming, it was an outstanding problem we had and it seems to have gone away now that we updated to the latest version of mongodb
[12:34:57] <zeen> i'm using python/mongokit and a bit confused about a relationship i'm trying to do. i'm very new to nosql way of things so learning. i have World and i have Island, for each island it will belong to a world. I know there are no real joins in mongo. what is the data type for the world_id? should the field be world, or world_id
[12:35:18] <zeen> http://pastie.org/8034285
[12:35:26] <zeen> i tried 'World' but it failed
[12:46:22] <zeen> actually, nm, i just forgot to import the World
[12:49:06] <ron> OMG
[12:49:10] <ron> YOU IMPORTED THE WORLD?!
[12:49:12] <ron> O M G
[12:49:33] <zeen> :)
[12:50:22] <deepy> Isn't that requirement a little heavy?
[12:51:40] <ecornips> Question re replication: I want to replicate a smallish data set (20GB) and keep it up to date across 4 different data centres (UK, US, AU, JP), where each data centre has the entirety of the information. Any of the DCs is able to be Master (low-volume, async updates). Is this possible? The election process looks like it supports Master election quite well. But not sure about the replication aspects across data centres. Is this something that should work ok -
[12:51:41] <ecornips> especially across continents?
[12:53:07] <kali> ecornips: it should work
[12:53:47] <ecornips> kali: would there be problems with the higher latency (e.g. 150-200ms between some DCs, others will be < 50ms)?
[12:54:39] <kali> ecornips: other people around here probably have more experience than i have with this
[12:54:57] <kali> ecornips: i haven't play with this (yet ?)
[12:55:02] <ecornips> kali: ok thanks, no worries :)
[12:55:53] <ecornips> kali: by chance have you ever done local replication (replica sets I guess), where a node has been down for a couple days, e.g. some hardware failure? I'm quite curious about those sorts of impacts
[12:56:06] <ecornips> as in, will it recover by itself ok?
[12:56:27] <ecornips> I've done a bit of testing locally, and things seem to work quite well, but I'm curious for other peoples experiences
[13:01:56] <starfly> deepy: all in a day's work for the NSA… :)
[13:04:32] <starfly> ecornips: I haven't replicated in that scenario with MongoDB, but have with other technologies. If you able to tolerate the latency with writes from an application standpoint, it should work fine with MongoDB
[13:05:18] <ecornips> starfly: writes will be few and far between, so very tolerant there. If you don't mind me asking, what other technologies have you used in similar scenarios?
[13:05:33] <starfly> Oracle and filesystems
[13:06:02] <ecornips> ah ok - Oracle over geographies, must've been an interesting experience :)
[13:06:34] <n06> hey do you guys know how much overhead the system profiler adds to I/O operations? Trying to gauge how efficiently im writing to my db
[13:06:43] <starfly> ecornips: yes, both with Oracle Streams and Data Guard, from Europe to US to Japan to Singapore; all worked fine
[13:07:11] <ecornips> nice
[13:07:51] <starfly> ecornips: care and feeding from time to time (there's no avoiding WAN issues sometime, even when private)
[13:19:03] <n06> second question: can i read from replset secondaries while writing to the primary?
[13:23:04] <Derick> sure, you might end up getting a little stale data though
[13:23:46] <n06> Derick, thanks, in my case stale data shouldn't be an issue
[13:24:01] <n06> i just need high throughput
[13:26:23] <starfly> n06: if you need high write throughput, consider spreading your writes around with sharding
[13:26:42] <n06> I already have starfly, but thank you for the suggestion
[13:27:00] <starfly> n06: good deal
[13:27:06] <n06> im trying to tune my read preferences. I think ill go for primaryPreferred
[13:33:34] <kali> n06: you don't have to pick one for your whole app...
[13:34:05] <n06> really? so read preference is set within each query?
[13:34:27] <kali> sure
[13:34:37] <n06> ok thanks
[14:21:46] <hectron> Hey guys, I had a question regarding scaling and MongoDB.
[14:22:30] <starfly> hectron: ask away, maybe someone will know an answer
[14:22:55] <hectron> I'm going to be building an application which is going to interact with a thousands of users on a daily basis, and I'm going to require an equal amount of read/writes. I intend to perform on-the-fly analytics on the data collected.
[14:23:02] <hectron> Is Mongo the choice for my web-app?
[14:23:43] <Nodex> depends on the type of "on the fly" analytics you're after
[14:24:36] <hectron> I'm likely going to need to use summary tables. I expect there to be billions of reads/writes in my application daily.
[14:24:59] <starfly> hectron: not sure of the pricing (likely very expensive), but SAP Hana might be an interesting avenue to pursue
[14:25:33] <Nodex> do you have the infrastructure to accomodate billions of read/writes
[14:25:40] <hectron> Definitely.
[14:25:59] <Nodex> that will probably entila sharding and replicas
[14:26:03] <Nodex> entail*
[14:26:32] <starfly> hectron: I'd be dubious about ability of MongoDB to scale up for that
[14:27:19] <starfly> hectron: if open source is a goal, perhaps Cassandra
[14:27:19] <Nodex> I wouldn't :)
[14:27:28] <hectron> I'm just a level 1 developer here, so I was given the task to research viable options. We wanted to leverage Mongo and set up sharding/replicas. However, I do not know much more than what I just stated. We have our own datacenters, so I'm sure that we have the infrastructure.
[14:27:48] <hectron> Ah, I see. I'll definitely research Cassandra as well.\
[14:28:12] <Nodex> you won't get the same level of aggregations from cassandra
[14:28:15] <Derick> or a combination...
[14:28:22] <starfly> Hana is built for in-memory, on-the-fly analytics
[14:28:33] <Nodex> and it's really a different datastore concept
[14:28:40] <starfly> agree
[14:29:29] <hectron> ANd what would be the difference between Cassandra and MongoDB?
[14:30:01] <starfly> Cassandra is more of a name-value datastore, not the bells and whistles of MongoDB, but also can scale up much better
[14:30:10] <Nodex> cassandra is a column style store
[14:31:36] <starfly> Hana is also a column store at the lowest level, but you can reassemble row store representations
[14:31:59] <hectron> Interesting. I will definitely refer to these things when I report to my manager.
[14:32:08] <hectron> You guys are awesome!
[14:32:26] <hectron> I have only gotten this quality of responses from the Drupal community and Linode support.
[14:33:58] <starfly> best of luck, hectron, sounds like you'll have a full plate and an interesting use case
[14:34:18] <hectron> Thank you guys!
[15:01:13] <agend> hi
[15:01:44] <agend> is there any way to flatten map-reduce result - i dont want the value sub object ???
[15:02:16] <agend> any way to do it with finalize function?
[15:06:09] <parox> hey, if i have a mongo collection is there an easy way to get all the unique keys?
[15:07:03] <Nodex> distinct?
[15:07:30] <starfly> db.collection.find({},{_id:1})
[15:07:54] <parox> yes, assuming that not all models always contain every key
[15:08:13] <n06> how bad is it for read speeds to shard a collection by hashed _id field
[15:08:52] <Nodex> what's a model?
[15:09:53] <parox> sorry documents*
[15:10:27] <Nodex> someone has made something on github to get all of your keys but I can't remember what it's called
[15:10:43] <Nodex> it's a trivial thing to accomplish in a script anyway.. < 5 lines
[15:11:15] <parox> even if not all documents contain all keys?
[15:11:53] <Nodex> yes, just find() all docs
[15:16:59] <parox> if you mean find() and then parse keys, i think that would be slow with a few thousand docs
[15:18:36] <timstermatic> Hi there. I'm using the AF and I want to group numbers by rounding them up. So 119.10609436035156 would be 119
[15:25:20] <starfly> timstermatic: I'm not familiar with a rounding function, but perhaps you can project result as a string and perform a substring on that to at least truncate to an integer
[15:30:32] <starfly> future feature… :)
[15:32:20] <timstermatic> I could maybe use divide by 1
[15:32:24] <timstermatic> $divide
[15:32:44] <ron> try dividing by 0.
[15:32:46] <timstermatic> I'm happy just to get what is before the decimal point
[15:47:06] <barnes> anyone managed to do logrotate via Linux logrotate script on mongod 2.4.4?
[15:48:30] <barnes> in the log i get Tue Jun 11 17:15:36.701 ERROR: failed to rename '/var/log/mongo/mongod.log' to '/var/log/mongo/mongod.log.2013-06-11T15-15-36': errno:2 No such file or directory
[15:48:34] <barnes> Tue Jun 11 17:15:36.702 Fatal Assertion 16782
[15:51:51] <Swimming_Bird> is there a way to limit the cursor for the distinct command?
[15:53:02] <Derick> distinct, being a command, doesn't return a cursor
[15:53:12] <Derick> and the command doesn't support "limit"
[15:53:22] <starfly> barnes: you've verified that /var/log/mongo directory exists and the /var/log/mongod.log file exists (admittedly, simple questions)?
[15:53:23] <Derick> you can however get around this by using the aggregation framework
[15:54:03] <Derick> Swimming_Bird: as pipeline operators you'd use: group on the key, (sort on the key), limit by x
[15:54:18] <Swimming_Bird> @Derick: would that be performant for large collections?
[15:54:29] <Derick> how large is large?
[15:54:46] <Swimming_Bird> millions of documents
[15:54:58] <Swimming_Bird> so medium largeness i suppose
[15:55:17] <Swimming_Bird> basically i dont want to scan the full collection
[15:55:28] <Derick> you will need that for distinct
[15:55:29] <Swimming_Bird> based on the index the query should be pretty quick
[15:55:39] <Derick> hmm
[15:55:40] <Derick> good point
[15:55:42] <Derick> let me check!
[15:56:45] <Derick> Swimming_Bird: hmm, no, the index doesn't come into play here I think
[15:57:10] <barnes> starfly: yes, I think it is because mongod tries to rename it to /var/log/mongo/mongod.log.2013-06-11T15-15-36 and logrotate to mongod.log.1
[15:57:23] <Swimming_Bird> basically i've got a group members collection, i want to do: where user_id: {"$in": […]}, order_by: 'updated_at': -1, limit: 5, distinct: 'group_id'
[15:57:36] <Derick> k
[15:58:09] <Swimming_Bird> the problem i'm getting is if i pass multiple user ids who are in the same group, i'll get 2 of the same group back and not get 5 unique groups
[15:58:10] <barnes> starfly: it actully works if i do kill -SIGUSR1 <PID> from commandline
[15:58:26] <Swimming_Bird> i'll check out aggigation framework maybe though
[15:58:30] <Swimming_Bird> never used before
[15:59:19] <Swimming_Bird> looks like aggregate supports indexes and limits. so it may be the way to go
[15:59:50] <Derick> you should be able to do : match, group_by: group_id, sort: { updated_at: -1 }, limit 5
[16:11:56] <Swimming_Bird> @Derick: do you know why they have the _id: null in the second example? http://docs.mongodb.org/manual/reference/aggregation/match/#stage._S_match
[16:13:00] <saml> hey do you use ObjectID or use some custom String id?
[16:13:02] <Derick> Swimming_Bird: to put all documents in the same group-by bucket
[16:13:28] <Swimming_Bird> how does _id: null accomplish that?
[16:13:46] <Swimming_Bird> Ohh, _id: isn't a real id
[16:13:52] <Derick> _id is the selector for group-by-buckets
[16:13:56] <Swimming_Bird> it's something you make up for the grouping
[16:14:05] <Derick> yes
[16:15:25] <barnes> starfly: https://jira.mongodb.org/browse/SERVER-4905 seems to handle standard logrotate in the future
[16:16:35] <starfly> thanks, barnes
[16:16:45] <saml> is casbah deprecated over ractivemongo?
[16:16:47] <saml> reactive
[17:00:52] <Derick> timstermatic: my local build now has a $round :-)
[17:05:28] <starfly> cool
[18:16:48] <grouch-dev> Hi all.
[18:17:24] <grouch-dev> trying to execute a delete in a replica-set but the delete query always fails, with mongo stacktrace.
[18:18:08] <grouch-dev> The stacktrace I assume is due to the "logOp but not primary"
[18:19:56] <grouch-dev> during the delete, our secondaries seem to be under extreme load from the replication
[18:20:25] <grouch-dev> we start seeing things like "thinks we are down" on all the servers, then the primary loses majority
[18:20:54] <grouch-dev> and the only operation is the single delete request that was issued from the primary
[18:21:08] <grouch-dev> other than that these boxes are just sitting there running mongod
[18:23:19] <starfly> grouch-dev: approximately how many documents are you attempting to delete? How large is the collection you're deleting from?
[18:23:43] <grouch-dev> the collection contains 10,000,000 documents
[18:23:50] <grouch-dev> the delete is for 200k
[18:24:00] <starfly> index used on the delete?
[18:24:46] <grouch-dev> yes, the index is based on a status field (string), and a date field
[18:25:00] <grouch-dev> so we first run the count() and get 200k
[18:25:15] <grouch-dev> and we verified index hit using explain()
[18:25:54] <grouch-dev> then tried to delete the records
[18:26:11] <grouch-dev> if we backup the dataset, and restore onto a single machine, this process works fine.
[18:26:38] <grouch-dev> however, our pre-prod and prod environments consist of 4 mongos and an arbiter
[18:26:47] <starfly> how many distinct status values are there? Is status first in the combined status-date index?
[18:29:15] <grouch-dev> I think there are only 2 states. I am trying to get an answer to the second question...please bear with me
[18:30:56] <grouch-dev> yes, the string is the first entry in the index
[18:31:40] <starfly> you could have a selectivity issue regarding the index. With that many documents, you'd need to have the date field first in a combined index, otherwise, MongoDB has to effectively scan a very large portion of the index, effectively rendering it suboptimal (putting it mildly)
[18:32:15] <grouch-dev> ok, there are 4 states
[18:32:33] <starfly> same answer, though, selectivity would be a problem...
[18:33:07] <grouch-dev> I am not sure I understand
[18:34:34] <starfly> with indexes (almost regardless of the database technology), you want your most selective field (field with the highest number of distinct values) to be the first field in any combined index, so your query optimizer can work from the most selective to the least selective, which greatly optimizes runtime performance
[18:35:18] <grouch-dev> makes sense
[18:35:55] <starfly> so, no doubt there's added overhead with replication, but I think you more have an issue with the way that index is defined
[18:36:09] <asciiduck> Currently I'm stowing some log data into a mongo collection so I can use mapReduce on the data. One thing I would like to do is map ip addresses from the log data to the country of origin so I can reduce and find numbers of based on country of origin. I'm trying to do that by making a call out to some of the APIs that offer that information, the problem I am encountering is that the javascript engine that mongo uses for processing map reduce functio
[18:36:49] <grouch-dev> starfly: a more broad question would be "why can't mongo handle a simple delete query"
[18:37:27] <starfly> grouch-dev: you are deleting from a 10 million document collection, if it's not indexed properly, what is simple might take forever to execute
[18:38:06] <grouch-dev> starfly: yes, I could handle that, but it just simply cant handle that command
[18:38:59] <grouch-dev> starfly: I am wondering is there a real reason why (which requires much more education on my part) or is it just a bug?
[18:39:30] <grouch-dev> starfly: I would not expect the database to dump a stacktrace and fail on a delete
[18:40:51] <starfly> grouch-dev: agree that there is likely a bug in the way that the command is being handled, but its performance is entirely predictable if not indexed properly
[18:40:59] <grouch-dev> starfly: also, it does delete "some" of the records...approx 10k per run...
[18:41:17] <starfly> grouch-dev: sounds like a buffering issue
[18:42:07] <starfly> grouch-dev: anyway, you might want to try revising the index, you can probably see things work as you'd like (and improve other query performance)
[18:55:45] <grouch-dev> starfly: I really appreciate your input starfly. I currently have a task to reimplement the indexing in the application, and that insight will be part of that effort. The application uses spring-mongo-data and @Index attribute on sub-documents, so 90% of our indexing is broken anyhow...
[18:57:56] <leifw> anyone know how to determine whether a client object in the ruby driver is connected to a single mongod or to a replica set?
[18:59:46] <starfly> good luck, grouch-dev
[19:12:06] <leifw> or rather, if it's connected to a mongod that is part of a replica set
[19:12:35] <ron> remonvv
[19:17:45] <starfly> leifw: you could examine all sessions for all mongod instances (using the REST API or from mongo shell) that are in the replica set in question
[19:18:31] <leifw> I am trying to do this through the ruby driver because I want to skip a ruby driver test when I'm running against a replica set
[19:20:16] <starfly> leifw: so you'd prefer to assume things are available on the back end if you've upped the odds with a replica set?
[19:21:21] <leifw> no, I'm trying to skip some ruby unit tests
[19:21:25] <leifw> this is not for production
[19:22:01] <leifw> I'm testing tokumx by running some of the driver tests against it, but there are some changed assumptions so for now I just want to skip the driver tests that I know won't work
[19:22:37] <starfly> but you're unable to skip the unit tests across the board?
[19:23:07] <leifw> I just can't figure out what ruby expression will tell me whether the server that the ruby driver is connected to is standalone or the primary of a replica set
[19:23:22] <leifw> if I can figure that out I can skip the test with an if statement
[19:33:45] <ProLoser> hallo
[19:33:48] <ProLoser> anyone use mongohub?
[19:34:12] <ron> I imagine someone does.
[19:40:27] <starfly> leifw: short of sending an rs.conf() command over the wire and interpreting the results (null if not RS) in Ruby, can't think of anything
[20:28:16] <ikurei> hi. i need to ask a noob question, im new to nosql and mongo and i cant figure this out
[20:28:43] <starfly> ikurei: shoot
[20:29:25] <ikurei> is there any easy way to add a property to a property of a document? like, i have a game with this structure { name : "LOL", platform: "PS5", prices : { "amazon" : 50, "play: 50 } }
[20:29:39] <ikurei> i need to add a new property to that "prices" property
[20:30:12] <ikurei> what would be the canonical way to do it? i can do it in a couple of dirty ugly ways, but it's too easy a problem to not have an easy solution
[20:32:19] <kali> $set: { "prices.steam" : 42 }
[20:32:20] <starfly> ikurei: you can read the document, update the structure in your code, and update it in MongoDB (using the same _id)
[20:33:06] <ikurei> starfly: that was my only working idea. and that's what i meant, there had to be an easiest way
[20:33:38] <kali> but you'd bette change the structure to prices: [{ store: "amazon", price: 50}, {store: "play", price: 50} ]
[20:33:53] <n06> I have a really strange error popping up as im trying to setup my production cluster. I cant add any shards to the mongos
[20:34:22] <n06> "errmsg" : "exception: SyncClusterConnection::insert prepare failed: 10276 DBClientBase::findN: transport error: <host>:27020 ns: admin.$cmd query: { fsync: 1 } <host>:27020:{}"
[20:34:35] <n06> redacted just for piece of mind
[20:34:36] <ikurei> kali: thanks a lot. i tried doing that, but without the quotes aroind prices.steam
[20:35:41] <ikurei> kali: also that was my first design, but i need to do fast accesses at it from the node code. that design i guess will be way better to perform queries
[20:35:51] <ikurei> thanks a lot to both!
[20:36:19] <kali> ikurei: it's usually better to avoid using "variable stuff" as key name in mongodb
[20:37:04] <ikurei> kali: why so?
[20:37:09] <kali> ikurei: you're going to need an index for amazon, and index for "play", and add a new index every time you add a store
[20:37:39] <nemothekid> I have a collection with ~20,000,000 documents and I need to update all of them once a day. Its stored on a single database with SSD and enough RAM for the whole dataset. The problem is updates are really slow (so slow that we now batch remove and insert) and while this is happening the database latency for other queries goes through the roof. I thought about using another data store like Redis, but I need to be able to sort these
[20:37:39] <nemothekid> documents for top-N queries. Lastly we thought about inserting to a seperate collections and renaming, but if we eventually decide to shard this collection again, that will be a huge problem. Just looking for advice/similar solutions
[20:38:00] <ikurei> hummm... yep, i see it's an awful way. i forget this is a damned database.
[20:38:04] <ikurei> thanks for the advice
[20:38:11] <kali> ikurei: also, it will lead you to concatening strings to craft queries which is a first step towards injection vulnerabilities
[20:39:53] <kali> nemothekid: this is bad. really bad. mass updates are database killers
[20:40:17] <kali> nemothekid: is there any way you can avoid these mass updates ?
[20:41:17] <nemothekid> kali: well I don't really need updates, I just need to replace the data. So if I could have an insert that would kill the previous document instead of dropping the newer one that would be optimal
[20:41:47] <kali> nemothekid: what about adding the date in the id ?
[20:42:42] <starfly> nemothekid: you could use an active-passive database design, update the passive collection while the active copy takes reads, then swap when the passive collection has been updated. Separate the hosts, storage, etc. to avoid interaction.
[20:42:46] <nemothekid> I thought about that but then I'm left with a collection that I'm adding 20mill documents per day, and the indexes are growing, and the sort performance will start to suffer as well
[20:43:56] <kali> nemothekid: active/passive can be an option, in that case... switch db based on the current day parity...
[20:45:25] <nemothekid> That could work, and it would work if we decided to start sharding again as well. Don't like the changes I would have to make to the application, but it seems like the best option for now. Thanks
[20:47:17] <starfly> nemothekid: we do something similar with a backend for a REST app and it works well, although the setup for swapping is (of course) some work
[22:33:47] <generic_nick> hi, i'm using the pymongo driver and trying to set the read_preference to SECONDARD_PREFERRED
[22:33:47] <generic_nick> it works on my sharded collection
[22:33:54] <generic_nick> but on my unsharded collection mongotop/mongostat are showing no reads whatsoever
[22:33:59] <generic_nick> any ideas or anyone have the same issue?
[22:52:38] <rpcesar> im getting this weird error in my map reduce after switching from a 2.0.5 single server setup to a 2.4 sharded replica set. Every other process seems to be working as far as I can tell. The error I am seeing is : exception: stale config on lazy receive :: caused by :: $err: "[copilot.SearchPartials] shard version not ok in Client::Context: this shard contains v...", code: 9996
[22:53:20] <rpcesar> im not having much luck finding that error, google returns very limited results, and those seem to be bugs that were closed in other versions
[23:51:16] <dberg> I'm trying to set a cluster locally for testing purposes with a config server, 1 mongos and 3 mongod running as a replica set.
[23:51:37] <dberg> starting the servers seem fine
[23:52:08] <dberg> if I log into one of the mongod and run rs.initiate({...}) it seems fine from the logs.
[23:52:22] <dberg> I can see that one is master and the other 2 are secondary.
[23:52:36] <dberg> but from the documentation should I be doing this from mongos?
[23:52:59] <dberg> using sh.addShard()
[23:53:00] <dberg> ?