[08:47:50] <kali> k610: mmm not realy. the syntax is quite different, but it is query-based, whereas redis interface is closer to "hashmap and list on server" with similar interface to what standard api provides
[09:10:29] <christkv1> kali: I consider redis a swiss army knife of data structures and more often than not people use both mongodb and redis
[09:13:29] <kali> christkv1: redis was not ready for HA operation at that time, i don't know where it stands now
[09:15:25] <tpae> quick question.. i've been reading the docs, but can't seem to find good practice for adding indices.. when should i add an index, and when should i re-index?
[09:25:35] <kali> ppetermann: sure. from a development pov, it's a nice theory. but from the deployment and operations side, it's nice to have flexible tools and limit the number of infrastructure services
[09:26:08] <NodeX> "increments field by the number value if field is present in the object, otherwise sets field to the number value. This can also be used to decrement by using a negative value."
[09:26:54] <NodeX> tpae : you should never need to re-index
[09:27:16] <NodeX> and you should add indexes at the start really as they're blocking operations that stop reads/writes to your data while they're building
[09:27:21] <ppetermann> kali: right tool for the right job.
[10:20:06] <loganBS> Hi all. I'm puzzled about compound shard key. Is there a limit on number of fields which form a shard key? For example, can I use a compound shard key with 6 fields?
[10:30:06] <christkv1> loganBS: post 2.0 the max key size for an index is 1024 bytes
[10:30:17] <christkv1> loganBS: I think that's the only limitation
[10:33:59] <loganBS> christkv1: So can be a good idea to use a shard key as is: {coarsely ascending key (time); username; app;}?. My target is to store for every app of a user the hourly number of access.
[10:34:36] <christkv1> well if you use the time as the first part you will get a hot shard
[10:34:42] <christkv1> as it's an increasing counter
[10:35:18] <christkv1> what would your user document look like ?
[10:35:33] <christkv1> and you would probably want to to use the user name as the first part
[10:35:45] <christkv1> if you are going to be querying the information by the user
[10:36:27] <loganBS> but ideally i've a large number of users, not only one (it's analytics)
[10:36:54] <christkv1> ok then you might want to presplit the time ranges
[10:37:26] <loganBS> user document is like: {_id, user, app, year, month, day, {hour_1, hour_2,...hour_24}}
[10:38:28] <christkv1> ah ok each document is a single day instance of the user
[10:38:41] <christkv1> so for a single user in a year you might have 365 docs right ?
[10:41:48] <christkv1> then you want the docs to stay together
[10:42:06] <christkv1> this way you don't incur a scatter-gather operation
[10:42:22] <christkv1> if you shard by user the queries will go straight to the right shard
[10:42:31] <loganBS> ok, so using user as shard key could be good in this case
[10:42:52] <christkv1> yes BUT no guessing is a good replacement for testing the shard key with real data :)
[10:43:21] <loganBS> but what happened when a collection is like this: {_id, user, app, PAGE, year, month, day, {hour_1, hour_2,...hour_24}}
[10:44:19] <loganBS> where number of PAGE is different for each user (i.e. a user can have only 2 pages, but another user can have 1 million of pages)?
[10:45:11] <christkv1> so what's the biggest possible single day dataset a user can generate ?
[10:57:37] <loganBS> but using only user as shard key isn't there the risk to unbalance the cluster (a user with more page have a greater number of writes)?
[12:13:31] <coalado> sounds like it will replace M&R sooner or later
[12:13:51] <christkv1> well I think there might still be cases where M&R makes sense
[12:57:36] <loganBS> Another question about sharding: multi-range sharding (i.e. a shard node is responsible for different ranges, for example {-Inf - 100} and {2000-2100}) is what happen by default?
[12:59:58] <NodeX> I think the default shard key is _id which splits evenly across shards
[13:04:27] <tpae> man, more i work with mongodb, more i like it
[13:04:53] <remonvv> Indeed. I have similar sentiments towards sleeping.
[13:05:05] <remonvv> ron, I endorsed your epic skills on LI. You owe me.
[13:05:28] <loganBS> NodeX: the question is another: a shard node can be responsible for more than one range? For example, when using a monotonically increasing shard key, all the writes are directed to the current shard; but when the last shard is reached, continue from the first shard?
[13:07:35] <tpae> according to this article: http://viktorpetersson.com/2012/02/13/comparing-mongodb-write-performance-on-centos-freebsd-and-ubuntu/
[13:33:26] <remonvv> tpae, that's an incredibly questionable benchmark. The main performance differences come from chosen filesystems (the defaults may be different for those OSs), memory manager, etc.
[13:34:15] <remonvv> None of those things is specified in the test.
[14:00:06] <sag> ohhh..sorry for the confusion..i mean what if lookup on diff values..
[14:00:24] <ianblenke> like name in ppetermann's example?
[14:00:53] <ppetermann> sag: are you looking for $or?
[14:01:19] <sag> like something=foobar,something=barfoo,something=foo
[14:01:44] <sag> yep $or could be option..ohh yeah..
[14:02:11] <sag> what will be impact on performance by using $or and $in
[14:22:32] <sgtpepper> hello to all! Consider this situation: in a sharded environment one node reached its memory limit. What could happen at this point: 1) Add a more capable disk. 2) Add a new shard node. In the case 2 the full node is rebalancing?
[14:24:37] <kali> sgtpepper: 1 is a recipe for disaster
[14:25:24] <kali> sgtpepper: a more capable disk will be what... at most 20% better than whatever you have now, whereas memory is orders of magnitude faster than disk.
[14:26:03] <kali> so if you add a node, the whole cluster will be rebalancing, stealing about 1/n of the content from the existing nodes
[14:27:07] <sgtpepper> kali: under the hypothesis that the shard key was correctly chosen?
[14:28:08] <kali> sgtpepper: you can look at the chunks list and check they are of even sized
[14:29:24] <remonvv> sgtpepper, I guess the assumption that the shard key is actually correct doesn't fly or it wouldn't be just one node with memory issues.
[14:29:35] <remonvv> The easiest solution is adding nodes, regardless.
[14:30:14] <sgtpepper> Ok, and if one has chosen a wrong shard key, is there a way to hot change it?
[14:30:27] <remonvv> sag, your question is a little vague but i'm pretty sure you are indeed looking for $in rather than $or. In terms of performance $in is faster if your use case allows for $in.
[14:35:10] <remonvv> Depends a bit on your data whether or not you should. Smaller shard keys tend to be "better"
[14:37:33] <sgtpepper> Because my assumption is to use: "month - username - category - type - page" where page is contained in a type, which is contained in a category
[14:38:08] <sgtpepper> or can be better something like this: "month - page", bypassing username, category and type?
[14:44:07] <remonvv> month shouldn't be in there, that's an aweful first key. it means all your data for a certain month is likely to go to the same (couple of) shards.
[14:46:48] <remonvv> sgtpepper, the goal is to make sure your shard key uses high cardinality fields that ideally are not based on time (at least not as the first field in the key). Things like UUIDs, usernames or other highly distributed IDs are good.
[14:46:59] <sgtpepper> remonvv: but i've a potentially very large number of username (and pages)
[14:47:02] <remonvv> eka, specifics kind sir. Which log?
[14:47:20] <remonvv> sgtpepper, that's the point. You want it to be as large as possible. It means there are more unique values for that field.
[14:48:02] <remonvv> sgtpepper, imagine you have {month:1} as your shardkey and you have 100 shards and you'll see the problem.
[14:48:34] <remonvv> Actually that's not clear at all.
[14:49:43] <remonvv> eka, --quiet helps a little and you can write the log output to the void. Why do you want it disabled?
[14:50:13] <eka> remonvv: added quiet to my .conf... but still logging connections and some queries
[14:50:15] <sgtpepper> remonvv: ok, but using {month:1, username:1} the cluster is not balanced based on username?
[14:50:28] <remonvv> sgtpepper, it's a bit hard to explain in a few sentences.
[14:52:26] <sgtpepper> remonvv: I've read Scaling MongoDB of Kristina Chodorow...what I understand is that a shard key of the type {Coarsely ascending key + search key} could be good. Isn't true?
[14:53:20] <remonvv> sgtpepper, let's just say that will probably be worse than just {username:1}. Especially if "month" is something like the current month which means all new data will have the same value for month and will lean towards the same shard initially. MongoDB will split based on data hotspots so that'll eventually sort itself out but it just never actually helps the situation. Then if the next month arrives you'll be hitting the worst case scenario
[14:54:26] <remonvv> sgtpepper, kchodorow works for 10gen so I assume she knows what she's on about. I'm assuming the coarsely ascending key wasn't time based though. In my experience that's a particularly bad idea.
[14:54:37] <remonvv> And I have quite a bit of experience with that specifically.
[14:58:41] <sgtpepper> ...but when a certain username receive much more write/update operation compared to other? (for example has a greater number of pages, for each one an hourly counter should be stored)?
[14:59:07] <remonvv> sgtpepper, that's not relevant. If a single user receives a lot of writes those writes will always have to happen on the same shard anyway.
[14:59:33] <remonvv> And if that single user has a lot of associated data it will only arrive on the same shard if you use the same field(s) as shard key for that data.
[14:59:38] <remonvv> Which is actually a good idea.
[14:59:49] <remonvv> I usually shard on our UUID for user related data for example.
[14:59:57] <remonvv> That guarantees all data for user X lives on the same shard,.
[15:00:18] <sgtpepper> Isn't there a risk to have unbalanced shard (for example the one with this username has more data)?
[15:00:22] <remonvv> UUID is better than username due to length variability and usernames generally being larger.
[15:00:43] <remonvv> sgtpepper, well, unless you have incredible powerusers that can fill a single shard, no.
[15:01:23] <remonvv> sgtpepper, note that mongodb splits chunks. If a single shard is dealing with a lot of data from a few users it'll automatically split related chunks and rebalance them across the shards.
[15:01:46] <remonvv> sgtpepper, and it's not very realistic. You won't have a system where 0,1% of your users generate 99% of your data usually.
[15:03:30] <remonvv> sgtpepper, how many users do you expect? roughly
[15:06:34] <NodeX> how come your sharding on only 3k documents?
[15:06:41] <remonvv> sgtpepper, distributed writes is exactly what you get when using username.
[15:06:52] <remonvv> Oh, I was assuming there are multiple docs per user.
[15:07:38] <NodeX> that has a 48gb MAX size across any of your shards at mongo's current doc size cap
[15:07:39] <remonvv> sgtpepper, you can add more fields to the shard key if you're worried some users might produce the bulk of your data. E.g. {username:1, productId:1}
[15:08:13] <remonvv> username sounds good but if you have other fields that have higher cardinality then by all means, use that.
[15:08:15] <sgtpepper> remonvv: ok this sounds better
[15:08:27] <remonvv> I shard on UUID because I have millions of those.
[15:08:56] <remonvv> if I have 100k users on my system they tend to be roughly evenly distributed amongst the shards
[15:09:52] <remonvv> if you end up with 3k documents sharding is a bit useless by the way
[15:10:39] <sgtpepper> but in my case some user can generate more traffic than others, and what i don't what is that all the writes inherents to this users are directed to the same shard....that's the problem
[15:11:29] <sgtpepper> 3000 users for about 20 different collections (different data collecting)
[15:12:00] <NodeX> how many docs/(Rows) do you intend to shard lol
[15:12:23] <NodeX> you implied you would be sharding 3k docs (your users)
[15:13:00] <sgtpepper> ...and 3000 users per day (statistics are hourly)
[15:20:58] <remonvv> sgtpepper, and each user generates their own statistics? If so the scenario of having power users becomes a bit more realistic which makes username a bit less ideal.
[15:21:11] <kali> month-first sharding key... i think there is some value here if your app hammers recent activity... chunks with old data will slip out of the cache
[15:21:20] <remonvv> If you have a big customer that pumps GBs of data per day to you splitting on username wont help you much.
[15:22:06] <remonvv> and if you have month as your first shard key all data will be written to a single or a specific few shards with the rest idle.
[15:22:34] <sgtpepper> yes, each user generates their statistics
[15:22:46] <remonvv> now if it's {second:1} it's somewhat different but then you might as well use _id
[15:23:01] <kali> remonvv: i don't know how the chunk storage is segregated... data from one chunk can be interleaved with another ?
[15:23:03] <remonvv> sgtpepper, what other fields do you have? category?
[15:23:33] <kali> remonvv: as for the rest, i aggree, at the beginning of the month, all writes will got to one single shard until the second key starts to play
[15:24:12] <remonvv> kali, no, but caching happens based on memory pages. What chunks those pages belong to isn't that relevant. You are right when it comes to indexes though since that might help right-balancing the index. But for shard distribution it's not a good idea.
[15:24:21] <sgtpepper> the more complex collection is: username - application - category - type - page - year - month - day
[15:24:22] <remonvv> shard distribution ~= chunk distribution
[15:24:48] <remonvv> and what is the cardinality of category and type? and what's "page"?
[15:25:59] <sgtpepper> category and type cardinality is low, page (a visited page) could be high for certain users
[15:31:32] <remonvv> application cardinality is low too i assume
[15:33:32] <remonvv> {username:1, day :1, page:1} would probably be reasonable then. Username ensures chunk distribution, once there the day+page part will help making sure often accessed data (assuming it goes per page) will end up in cache as kali suggests.
[15:33:51] <remonvv> Alternatively, write a realistic test ;)
[15:36:43] <sgtpepper> Sure, realistic test will be run...but I wanted to have a clearer idea before :)
[15:38:19] <sgtpepper> Instead, what is the disadvantage using {day:1, username:1, page:1}
[15:42:09] <kali> sgtpepper: similar to month: at the beginning of each day, one single chunk will handle the write
[15:44:09] <sgtpepper> kali: so, for example, I reach balancing only at the middle of the day?
[15:49:47] <gheegh> Hi all, I've outgrown my single Mongo server, and I'm about to upgrade. I've got 2 servers now to migrate to.. I'm wondering if I should shard them or run them master/slave? Everything we do is very bursty..and we have periods of super sustained writes. Once those writes are over though, the "write" load gets very low. What else should I be looking at to figure out how to most effectively use 2 servers?
[15:50:18] <kali> sgtpepper: when the chunk has outgrown the wanted chunk size... and at that point you get an expensive split and rebalance
[15:58:32] <gdoteof> i need to take all those values and add an {end:$ISODATENOW}
[15:59:20] <ianblenke> yeah, ignore what I said. shard key. missed the scrollback. caught up now.
[16:02:08] <gdoteof> basiucally i just need to know how to get "now" for an ISO date
[16:03:23] <ianblenke> like Date()? something like this perhaps: db.test.update({end:null}, {$set: {end:Date()}}, false, true);
[16:07:09] <playcraft> What would be the most efficient, multi-user and shard friendly way to calculate the difference between two values and then calculate the average of the differences?
[16:09:19] <playcraft> MapReduce can get the job done but, from what I read it is not good to do it this way because it locks Javascript Context.
[16:10:51] <SergeyUkolov> Hi all! is there an analog in mongo for USE INDEX (index_list) (from MySQL)?
[16:11:30] <eydaimon> are the same 4mb limitations still in place per document now in 2.2?
[16:16:39] <DrShoggoth> can i store and index 2d shapes and query them by lat/lon within?
[16:17:03] <DrShoggoth> i find a lot of examples on doing "near" queries, but I only want to find shapes my point falls within
[18:46:27] <kali> krispyjala: nope, it's actually the right one. you just need to make your app talking to the mongos instead of the mongod before enabling sharding
[18:47:12] <krispyjala> kali: so it'll magically shard by key on my old data?
[18:47:29] <kali> you need to enablesharding on one or several database
[18:47:44] <kali> then enable sharding on one collection, and specify a sharding key
[18:48:00] <kali> at that point, the cluster will start to split data and move it around
[18:48:02] <krispyjala> there's just one huge collection that we're trying to shard cuz it's getting too big
[18:48:14] <krispyjala> ok, but in the meantime it will still be serviceable?
[18:48:22] <krispyjala> or would this be the downtime?
[18:48:34] <krispyjala> that collection size is about 300GB I think
[18:48:49] <kali> yes. the only downtime is when you stop your app, change its config to make it speak to mongos instead of mongod
[18:50:41] <krispyjala> kali: one other question is, we're not even on replica set. We're still on master-slave, so would we need to convert to replica set first? or can just do all of it in one shot w/ the new sharding?
[18:51:14] <kali> krispyjala: i don't know if sharding and old master/slave are compatible
[18:51:42] <kali> i would certainly start by moving to replica set
[19:05:53] <jgornick> Hey guys, are there any plans to have Mongo remove the limit of 16MB documents and use something similar to how GridFS partitions documents?
[19:06:16] <jrdn> so using the aggregation framework, i want to count all the items that exist in a document?.. just a simple count
[19:52:43] <_m> A document with structure: { username, email, remember_me }
[19:53:42] <_m> Assume some records exist like with this data: {username: "foo", email: "foo@foo.com", remember_me: 'asdasdfasdf' } and { username: 'bar', email: 'bar@bar.com', remember_me: 'asdasdfasdf' }
[20:22:35] <kewark> hi guys, is 'rs.initiate()' idempotent?
[20:23:29] <kanzie> I have a collection of movies and I want to save which users like which movies by adding the userId in an array stored in the movies-collection. Every time the user press like on a movie I want to add their id to the array and increment the "likes"-field by one if the user has not liked the movie before, if already liked then skip. I can do this brute force with php but think there are nicer ways of doing this in the insert-statement for mo
[20:24:40] <kanzie> so my questions is: should the "likes"-property be a js-function to count items in the likedBy-array or should I have a integer value that I increment
[20:27:37] <kanzie> and should I do something like if ( db.movies.find({'_id': $myid, 'likedBy': $userId}) )
[20:29:51] <jrdn> i'd also increment / decrement a likes field
[20:30:01] <crudson> 1) the latter, 2) yes, as part of an update in which you can $inc the count and $push the userId
[20:30:50] <jrdn> depending on the traffic / amount of likes you get, it may be bad to push the userid
[20:31:23] <kanzie> jrdn crudson: thanks guys… since I was planning to build a scoring-function too that takes a bunch of properties from the movie item and does some magic on it I planned on doing this as a Score-field with a js-function that would calculate the score, but you don't recommend storing js-functions?
[20:32:08] <kanzie> jrdn: why is that? I was thinking perhaps just store the userid as strings instead of actual userIds, but I have no idea which is best from a mongo point of view
[20:32:48] <kanzie> crudson: Is there any way to see if $addToSet actually added a value or not?
[20:32:52] <jrdn> i'm sure you can get a lot of user ids in there before hitting the 16MB document limit :P
[20:34:15] <jrdn> also, documents move on disk and you'll see high CPU / IO if they continuously grow in size.. i've noticed this with our current schema.. but i'm not a complete genius :X
[20:38:18] <kanzie> thanks, Im reading up on addToSet and push now but it seems a little wonky given what I need because I only want to increment if I successfully add
[20:39:35] <kanzie> maybe I should count the items in the array after addToSet and update the count thereby
[20:39:59] <kanzie> nah, better then to check if the id is in the set already, if not then perform update and increment, else skip
[20:40:52] <crudson> kanzie: that's why the query part of the update should have {users:{$ne:'123'}} so you are matching a movie that a user hasn't liked already
[20:42:03] <crudson> kanzie: "find movie x that user y hasn't liked, and increment the count and add the user to the list" can happen in one go
[20:42:12] <kanzie> but I already know the movieId and the userId, however if I do an update I would need to consider the return value and there doesn't seem to be any simple and fool proof way to do that
[20:42:46] <kanzie> crudson: well, if so I really need some help constructing that query cause I can't get my thick fingers around that
[20:46:10] <crudson> db.movies.update({_id:'titanic', userlikes:{$ne:'user123'}}, {$inc:{likes:1}, $push:{userlikes:'user123'}, false, false) could do it
[20:46:39] <kanzie> ok, lets see if I can translate this to php