pmxbot IRC Log Viewer

[04:07:30] <jtomasrl> is this a correct pattern for a mongodb collection?

[04:07:32] <jtomasrl> https://gist.github.com/a604e897d49674a5205b

[06:32:56] <tpae> is there a comparable speed difference with bulk insert vs. iterative approach insert?

[06:33:29] <ranman> yes

[06:33:48] <tpae> thank you

[06:33:56] <ranman> you're welcome :D

[06:45:40] <tpae> i am facing a dilemma/problem

[06:46:02] <tpae> let's say i have to perform 100,000 inserts

[06:46:23] <tpae> need to check whether each entry already exists or not

[06:47:14] <tpae> if it does, need to insert that row to another collection

[06:47:17] <LouisT> tpae: i've gotten stuck with that same issue, no idea how to do it still =/

[06:47:32] <tpae> iterative approach is all i got :(

[06:47:51] <tpae> i can't imagine the speed once it hits 1,000,000 inserts

[06:47:53] <LouisT> yea, i've gotten to the point where i loop a MASSIVE array and use upsert...

[06:48:08] <tpae> how's the performance..?

[06:48:21] <LouisT> dude, it takes a few minutes for me to do around 50k =/

[06:48:36] <tpae> =/

[06:48:47] <LouisT> this is with a remote server, but still... takes forever

[06:49:01] <LouisT> i tested it locally, still took quite a while..

[06:49:18] <LouisT> tpae: if you figure out how, let me know

[06:49:40] <tpae> someone pointed to me findAndModify

[06:49:54] <tpae> possibly bring down 2 steps to 1

[06:49:56] <LouisT> i tried that, wasn't really sure about it

[06:50:06] <tpae> yeh

[06:51:45] <LouisT> i don't think it can upsert multiple, but i don't think you need upsert as you stated you wanted to use multiple collections..?

[06:52:50] <tpae> well

[06:52:53] <tpae> nested doc

[06:53:30] <tpae> http://pastie.org/private/filjotmx8xbipxa7g2boq

[06:54:24] <tpae> i have to follow that logic for... up to 100,000 users :(

[06:55:02] <LouisT> lots of fun

[06:55:06] <tpae> subscribe_company is basically addToSet

[06:55:32] <LouisT> i have/needed to download huge text files and convert each line to an object to be inserted into a db

[06:55:58] <LouisT> 256 files of all random sizes and content.. ranging from 10kb to 15MB =/

[06:56:01] <tpae> yeah probably what i'll end up doing

[06:56:52] <tpae> once i figure it out, i'll let you know :P

[06:57:37] <LouisT> kk

[07:30:13] <[AD]Turbo> hi there

[07:44:19] <tpae> hi

[08:24:44] <k610> i was planning to use sqlite but since my table schemas are going to change often I was considering mongodb

[08:25:45] <kali> k610: mongodb is not good for embedding

[08:25:58] <k610> embedding ?

[08:26:18] <christkv1> yeah I don't think he wants to embed it :)

[08:26:37] <kali> errrr ok

[08:27:09] <k610> i just wanna read-write from python-ruby

[08:33:18] <k610> i checked http://hammerprinciple.com/databases/items/mongodb/sqlite

[08:34:25] <k610> would redis be better ? more ruby&python support

[08:35:42] <ppetermann> thats a pretty weird comparisation

[08:40:50] <algernon> I'd say there's plenty of mongodb support in both python and ruby.

[08:44:33] <k610> ok, from what i read redis is going to perform faster but mongodb syntax is going to be more like sql

[08:47:08] <NodeX> they are 2 different things

[08:47:17] <NodeX> (redis and mongo)

[08:47:43] <algernon> mongodb syntax is about as similar to sql as haskell is similar to m4

[08:47:45] <ppetermann> like sql?!

[08:47:49] <ppetermann> wtf

[08:47:50] <kali> k610: mmm not realy. the syntax is quite different, but it is query-based, whereas redis interface is closer to "hashmap and list on server" with similar interface to what standard api provides

[09:10:29] <christkv1> kali: I consider redis a swiss army knife of data structures and more often than not people use both mongodb and redis

[09:10:31] <christkv1> together

[09:11:35] <christkv1> mongodb is more flexible due to it being a document database and having a query language and aggregation framework

[09:11:51] <christkv1> while redis is mre of a set of basic data structures that are very fast

[09:12:04] <christkv1> so if you need set operations, queues etc

[09:12:58] <kali> christkv1: actually, i've stopped using redis for queuing and shove everything into mongodb

[09:13:00] <mrpro> can in $inc an array value?

[09:13:29] <kali> christkv1: redis was not ready for HA operation at that time, i don't know where it stands now

[09:15:25] <tpae> quick question.. i've been reading the docs, but can't seem to find good practice for adding indices.. when should i add an index, and when should i re-index?

[09:16:13] <christkv1> kali: true

[09:18:42] <kali> tpae: realy depends... you can do it in your application deployment process, or in your application initialization phase

[09:19:59] <ppetermann> ai dont know, if i want a queue, i use a queue not a database

[09:20:45] <mrpro> can in $inc an array value?

[09:25:24] <NodeX> no

[09:25:35] <kali> ppetermann: sure. from a development pov, it's a nice theory. but from the deployment and operations side, it's nice to have flexible tools and limit the number of infrastructure services

[09:26:08] <NodeX> "increments field by the number value if field is present in the object, otherwise sets field to the number value. This can also be used to decrement by using a negative value."

[09:26:54] <NodeX> tpae : you should never need to re-index

[09:27:14] <tpae> thank you

[09:27:16] <NodeX> and you should add indexes at the start really as they're blocking operations that stop reads/writes to your data while they're building

[09:27:21] <ppetermann> kali: right tool for the right job.

[09:29:00] <coalado> hi

[09:29:33] <coalado> I have a large collection, and I need some kind of distinct feature on one field. what's the prefered way to do this?

[09:29:43] <NodeX> add a distinct query

[09:29:46] <coalado> mapreduce is far too slow

[09:30:02] <coalado> distinct does not work. too many entries

[09:30:06] <NodeX> err sorry I thought you meant a distinct field

[09:30:14] <NodeX> aggregation framework

[09:34:01] <mrpro> can in $inc an array value?

[09:34:39] <coalado> NodeX: I think I should update the db first. we are on 1.6.5 :(

[09:35:26] <NodeX> mrpro : NO

[09:35:29] <NodeX> 3rd time

[09:35:35] <NodeX> coalado : Yes I agree!

[09:35:39] <mrpro> hmm

[09:35:45] <mrpro> what to do then

[09:35:52] <mrpro> i want to increment counter for time slots of 10 seconds

[09:37:25] <NodeX> I dont understand what that means

[09:38:59] <mrpro> you must be burnt out

[09:41:28] <NodeX> nope - it just makes no sense

[09:41:30] <mrpro> with c# driver, when you do getCollection() while there are connectivity issues, will it throw?

[09:59:15] <christkv1> mrpro: you can add a sub document with the field names being the index

[09:59:27] <mrpro> ehhh thats what i did with upsert

[09:59:29] <mrpro> pretty lame

[09:59:35] <christkv1> {'0': 1, '0': 2}

[10:20:06] <loganBS> Hi all. I'm puzzled about compound shard key. Is there a limit on number of fields which form a shard key? For example, can I use a compound shard key with 6 fields?

[10:30:06] <christkv1> loganBS: post 2.0 the max key size for an index is 1024 bytes

[10:30:17] <christkv1> loganBS: I think that's the only limitation

[10:33:59] <loganBS> christkv1: So can be a good idea to use a shard key as is: {coarsely ascending key (time); username; app;}?. My target is to store for every app of a user the hourly number of access.

[10:34:36] <christkv1> well if you use the time as the first part you will get a hot shard

[10:34:42] <christkv1> as it's an increasing counter

[10:35:18] <christkv1> what would your user document look like ?

[10:35:33] <christkv1> and you would probably want to to use the user name as the first part

[10:35:45] <christkv1> if you are going to be querying the information by the user

[10:36:27] <loganBS> but ideally i've a large number of users, not only one (it's analytics)

[10:36:54] <christkv1> ok then you might want to presplit the time ranges

[10:37:26] <loganBS> user document is like: {_id, user, app, year, month, day, {hour_1, hour_2,...hour_24}}

[10:38:28] <christkv1> ah ok each document is a single day instance of the user

[10:38:41] <christkv1> so for a single user in a year you might have 365 docs right ?

[10:38:45] <loganBS> yes

[10:38:50] <christkv1> then I would shard by user

[10:39:06] <christkv1> so the docs for the user reside on the same server

[10:39:42] <christkv1> unless you explicitly are going to access the data by a date component

[10:40:47] <christkv1> to pick the shard key you should figure out how you are planning to access the data ?

[10:40:57] <christkv1> what would your queries look like

[10:41:03] <loganBS> mmm, usually I access data specifying a time range

[10:41:22] <christkv1> and aggregate across all the users ?

[10:41:39] <loganBS> no, only for a specific user

[10:41:48] <christkv1> then you want the docs to stay together

[10:42:06] <christkv1> this way you don't incur a scatter-gather operation

[10:42:22] <christkv1> if you shard by user the queries will go straight to the right shard

[10:42:31] <loganBS> ok, so using user as shard key could be good in this case

[10:42:52] <christkv1> yes BUT no guessing is a good replacement for testing the shard key with real data :)

[10:43:21] <loganBS> but what happened when a collection is like this: {_id, user, app, PAGE, year, month, day, {hour_1, hour_2,...hour_24}}

[10:44:19] <loganBS> where number of PAGE is different for each user (i.e. a user can have only 2 pages, but another user can have 1 million of pages)?

[10:45:11] <christkv1> so what's the biggest possible single day dataset a user can generate ?

[10:45:24] <christkv1> in bytes

[10:47:01] <loganBS> for this collection something like 2 MB I think

[10:48:59] <loganBS> my concern is that a single shard become dedicated to an only user, without chance to split

[10:50:13] <loganBS> that's why is start thinking to use {coarsely ascending key (time); username; app;} shard key

[10:52:03] <christkv1> well it will split after 64MB

[10:52:06] <christkv1> of data

[10:52:37] <christkv1> by default

[10:52:44] <christkv1> or earlier if you set a smaller block size

[10:52:50] <christkv1> sorry chunk size

[10:57:37] <loganBS> but using only user as shard key isn't there the risk to unbalance the cluster (a user with more page have a greater number of writes)?

[11:05:16] <christkv1> yes

[11:06:18] <christkv1> you could shard by a combination of user and day f.ex

[11:11:10] <ebravick_> Hello, King Ian.

[11:11:38] <ianblenke1> Greetings, Emperor Eric.

[12:02:08] <coalado> I have a huge list of documents. each doc has a non-unique id. I'd like to count the amount of unique ids. any ideas?

[12:03:13] <kali> coalado: look at the aggregation framework

[12:03:42] <kali> coopsh: distinct would work on a small collection

[12:03:48] <coalado> it is not small.

[12:03:51] <coalado> that's the problem

[12:04:06] <kali> AF then

[12:04:09] <coalado> aggregation framework could count, but it cannot return the list of unique ids

[12:06:07] <kali> coalado: you said you wanted to count them :)

[12:06:12] <kali> coalado: map/reduce then :)

[12:08:53] <coalado> thanks

[12:11:45] <NodeX> coalado : the aggregation framework will give you the list also if you want it to

[12:12:17] <christkv1> NodeX: there is a limit in 2.2 that the end result is less than 16MB

[12:12:35] <NodeX> really :/

[12:12:45] <coalado> NodeX: yes, but it will fail if the list is too big

[12:12:51] <christkv1> NodeX: yes 2.4 will have output to a collection removing that limit

[12:13:06] <coalado> nice

[12:13:22] <christkv1> 2.4 rc is planned for december

[12:13:27] <christkv1> at this point

[12:13:31] <coalado> sounds like it will replace M&R sooner or later

[12:13:51] <christkv1> well I think there might still be cases where M&R makes sense

[12:57:36] <loganBS> Another question about sharding: multi-range sharding (i.e. a shard node is responsible for different ranges, for example {-Inf - 100} and {2000-2100}) is what happen by default?

[12:59:58] <NodeX> I think the default shard key is _id which splits evenly across shards

[13:04:27] <tpae> man, more i work with mongodb, more i like it

[13:04:53] <NodeX> :)

[13:04:53] <remonvv> Indeed. I have similar sentiments towards sleeping.

[13:05:05] <remonvv> ron, I endorsed your epic skills on LI. You owe me.

[13:05:28] <loganBS> NodeX: the question is another: a shard node can be responsible for more than one range? For example, when using a monotonically increasing shard key, all the writes are directed to the current shard; but when the last shard is reached, continue from the first shard?

[13:07:35] <tpae> according to this article: http://viktorpetersson.com/2012/02/13/comparing-mongodb-write-performance-on-centos-freebsd-and-ubuntu/

[13:07:42] <tpae> why is it faster on ubuntu?

[13:08:16] <NodeX> loganBS : sorry I dont know what big words mean like monotonicaly

[13:09:18] <loganBS> (monotonically: that continue to increase)

[13:09:46] <NodeX> better to use "incrementing" no?

[13:10:05] <NodeX> tpae : perhaps ubuntu has better swap management

[13:11:09] <NodeX> loganBS : if it's an even distribution then it will just go round and round

[13:11:16] <NodeX> unless one is unavailable

[13:33:26] <remonvv> tpae, that's an incredibly questionable benchmark. The main performance differences come from chosen filesystems (the defaults may be different for those OSs), memory manager, etc.

[13:34:15] <remonvv> None of those things is specified in the test.

[13:34:31] <tpae> i see..

[13:35:36] <christkv1> tpae: we know there might be some penalty on freebsd due to super pages being turned on by default

[13:36:33] <christkv1> https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/4KWyeaY4nSw

[13:37:09] <remonvv> I wonder if there any metrics available concerning FreeBSD usage for medium to large MongoDB clusters.

[13:38:12] <christkv1> we would love to hear if anyone investigates and finds an optimal freebsd setup

[13:38:53] <remonvv> 10gen didn't hire a FreeBSD guru yet? ;)

[13:39:25] <tpae> are there optimal setups for ubuntu or centos ? :D

[13:40:01] <remonvv> There are many presentations and articles available for optimizing Linux based setups.

[13:45:01] <christkv1> remonvv: not yet, to be honest most customers are linux or windows

[13:45:58] <remonvv> Makes sense. I've never even installed BSD ;)

[13:46:12] <christkv1> well :) do you run osx ?

[13:46:17] <christkv1> ;)

[13:46:27] <christkv1> that's BSD

[13:47:59] <remonvv> I know, but nope.

[13:48:28] <remonvv> I'm Linux/Windows in every day life ;)

[13:48:45] <remonvv> Well, I have Apple gadgets so I suppose that counts.

[13:56:19] <sag> is there any method to fetch multi documents from mongo

[13:56:30] <sag> other than $in

[13:57:07] <ppetermann> uhm find()?

[13:57:16] <sag> yep in find

[13:57:36] <coalado> find is the answer

[13:57:38] <ppetermann> find will fetch you multiple documents if it finds multiple documents

[13:57:57] <sag> i know but what should be the criteria to do that

[13:58:05] <ppetermann> that depends on your document

[13:58:06] <sag> must be $in rite?

[13:58:10] <ppetermann> no, why?

[13:59:00] <sag> i mean it should be some unique attribute from document mentioned in $in rite?

[13:59:07] <ppetermann> if i have { _id: randomid, name: "sag", something: "foobar"} and { _id: randomotherid, name: "sag", something: "barfoo"}

[13:59:20] <ppetermann> and i do find({name: "sag"})

[13:59:21] <ppetermann> ill get both

[14:00:06] <sag> ohhh..sorry for the confusion..i mean what if lookup on diff values..

[14:00:24] <ianblenke> like name in ppetermann's example?

[14:00:53] <ppetermann> sag: are you looking for $or?

[14:01:19] <sag> like something=foobar,something=barfoo,something=foo

[14:01:44] <sag> yep $or could be option..ohh yeah..

[14:02:11] <sag> what will be impact on performance by using $or and $in

[14:22:32] <sgtpepper> hello to all! Consider this situation: in a sharded environment one node reached its memory limit. What could happen at this point: 1) Add a more capable disk. 2) Add a new shard node. In the case 2 the full node is rebalancing?

[14:22:49] <sgtpepper> ...rebalancing automatically

[14:24:37] <kali> sgtpepper: 1 is a recipe for disaster

[14:25:24] <kali> sgtpepper: a more capable disk will be what... at most 20% better than whatever you have now, whereas memory is orders of magnitude faster than disk.

[14:26:03] <kali> so if you add a node, the whole cluster will be rebalancing, stealing about 1/n of the content from the existing nodes

[14:27:07] <sgtpepper> kali: under the hypothesis that the shard key was correctly chosen?

[14:27:40] <kali> sgtpepper: hmmm yes :)

[14:28:05] <sgtpepper> :)

[14:28:08] <kali> sgtpepper: you can look at the chunks list and check they are of even sized

[14:29:24] <remonvv> sgtpepper, I guess the assumption that the shard key is actually correct doesn't fly or it wouldn't be just one node with memory issues.

[14:29:35] <remonvv> The easiest solution is adding nodes, regardless.

[14:30:14] <sgtpepper> Ok, and if one has chosen a wrong shard key, is there a way to hot change it?

[14:30:27] <remonvv> sag, your question is a little vague but i'm pretty sure you are indeed looking for $in rather than $or. In terms of performance $in is faster if your use case allows for $in.

[14:30:38] <remonvv> sgtpepper, not easily.

[14:31:03] <remonvv> If hot = busy system then no, if hot = running but idle system then yes

[14:31:46] <sgtpepper> remonvv: it's what i've suspected...so it's very crucial to chose the correct shard key

[14:32:07] <remonvv> sgtpepper, quite ;)

[14:34:19] <sgtpepper> About the shard key: can i use more than two collection fields?

[14:34:43] <remonvv> In theory, yes.

[14:35:10] <remonvv> Depends a bit on your data whether or not you should. Smaller shard keys tend to be "better"

[14:37:33] <sgtpepper> Because my assumption is to use: "month - username - category - type - page" where page is contained in a type, which is contained in a category

[14:38:08] <sgtpepper> or can be better something like this: "month - page", bypassing username, category and type?

[14:43:45] <remonvv> no

[14:44:07] <remonvv> month shouldn't be in there, that's an aweful first key. it means all your data for a certain month is likely to go to the same (couple of) shards.

[14:44:25] <remonvv> awful*

[14:44:27] <NodeX> something to blog about :P

[14:44:40] <remonvv> Shush ;)

[14:44:45] <NodeX> lol

[14:45:13] <remonvv> I think you should blog about it sir. I'll proof read and take credit.

[14:45:40] <NodeX> sounds ..err... aweomse

[14:45:42] <NodeX> awesome*

[14:46:09] <eka> hi all, how to disable log?

[14:46:48] <remonvv> sgtpepper, the goal is to make sure your shard key uses high cardinality fields that ideally are not based on time (at least not as the first field in the key). Things like UUIDs, usernames or other highly distributed IDs are good.

[14:46:59] <sgtpepper> remonvv: but i've a potentially very large number of username (and pages)

[14:47:02] <remonvv> eka, specifics kind sir. Which log?

[14:47:20] <remonvv> sgtpepper, that's the point. You want it to be as large as possible. It means there are more unique values for that field.

[14:47:24] <eka> remonvv: /var/log/ ?

[14:47:37] <remonvv> so mongod/mongos logs?

[14:47:53] <eka> remonvv: yes sorry

[14:48:02] <remonvv> sgtpepper, imagine you have {month:1} as your shardkey and you have 100 shards and you'll see the problem.

[14:48:34] <remonvv> Actually that's not clear at all.

[14:49:43] <remonvv> eka, --quiet helps a little and you can write the log output to the void. Why do you want it disabled?

[14:50:13] <eka> remonvv: added quiet to my .conf... but still logging connections and some queries

[14:50:15] <sgtpepper> remonvv: ok, but using {month:1, username:1} the cluster is not balanced based on username?

[14:50:28] <remonvv> sgtpepper, it's a bit hard to explain in a few sentences.

[14:52:26] <sgtpepper> remonvv: I've read Scaling MongoDB of Kristina Chodorow...what I understand is that a shard key of the type {Coarsely ascending key + search key} could be good. Isn't true?

[14:53:20] <remonvv> sgtpepper, let's just say that will probably be worse than just {username:1}. Especially if "month" is something like the current month which means all new data will have the same value for month and will lean towards the same shard initially. MongoDB will split based on data hotspots so that'll eventually sort itself out but it just never actually helps the situation. Then if the next month arrives you'll be hitting the worst case scenario

[14:54:26] <remonvv> sgtpepper, kchodorow works for 10gen so I assume she knows what she's on about. I'm assuming the coarsely ascending key wasn't time based though. In my experience that's a particularly bad idea.

[14:54:37] <remonvv> And I have quite a bit of experience with that specifically.

[14:55:53] <remonvv> sgtpepper, http://docs.mongodb.org/manual/core/sharding-internals/#sharding-internals-shard-keys

[14:56:00] <remonvv> That seems to agree with my views

[14:57:13] <sgtpepper> Ok, so your advice is to use {username: 1} as shard key

[14:57:49] <remonvv> sgtpepper, from my perspective the preference is high cardinality non time based fields with even value distribution.

[14:58:23] <remonvv> username is a very good value unless for example 30% of the database has "Guest" as their username.

[14:58:38] <remonvv> you don't have a UUID?

[14:58:41] <sgtpepper> ...but when a certain username receive much more write/update operation compared to other? (for example has a greater number of pages, for each one an hourly counter should be stored)?

[14:59:07] <remonvv> sgtpepper, that's not relevant. If a single user receives a lot of writes those writes will always have to happen on the same shard anyway.

[14:59:33] <remonvv> And if that single user has a lot of associated data it will only arrive on the same shard if you use the same field(s) as shard key for that data.

[14:59:38] <remonvv> Which is actually a good idea.

[14:59:49] <remonvv> I usually shard on our UUID for user related data for example.

[14:59:57] <remonvv> That guarantees all data for user X lives on the same shard,.

[15:00:18] <sgtpepper> Isn't there a risk to have unbalanced shard (for example the one with this username has more data)?

[15:00:22] <remonvv> UUID is better than username due to length variability and usernames generally being larger.

[15:00:43] <remonvv> sgtpepper, well, unless you have incredible powerusers that can fill a single shard, no.

[15:01:23] <remonvv> sgtpepper, note that mongodb splits chunks. If a single shard is dealing with a lot of data from a few users it'll automatically split related chunks and rebalance them across the shards.

[15:01:46] <remonvv> sgtpepper, and it's not very realistic. You won't have a system where 0,1% of your users generate 99% of your data usually.

[15:03:30] <remonvv> sgtpepper, how many users do you expect? roughly

[15:03:57] <sgtpepper> about 3000

[15:05:33] <sgtpepper> but at the same time i would have balanced writes, i.e. not all writes should be directed to only one shard

[15:05:47] <NodeX> 3000 per shard or 3000

[15:05:50] <NodeX> total*

[15:06:07] <sgtpepper> total

[15:06:34] <NodeX> how come your sharding on only 3k documents?

[15:06:41] <remonvv> sgtpepper, distributed writes is exactly what you get when using username.

[15:06:52] <remonvv> Oh, I was assuming there are multiple docs per user.

[15:07:38] <NodeX> that has a 48gb MAX size across any of your shards at mongo's current doc size cap

[15:07:39] <remonvv> sgtpepper, you can add more fields to the shard key if you're worried some users might produce the bulk of your data. E.g. {username:1, productId:1}

[15:08:13] <remonvv> username sounds good but if you have other fields that have higher cardinality then by all means, use that.

[15:08:15] <sgtpepper> remonvv: ok this sounds better

[15:08:27] <remonvv> I shard on UUID because I have millions of those.

[15:08:56] <remonvv> if I have 100k users on my system they tend to be roughly evenly distributed amongst the shards

[15:09:01] <remonvv> Which is the goal.

[15:09:52] <remonvv> if you end up with 3k documents sharding is a bit useless by the way

[15:10:39] <sgtpepper> but in my case some user can generate more traffic than others, and what i don't what is that all the writes inherents to this users are directed to the same shard....that's the problem

[15:11:29] <sgtpepper> 3000 users for about 20 different collections (different data collecting)

[15:12:00] <NodeX> how many docs/(Rows) do you intend to shard lol

[15:12:23] <NodeX> you implied you would be sharding 3k docs (your users)

[15:13:00] <sgtpepper> ...and 3000 users per day (statistics are hourly)

[15:15:57] <NodeX> right, now it makes more sense

[15:16:18] <sgtpepper> :)

[15:20:58] <remonvv> sgtpepper, and each user generates their own statistics? If so the scenario of having power users becomes a bit more realistic which makes username a bit less ideal.

[15:21:11] <kali> month-first sharding key... i think there is some value here if your app hammers recent activity... chunks with old data will slip out of the cache

[15:21:20] <remonvv> If you have a big customer that pumps GBs of data per day to you splitting on username wont help you much.

[15:21:45] <remonvv> kali, chunks aren't cached though, memory pages are.

[15:22:06] <remonvv> and if you have month as your first shard key all data will be written to a single or a specific few shards with the rest idle.

[15:22:34] <sgtpepper> yes, each user generates their statistics

[15:22:46] <remonvv> now if it's {second:1} it's somewhat different but then you might as well use _id

[15:23:01] <kali> remonvv: i don't know how the chunk storage is segregated... data from one chunk can be interleaved with another ?

[15:23:03] <remonvv> sgtpepper, what other fields do you have? category?

[15:23:33] <kali> remonvv: as for the rest, i aggree, at the beginning of the month, all writes will got to one single shard until the second key starts to play

[15:24:12] <remonvv> kali, no, but caching happens based on memory pages. What chunks those pages belong to isn't that relevant. You are right when it comes to indexes though since that might help right-balancing the index. But for shard distribution it's not a good idea.

[15:24:21] <sgtpepper> the more complex collection is: username - application - category - type - page - year - month - day

[15:24:22] <remonvv> shard distribution ~= chunk distribution

[15:24:48] <remonvv> and what is the cardinality of category and type? and what's "page"?

[15:25:59] <sgtpepper> category and type cardinality is low, page (a visited page) could be high for certain users

[15:31:21] <remonvv> hm

[15:31:32] <remonvv> application cardinality is low too i assume

[15:33:32] <remonvv> {username:1, day :1, page:1} would probably be reasonable then. Username ensures chunk distribution, once there the day+page part will help making sure often accessed data (assuming it goes per page) will end up in cache as kali suggests.

[15:33:51] <remonvv> Alternatively, write a realistic test ;)

[15:36:43] <sgtpepper> Sure, realistic test will be run...but I wanted to have a clearer idea before :)

[15:38:19] <sgtpepper> Instead, what is the disadvantage using {day:1, username:1, page:1}

[15:42:09] <kali> sgtpepper: similar to month: at the beginning of each day, one single chunk will handle the write

[15:44:09] <sgtpepper> kali: so, for example, I reach balancing only at the middle of the day?

[15:49:47] <gheegh> Hi all, I've outgrown my single Mongo server, and I'm about to upgrade. I've got 2 servers now to migrate to.. I'm wondering if I should shard them or run them master/slave? Everything we do is very bursty..and we have periods of super sustained writes. Once those writes are over though, the "write" load gets very low. What else should I be looking at to figure out how to most effectively use 2 servers?

[15:50:18] <kali> sgtpepper: when the chunk has outgrown the wanted chunk size... and at that point you get an expensive split and rebalance

[15:50:24] <kali> sgtpepper: every day.

[15:50:29] <kali> sgtpepper: not very good

[15:50:50] <Guest70315> can you update an index to be background: true? or do you have to destroy and recreate it?

[15:51:14] <timeturne> while it's indexing?

[15:51:26] <timeturne> I think you have to restart the indexing

[15:51:29] <sirious> timeturne: no, just in general

[15:51:43] <timeturne> then yeah

[15:52:06] <timeturne> you can easily reconfigure the index at any time pretty much

[15:52:11] <timeturne> forgot the syntax though

[15:52:33] <sirious> cool. i'll keep looking. haven't found the right command anywhere yet

[15:53:43] <sgtpepper> kali: so, finally, it's better {username:1, day :1, page:1}

[15:53:46] <sgtpepper> ?

[15:55:05] <ianblenke> if you're doing a find on all three at the same time, that makes sense as a clustered index.

[15:55:44] <ianblenke> if you're doing a find on them separately as well, you probably want an index on them separately as well.

[15:56:17] <gdoteof> i am trying to massage a database. i haven't interfaced with mogno very much at all.

[15:56:51] <gdoteof> if i do db.gaming_sessions.find({end:null}); i get a list of 7 'gaming sessions'

[15:57:05] <gdoteof> i need to do for each of those, set end:$isodatenow

[15:57:07] <gdoteof> or whatever

[15:57:10] <gdoteof> if that makes sends

[15:57:16] <gdoteof> end should be an ISODate

[15:57:47] <gdoteof> does it make sense?

[15:58:16] <gdoteof> http://pastebin.com/gTZnKeSq

[15:58:32] <gdoteof> i need to take all those values and add an {end:$ISODATENOW}

[15:59:20] <ianblenke> yeah, ignore what I said. shard key. missed the scrollback. caught up now.

[16:02:08] <gdoteof> basiucally i just need to know how to get "now" for an ISO date

[16:03:23] <ianblenke> like Date()? something like this perhaps: db.test.update({end:null}, {$set: {end:Date()}}, false, true);

[16:07:09] <playcraft> What would be the most efficient, multi-user and shard friendly way to calculate the difference between two values and then calculate the average of the differences?

[16:09:19] <playcraft> MapReduce can get the job done but, from what I read it is not good to do it this way because it locks Javascript Context.

[16:10:51] <SergeyUkolov> Hi all! is there an analog in mongo for USE INDEX (index_list) (from MySQL)?

[16:11:30] <eydaimon> are the same 4mb limitations still in place per document now in 2.2?

[16:16:39] <DrShoggoth> can i store and index 2d shapes and query them by lat/lon within?

[16:17:03] <DrShoggoth> i find a lot of examples on doing "near" queries, but I only want to find shapes my point falls within

[16:20:24] <Derick> you can't do that yet I think

[16:24:55] <NodeX> it's called percolation and it's not possible :/

[16:27:22] <DrShoggoth> :/

[16:45:15] <NodeX> you can percolate in elastic search but it's not the same as mongo

[18:39:49] <krispyjala> anyone converted standalone to sharded before?

[18:39:53] <krispyjala> I'm just curious what the process is

[18:40:00] <krispyjala> do I need to "reindex"?

[18:40:24] <kali> you need an index on the sharded key, iirc

[18:40:57] <kali> but you should be able to switch with a minimum downtime

[18:41:21] <krispyjala> kali: I guess how does that work? Since initially standalone is just one "shard" right?

[18:41:28] <kali> yes.

[18:41:30] <krispyjala> how does it split itself based on your new config?

[18:42:54] <kali> krispyjala: http://docs.mongodb.org/manual/tutorial/convert-replica-set-to-replicated-shard-cluster/

[18:43:32] <kali> krispyjala: mmm wait. wront tutorial

[18:46:27] <kali> krispyjala: nope, it's actually the right one. you just need to make your app talking to the mongos instead of the mongod before enabling sharding

[18:47:12] <krispyjala> kali: so it'll magically shard by key on my old data?

[18:47:29] <kali> you need to enablesharding on one or several database

[18:47:44] <kali> then enable sharding on one collection, and specify a sharding key

[18:48:00] <kali> at that point, the cluster will start to split data and move it around

[18:48:02] <krispyjala> there's just one huge collection that we're trying to shard cuz it's getting too big

[18:48:14] <krispyjala> ok, but in the meantime it will still be serviceable?

[18:48:22] <krispyjala> or would this be the downtime?

[18:48:34] <krispyjala> that collection size is about 300GB I think

[18:48:49] <kali> yes. the only downtime is when you stop your app, change its config to make it speak to mongos instead of mongod

[18:48:55] <kali> and start it again

[18:49:13] <krispyjala> ok cool

[18:49:20] <krispyjala> sorry do you work for 10gen kali?

[18:49:28] <kali> i dont

[18:49:33] <krispyjala> haha k just curious thx

[18:50:41] <krispyjala> kali: one other question is, we're not even on replica set. We're still on master-slave, so would we need to convert to replica set first? or can just do all of it in one shot w/ the new sharding?

[18:51:14] <kali> krispyjala: i don't know if sharding and old master/slave are compatible

[18:51:42] <kali> i would certainly start by moving to replica set

[19:05:53] <jgornick> Hey guys, are there any plans to have Mongo remove the limit of 16MB documents and use something similar to how GridFS partitions documents?

[19:06:16] <jrdn> so using the aggregation framework, i want to count all the items that exist in a document?.. just a simple count

[19:07:15] <jrdn> i.e.) document1 = { clicked_at: Date(), created_at: Date() } / document2 = { created_at: Date() }

[19:07:21] <jrdn> i want to count the "clicks"

[19:07:27] <jrdn> the above should return one

[19:11:38] <jrdn> i'm guessing it's: { $cond: [ clicked_at, 1, 0 ] }

[19:17:23] <jrdn> yep i answered my own Q!

[19:24:33] <coalado> I'd like to use the aggregation framework to group an 2 fields.

[19:24:36] <coalado> is this possible?

[19:47:23] <mrpro> i think there's an issue with c# driver

[19:47:32] <mrpro> Unhandled Application Error – Key duplication when adding: authorization||System.ArgumentException||ArgumentException||Void PutImpl(System.Object, System.Object, Boolean)|| at System.Collections.Hashtable.PutImpl (System.Object key, System.Object value, Boolean overwrite) [0x00137] in /home/abuild/rpmbuild/BUILD/mono-2.10.5/mcs/class/corlib/System.Collections/Hashtable.cs:779

[19:47:32] <mrpro> at System.Collections.Hashtable.Add (System.Object key, System.Object value) [0x00000] in /home/abuild/rpmbuild/BUILD/mono-2.10.5/mcs/class/corlib/System.Collections/Hashtable.cs:442

[19:49:20] <_m> You're duplicating a uniq index

[19:49:27] <_m> According to the error.

[19:49:38] <_m> *unique

[19:50:36] <_m> So, if you have a document with fields: foo, bar, baz And a unique constraint on foo

[19:50:59] <_m> You cannot insert a duplicate value for foo, which is what that error is telling you.

[19:51:05] <mrpro> is that a mongo driver error or i am seeing things

[19:51:08] <mrpro> hmm

[19:51:25] <_m> Key duplication when adding: authorization

[19:51:38] <mrpro> yea

[19:51:42] <mrpro> might be not mongo?

[19:51:49] <_m> It's a Mongo erro

[19:51:53] <mrpro> oh it is?

[19:51:59] <_m> But because you're doing it wrong.

[19:52:12] <mrpro> how can you tell its a mongo err

[19:52:16] <_m> Let's take a more real-life example

[19:52:39] <mrpro> ?

[19:52:43] <_m> A document with structure: { username, email, remember_me }

[19:53:42] <_m> Assume some records exist like with this data: {username: "foo", email: "foo@foo.com", remember_me: 'asdasdfasdf' } and { username: 'bar', email: 'bar@bar.com', remember_me: 'asdasdfasdf' }

[19:54:20] <mrpro> yea dude but

[19:55:42] <mrpro> can you tell me why you are certain that it's a mongo error

[19:55:46] <mrpro> just because i said it is mongo?

[19:56:06] <_m> Because the error is "key duplication"

[19:56:17] <mrpro> looks like a hastable error?

[19:56:20] <_m> And given the message, I'm assuming you're inserting into an "authorization" collection

[19:56:27] <mrpro> no not at all

[19:56:41] <mrpro> i think i was just connecitng

[19:57:09] <_m> Be certain what you were doing and then get back to us.

[19:57:21] <mrpro> it happen only once

[19:57:25] <mrpro> not sure why assumed it was mongo

[19:57:42] <_m> Because "Duplicate key" is a pretty common error

[19:58:17] <mrpro> id need unique indexes for that wouldnt i

[19:58:57] <_m> Depending on how you're trying to do an insert.

[19:59:03] <_m> But in general, yes.

[20:00:40] <crudson> is it not that you are trying to overwrite an existing key in a C# hashtable where "overwrite" is false

[20:00:58] <crudson> from the C# docs: ArgumentException An element with the same key already exists in the Hashtable.

[20:01:18] <mrpro> wow

[20:01:23] <mrpro> looks like mono has concurrency issues then

[20:01:27] <mrpro> mono mvc3

[20:01:39] <crudson> you said it was at System.Collections.Hashtable.Add

[20:02:05] <mrpro> yea looks like its MVC internal

[20:02:28] <crudson> check that your client code is not adding the same thing twice to a mongo query perhaps?

[20:03:24] <mrpro> i dont think its mongo

[20:03:28] <mrpro> it was false alarm about that

[20:03:29] <mrpro> http://pastie.org/5020605

[20:22:35] <kewark> hi guys, is 'rs.initiate()' idempotent?

[20:23:29] <kanzie> I have a collection of movies and I want to save which users like which movies by adding the userId in an array stored in the movies-collection. Every time the user press like on a movie I want to add their id to the array and increment the "likes"-field by one if the user has not liked the movie before, if already liked then skip. I can do this brute force with php but think there are nicer ways of doing this in the insert-statement for mo

[20:24:40] <kanzie> so my questions is: should the "likes"-property be a js-function to count items in the likedBy-array or should I have a integer value that I increment

[20:27:37] <kanzie> and should I do something like if ( db.movies.find({'_id': $myid, 'likedBy': $userId}) )

[20:29:51] <jrdn> i'd also increment / decrement a likes field

[20:30:01] <crudson> 1) the latter, 2) yes, as part of an update in which you can $inc the count and $push the userId

[20:30:50] <jrdn> depending on the traffic / amount of likes you get, it may be bad to push the userid

[20:31:23] <kanzie> jrdn crudson: thanks guys… since I was planning to build a scoring-function too that takes a bunch of properties from the movie item and does some magic on it I planned on doing this as a Score-field with a js-function that would calculate the score, but you don't recommend storing js-functions?

[20:32:08] <kanzie> jrdn: why is that? I was thinking perhaps just store the userid as strings instead of actual userIds, but I have no idea which is best from a mongo point of view

[20:32:48] <kanzie> crudson: Is there any way to see if $addToSet actually added a value or not?

[20:32:52] <jrdn> i'm sure you can get a lot of user ids in there before hitting the 16MB document limit :P

[20:34:15] <jrdn> also, documents move on disk and you'll see high CPU / IO if they continuously grow in size.. i've noticed this with our current schema.. but i'm not a complete genius :X

[20:37:45] <crudson> kanzie: http://www.mongodb.org/display/DOCS/Updating#Updating-CheckingtheOutcomeofanUpdateRequest

[20:38:18] <kanzie> thanks, Im reading up on addToSet and push now but it seems a little wonky given what I need because I only want to increment if I successfully add

[20:39:35] <kanzie> maybe I should count the items in the array after addToSet and update the count thereby

[20:39:59] <kanzie> nah, better then to check if the id is in the set already, if not then perform update and increment, else skip

[20:40:05] <kanzie> keep it simple

[20:40:52] <crudson> kanzie: that's why the query part of the update should have {users:{$ne:'123'}} so you are matching a movie that a user hasn't liked already

[20:42:03] <crudson> kanzie: "find movie x that user y hasn't liked, and increment the count and add the user to the list" can happen in one go

[20:42:12] <kanzie> but I already know the movieId and the userId, however if I do an update I would need to consider the return value and there doesn't seem to be any simple and fool proof way to do that

[20:42:46] <kanzie> crudson: well, if so I really need some help constructing that query cause I can't get my thick fingers around that

[20:46:10] <crudson> db.movies.update({_id:'titanic', userlikes:{$ne:'user123'}}, {$inc:{likes:1}, $push:{userlikes:'user123'}, false, false) could do it

[20:46:39] <kanzie> ok, lets see if I can translate this to php

[20:50:18] <_m> OH THE HUMANITY

[22:55:05] <_macro> any familiar with the erlang driver?

[22:55:47] <_macro> i'm getting bad connections (connects fail) when using the resource_pool

Log file Viewer

Help | Karma | Search:

#mongodb logs for Monday the 8th of October, 2012