[02:23:45] <Frozenlock> Nice, I'm happy to see this channel exists :)
[02:24:10] <Frozenlock> Is there a way to make a "variable-capped-collection"? By that I mean that I want a maximum size for a given collection (say 1go), but I don't want to occupy a whole Gigaoctet on my hard disk if I'm really only using 10k.
[02:33:41] <nemothekid> Frozenlock: I don't think capped collections allocate all the given space
[02:36:49] <Frozenlock> nemothekid: If they don't it must be quite near... it just created for multiple Go of files :(
[02:40:12] <Frozenlock> From the doc: Unlike a standard collection, you must explicitly create a capped collection, specifying a collection size in bytes. The collection's data space is then preallocated.
[02:41:39] <TkTech> Why do you /not/ want to pre-allocate it?
[02:43:10] <Frozenlock> Because the capped size was a worst case scenario. However if I use a capped collection, then the worst case scenario becomes immediate.
[02:44:34] <TkTech> I don't think you're using it the way you should be using it.
[02:45:04] <TkTech> A capped collection is fully designed to accept more documents than it can reasonable fit, in which case the oldest are replaced.
[02:45:52] <Frozenlock> Yes, which is exactly the functionality I was looking for -except for the preallocation.
[02:46:30] <TkTech> So, you want it to fold over when it gets filled, but you don't want it to be large enough to be filled?
[02:46:31] <Frozenlock> Say I let 100 users upload logs data. I want to give them a maximum of a gigaoctet each. If I use capped collections, then I _immediately_ lock down 100 Go, even if most of them might be using only 100ko
[02:47:14] <TkTech> That is an amazingly bad usage of capped collections.
[02:48:13] <Frozenlock> Ok... is there a function to use a normal collection and give a maximum size?
[02:48:31] <Frozenlock> And have it discards oldest value in case of reaching this limit?
[02:58:07] <nemothekid> Frozenlock: Your best bet is doing it yourself
[02:59:28] <clone1018> I'm having quite a weird issue with MongoDB and it's PHP driver, basically I've enabled auth=true in the mongodb config file, and verified my script can login using the user information I selected, but you're still able to make a connection and read data using without using a user on localhost
[03:10:16] <TkTech> clone1018: You've done the obvious? Tried running mongod with --auth directly? Made sure you had at least one user? (Mongo will continue to accept connections until you add at least one)
[03:10:52] <TkTech> clone1018: Also, dubbed Gundam? To shame.
[03:15:10] <clone1018> TkTech: yes, and yes, not for every database, but one user for admin and one user for the db in question
[03:15:21] <clone1018> TkTech: Seen it before, background content
[06:59:12] <tanghaihao> hi all, I have some questions, but I can not find answer in mongdb DOCS. Does oplog sync to disk every operation?
[07:00:00] <tanghaihao> Or like journal, sync to disk every 100ms.
[07:07:54] <kali> tanghaihao: i think it sync the same time the main data does, so iirc, it's up to mongodb to decide, and that might be as low as once a minute
[07:11:07] <tanghaihao> I think it is not a good idear. The client is likely to lost the data.
[07:12:34] <kali> this is debatable. the main idea behind mongodb durability is to rely on replication
[07:13:16] <tanghaihao> So in replica set replication, recovery is rely on joural files or oplog files?
[07:13:50] <kali> tanghaihao: you can adjust the durability depending on the importance of the write your performing. you can require fsync, but also wait for the write to be copied on a majority of replica
[07:14:32] <kali> the oplog is actually the way replication is handled
[07:14:52] <kali> and the journal is to assure a faster and safer standalone node recovery
[07:15:58] <tanghaihao> journal files do not play any role in relipcation recovery?
[07:17:24] <kali> not directly, but they allow each to restart faster. think of it as the journaling of a file system: you just get a faster reboot in case of a power failure
[07:19:45] <tanghaihao> Can I understand like that node will recover from journal files first, and next is replicas?
[07:21:24] <kali> yes, the node will recover its last clean state from the journal, and then start fetching the write it missed from another replica
[07:25:11] <tanghaihao> When Primary recieves a 'INSERT' operation, the executive order is INSERT, write INSERT operation into journal files, write INSERT operation into oplog, write WRITE OPLOG operation into journal files, total 4 steps?
[07:30:02] <tanghaihao> Can you descripe the executive oder of OPERATION, JOURNAL, OPLOG in PRIMARY role and SECONDARY role? Many thanks!
[07:45:38] <kali> tanghaihao: i don't know that level of detail
[07:46:22] <kali> tanghaihao: but let's make this clear: you wont get easily with mongodb the same level of durability than you get by default from a SQL server
[08:29:51] <amitprakash> For collection.update(filter, newobj,) how can I ask update to set one key to the value of another key in the object?
[08:30:06] <amitprakash> i.e. $set { field: field2 }
[08:58:38] <purple_cricket> so I'm attempting to mimic a multikey shard key with a compound key, i.e. replacing ["foo", "bar", "baz"] with {foo:true, bar:true, baz:true}. I've checked this out with explain (hits the index and routes to just one shard) but I don't have a proper cluster available to do real performance testing on this
[08:59:02] <purple_cricket> can anyone think of any obvious performance pitfalls with this technique, i.e. is my index going to become ludicrously massive?
[09:02:07] <purple_cricket> er, to clarify, the key is on the entire embedded document, which contains arbitrary elements, but an uncommonly large one would only have about a dozen sub-keys, and two dozen would be particularly rare. Many documents will share subkeys in common, and the shard key is also compounded on a nice random
[09:24:09] <purple_cricket> hmm, no, the key names are long so that'll consistently assault the shard key length limit. Providing query isolation may not even be possible with this application... back to the drawing board
[09:30:28] <NodeX> I think if you use bigger words in your explination it will definately help your app to look cool :P
[09:54:57] <SisterArrow> Hiya! Im trying to understand why my mongo instance never uses more than ~4gb resident memory on my 16gig machine. free -m reports 13gb free cache memory. For a specific collection that often have querys thats taking over 100ms I now have 3.5gig resident memory, 950gb virtual and 480 mapped.
[09:56:40] <NodeX> have you indexed your docs properly ?
[09:58:46] <SisterArrow> NodeX: In this specific collection I have one index that is basically used for evey query, product_identifyer, which is a hash of some fields in the document. I *always* use the indexes when querying, according to explain().
[10:11:44] <SisterArrow> NodeX: Its 81 million docs with a size of ~310gb.
[10:12:30] <NodeX> the index probably doesn't fit into memory then
[10:13:11] <SisterArrow> The index that Im using to query is 6.5 gb so there should be plenty of space, I would think?
[10:13:59] <SisterArrow> Altough in the same mongo-instance I have another database on around 60gb with around 20gb worth of indexes, which is also queryd quite a lot, I dunno if that has anything to do with it?
[10:15:14] <NodeX> maybe the indexes are colliding between the 2 instances
[10:15:18] <SisterArrow> Well, I have a process that sends data to a LED-sign that does a .count on the collection every minute when I think about it.
[10:15:42] <SisterArrow> How do you mean colliding?
[10:15:43] <NodeX> count() is very expensive on queries with parameters
[10:15:50] <NodeX> [11:14:00] <SisterArrow> Altough in the same mongo-instance I have another database on around 60gb with around 20gb worth of indexes,
[10:16:35] <SisterArrow> Ah, instances made me think of two seperate mongo-instances.. Both databases live in the same mongo instance.
[10:16:59] <NodeX> not sure why free -m isn't reporting properly though
[10:17:19] <NodeX> I would hazard a guess that the 81million document index is spilling to disk and it's thrashing
[10:17:32] <SisterArrow> Is it really expensive? I tried it out in a shell and it returned right away.
[10:17:43] <SisterArrow> Yah, Im seeing arount ~200 page-faults per second.
[10:22:08] <SisterArrow> Well, thats because it only finds 20 documents, and I have a read ahead that reflects that on avarage there should be ~50 documents in a serie. But sometimes there are 3-400 documents.
[10:22:33] <SisterArrow> And also, that does not explain why the memory usage is not higher :)
[10:23:07] <NodeX> can you pastebin an explain() with a ~100ms response time
[10:25:57] <NodeX> the memory usage can be explained by your other instance using the memory and this index spilling onto disk
[10:26:09] <NodeX> I thought that was implicit sorry
[10:26:36] <SisterArrow> NodeX: Im trying different querys from the mongolog which have been reported as slow but all of them seems to be cached somewhere and returns fast.
[10:27:28] <NodeX> always the same when trying to test!
[10:28:01] <SisterArrow> That would explain it I guess. For every time I look at this db/collection I also look in a couple of other collections in the other database, these collections do feature indexes which are to big to fit in memory (if summing them up)
[10:28:52] <SisterArrow> I should probably request beefier machines. :)
[10:29:17] <NodeX> I work on around double the ram to index size personaly
[10:33:35] <SisterArrow> Two machines with 16GB Ram and 512+256 Raid-0 SSD where one is slower for some reason and a replica node in AWS that is useless except for having a safety net.
[10:33:41] <NodeX> also remember that every write yields a lock and if the write is index bound it incurrs overhead when writing thus not releaseing the lock faster
[10:33:54] <NodeX> as your dataset grows this lock time will only ever increase
[10:34:36] <SisterArrow> Yep. Im keeping an eye out on the lock% and flush time. As it is now its aroud 2-3% and a flush time of ~500ms.
[10:34:40] <NodeX> if you sharded then you can halve the lock time with 2 machines
[10:35:16] <NodeX> Alot of people don't like AWS performance
[10:35:37] <SisterArrow> If the SSD-backed instance wasnt so darn expensive I would be a happy ops.
[10:36:13] <SisterArrow> (Im not a ops-person, I just happens to be the only guy in the office with Linux-experience so I kinda look after things while trying to learn all the bits and peaces about it)
[10:36:29] <SisterArrow> I knewer knew about the devil we call "go to disk" before.
[10:36:44] <NodeX> I operate as much in ram as I can
[10:37:10] <SisterArrow> I was amazed how tweaking the readahead yielded a database that was able to perform twice the querys per second for instance.
[10:37:33] <NodeX> on the plus side... even on a bad performance day Mongo still out performs -most- *SQL solutions
[10:38:07] <NodeX> and it's scalability is its biggest strength
[10:38:30] <NodeX> if your app was in Mysql you would be throwing RAM at the system and more SSD's about now !!
[10:39:30] <SisterArrow> Im impressed on how hard it is to bring a node to halt, also I have many times managed to kill the instances but they always(well, most of the times) comes back alive without having to resync, guess I can thank journaling for that.
[11:07:36] <greeeps> i have two node, before node a find() query gives 39 results but after shard, one node gives 17 other one gives 22 results, why one node cant give total results?
[11:09:46] <NodeX> doesn't the mongos do that for you?
[11:17:04] <greeeps> NodeX: Thanks. That works. I tried to connect nodes directly, so I get missed results.
[11:17:21] <greeeps> NodeX: can we run more than one mongos ?
[11:18:17] <NodeX> I am not sure, I don't think so
[11:18:45] <NodeX> it's the router that scatters / gathers for your cluster ... You should double check that though
[11:22:14] <solars> quick question, what does a limit(-1) do?
[11:22:39] <NodeX> quick answer - do it and find out !!
[11:24:04] <solars> I saw that the ruby driver or mongoid uses it
[13:06:16] <wiherek> hi, I am looking into querying for geospatial results and I wonder whether I will get performance gain by storing data in geohash format instead of using mongodbs geospatial indexing
[13:17:07] <Honeyman> Hello. Seems I am missing something trivial, but... how can I make a spec for querying a subdocument with multiple conditions, so that both of them must be satisfied on THE SAME subdocument?
[13:20:14] <Honeyman> Like, db.users.findOne({'msgs.time': {$gt: sometime}, 'msgs.tag': 'cats'}) which returns a user only if it finds some message which both has time greated than, and tagged as 'cats".
[13:21:29] <Honeyman> But doesn't return a user if it (has a message with time greater than sometime, and has a different message tagged as "cats", but has now message which satisfies both conditions simultaneously)
[13:28:12] <clone1018> I'm having quite a weird issue with MongoDB and it's PHP driver, basically I've enabled auth=true in the mongodb config file, and verified my script can login using the user information I selected, but you're still able to make a connection and read data using without using a user on localhost
[13:39:16] <madrigal> clone1018: this might help - https://jira.mongodb.org/browse/SERVER-4933
[13:40:50] <agonist> Is there a limit to the amount of bytes for PyMongo's .read? I can read all files below 100 kb perfectly, but when I try to print out files (I've confirmed they are in the database and intact) with a .get.read(), they are missing about 2000 bytes. Do I need to setup a read function that loops over the amount of bytes?
[14:02:35] <remonvv> clone1018, localhost can always access the server directly
[14:03:25] <clone1018> remonvv: how do I disable that.
[14:03:40] <remonvv> You don't. There's no possible reason why you'd auth from local host.
[14:03:57] <remonvv> you can use the machine IP rather than localhost
[14:04:26] <clone1018> There is a possible reason, code being eval'd on the machine in question
[14:04:59] <remonvv> I meant from MongoDB perspective. If you're already confirmed to have access to the entire machine doing authentication at that point is a bit useless. As I mentioned you can use the machine's IP address instead.
[14:05:16] <clone1018> But that doesn't stop the code being eval'd from using local
[14:05:40] <remonvv> Hm, it should I think. I think it special cases localhost/127.0.0.1 iirc
[14:05:46] <remonvv> That may have changed since I last tried.
[14:06:14] <remonvv> By the way, you can auth, you just don't have to. So you can test if it works anyway.
[14:06:23] <remonvv> Wrong auth should still be rejected iirc.
[14:06:50] <remonvv> Out of curiousity, why are you planning to use auth?
[14:07:25] <clone1018> So you can't access the database from localhost
[14:08:00] <remonvv> Ha, nice, our servers just peaked at 192,567 reads/sec and 83,038 writes/sec
[14:08:14] <remonvv> clone1018, you can always access the db from localhost
[14:08:40] <remonvv> the purpose of auth is to provide an authentication mechanism for clients that are non-local to the database node
[14:09:10] <clone1018> So what are my choices here?
[14:09:23] <remonvv> I'm not aware of many good reasons to use MongoDB auth though. Your database shouldn be inside a DMZ anyway and not be accessible for anyone but trusted machines.
[14:09:38] <remonvv> Well, what are you trying to accomplish? That auth is required when you connect from localhost?
[14:11:35] <clone1018> Sorry was busy there, can talk now. Okay basically, I'm using MongoDB for a small site that allows you to run php on the fly, due to bugetary reasons, there's only one machine, compile, run, save, database all on the same machine, at the moment you can access the mongodb without any restrictions via the eval'd code, meaning you can see users, dump stuff, and do whatever you want, I want a way of allowing username/password connections only, so the root script
[14:11:35] <clone1018> can make the connection but so the eval'd code can't make the connection.
[14:12:14] <remonvv> And there are admin users in the db already?
[14:12:25] <remonvv> As soon as you entered an admin user it should force auth regardless.
[14:12:35] <remonvv> That's why you need to add the first through localhost typically
[14:13:12] <remonvv> So, with auth enabled you can only access the db from localhost (without auth) until you add an admin user.
[14:13:33] <clone1018> There is an admin user and it is NOT forcing auth
[14:14:06] <remonvv> And when you say "eval'd" code you mean what?
[14:14:26] <clone1018> External code written by other users that is ran on the local machine
[14:15:13] <remonvv> Let me quickly test if auth is forced for connections coming from local host, 1 sec
[14:17:53] <remonvv> okay, auth works from any host now
[14:18:18] <clone1018> As of which version? Or which version are you on
[17:20:26] <timoxley> I have customers. customers can own objects. customers can see objects other customers own. how would you implement this relationship in mongo?
[17:21:20] <timoxley> object gets customer id? that sounds very relational
[17:24:12] <timoxley> TkTech ok what if, every time I look up one of these objects, I also need the owner details
[17:25:06] <TkTech> The you do another query and look up the owner :)
[17:25:28] <TkTech> Efficient caching and delayed aggregation are key here if you intend to serve volume.
[17:25:51] <timoxley> TkTech can you elaborate on that
[17:27:29] <TkTech> Say you have 16 million objects, and you want to count all the objects that belong to not just one user, but intersect between 3 of them.
[17:27:53] <TkTech> Rather than do it on demand, you either add it to a task queue and notify the user when it is complete, or you generate it every so often and return cached results.
[17:28:17] <TkTech> For simple things like how many objects does a user own, simple keep a counter on the customer object and $inc as that count changes, so you don't need to count each time.
[17:36:43] <TkTech> timoxley: Also, tip that most people seem to miss: Key sizes matter more than you think, and if you plan on EVER supporting search, add a normalized (stripped and lower case) field in addition to your other fields.
[17:37:03] <timoxley> TkTech I push search out to redis
[17:37:36] <TkTech> Hm, I use redis for a bunch of things, never thought to use it for the actual search.
[19:24:17] <rnickb> any reason why safe mode isn't enabled by default?
[19:24:46] <Lawnchair_> if you don't want exceptions, might also be slower
[19:27:04] <rnickb> what's the c++ function that enables safe mode?
[19:28:32] <TkTech> rnickb: Because it is vastly slower, especially for things like inserts.
[19:29:12] <TkTech> rnickb: Instead of just sending off the query and immediately continuing, you sit there and wait to ensure minimum consistency (including optionally asking to make sure it is replicated to at least N nodes before returning)
[21:10:29] <frozenlock> What would be the best function to use if I want to get a sample of a collection? For example, how could I get a value every 1/10th of the entire collection?
[21:12:44] <slava_> Any Perl users around? I was wondering if there is a recommendation/comparison on MongoDB vs MongoDBx::Class.
[22:27:30] <devdazed> Hi all, I downloaded the 2.2rc0 to try out the new aggregation framework and i get this "no such cmd: aggregate"
[22:27:31] <devdazed> on all queries (even the examples)