pmxbot IRC Log Viewer

[00:00:49] <GothAlice> Apparently. :/ min/max is darned useful. (Note: I don't actually use MongoDB's geo capabilities.)

[00:02:58] <Streemo> min/max as in efficiency?

[00:03:14] <GothAlice> As in that style of querying, just without the sorting by distance.

[00:03:27] <GothAlice> $minDistance/$maxDistance

[00:04:01] <Streemo> yeah, like a nice annular query

[00:04:49] <Streemo> but couldnt one jusy query "in big circle & not in small one"?

[00:05:00] <Streemo> to get the desired ring effect

[00:09:38] <JokesOnYou77> Hi all, I have documents of the from {from: [list of string], topics: [list of string]} I'd like to get an aggregation of {topicName: {address: count}} where address is a unique "from" value count is the sum of how many times that from occurs in that topic. A document may have many topics

[00:30:10] <JokesOnYou77> Hi all, I have documents of the from {from: [list of string], topics: [list of string]} I'd like to get an aggregation of {topicName: {address: count}} where address is a unique "from" value count is the sum of how many times that from occurs in that topic. A document may have many topics. What is the best way to do this in mongo?

[00:30:45] <joannac> aggregation framework

[00:31:37] <JokesOnYou77> joannac, Yes, I'm trying to figure out the best way to do that with the framework and I'm having trouble using a nested $group, and I'm not even sure that's how I should be doing it

[00:32:06] <joannac> pastebin a sample document, and what you have so far

[00:39:26] <JokesOnYou77> joannac, http://paste.ubuntu.com/10987513/

[00:41:23] <JokesOnYou77> srry, there's a typo in the post, should reat "$from" instead of entities, but you get the idea i think

[00:56:15] <joannac> JokesOnYou77: seems right to me. you just need a counter as well

[00:59:29] <JokesOnYou77> joannac, That's what I'm trying to figure out. I need to get a count of EACH unique from in that topic

[01:02:23] <joannac> yup

[01:02:45] <JokesOnYou77> That's the part I need help with...any thoughts?

[01:02:52] <joannac> yeah... go read the docs

[01:02:55] <joannac> specifically, $sum

[01:03:01] <JokesOnYou77> Thanks.

[05:14:27] <Streemo> is there a simple way to convert an array to a cursor so i can query on the JSON in it?

[05:15:18] <Streemo> for exmple i want to use the power behind mongodb onmy arrya on the fly instead of intitalizing a new collection

[05:23:38] <joannac> Streemo: no

[05:24:26] <Streemo> joannac: ok

[05:24:27] <Streemo> thx

[05:25:00] <Streemo> i think creating a collectio nfor the sake of rujnning a mongo query is not good, might as well just do the math explicitly on the array

[08:33:00] <sweb> i have data like this : http://paste.ubuntu.com/10988868/

[08:33:39] <sweb> how can i get aggregation ... group by tags match /\~R/ and count by /\~C/ ?

[11:17:58] <jdo_dk> How to insert a record from pymongo like the mongo shell: created: new Date() ? I have tried: "created":datetime.datetime.now(). But there is 2 hours different on the inserted time.

[11:22:21] <jdo_dk> utcnow() looks like the on i was missing. Thanks. :D

[15:30:13] <salty-horse> can data stored in a collection affect the query plan?

[15:31:00] <StephenLynx> what do you mean?

[15:31:32] <GothAlice> salty-horse: It can. Indexes being considered for use are weighed against how effective they would be at reducing the work later in processing—highly selective indexes are preferred. Some times an index won't be used when one might think it should because of this.

[15:32:16] <GothAlice> salty-horse: I.e. you want indexes that are highly selective: if 90% of your data has the same value for an indexed value, that index isn't very useful, and may only be used in scenarios where you're hitting that remaining 10%.

[15:32:36] <salty-horse> I'll be more specific then. (I want to hint() something, but I'm worried if it's the right approach)

[15:32:54] <GothAlice> salty-horse: When in doubt, measure. (Esp. as any form of optimization is by definition premature without measurement.)

[15:33:17] <deathanchor> .explain() is best to figure things like that out.

[15:33:20] <GothAlice> Indeed.

[15:33:44] <GothAlice> Hinting can be extremely useful, as MongoDB only allocates so much time to trying to figure out the best approach. Hinting lets it skip that heuristic process.

[15:33:55] <deathanchor> of course explain() is only a snapshot in time, you should check your logs for slow queries over time to try to optimize the slow things

[15:34:23] <salty-horse> 1) I have a sharded collection with a shard index {a: "hashed"} and an index on {a:1, b:1, c:1}. this is the type of query I'm interested in: find({a: 1, b: {$in: [1, 2]}}).sort({c: 1}).limit(100)

[15:34:53] <salty-horse> 2) According to SERVER-12783, a sensible merge sort should happen for each of the '1' and '2' sorted results

[15:35:19] <salty-horse> 3) If I create this dummy collection, shard it, and explain() an empty find, it chooses the a, b, c index.

[15:35:39] <salty-horse> 4) If I try it on a collection I filled with some data, it chooses the { a: hashed} index

[15:36:00] <salty-horse> 5) If I try it on one of the shards locally, it chooses the a, b, c index.

[15:37:48] <salty-horse> I don't know what the database actually does if I hint it. explain() seems to scan more documents than it needs to, given the limit I set.

[15:38:24] <salty-horse> GothAlice: what kind of measuring methods are you suggesting? timing it, or something more involved?

[15:38:42] <GothAlice> salty-horse: If you're ordering (not using natural sort), then it needs to process the whole candidate set as part of that sorting operation, prior to limiting the count. (The sort itself is smart enough to only hold on to as many items as the limit specifies, though.)

[15:38:53] <GothAlice> salty-horse: Timing it is the general approach, yes.

[15:40:23] <salty-horse> GothAlice: the sort is part of the the index. Isn't that "natural"?

[15:40:45] <GothAlice> No, $natural is natural sort. If you're sorting on something in the index, direction can matter.

[15:41:07] <GothAlice> (Sorting on {foo: -1} might not be able to make use of a {foo: 1, bar: 1} index.)

[15:45:02] <salty-horse> GothAlice: oh, I'm planning not to access the actual data :) (my actual index has the fetched field as an extra last value, and I'm upgrading to 3.0 so the shards would only access the index. I hope)

[15:45:23] <GothAlice> Index-covered queries are, indeed, awesome.

[15:49:30] <salty-horse> I'm suspicious something isn't working optimally. if I'm searching for b:{$in[1]} with limit(20) and the "proper" index, mongodb says it scanned 20 records. If I'm searching instead for b:{$in:[1,2]} (and hinting the proper index) mongodb suddenly scans 45 records. No matter the data, it would only need to scan at most 20 + 20 = 40, since I set a limit.

[15:51:43] <deathanchor> limit is cursor operator, not a query operator

[15:52:49] <cheeser> you *have* to scan all the results to apply the query filters. then limit() just constrains the number you get back to your application.

[15:52:59] <cheeser> though limit is largely a driver operation, iirc.

[15:56:34] <pamp> Hi

[15:58:03] <pamp> windows users, how to add permissions to key_file? I cant start mongod, error open file, permission denied

[15:58:17] <pamp> windows server 2012

[16:03:45] <salty-horse> cheeser: If I understand correctly, according to this bug, queries can be limit-aware - https://jira.mongodb.org/browse/SERVER-12783

[16:12:24] <GothAlice> salty-horse: That is correct; the sorting system is smart enough now to only track N items, where N is the limit you have provided.

[16:12:53] <GothAlice> (It still needs to evaluate all candidates, though.)

[16:13:22] <GothAlice> (Unless you have a direction-matched index. Then it's a bit simpler.)

[16:14:06] <salty-horse> GothAlice: what does it have to evaluate, for example? all of my fields are in the index, in the sort order I request.

[16:19:26] <GothAlice> salty-horse: In that scenario it'd walk the btree and just gather the number of elements you want.

[16:21:11] <salty-horse> GothAlice: exactly, but "nscanned" is too high :) - is it safe to ignore this anomaly?

[16:21:31] <GothAlice> salty-horse: Gist/pastebin your .explain() results?

[16:21:49] <GothAlice> (And query. ;)

[16:23:00] <salty-horse> just a moment

[16:23:05] <salty-horse> I'll have to sanitize

[16:35:09] <salty-horse> GothAlice -- thanks a lot for trying to help :D - http://pastebin.com/WPcjNLG0 -- two things I don't understand: Why the second query (the one with ['123', '456']) has two similar "clause" sections, and why one of them scans 45 documents, when it should at most be 40, given the limit

[16:37:25] <matthavard> Let's say I have a Mommy collection where a Mommy can have zero or many Children from the Children collection. To represent this in mongo I might have a child-ids array field in a Mommy document that is just an array of ObjectIds pointing to various Children records. But is there a way that I could tell monog that these object ids refer to documents in the Children collection so I can operate on entries in the Mommy's child

[16:37:26] <matthavard> as if they were the full Children document for each and not just the ObjectId. So I could do `db.Mommy.findOne({_id:ObjectId(...)}).children[0].name`, to get the name of the first child of the Mommy document, instead of doing `var child_id = db.Mommy.findOne({_id:ObjectId(...)}).children[0]; db.Children.findOne({_id:ObjectId(child_id)}).name`

[16:38:41] <GothAlice> salty-horse: My initial impression would be that $in requires the planner to divide its plans based on the number of elements you are matching against. (Combinatorial problem.)

[16:39:07] <GothAlice> The second likely comes down to how the btrees in MongoDB work. They're buckets, each holding references to more than one actual record.

[16:39:12] <GothAlice> You can't partially load a btree leaf.

[16:39:37] <GothAlice> However, these are "gut impressions", not gospel, salty-horse. ;)

[16:39:38] <salty-horse> GothAlice: so it scans as much as there is in the leaf, but not too much more?

[16:39:44] <GothAlice> That should be correct.

[16:40:03] <GothAlice> I'd have to dig through the code again to determine the bucket size.

[16:40:08] <GothAlice> ^_^

[16:40:13] <salty-horse> GothAlice: ok, I'll take the plunge and load some more data. (I guess I could test it with dummy data first)

[16:40:32] <GothAlice> Try it out with different limits, i.e. +/-1 1, and see if the nscanned changes. ;)

[16:40:49] <salty-horse> GothAlice: oh, thanks for the offer, but don't bother. I'll do more tests and if things don't make sense, I'll bug the channel again :)

[16:40:55] <salty-horse> good idea

[16:41:05] <GothAlice> ±1 — apparently I turned automatic unicode transforms off. XD

[16:41:34] <salty-horse> GothAlice: fancy. irssi?

[16:41:43] <GothAlice> Mac. ;)

[16:41:52] <GothAlice> (Built in feature; no need to buy textexpander.)

[16:42:00] <salty-horse> nope. still not going to use it :P

[16:42:16] <matthavard> GothAlice: Help me! when you get a chance...

[16:42:30] <salty-horse> matthavard: sorry to keep you waiting :D

[16:42:34] <salty-horse> thanks GothAlice!

[16:42:41] <matthavard> haha no problem

[16:44:01] <GothAlice> matthavard: A child list is one approach, but it's tuned for very specific use cases. And one of those use cases is actually better served by other approaches. (Preserving the order of the children.) A list of child IDs is actually not very useful as you can't efficiently get the child records back in that explicit order.

[16:45:00] <GothAlice> https://gist.github.com/amcgregor/901c6d5031ed4727dd2f#file-taxonomy-py-L23-L35 illustrates the fields needed for several other approaches.

[16:45:20] <GothAlice> (And the code needed to manage an arbitrarily deep nesting; it's more than you seem to need, though.)

[16:48:48] <matthavard> I guess I don't really understand what that code is doing. I really just want to know how to store a reference to one document (Children) in a field in another document (Mommy) so that you can do `db.Mommy.findOne().child.name` instead of storing ObjectIds and looking up the document in the referred to collection (Children) by its ObjectId

[16:49:41] <matthavard> Is that what that code is doing GothAlice ?

[16:50:03] <matthavard> I know Python, just not Mongo ha

[16:50:59] <matthavard> Is CachedReferenceField a Mongo thing or a mongoengine.py thing?

[16:51:56] <Gevox> Do i have to write javascript syntex while making a mapreduce methods in java?

[16:52:11] <Gevox> syntax*

[16:53:08] <GothAlice> It's a mongoengine thing, matthavard.

[16:53:48] <GothAlice> It's basically creating a sub-document like {id: ObjectId(…), name: "…", acl: […]} for each "parents" list entry, and the same without "acl" for the direct parent reference. Saves needing to perform additional queries to load "related" data.

[16:54:21] <GothAlice> But, what you need to have the most use out of your heirarchy (preserved order, efficient querying) is a parent reference (ObjectId), and a "sort order" integer.

[16:54:51] <GothAlice> http://docs.mongodb.org/manual/tutorial/model-tree-structures-with-parent-references/

[16:55:14] <GothAlice> My own code does the parent reference, array of ancestors, and materialized paths. (All of them. Each optimizes for a different querying approach.)

[16:56:35] <matthavard> Hmm thanks GothAlice I will check that out

[17:17:12] <Gevox> Hello i need to acheive something almost similar to this http://stackoverflow.com/questions/13732735/query-in-a-mongodb-map-reduce-function

[17:17:25] <StephenLynx> what would be the differences?

[17:17:54] <Gevox> But in my case i have a key "title" stored into an event document. I need to split each word stored in that title field and count how many words each one repeated

[17:19:04] <Gevox> Its 1st time ever i hear about this mapreduce, so i'm bit confused. How do i get a reference to the "title" field which is stored inside the "Event" document stored in the "Events" collection and tell the map function to split each word in each title of these and return the key value to reduce function?

[17:20:36] <GothAlice> Gevox: When I was first learning map/reduce good (simple) examples were hard to find. Have mine: https://gist.github.com/amcgregor/1623352

[17:20:38] <StephenLynx> you can reference fields by using a string '$fieldname'

[17:22:12] <GothAlice> In map/reduce, you write JavaScript functions. The map function gets run once for each document from the initial query, referenced as "this". The map function emits values which are then fed into one or more runs of the "reduce" function. The output of the reduce function should match the input to the reduce function—something that tangled me up for a while—as it may be run over its own output to further reduce.

[17:23:16] <Gevox> Ok I quite understood this part by now. Get ready for some dumb questions and bear me. That's how i get to understand things

[17:23:31] <GothAlice> There are no dumb questions: only bad answers.

[17:23:43] <Gevox> This http://pastebin.com/tqHXuihG

[17:24:08] <Gevox> i want to tell this map to look at the "title" of each "Event" document. How can i explain this ?

[17:24:47] <GothAlice> Gevox: Instead of str.split(), you need to reference the field: this.title.split()

[17:25:03] <GothAlice> ("this" being the "current document being processed from the query" within the map functionl.)

[17:25:06] <GothAlice> *function

[17:25:18] <Gevox> this.title directly? not this.get("title")?

[17:25:27] <GothAlice> Then you'd emit(word), not emit('word') (which would emit only one word: "word". ;)

[17:25:32] <GothAlice> this.title directly.

[17:25:34] <Gevox> i mean in regular i use the .get method to get things from a certain document ( i use java )

[17:25:39] <Gevox> ok i can adapt

[17:25:48] <GothAlice> JavaScript is a fully dynamic language; no need for setters/getters in most situations.

[17:26:39] <Gevox> Ok, so this looks logical for you? emit(word, 1);

[17:26:50] <Gevox> i mean to return each word and a "1" as a value for it

[17:27:14] <Gevox> and then i make sum of these "1's" of each word to get how many times each word mentioned

[17:27:20] <GothAlice> Exactly.

[17:27:20] <Gevox> damn i feel this is so wrong

[17:27:27] <GothAlice> So you want a count of 1 for each emitted word.

[17:27:35] <GothAlice> Then during reduce, add identical words up together.

[17:27:49] <GothAlice> From my example: https://gist.github.com/amcgregor/1623352 you can see that this is effectively what I'm doing, just grouped by month.

[17:28:20] <GothAlice> (line 2 gets the creation time field from the record, then I emit a set of month counts for the given year for that record. (There'll be a 1 in one of the months.)

[17:28:36] <GothAlice> Then in the reduce function, I combine the values for the given year together.

[17:28:48] <Gevox> line 18, 23

[17:28:54] <Gevox> why do you have this document there

[17:29:02] <Gevox> from 18 to 22

[17:29:23] <GothAlice> You generally want to start counting from zero. So I create a record that is all zeros, then populate it from the passed-in values.

[17:29:36] <GothAlice> It has to match the structure on lines 5-9 for things to work properly.

[17:29:56] <Gevox> What does the reduce function actually takes as arguments? The key is what and the value is what?

[17:30:08] <GothAlice> values is an array of emitted values.

[17:30:14] <GothAlice> So in your case, they'd be integers and all 1.

[17:30:32] <Gevox> hold on

[17:30:32] <GothAlice> (You could cheat and simply return values.length ;)

[17:30:49] <GothAlice> But that cheat would bite you if you shard.

[17:31:01] <Gevox> so the exact thing i emit from map is sent as argument to reduce? in my case (word, 1) is what is taken for the reduce ?

[17:31:10] <GothAlice> Sorta.

[17:31:14] <Gevox> key is "word" value is "1"

[17:31:15] <GothAlice> The reduce gets the result of several mappings.

[17:31:31] <GothAlice> k == "word", values = [1,1,1,1,1] if that word appeared 5 times.

[17:31:48] <GothAlice> You'd then forEach across values and sum the values.

[17:31:50] <Gevox> aha, so i do a loop

[17:31:52] <louie_louiie> hey guys. sorry to interrupt- i got a quick question: what is the best way to ensure a search index on url's in mongodb?

[17:31:58] <GothAlice> Aye, like lines 24 to 29 of mine.

[17:34:15] <Gevox> GothAlice: like this? http://pastebin.com/9GLaFiSV

[17:35:02] <Gevox> I think i need to use sort of HashMap (key is the world, value is the total sum) and return that from the reduce. Correct thinking?

[17:38:21] <GothAlice> Gevox: In the reduce function, the key is preserved. All you need to do is return the sum.

[17:38:53] <GothAlice> louie_louiie: http://docs.mongodb.org/manual/tutorial/create-an-index/ and http://docs.mongodb.org/manual/tutorial/create-indexes-to-support-queries/ should be useful resources for you.

[17:38:58] <Gevox> Does even my reduce function look fine?

[17:39:14] <GothAlice> Gevox: You aren't returning the sum.

[17:39:28] <Gevox> is this what is only missing? I feel its ALL wrong

[17:39:32] <GothAlice> (Or setting the initial value to zero.)

[17:39:59] <Gevox> http://pastebin.com/e5QhnSkL

[17:40:02] <Gevox> correct?

[17:40:15] <GothAlice> … the return needs to be inside the function.

[17:40:16] <sijis> is this connectionstring correct "mongodb://mdb-11,mdb-12,mdb-13/?replicaSet=rs0-seyren" based on my configs? http://paste.fedoraproject.org/218708/14308474/

[17:40:31] <GothAlice> (At least that doesn't differ between Java and JavaScript. ;)

[17:41:11] <GothAlice> sijis: I typically always include the port number explicitly against the hosts in my URIs when using a replica set. (Explicit is better than implicit.)

[17:41:43] <GothAlice> sijis: "mongodb://mdb-11:27017,mdb-12:27017,mdb-13:27017/?replicaSet=rs0-seyren" — otherwise, yes, looks fine, as long as the servers can DNS resolve "mdb-11" and friends. :)

[17:42:04] <Gevox> http://pastebin.com/18ny9E2a

[17:42:08] <sijis> GothAlice: yeah, i validated they can resolv

[17:42:30] <GothAlice> I name mine "s1r1", "s1r2", etc. as in "shard 1, replica 2".

[17:43:06] <Gevox> Final version -> http://pastebin.com/p31uhSSb

[17:44:33] <GothAlice> Gevox: No points for style, but that should work. I haven't used map/reduce in a long time, so I'd test that and see if it requires objects to be returned. (I.e. return {sum: result}, not just return result.)

[17:45:03] <GothAlice> If it does, you'll also need to change the emit: emit(word, {sum: 1})

[17:45:09] <GothAlice> (And the return.)

[17:45:22] <Gevox> ok let me test it

[17:46:09] <louie_louiie> hey @GothAlice can you help me out?

[17:46:34] <louie_louiie> do you know the best way to index urls? as an object or as text?

[17:46:46] <GothAlice> Define: "object"

[17:48:31] <louie_louiie> there was this example of indexing urls as an BasicDBObject to remove duplicates

[17:48:41] <louie_louiie> but i just want to be able to search for them

[17:49:08] <GothAlice> I don't even know what BasicDBObject is, so this sounds like over-engineering.

[17:49:31] <sijis> is there a way to 'delete' all the contents in the replica set?

[17:49:41] <cheeser> it's java driver class

[17:49:49] <GothAlice> If you need to perform exact searches, a string is sufficient. If you need to perform prefix searches (i.e. https://example.com) strings can work for that, too, and still make use of an index. Note that "www" being there or not might not matter for many URLs, but will create edge cases for prefix searches.

[17:50:42] <louie_louiie> cool, ill just text index it

[17:50:44] <louie_louiie> thanks!

[17:51:00] <louie_louiie> @sijis you can remove duplicates with mongodb

[17:51:28] <GothAlice> louie_louiie: If you _really_ want to store a URL in the ultimate queryable way, store an embedded document like: {scheme: 'http', user: null, password: null, host: 'example.com', port: 80, path: '/', params: {}, query: {}, fragment: null}

[17:52:15] <GothAlice> From that you can rebuild the full URL, or query components, including query string argument names and values. (Need to use an array of embedded documents if you want params and query to benefit from indexes.)

[17:52:44] <cheeser> sijis: say what?

[17:53:27] <louie_louiie> thats a really cool way of thinking about it! Thanks @GothAlice

[17:53:49] <sijis> cheeser: i basically want to 'start over' with it

[17:53:51] <GothAlice> louie_louiie: https://github.com/marrow/util/blob/develop/marrow/util/url.py#L188-L252 is my Python URL object. :)

[17:54:09] <GothAlice> sijis: If you want to nuke it and start over, you can shut down the node and delete the contents of your data directory, typically /var/lib/mongodb/*

[17:54:24] <GothAlice> When you start it back up, it'll be all fresh and new.

[17:54:46] <Gevox> GothAlice: very bad exception happens, i tried both methods http://pastebin.com/eQfZE8JF

[17:55:09] <GothAlice> Gevox: Your loss of points to style is obscuring the issue. You forgot a semicolon.

[17:55:29] <cheeser> sijis: http://docs.mongodb.org/manual/tutorial/resync-replica-set-member/

[17:55:50] <Gevox> GothAlice: where exactly?

[17:56:01] <Gevox> sum+=value

[17:56:03] <GothAlice> Oh, not just a semicolon, also a closing parenthesis. Line 11 of your last paste.

[17:56:15] <GothAlice> (Those really should be 4 lines.)

[17:57:02] <GothAlice> You're passing a function to forEach, but you don't close the argument list for forEach (missing ")").

[17:57:16] <GothAlice> (Well, missing ");")

[17:58:12] <louie_louiie> GothAlice: props

[17:59:13] <Gevox> http://pastebin.com/Ji3kg6m6

[17:59:28] <Gevox> (Well, missing ");")? i don't see

[18:00:31] <GothAlice> Gevox: Let me link you my list of Laws. #28 to #34 apply to code readability: https://gist.github.com/amcgregor/9f5c0e7b30b7fc042d81 ;) Step 1: take those JS functions out of those blasted strings and format it properly. ;)

[18:01:25] <kakashiAL> hey guys, why does mongoDB allow with save() to overwrite an document with the same _id?

[18:01:47] <kakashiAL> isn it more secure to use the update functionality?

[18:01:48] <GothAlice> Gevox: https://gist.github.com/amcgregor/b395d6136db7cfcae071 < line 6, not line 8 in the actual function is where the ); should have gone.

[18:03:16] <Gevox> GothAlice: if i remove the string quotes i will get errors

[18:03:26] <Gevox> because i'm in Java

[18:04:31] <GothAlice> kakashiAL: MongoDB performs no security checks (other than read/write access to the db/collection) when performing any of the CRUD operations, excluding some optional fancy redaction stuff on reads.

[18:04:35] <cheeser> the end of line 9 seems to have an extraneous }

[18:04:56] <GothAlice> kakashiAL: It's up to your application to handle your data's security, really.

[18:05:02] <GothAlice> cheeser: That's the close of line 1.

[18:05:38] <cheeser> hrm?

[18:05:39] <Gevox> line 6 is having an extra ")"

[18:06:34] <GothAlice> Gevox: The point is to write the code as code first, with something akin to a proper style guide; then syntax errors and whatnot can be easily caught. You can even copy/paste these functions into the MongoDB shell to try them out first. *Then* you can wrap it in strings. Wrapped in strings to begin with is maddeningly difficult to read, and encourages one to not make use of newlines.

[18:07:08] <kakashiAL> GothAlice: so it takes my object, look if there is the same ID, and overwrite it with no mercy?

[18:08:12] <GothAlice> https://gist.github.com/amcgregor/b395d6136db7cfcae071 < updated

[18:08:32] <GothAlice> kakashiAL: Yup. Your application told MongoDB to save the record, so that's what it does.

[18:08:45] <GothAlice> If you do not like that behaviour, avoid .save(). (I avoid it like the plague.)

[18:11:00] <sijis> GothAlice: thank you. that worked. i think the app i'm playing with doens't support it or something

[18:11:06] <kakashiAL> GothAlice: I dont know how the mongoose update function works, that only updates one value, but I guess it uses save() somehow, since mongoDB doesnt care from the beginning

[18:11:35] <sijis> if i jus used mongo:://mdb-11/seyren it works ;/

[18:11:58] <sijis> sorry. mongodb://mdb-11/seyren

[18:14:13] <Gevox> GothAlice: it runs, but it does not seem to split each word in the title. It takes the entire title . Look the output http://pastebin.com/mCzbGgAZ

[18:15:40] <GothAlice> Gevox: https://gist.github.com/amcgregor/b395d6136db7cfcae071 and http://showterm.io/8b877b27715956e03581a (my testing it out)

[18:16:47] <GothAlice> You forgot to loop over the individual words, and were instead emitting all of the words that were split as a single key.

[18:17:34] <GothAlice> kakashiAL: I long ago stopped expecting Mongoose to do things in sane ways.

[18:24:15] <Gevox> GothAlice: it works now! Well, appearently i got punch of DBObjects as return, if i want to show only words which got total of more than 10 for example. I shall make another regular java method that iterates over these DbObjects and prints out which has a value of greater than 10?

[18:25:01] <GothAlice> Gevox: That would work. (You can see from the end of my second link the structure being returned by the mapReduce call.)

[18:25:02] <Gevox> in paractice the reduce function already returns the value as "total" : 2.0". So i will parse this DBObject.get("value") to get only the number out of the string, and print it out based on my specified condition.

[18:25:09] <GothAlice> You could also write a "finalize" function which does that filtering.

[18:25:12] <Gevox> Is this how things work or i hardcoded everything that way?

[18:26:07] <GothAlice> The _id of a record in the "results" array will be the key you emit in the map, and the "value" will be the value you emit and subsequently return from reduce.

[18:26:56] <GothAlice> In my example I emit and return basic integers. Thus result.results[0].value == 1 in my example. How you get that in Java… others may be better suited to answer.

[18:27:13] <GothAlice> ("result" being the value mapReduce returns.)

[18:29:05] <Gevox> ok i will see what is properly to do about this.

[18:31:04] <Gevox> One last thing, what to run mapreduce on Events that happened between 5 and 6 PM for example? I store the time in a key "hour" as part of the Event document. Shall i make a function that gets all of the documents which has a "hour" of 6:00 PM for example and store it in a new temp collection, then run my mapreduce function on that collection?

[18:31:20] <Gevox> i mean "how" not what sorry xD

[18:31:31] <hfp_work> Hi all, when you do db.collection in NodeJS, is collection a method of MongoClient?

[18:32:04] <GothAlice> Gevox: http://docs.mongodb.org/manual/reference/command/mapReduce/#calculate-order-and-total-quantity-with-average-quantity-per-item < the "query" option you pass in the options to the mapReduce call. These examples (down a bit from the anchor I linked) demonstrate this.

[18:33:02] <GothAlice> hfp_work: It'd be a Collection object.

[18:33:24] <GothAlice> hfp_work: Ref: https://mongodb.github.io/node-mongodb-native/api-generated/collection.html

[18:33:32] <hfp_work> GothAlice: Thanks

[18:34:01] <GothAlice> THANK THE GODS

[18:34:02] <GothAlice> ;)

[18:34:59] <hfp_work> GothAlice: So to test inserting and removing documents with sinon in nodejs, I should `var Collection = require('mongodb').Collection; sinon.stub(Collection, 'insert');`?

[18:35:09] <GothAlice> No.

[18:35:14] <hfp_work> :) I love testing

[18:35:23] <GothAlice> One should never attempt to directly initialize driver-internal prototypes.

[18:35:43] <GothAlice> "db" (a Database object) is a factory for them, as MongoClient is a factory for Database objects.

[18:36:29] <hfp_work> Right, but how do I get a Database object for a test? I don't want to test against an actual MongoDB, I just want to make sure my function inserts or deletes depending on what it is given

[18:37:12] <GothAlice> Oh, it's a test mocking/monkeypatching tool?

[18:37:18] <hfp_work> This is the function I want to test: https://gist.github.com/anonymous/db196cf7e1144e2a27f0. I want to make sure it inserts and deletes when it should. But I dont want to ahve to run a mongodb to run my test suite.

[18:37:22] <hfp_work> GothAlice: Yes

[18:37:47] <GothAlice> hfp_work: The typical approach is to not wrestle the driver, as monkeypatching would, nor to replace the API as mocking would, but to simply test the database state after execution of your function.

[18:38:00] <StephenLynx> you could just run a virtual machine

[18:38:15] <StephenLynx> if you don't want to install mongo on your work environment.

[18:38:27] <sijis> is there a simple way to validate the replication set it working?

[18:38:32] <GothAlice> I.e. clear the collection, insert a value, run your function that inserts or updates, check if it added (count() > 1) or modified (old value is now new, expected value).

[18:38:34] <StephenLynx> I have about half dozen VM's.

[18:38:42] <StephenLynx> mongo, io, PHP, mariadb,mysql

[18:39:25] <StephenLynx> and a couple more for some small issues

[18:39:27] <hfp_work> StephenLynx: It's not about that, it's because our circleci test suite doesnt have a mongo setup and I dont want to go and change all that for one test out of 1200+

[18:39:38] <StephenLynx> clrcleci?

[18:39:41] <StephenLynx> circleci*?

[18:39:44] <hfp_work> yes

[18:39:45] <Gevox> GothAlice: I don't know how to thank you. Thank you very much man

[18:39:47] <StephenLynx> what is that?

[18:39:55] <hfp_work> StephenLynx: continuous integration tool

[18:40:01] <GothAlice> Continuous integration testing service, StephenLynx. Like Travis.

[18:40:04] <StephenLynx> but you do have mongo on your servers?

[18:40:19] <hfp_work> StephenLynx: yes

[18:40:23] <StephenLynx> uh

[18:40:24] <GothAlice> Gevox: It never hurts to help. When I was learning map/reduce the biggest thing I learned was to avoid it wherever possible. ;)

[18:40:47] <StephenLynx> I never told you to touch your complex thingie for enterprise thingies.

[18:41:21] <StephenLynx> just to create a VM, check if your operation is working and move along

[18:41:41] <GothAlice> …

[18:42:24] <GothAlice> hfp_work: Are you able to explicitly query the DB in your test, to see the result of any operations made in the test block?

[18:42:50] <hfp_work> GothAlice: No, that's the thing

[18:43:36] <GothAlice> … so you can run code that performs operations on the DB. But can't run code that performs operations on the DB. I'm really not grasping something, here. ;)

[18:43:36] <deathanchor> yeah map/reduce, good for never using or running on secondaries for one-offs.

[18:44:00] <GothAlice> deathanchor: Sometimes it's the only thing that'll work, but I always make those one-offs to correct the "issue". ;)

[18:44:12] <deathanchor> exactly :)

[18:44:21] <GothAlice> hfp_work: The former being the code you're testing, the latter begin the code testing your code.

[18:45:24] <sijis> are these equivalent? mongodb://mdb-11:27017,mdb-12:27017,mdb-13:27017/?replicaSet=seyren OR mongodb://mdb-11:27017,mdb-12:27017,mdb-13:27017/seyren?replicaSet=seyren

[18:45:40] <hfp_work> GothAlice: I managed to get away with stubbing all the calls to a MongoDB so I monkeypatched whatever uses mongo so it returns what I would expect from a successful mongo operation.

[18:45:51] <GothAlice> hfp_work: I'm apparently too used to testing that lets you do things like this: https://github.com/marrow/task/blob/develop/test/test_queue.py#L23-L25 and https://github.com/marrow/task/blob/develop/test/test_queue.py#L86-L99

[18:46:16] <GothAlice> hfp_work: That's an utterly terrifying hybrid of mocking and monkeypatching.

[18:46:27] <hfp_work> GothAlice: I know... I miss Rails so much!!

[18:46:30] <GothAlice> … with all the overhead in maintaining parity with teh wrapped lib.

[18:46:34] <GothAlice> *the

[18:46:53] <GothAlice> I mean, it's a terrible approach unless the point is to make more work. :'(

[18:47:13] <hfp_work> GothAlice: So far, it took me 3 times longer to write the tests than the actual code. I hate it.

[18:47:16] <GothAlice> This was also discussed yesterday: http://irclogger.com/.mongodb/2015-05-04#1430765075

[18:48:11] <GothAlice> My argument from yesterday stands: you're not testing against Mongo at that point, you're testing against your mock.

[18:49:39] <GothAlice> (With a passing test meaning nothing about the ability for that code to actually run in a production environment.) This has bit me in the past, notably on the map/reduce call spec, which changed.

[18:50:36] <hfp_work> GothAlice: Thanks for the insight, I'll bang my head on the wall a little more, see if any better solution than mocking comes up

[18:51:30] <GothAlice> The chat history from yesterday should include a link to my "start a cluster on a single node" shell script, which I use heavily for making reproducible tests of sharding behaviour, not to mention a clean way to start and stop a mongod for testing purposes, separate from one running on the normal port, too.

[18:51:31] <StephenLynx> <hfp_work> GothAlice: So far, it took me 3 times longer to write the tests than the actual code. I hate it.

[18:51:38] <StephenLynx> I have a very efficient way to deal with that issue.

[18:51:42] <StephenLynx> no tests! :v

[18:57:52] <GothAlice> hfp_work: Testing should be fun, it should be enjoyable, and it should be easy. There's a trend in my own packages to have more tests than there are executable statements of code: https://travis-ci.org/marrow/schema/jobs/41019976#L317-L319 (578 statements, 629 tests)

[19:13:03] <deathanchor> StephenLynx: guess you don't believe in coding the tests before coding the functions?

[19:13:14] <StephenLynx> I do believe in tests.

[19:13:54] <StephenLynx> I just rather make a clear documentation that any issue is easily spotted than writing more code.

[19:14:20] <StephenLynx> well, if I don't write tests, I guess I wouldn't write them before the actual code because I am not writing them at all.

[19:14:50] <StephenLynx> I don't say they are bad or pointless. I just have experimented with both and stuck to the later.

[19:15:12] <StephenLynx> I can understand tests for some cases though.

[19:15:24] <StephenLynx> where you don't have many possible outputs.

[19:17:15] <StephenLynx> I rather spend my time manually testing than writing code that will test and have then to test it.

[19:17:22] <StephenLynx> because when you think about it

[19:17:25] <StephenLynx> who tests the test?

[19:17:30] <StephenLynx> what if the test is wrong?

[19:17:40] <StephenLynx> you don't write tests to test the test.

[19:17:49] <StephenLynx> you manually test the test.

[19:20:40] <GothAlice> StephenLynx: Actually, I do sometimes write tests that test tests. I also write tests that test everything down to the informational logging messages.

[19:20:52] <GothAlice> More often I write test generators: factories for known-good tests.

[19:20:53] <StephenLynx> :v

[19:21:20] <StephenLynx> I can understand that approach. But I rather keep my projects lean.

[19:21:40] <GothAlice> One of the reasons why marrow.schema has so many tests is that many of them are simply testing input data against an expected output; bam, feed it a list of input/output pairs for good, bad, and error conditions, and you instantly have hundreds of tests… having only actually written three.

[19:21:41] <StephenLynx> so I can manually work on them with more ease.

[19:22:05] <GothAlice> Anything worth doing twice is worth automating.

[19:22:34] <StephenLynx> thats why I don't do it.

[19:22:35] <StephenLynx> :v

[19:23:13] <GothAlice> https://github.com/marrow/schema/blob/develop/marrow/schema/testing.py#L11-L36 < is the factory, provided by the library being tested for all users of that library. https://github.com/marrow/schema/blob/develop/test/validate/test_geo.py#L19-L32 is an example of using it.

[19:23:34] <GothAlice> Face it: if I'm too lazy to write that style of resulting test, I have no business being a library author. ;)

[19:24:19] <StephenLynx> wat

[19:24:48] <GothAlice> https://github.com/marrow/schema/blob/develop/test/validate/test_compound.py#L87-L99 < it also lets you factor out things common to multiple tests more easily

[19:24:52] <hfp_work> GothAlice: Haha I know I keep reading Mocha's headline that testing should be "simple an fun". I wish I was better at testing so it actually becomes easier and funnier because right now it's a drag. And I don't want to not write tests.

[19:28:13] <fxmulder_> [rsSync] replSet initial sync drop all databases

[19:28:15] <fxmulder_> again

[19:28:31] <GothAlice> Anything else in the logs to indicate the reason?

[19:28:32] <StephenLynx> anything that you wouldn't voluntarily do is not worth doing. a counter part of alice's automation moto.

[19:29:10] <GothAlice> StephenLynx: Garbage collection; just because you wouldn't drive a garbage truck doesn't make the task any less worthwhile. -_-;

[19:29:32] <fxmulder_> [rsSync] Socket flush send() errno:9 Bad file descriptor 10.251.188.80:27018

[19:30:00] <GothAlice> fxmulder_: Did you have network connectivity issues long enough in duration to burn through the oplog duration?

[19:30:35] <fxmulder_> we haven't had any network connectivity issues

[19:31:50] <GothAlice> Hmm. That particular socket error indicates that the connection disappeared unexpectedly, either due to intermittent connectivity or the remote side closing the connection without sending a FIN, with MongoDB figuring that out only after it tried sending data there.

[19:32:20] <GothAlice> Possibly the enabling of firewall rules would exhibit the same error.

[19:32:51] <fxmulder_> there's no firewalls between these

[19:33:00] <fxmulder_> it looks like it reconnected right away

[19:33:09] <fxmulder_> is there a way to make it not start over if this happens?

[19:33:47] <deathanchor> was it during an inital sync?

[19:33:52] <fxmulder_> it is

[19:33:52] <GothAlice> MongoDB does try to be quick about those things, and usually Does the Right Thing™. It'll "start over" if it was unable to complete the first sync, or if the node is unable to catch up with the primary before exhausting the oplog.

[19:34:18] <deathanchor> fxmulder_: I had issues with socket timeouts for initial syncs between DCs

[19:34:24] <fxmulder_> I have everything shut off so nothing is reading or writing to mongodb, the only thing going on is this sync

[19:34:34] <deathanchor> i had to change the tcp_ timeouts on the machine

[19:34:35] <GothAlice> In the event where you have enough data that takes longer to transfer the initial sync than your oplog size, you have two choices: increase the oplog size, or pre-populate the secondary and pray *that* is fast enough.

[19:35:29] <fxmulder_> its not uncommon for us to have to rebuild one of these machines, I'd like to be able to get this syncing process working without having to rsync

[19:35:44] <GothAlice> Ref: http://docs.mongodb.org/manual/tutorial/expand-replica-set/#data-files

[19:36:16] <deathanchor> fxmulder_: what are your tcp_keepalive settings?

[19:36:25] <GothAlice> Then increasing the socket timeouts and increasing the oplog size will likely be the solution for you.

[19:36:42] <deathanchor> GothAlice: actually I turned down the time out

[19:36:52] <GothAlice> Also ref: http://docs.mongodb.org/manual/tutorial/change-oplog-size/

[19:37:16] <fxmulder_> deathanchor: cat /proc/sys/net/ipv4/tcp_keepalive_time says 7200

[19:37:21] <GothAlice> deathanchor: That's weird. You're saying making the timeouts more strict alleviated the problem of them being hit?

[19:37:43] <deathanchor> my tcp_keepalive_intvl was 75, and tcp_keepalive_time was 7200, I set them both to 5 and goodbye socket timeouts

[19:37:45] <GothAlice> Ah, keepalive, not timeout.

[19:37:51] <GothAlice> Uhm.

[19:37:52] <GothAlice> Wat.

[19:37:54] <GothAlice> 5?!

[19:38:13] <deathanchor> yeah, it was the only way to get the initial sync completed.

[19:38:16] <deathanchor> between DCs

[19:38:34] <deathanchor> after that I could change it, but one of the DCs has horrible networking

[19:38:54] <GothAlice> http://www.tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html

[19:39:45] <fxmulder_> that is in seconds

[19:40:33] <fxmulder_> I suppose the logs don't really show network transfer information

[19:40:56] <fxmulder_> [rsSync] clone mail.message_data 329061375 is just indicating the insert process

[19:42:15] <fxmulder_> but if something happened and it didn't notice for 7200 seconds maybe decreasing that time helps

[19:42:24] <fxmulder_> worth I shot

[19:44:43] <deathanchor> GothAlice: fxmulder_: here is where I found the suggestion to use 5: https://jira.mongodb.org/browse/SERVER-9199

[19:46:54] <GothAlice> deathanchor: Indeed, that's a diagnostic strategy, not a production one. Lets just hope you aren't billed for server-to-server traffic—a value of five means send a packet every five seconds, and that can add up over time. ;)

[19:47:24] <GothAlice> Also, those comments are referring to MongoDB 2.4.

[19:47:31] <deathanchor> it's only until I cutover to the new DC

[19:47:47] <GothAlice> O_o

[19:47:51] <deathanchor> and it applied to 2.6 also

[19:48:13] <deathanchor> I got all kinds of flavors of mongo

[19:48:46] <deathanchor> running tokumx also

[19:48:54] <GothAlice> Hrk; one of my boxen (a testing VM) has MongoDB 2.0.7 on it. Wau.

[19:49:10] <GothAlice> TokuMX is good people.

[19:49:33] <deathanchor> yeah, missing some features, but good for when you have tons of data.

[19:49:44] <GothAlice> Or absolutely _must_ have transactions.

[19:49:57] <GothAlice> (But love MongoDB too much to give up document storage. ;)

[19:53:19] <fxmulder_> so lets say I have 1 primary replica member and I'm cloning to a new secondary, I restart that primary member, does that restart the cloning process on the new secondary?

[19:53:48] <cheeser> the secondary reads from the oplog from a particular time and will simply pick up where it left off.

[19:53:59] <Gevox> GothAlice: I would like to configure my db to allow shards and replica sets. Is there any good reference you would recommend for me to help get the db configured?

[19:54:58] <Gevox> currently i got this one http://www.severalnines.com/blog/turning-mongodb-replica-set-sharded-cluster

[19:55:43] <fxmulder_> cheeser: I just gave it a shot, it looks like the secondary did perform a dropAllDatabasesExceptLocal

[19:55:56] <fxmulder_> this is 2.4, not sure if that changed in later versions

[19:56:12] <GothAlice> Gevox: The construction of a sharded cluster is simply the application of several replica sets with a few management daemons (configuration services, routers) to glue them together.

[19:56:58] <GothAlice> Gevox: In my "local testing" script, you can see the process in these lines: https://gist.github.com/amcgregor/c33da0d76350f7018875#file-cluster-sh-L130-L158

[19:56:58] <Gevox> Then i must be looking at the wrong reference

[19:57:04] <drags> I'm putting together a system wide /etc/mongorc.js file to update my prompts but having some trouble. If I put the file at /etc/mongorc.js it is seemingly ignored. If I copy that file to ~/.mongorc.js then it is used

[19:57:27] <drags> interestingly: if I delete the ~/.mongorc.js file it is recreated as an empty file at the start of each shell invocation

[19:57:55] <GothAlice> Gevox: The process is: construct two, three-node replica sets, spin up some config servers (that track metadata about what data is where), optionally handle authentication by adding the first user, then stat a router and configure it with two shards.

[19:57:59] <drags> from the docs it seems like the client should be merging the effects of the global and user level mongorc files, but I'm wondering if the 0 length file is winning

[19:58:09] <drags> anyone use mongorc files for this sort of thing?

[19:59:05] <GothAlice> drags: http://docs.mongodb.org/manual/reference/program/mongo/ < makes no mention of a "system wide" rc file.

[19:59:09] <fxmulder_> maybe once I get this secondary going I'll upgrade, 2.6 at least

[19:59:09] <deathanchor> fxmulder_: yeah the initial sync is precarious on these older mongo versions

[19:59:15] <GothAlice> AFIK the feature you are trying to use doesn't exist, drags.

[19:59:22] <Gevox> GothAlice: The lines you sent, are all mongodb commands?

[19:59:40] <GothAlice> drags: Ugh, ignore me. My in-browser find was searching the wrong term.

[20:00:09] <GothAlice> Gevox: That's a shell script, so those would be things you could type into an interactive terminal.

[20:00:15] <deathanchor> fxmulder_: you trying the new keepalive settings?

[20:00:31] <fxmulder_> deathanchor: I am, crossing my fingers

[20:00:45] <Gevox> GothAlice: mongo shell can handle these operations?

[20:00:47] <GothAlice> Gevox: I use functions in that script to isolate the common commands, so you may need to jump up a bit to reference those functions and see what they're doing. It's pretty straight-forward.

[20:00:49] <deathanchor> fxmulder_: I banged my head on that problem for a week trying to sync to another DC.

[20:01:09] <GothAlice> Gevox: BASH shell, not mongo shell. The first line of a script typically tells you what it runs under, if it starts with a "sh-bang" (#!).

[20:01:10] <fxmulder_> it generally gets about halfway or so before something happens

[20:01:10] <drags> GothAlice: yar: http://docs.mongodb.org/manual/reference/program/mongo/#files

[20:01:16] <deathanchor> fxmulder_: I think after the sync is done, you can change the timeouts back

[20:01:24] <GothAlice> drags: Yeah, I realized my mistake quickly; sorry!

[20:01:30] <deathanchor> fxmulder_: exactly what happened to me

[20:01:36] <drags> GothAlice: but switching the docs to be for my version (2.4) show that the global level one didn't exist :P

[20:01:44] <GothAlice> drags: Are you sure your users have read permission on the file in /etc?

[20:01:46] <drags> thank god we're rolling to 2.6 this month

[20:01:48] <deathanchor> fxmulder_: would get about 90% done, then network drop on other DC would kill it off

[20:01:51] <drags> (finalllyyyyy)

[20:01:58] <GothAlice> drags: E gods, why not upgrade to the _current_ version?

[20:02:24] <drags> 3.0? that's the path we're on, but it's 2.4 -> 2.6, sharding, then 2.6 -> 3.0

[20:03:30] <fxmulder_> deathanchor: well I should know in 3-4 days if there is progress

[20:03:58] <deathanchor> fxmulder_: wow, really that long for initial sync?

[20:05:17] <deathanchor> fxmulder_: 100mbps 90GB of data between DCs took about 11 hours.

[20:05:23] <fxmulder_> deathanchor: yeah, it would be longer but we just got the second shard set in place, probably looking at about 15TB of data to sync

[20:05:37] <deathanchor> fxmulder_: godspeed

[20:05:49] <fxmulder_> indeed

[20:09:23] <deathanchor> fxmulder_: when it happened for me, the socket timeout occurred right after building some giant indexes on a large DB when it tried to grab the next thing to sync.

[20:09:46] <deathanchor> of course it had to build 20 indexes for that db

[20:10:02] <Gevox> GothAlice: Can i private you?

[20:10:22] <GothAlice> Gevox: Yup.

[20:23:41] <deathanchor> hey, what do you upgrade first? shardsets or configsvrs?

[20:59:18] <deathanchor> crickets when you ask the wrong question I guess.

[21:01:13] <cheeser> http://docs.mongodb.org/manual/release-notes/3.0-upgrade/

[21:24:51] <androidbruce> hey all, what is the official mongodb chef cookbook?

[21:41:19] <tubbo> is there any guarantee that two BSON::ObjectIds representing different documents are unique?

[21:41:42] <tubbo> say i have the SHA '5547bc60a7a206c7dd000580', would it be possible for another model in the mongodb database (not the collection) to have that ID?

[21:52:49] <Gevox> tubbo: depends on the hashing algorithm you are using, some hasing algorithms has a collision chance. You need to google bit about which hashing function you shall use to avoid collision

[21:53:22] <ToeSnacks> what do I need to put in the mongos /etc/mongodb.conf file if I only want to connect to a stand alone db? I have been using a sharded system so far and I specify configdb= for that. Is it the same or is there a different thing I need to define?

[23:37:03] <GothAlice> But, one more brave soul now knows how to set up 2x3 sharded replica sets on Windows, of all things. ^_^

[23:42:15] <GothAlice> tubbo: As a late reply (sorry; t'was distracted) all hashing algorithms offer the chance of collision (same hash, different data). This is because you're trying to create a compressed summary of larger data; depending on the algorithm you may only have 16 bytes of hash (a la MD5). Larger hashes reduce the statistical likelihood of collision at a rapidly growing cost in terms of space. (SHA512… wow.)

[23:42:54] <GothAlice> As an example that isn't quite a hashing algorithm (though it can be used that way): UUIDs have enough bits of data in them that you could generate one a second until the heat death of the universe and only have a 50% chance of collision.

[23:43:01] <GothAlice> (Alas, collisions are measured statistically like this.)

[23:43:45] <GothAlice> MD5, on the other hand, is well known for its collisions.

[23:44:54] <GothAlice> ObjectIds store 96 bits of data. (12 bytes.) Any hashing algorithm you apply would need to algorithmically "expand" the data to at least fit the hash size. This should give you the distinct impression that hashing ObjectIds is… probably not a good solution for anything.

Log file Viewer

Help | Karma | Search:

#mongodb logs for Tuesday the 5th of May, 2015