pmxbot IRC Log Viewer

[00:02:39] <appledash> This is the test data, BTW: https://gist.githubusercontent.com/AppleDash/32518e17866c7bbf0ed7/raw/19ddb617b31849fefe74f180a91ab5aa0416bb73/gistfile1.txt As it is, the query is intended to retrieve all the documents that match any of the attributes queried for, and sort them in the order of the ones with the most matches first. (It does so fine)

[00:26:13] <kYuZz> 0 down vote favorite

[00:26:13] <kYuZz> I have a MongoDB install on the server of one of my clients.

[00:26:13] <kYuZz> An application I have no control on is using one single Mongo database without authentication.

[00:26:13] <kYuZz> Basically I would like that every new database requires authentication, but the previous one needs to be accessed without using credentials.

[00:26:15] <kYuZz> I managed to enable authentication successfully, but now the previous database requires authentication.

[00:26:17] <kYuZz> How can I achieve this?

[00:26:24] <kYuZz> http://stackoverflow.com/questions/27239840/database-with-no-authentication

[00:29:50] <joannac> kYuZz: not possible. auth is on/off per instance

[00:30:05] <appledash> Hmm, nobody able to help me with my issue?

[00:30:11] <kYuZz> thanks joannac

[00:30:20] <GothAlice> appledash: Let me have a gander.

[00:31:04] <appledash> Alright

[00:34:10] <GothAlice> appledash: Right, so you have several changes you need to make here. In your $group, add: {…, val: {$push: "$value"}}, after the $group add a new pipeline stage: {$project: {vallen: {$size: "$val"}}} then you can sort on "vallen". This, if I've interpreted your request correctly.

[00:35:17] <appledash> Not entirely sure if you have

[00:35:18] <SubCreative> Anyone familiar with working with geospacial queries and mongodb?

[00:35:42] <Boomtime> just ask your question and maybe somebody can help you

[00:35:48] <SubCreative> Im working with $near query and things work, next I'd like to calculate and display the distance that it's being sorted by.

[00:35:49] <GothAlice> appledash: The above will give you the count of the "matches". The search criteria part I'm still pondering.

[00:35:50] <appledash> GothAlice: It works fine as it is, but I just want to make the val in the name, val pairs be an array instead, and have it so I can check if something is in that array

[00:36:18] <GothAlice> appledash: Gist me what you want as your output from that sample data?

[00:36:36] <GothAlice> appledash: An example of the desired output may be easier for me to grok.

[00:36:38] <appledash> IT already provides the right output

[00:36:55] <appledash> Runnint the query as it is on that test data provides the results I want

[00:36:59] <appledash> running*

[00:37:19] <appledash> It works 100% fine, but I want to make it do more things because that's what I need

[00:37:54] <GothAlice> appledash: Except that you want to change it to produce something else and thus the current output is _not_ your desired output.

[00:37:55] <GothAlice> :/

[00:38:53] <appledash> What I'm trying to say is that it isn't broken

[00:39:25] <appledash> The desired output will still be exactly the same

[00:39:38] <appledash> But the query used to get it needs to change because I want to change how the database is laid out

[00:40:29] <GothAlice> Your sample data and aggregate return nothing on my local machine, interestingly. :/

[00:40:40] <appledash> Instead of the attributes being {name: "NameHere", value: "ValueHere"}, I want them to become {name: "NameHere", value: ["Value1Here", "Value2Here"]}

[00:40:52] <appledash> And instead of checking if the value is equal, I want to check if it's in the array

[00:41:28] <GothAlice> val: {$in: […]}

[00:42:05] <appledash> No.

[00:42:07] <Boomtime> @SubCreative: docs.mongodb.org/manual/reference/command/geoNear/

[00:42:09] <appledash> The val is the array.

[00:42:13] <appledash> the query is just a string

[00:42:24] <appledash> I'm checking if the queried string is IN THE ARRAY OF VALUES, IN THE DATABASE

[00:42:27] <appledash> Not the other way around

[00:42:32] <SubCreative> Boomtime: Im aware of $geoNear however its not available in meteor.js

[00:43:03] <SubCreative> Boomtime: My questions was related to mongo/meteor (i'm in #meteor also) just wanted to see if anyone had experience here working with maps/locations/distance in meteor

[00:43:12] <GothAlice> appledash: Then your current query should actually work unmodified.

[00:43:26] <appledash> It should?

[00:43:33] <Boomtime> SubCreative: you only mentioned $near before, I'm not sure it's possible with $near

[00:43:54] <GothAlice> appledash: {… val: "female"} if "val" is an array will compare and actually ask it 'any val element is "female"' instead.

[00:44:40] <Boomtime> SubCreative: MongoDB can do what you want but your choice of framework doesn't allow it - there is little I can do about that

[00:45:20] <GothAlice> appledash: Your current data model is giving me a serious sinking feeling in the pit of my stomach, though. One assumes your arbitrary key/value storage is, in fact, limited to a certain set of valid keys. If this is the case, why have this convoluted list-of-abstract-properties model?

[00:46:00] <GothAlice> appledash: Instead of simply: {_id: ObjectId(…), name: "Gyro", gender: "male", hairColor: "purple", …}

[00:46:20] <appledash> attributes was originally a hash

[00:46:23] <appledash> then someone told me to change it

[00:46:36] <GothAlice> But why have an abstraction for those properties at all?

[00:46:47] <GothAlice> MongoDB documents aren't restricted to fixed sets of fields…

[00:46:53] <appledash> This is very small test data

[00:46:57] <appledash> The actual data

[00:47:06] <appledash> Will have about 100-200 attributes per thing

[00:47:17] <appledash> with up to 20-30 being queried against at once

[00:47:26] <GothAlice> appledash: I use sparse fields with potentially hundreds of fields to describe metadata in my metadata filesystem currently holding 25 TiB of data…

[00:47:40] <GothAlice> (With indexes on many of those fields to speed up querying.)

[00:48:04] <appledash> Well, how would I even modify my query if my data wasn't how it is?

[00:48:14] <GothAlice> appledash: Records can differ nearly 100% in the attributes they have. (Only _id, path, parent, and parents fields are common.)

[00:48:21] <appledash> Xe: I don't like you

[00:48:23] <appledash> Just for the record

[00:48:32] <GothAlice> ?

[00:48:42] <appledash> Don't ask :P

[00:48:45] <GothAlice> appledash: Your query could immediately drop all $elemMatch aspects.

[00:49:16] <appledash> Just for the sake of things

[00:49:21] <appledash> what if we made attributes a hash again

[00:49:27] <appledash> I like keeping it in a subdocument

[00:49:30] <appledash> But what if it was in a hash

[00:49:35] <appledash> How would my query look?

[00:49:44] <GothAlice> appledash: db.test.aggregate([{$match: {$or: [{gender: "female"}, {hairColor: "blue"}]}}, …])

[00:49:45] <Boomtime> the $elemMatch is needed

[00:49:57] <Boomtime> it is not the same without it

[00:50:05] <Boomtime> the difference is very subtle

[00:50:06] <appledash> I don't think that does what you think it does GothAlice

[00:50:10] <GothAlice> Boomtime: Currently, due to the weird nesting here, but if those "virtual" attributes moved up into the document proper, things become infinitely simpler.

[00:50:29] <appledash> I suppose they could be moved to the root document

[00:50:32] <Boomtime> ah, with your changes

[00:50:35] <appledash> But it seems... Messy I guess?

[00:50:59] <Boomtime> that type of attribute nesting is usually used to prevent needing a hundred indexes on seperate fields

[00:51:19] <appledash> Well I will be wanting indexes

[00:51:24] <appledash> after I actually get this thing working

[00:51:25] <appledash> just saying

[00:51:40] <Boomtime> are the attributes extensible or do you have a pre-determined set?

[00:51:50] <Boomtime> i.e do you always have 'gender'?

[00:52:10] <Boomtime> do you always have 'hairColor'?

[00:52:27] <appledash> Some documents may very well not have some attributes

[00:52:35] <appledash> There are none I 'always' have

[00:52:46] <appledash> It depends on what information I have on that particular character

[00:52:55] <GothAlice> AFIK the reason "sparse" indexes were invented. ;)

[00:52:55] <Boomtime> do you allow searching for any combination of attributes?

[00:53:01] <appledash> Well, I will /always/ have name, but I will rarely look up name.

[00:53:19] <appledash> Yes, any combination of attributes

[00:53:35] <appledash> And all of the attributes we are searching for are optional

[00:53:38] <Boomtime> roughly how many unique attributes may exist across all characters?

[00:53:51] <appledash> But results returned must have at least one of the queried attributes

[00:54:08] <appledash> And they must be sorted with the ones having the most matches coming first

[00:54:14] <appledash> Boomtime: Less than 1000

[00:54:18] <appledash> More than 100 though

[00:54:27] <Boomtime> you need the key/value array style

[00:54:45] <Boomtime> placing the attributes at the root as named elements will not work for you

[00:55:15] <appledash> Exactly what I was told

[00:55:22] <Boomtime> your current schema is the right choice, albeit you will pay more for queries because of the massive flexibility you need

[00:55:41] <appledash> Now... How to solve my problem?

[00:56:32] <Boomtime> i came into this in the middle of it, and may have missed the problem statement

[00:56:38] <appledash> Uhhh

[00:56:54] <Boomtime> sorry, please repeat or copy/paste - the conversation is long

[00:57:18] <appledash> I'm getting you a log

[00:57:34] <GothAlice> appledash: And I can't reproduce your query to test modifications locally. No output using https://gist.githubusercontent.com/AppleDash/32518e17866c7bbf0ed7/raw/19ddb617b31849fefe74f180a91ab5aa0416bb73/gistfile1.txt as sample data against the https://gist.githubusercontent.com/AppleDash/48e4d803e6fe870f7915/raw/85146717ad71839822b70451eba45319d51f1fe0/gistfile1.txt query.

[00:57:36] <appledash> Boomtime: https://gist.githubusercontent.com/AppleDash/2440d0429c9c97a1346e/raw/c038705b6cff0178031de4a6c430d738bb658cba/gistfile1.txt

[00:58:10] <appledash> GothAlice: Can I give you what I'm using now?

[00:58:16] <appledash> Oh uhh

[00:58:18] <appledash> That is in fact

[00:58:22] <appledash> not the right test data

[00:58:25] <appledash> moment

[00:58:28] <GothAlice> Indeed.

[00:58:46] <appledash> Apologies

[00:59:09] <appledash> https://gist.githubusercontent.com/AppleDash/d15bd03e1484786a3df6/raw/109ad3ff1717f75781e2450dec91c7aef546ea48/gistfile1.txt

[00:59:12] <appledash> Newest data

[00:59:15] <appledash> and query

[00:59:20] <appledash> Ignore any previous data or queries

[00:59:48] <appledash> Also ignore the weird hairColor on the 2nd row

[01:00:03] <appledash> That was like that as a temporary solution that I realize doesn't work too well for my needs

[01:00:12] <appledash> So feel free to either remove |pink for now, or just leave it.

[01:01:50] <GothAlice> Right; can reproduce now.

[01:02:08] <appledash> yay

[01:03:45] <GothAlice> appledash: If you change "purple|pink" to ["purple", "pink"] in your second test record, then change "blue" references in the query to "pink" ones, it just works.

[01:04:04] <appledash> Now, that brings me into another question

[01:04:06] <Boomtime> appledash: if I understand correctly, you want a doubly nested array

[01:04:09] <appledash> How the heck do I update() that?

[01:05:08] <appledash> If I wanted to change that with update(), what would I do?

[01:05:15] <GothAlice> appledash: $set to replace the value, or $push to append to the list…

[01:05:21] <GothAlice> $pull to remove an element…

[01:05:42] <appledash> But I mean

[01:06:14] <appledash> I know how to use $set and stuff

[01:06:20] <appledash> But I don't know how to use it like this

[01:06:47] <appledash> What do I pass in as the field name?

[01:07:13] <GothAlice> appledash: http://docs.mongodb.org/manual/reference/operator/update/positional/

[01:07:27] <GothAlice> properties.$.value

[01:07:45] <GothAlice> Er, attributes.$.val in your case.

[01:08:22] <appledash> Alright, what do I use as the query then? How do I actually query by attribute name for the update() ?

[01:09:40] <GothAlice> So, db.characters.update({_id: ObjectId("…"), attributes.name: "hairColor"}, {$set: {"attributes.$.val": "plaid"}})

[01:09:59] <GothAlice> That'll set the hairColor of a given record to plaid.

[01:10:05] <appledash> Er

[01:10:28] <appledash> I'm not 100% sure I understand how that works, but OK.

[01:10:33] <GothAlice> So, db.characters.update({_id: ObjectId("…"), attributes.name: "hairColor"}, {$push: {"attributes.$.val": "plaid"}}) — this would add it to the list

[01:11:15] <GothAlice> So, attributes.name in the query part will search for a particular array element. Once it finds it, it remembers it as $. It's then used in the update portion of that to represent the array element previously found, thus instead of attributes.27.val, you have attributes.$.val since MongoDB figured out the index for you.

[01:11:26] <Boomtime> you probably mean $addToSet (otherwise you could have the same hairColor twice)

[01:11:30] <appledash> That's what I figured

[01:11:32] <GothAlice> Boomtime: Ah, yes.

[01:12:37] <Boomtime> however, you have a problem removing attributes - you'd need to update the whole array since you're already using the positional operator

[01:13:13] <appledash> It does not seem to be working.

[01:13:15] <GothAlice> Or adding to a list if you aren't sure if the record *has* that named property or not… or aren't sure it was an array to begin with…

[01:13:46] <appledash> Updateing Gyro to have a hairColor of ["purple", "pink"] and then running the query with gender: female and hairColor: pink returns Gyro followed by AppleDash

[01:13:50] <appledash> BUT both have a score of 1

[01:13:55] <appledash> Updating*

[01:14:01] <appledash> The first should have a score of 2

[01:14:09] <appledash> Since he is both male and one of his hair colors is pink.

[01:14:17] <appledash> Which leads me to think that the hairColor match isn't working.

[01:15:48] <appledash> So unfortunately it doesn't seem to "just work"

[01:17:33] <GothAlice> appledash: Check your data. https://gist.github.com/amcgregor/7fc1411b6f3f1f944fed

[01:19:22] <GothAlice> (Searching for female and pink a male with pink hair should have a score of 2?)

[01:19:42] <appledash> Uh

[01:19:49] <appledash> If you search for male and pink hair

[01:20:04] <appledash> Then a male with pink mair should have a score of 2

[01:20:14] <appledash> GothAlice: Sorry, typo.

[01:20:22] <appledash> Updateing Gyro to have a hairColor of ["purple", "pink"] and then running the query with gender: male and hairColor: pink returns Gyro followed by AppleDash

[01:20:28] <appledash> BUT both have a score of 1

[01:20:35] <appledash> The first should have a score of 2

[01:20:40] <appledash> There, corrected female to male

[01:23:50] <Boomtime> what is the second $match line?

[01:24:09] <Boomtime> (i'm not sure if i'm up-to-date on your changes)

[01:25:37] <Boomtime> if your second $match occurrence is still this:

[01:25:40] <appledash> What second $match line?

[01:25:43] <Boomtime> ...

[01:25:50] <appledash> The gist I linked is the most up to date one

[01:25:52] <Boomtime> the second $match in the aggregation pipeline

[01:26:03] <appledash> Oh

[01:26:05] <Boomtime> do you understand the aggregation pipeline that you have?

[01:26:07] <appledash> I do.

[01:26:13] <appledash> but you said 'second match line'

[01:26:19] <appledash> which made me thing of a second query

[01:26:29] <Boomtime> it effectively is a second query

[01:26:38] <Boomtime> anyway, the second occurence of $match

[01:26:52] <appledash> Mhm?

[01:26:58] <Boomtime> if it is as it was originally, then it won't work anymore

[01:27:15] <appledash> OK, why will it not work anymore?

[01:27:20] <appledash> That is what I need to know

[01:27:22] <Boomtime> please paste it here

[01:27:26] <GothAlice> I think I see what's going on.

[01:27:43] <appledash> Boomtime: Look at the gist I gave you after the one with the logs

[01:27:46] <appledash> That is the latest one.

[01:27:55] <Boomtime> appledash: I want you to confirm that you understand the pipeline

[01:28:04] <appledash> Boomtime: I have confirmed it

[01:28:07] <Boomtime> you can do this by showing me you understand the staging

[01:28:19] <appledash> And how will pasting you the query help that?

[01:28:22] <Boomtime> otherwise you will be back here for every tiny change you make

[01:28:47] <appledash> {$match: {"$or": [{ "attributes": { name: "gender", val: "female" }}, {"attributes": {name: "hairColor", val: "blue"}}]}}

[01:28:49] <appledash> Is this what you want

[01:28:52] <Boomtime> thank you

[01:28:58] <Boomtime> note the end predicate

[01:29:08] <Boomtime> "attributes" is an exact match

[01:29:21] <Boomtime> it's looking for a document that exactly equals

[01:29:27] <GothAlice> Indeed.

[01:29:27] <appledash> This is what I have been asking all along, how do I change it so it does what I want?

[01:29:38] <Boomtime> the document being seen at this stage has an array in "val"

[01:29:48] <Boomtime> there are a couple of ways

[01:30:06] <Boomtime> the easiest, would be to use "attributes.name" and attributes.val" in this stage

[01:30:44] <GothAlice> {"attributes.name": "hairColor", "attributes.val": "blue"}

[01:30:50] <Boomtime> yes

[01:31:08] <Boomtime> note, you must not use this form in the first stage

[01:31:34] <GothAlice> Aye; $elemMatch has a different meaning for its nested values than "exact match of the whole subdocument".

[01:32:07] <appledash> I understand now

[01:32:14] <appledash> And that change makes it work

[01:34:09] <GothAlice> s/unwrap/prettyprint

[01:34:17] <appledash> The most useful thing ever would be like... A Sublime Text plugin that lets me run the contents of the current file as a query and see the results

[01:34:29] <appledash> Because that entire query has been written in the line editor of `mongo`

[01:34:51] <Boomtime> an easy test is to use the shell, make a variable of each stage, then add them into aggregation one at a time and ensure the results at each point are what you expect

[01:35:29] <Boomtime> eg. your "unwind" stage would be -> u = { $unwind: "$attributes" }

[01:35:46] <Boomtime> do the same for each stage.. m = { $match... } etc

[01:36:10] <Boomtime> now, in the shell you can do this: db.col.aggregate( m, u )

[01:36:34] <GothAlice> Also: .js files can be executed in the interactive shell. And that such an editor plugin would be relatively trivial. (a la echo "$SELECTION" | mongo) And the shell does support continuation, e.g. http://cl.ly/image/3Y003w100U0h

[01:36:46] <Boomtime> the add your second match: db.col.aggregate( m, u, m2 )

[01:37:03] <GothAlice> Boomtime: You're missing brackets? Or is the shell super smart about that?

[01:37:14] <Boomtime> try it

[01:37:26] <Boomtime> aggregate is a variable parameters function

[01:37:35] <GothAlice> Indeed. Yowza.

[01:37:36] <Boomtime> it internally aggregates :-)

[01:38:16] <appledash> Huh

[01:38:19] <appledash> I never knew that

[01:38:26] <appledash> I was thinking yesterday "why isn't it"

[01:38:32] <appledash> Apparently it is

[01:38:50] <appledash> I'm gonna go figure out how to write sublime text plugins

[01:40:15] <GothAlice> http://cl.ly/image/401I1r2c3n2H — deprecated way of doing it, too, it seems.

[01:41:41] <Boomtime> feel free to chuck in curly braces if you want to be proper

[01:42:08] <Boomtime> hooray two extra characters -> db.col.aggregate({ m, u, m2 })

[01:42:35] <Boomtime> oops, that should be square brackets :-)

[01:42:39] <appledash> I don't think you want curly :P

[01:51:35] <appledash> Would I be correct to assume that I can make it return all the info I want if I just add it to the $group?

[01:52:43] <appledash> nope, looks like I would not

[01:53:14] <Boomtime> you need to define how you want it to appear

[01:53:30] <Boomtime> when multiple documents are 'group'ed how does one decide how to munge them together?

[01:54:10] <appledash> Hrm

[01:54:11] <Boomtime> you need to express those rules, as defined here: http://docs.mongodb.org/manual/reference/operator/aggregation/group/#accumulator-operator

[01:54:45] <appledash> Basically I'm just looking for a way to pull out the name and attributes on each document in that query

[01:54:54] <appledash> without doing a separate query to find that out about each thing

[01:55:05] <appledash> If it's possible I'll figure it out

[01:55:09] <appledash> But if it isn't I want to know now :P

[01:55:17] <Boomtime> you probably just want a single $addToSet

[01:55:48] <Boomtime> but maybe you want some more complicated rules depending on the source documents

[01:57:48] <appledash> I seem to have figured it out

[01:57:50] <appledash> Thanks

[02:01:56] <appledash> Well, I figured it out but it's not doing what I want. With the way this query is working, I do not think that doing what I want is very possible.

[02:02:38] <Boomtime> what a shame

[02:03:42] <Boomtime> do you know what it is you want?

[02:03:54] <appledash> Yes

[02:04:01] <Boomtime> construct a sample output that you want and pastebin/gist

[02:04:42] <appledash> Well, it is easier just to explain... Instead of that query returning just _id and score, I want it to return everything, _id, score, name, and attributes. All the attributes.

[02:05:12] <appledash> Basically the same as .find() would, but with the addition of score, and sorted in the proper order and stuff, an only matching the $match of course

[02:05:16] <Boomtime> nope, insufficient, you need to say how the attributes will go together, apparently $addToSet didn't do it

[02:06:43] <appledash> $addToSet only returned those attributes that matches the query

[02:06:46] <appledash> matched*

[02:07:02] <appledash> I want ALL the attributes, regardless whether they match or even are included in the query

[02:09:17] <appledash> And from what I understand of how this works...

[02:09:19] <appledash> I cannot do that.

[02:10:17] <Boomtime> you want all the attributes, even those you've already filtered out?

[02:11:46] <appledash> Yeah. The issue is I've already filtered them out :P

[02:11:59] <appledash> Guessing I will need to do separate queries for each returned document?

[02:12:01] <appledash> To get the whole thing

[02:12:07] <Boomtime> yep, nevermind, queries are cheap

[02:12:19] <Boomtime> well, you can do a single query

[02:12:30] <appledash> I can?

[02:12:31] <Boomtime> $in.. you have an array of _id

[02:12:49] <Boomtime> it's one of the fastest queries you can do

[02:13:36] <appledash> Oh

[02:13:39] <appledash> I understand

[02:13:42] <appledash> that is smart

[02:13:45] <appledash> I did not think of that

[02:13:57] <appledash> Keep in mind... I come from MySQL and MSSQL over here

[02:14:35] <Boomtime> everybody does, and it's quite disabling - you should not apply pretty much any of your knowledge, this is not SQL in any way

[02:14:54] <appledash> I realize

[02:26:26] <GothAlice> http://www.javaworld.com/article/2088406/enterprise-java/how-to-screw-up-your-mongodb-schema-design.html This is the primary article I use as a reference for moving from the relational world to MongoDB one. It's a reasonable guide, though doesn't stress the importance of "data locality" and the "amplification effect" as much as I'd like.

[02:27:00] <GothAlice> (I.e. embedding 1:∞ relational data can make sense, but mostly if you always want all of that "related" data when you fetch the record. The latter relates to the question "do you store user IDs against the groups they belong to, or the group IDs against the user?" — usually go for the choice resulting in a smaller set, etc.)

[02:27:39] <appledash> I feel like there are things MongoDB is good for and things it is not.

[02:28:18] <GothAlice> appledash: Of course. I would never recommend one stores graph data in MongoDB unless you only ever need, and will only ever need single-hop lookups.

[02:28:34] <GothAlice> (As an example.)

[02:28:49] <appledash> If I want to say make a twitter clone, and store users on it (along with passwords, emails, etc) and then store the tweets those users have made, as well as people they are following, etc

[02:28:54] <appledash> no way in hell I'm using mongo

[02:29:21] <GothAlice> Really? MongoDB solved some of the problems Twitter encountered quite well.

[02:29:30] <GothAlice> (Twitter was originally a full Java CMS, BTW.)

[02:30:27] <GothAlice> For example: Twitter used auto-increment integer IDs. Now they do not, because that can not scale. (They went so far as to write a completely separate service whose only purpose is to generate new IDs, to let them scale that one issue separately from the rest of the app.)

[02:30:28] <appledash> It definitely does not seem like the right choice by any means

[02:30:54] <GothAlice> appledash: I have 25 TiB of data in my personal dataset, and this includes a complete copy of my incoming Twitter feed. :)

[02:31:05] <GothAlice> (And a whole lot more, obv.)

[02:31:15] <appledash> Sure, tweets

[02:31:17] <appledash> But not users

[02:31:25] <appledash> A relational DB would be way better for that

[02:31:35] <GothAlice> Aye, users. I associate their profiles across sites so I can perform aggregate queries… across sites.

[02:31:55] <appledash> I'm OK :(

[02:32:38] <GothAlice> Same for any data where transactional safety or referential integrity are mission-critical, such as financials. These I would recommend a truly transaction-safe relational database.

[02:34:57] <GothAlice> However forums, short messaging services, instant messengers, etc. are all quite modelable in MongoDB, including some interesting MongoDB-specific optimizations. (My forums embed thread replies within the thread document itself. I use "caching" references that bundle certain often-used fields with their reference, saving millions of extra lookups to generate most listings. &c.)

[02:35:30] <appledash> It just seems odd

[02:35:47] <GothAlice> MongoDB really forces you to think about data differently. :) Odd, certainly!

[02:36:35] <GothAlice> Most people with relational backgrounds I describe "replies stored within the thread record" to, blink a few times then call me crazy. ;)

[02:37:48] <GothAlice> Boomtime: Is there any way to get "progress" information from a long-running update?

[02:38:31] <appledash> Well, I think you are crazy already

[02:39:14] <appledash> I think my use works alright because I'm storing a bunch of one thing and that one thing will be the only thing I'm ever storing and I will only ever store it one way

[02:40:42] <Boomtime> GothAlice: not exactly "progress" information (how do you know ahead of time what will match?)

[02:40:52] <Boomtime> but you can get current information

[02:40:56] <Boomtime> with a lot of work

[02:40:59] <GothAlice> Boomtime: Well, this one is long-running because it matches everything…

[02:41:07] <Boomtime> well.. a lot of dicking around

[02:41:22] <GothAlice> Even a "nUpdatedSoFar" would be good: I could just .count() prior to the update and get % progress from there.

[02:41:25] <Boomtime> your operation turns up in currentOp

[02:41:29] <GothAlice> Aye.

[02:41:41] <Boomtime> and provides numerous stats about the current situation, but it's ugly and convoluted

[02:42:01] <GothAlice> … also aye.

[02:42:10] <Boomtime> that is the only way i know of

[02:42:40] <Boomtime> well, a cursor id pre-supposes you have a cursor

[02:42:46] <Boomtime> which you might not yet

[02:43:01] <GothAlice> Well, "opid" in the currentOp output.

[02:43:04] <Boomtime> the server has a cursor immediately

[02:43:21] <Boomtime> but until it has something to say to your client you don't know what it is

[02:43:31] <Boomtime> that right there is the entire problem

[02:43:46] <GothAlice> :(

[02:44:13] <Boomtime> that wouldn't do either

[02:44:24] <Boomtime> you still have the same problem

[02:44:35] <Boomtime> you need the network protocol to be different

[02:44:55] <GothAlice> Well, I could query currentOp against the DB in question, then filter the ops based on the actual update being issued.

[02:44:57] <Boomtime> when you send a command to the server, the server does not respond until it has something to say

[02:45:12] <Boomtime> right, now you just have an identity problem

[02:45:45] <Boomtime> it is hard to piece back together the information from currentOp, you don't have quite enough information to be unambiguous

[02:46:14] <Boomtime> (until you have a cursor ID)

[02:46:18] <GothAlice> (Find me any active operation against the "metadata" collection whose query portion includes "$set: someknownkey:".)

[02:46:28] <GothAlice> update portion, rather.

[02:46:42] <GothAlice> Well, I have a copy of the query and update SON objects passed to the original update()…

[02:47:01] <Boomtime> if you are in full control of the query/command such that you can make ever one of them uniquely fingerprintable, then you are OK

[02:47:10] <GothAlice> $comment FTW. ;)

[02:47:33] <Boomtime> that may do.. check that it turns up in the currentOp, I have not checked

[02:47:38] <GothAlice> Still, though, index operations and mapReduce get progress, others don't. :/

[02:47:56] <Boomtime> how does an index get a progress?

[02:48:03] <Boomtime> you mean, tailing?

[02:48:15] <Boomtime> because that's not exactly a general purpose solution

[02:48:16] <GothAlice> No: http://docs.mongodb.org/manual/reference/method/db.currentOp/#currentOp.progress

[02:48:31] <Boomtime> ah, right, but it is uniquely identifiable for a different reason

[02:48:36] <GothAlice> I'm not actually seeing any other value in currentOp that could be used to estimate progress. :/

[02:48:37] <Boomtime> it is also progress reportable

[02:48:51] <Boomtime> how do you give progress like that for a match query?

[02:48:55] <GothAlice> (numYields… nope. lock stats? Erm… maybe?)

[02:49:04] <Boomtime> you would need to know ahead of time how many documents matched

[02:49:21] <Boomtime> which means it would have to run the query twice just to get an _estimate_

[02:49:46] <GothAlice> Run the query, count it. Re-run in the update. The difference caused by the race condition there will be statistically insignificant compared to the size of the full collection.

[02:50:04] <Boomtime> so run the query twice... guess how popular that is?

[02:50:48] <Boomtime> but you care about performance

[02:51:04] <Boomtime> no, you can run the query twice

[02:51:11] <Boomtime> .count() first

[02:52:09] <Boomtime> wait, multi update ?

[02:52:16] <GothAlice> Oh my yes.

[02:52:21] <Boomtime> dunno what stats that reports

[02:52:29] <Boomtime> surprised there is no numUpdated

[02:52:46] <Boomtime> suggest an improvement if it's not there

[02:52:56] <GothAlice> This first query of mine updates every record in the collection. Counting first is immaterial. I would also hope that db.collection.count() would have a short-circuit path to a recorded statistic against the collection, rather than needing to iterate an index or documents.

[02:54:03] <Boomtime> .count() short circuits in the best way it can - how many index buckets are assigned for _id index, that is what you want

[02:57:54] <GothAlice> (I'm effectively wanting to be able to report progress information when updating the cached references of mine. I.e. a user changes their display name on the forums, I need to update all comment "author" references to have the new cached value—it's okay if this is wrong for an indeterminate period of time—and this can take some time if the user is very active. A "spinner" while the operation completes is the fallback I'll have to use,

[02:57:54] <GothAlice> alas.)

[02:59:40] <GothAlice> Boomtime: All other "progress" tickets are currently marked "closed" and "works as designed". Hope is not strong.

[02:59:41] <Boomtime> a very nasty way to do it, is to poll (every second or so?) using a count() query for the thing that is going away.. since those matches should be depleting

[02:59:51] <GothAlice> :/

[02:59:58] <GothAlice> That *is* very nasty. I like it.

[02:59:59] <Boomtime> yeah i know..

[03:00:22] <Boomtime> the query should be fairly fast since I'm quite certain you know how to make it so

[03:00:34] <GothAlice> It'd be indexOnly, even.

[03:00:40] <Boomtime> yep

[03:00:45] <Boomtime> and it's definitely in cache

[03:00:59] <Boomtime> or will need to be, so your working set is on your side

[03:04:54] <GothAlice> ARGH

[03:05:10] <GothAlice> JIRA goes down for maintenance the moment I start really digging through tickets and try to watch a few of them.

[03:06:18] <GothAlice> 'Cause a read-only mode for maintenance windows is too useful. T_T

[03:09:04] <GothAlice> With a two hour maintenance window, I'll have to continue this tomorrow. Wasn't much hope, though. As mentioned, all the tickets I found so far had been closed "works as intended". (There seems to be a "we can't do this perfectly (give actual percentage progress) so we aren't going to try" stance at play.)

[03:09:26] <cheeser> GothAlice: if they used mongodb that wouldn't be a problem. ;)

[03:11:15] <GothAlice> cheeser: If they used Cloudfront, this wouldn't be a problem. >:|

[03:11:35] <GothAlice> (Read-only cache mode is perfectly acceptable for browsing tickets.)

[08:01:13] <scruz> hello

[08:04:46] <scruz> how do i change this $project step to embedded syntax: {‘$project’: {‘location’: ‘$location_path.City’}}? i’ve tried {‘location’: {‘$location_name_path’: {‘City’: 1}}, but i get an error

[10:03:50] <shoshy> hey, i have a collection with documents holding a "created" date. Each such document might have dups with the same group_id (different _id of course though). How can i update all the LATEST documents with distinct group_id to the current date? The distinct command just returns an array...

[10:05:43] <joannac> shoshy: aggregation framework can get you the documents; you can then go through and update them

[10:07:06] <shoshy> joannac: ok, thanks! so it's the same as just saying .distinct() and then iterating over them

[10:07:50] <shoshy> right? there's no single command i can make to update the ones who have the latest "created" property and distinct "group_id" property

[10:08:06] <joannac> shoshy: I'm not sure how distinct would get you only the latest ones?

[10:09:01] <shoshy> joannac: it won't but i'll get me all the documents distincted by "group_id" and then i'll run sort(-1)

[10:09:24] <shoshy> but maybe it's the wrong approach... :/

[10:09:54] <joannac> you can, you're just going to have to do a lot of queries

[10:10:07] <joannac> one for the distinct, then one for each group_id

[10:10:24] <shoshy> right..

[10:10:26] <joannac> and then one update per unique group_id

[10:11:04] <shoshy> yep... :/

[10:15:45] <joannac> aggregate will do all the query stuff in one go and will probably be faster?

[10:15:58] <joannac> how many group_ids do you have?

[10:18:02] <shoshy> maybe 30...

[10:18:22] <shoshy> db.groups.aggregate([

[10:18:23] <shoshy> {$group:{_id:{group_id:'$group_id'},created: {$max:'$created'}} }

[10:18:23] <shoshy> ]);

[10:18:27] <shoshy> this is what i came up with, till now

[10:18:32] <shoshy> sorry for the paste..

[10:25:13] <joannac> I would sort by created:-1, and then group on group_id, and then use $first to grab the first _id field (which will be the _id of the document with the latest date for "created")

[10:27:43] <shoshy> joannac: could you please help me bit with it? i'm new to it... thanks! i'll try your suggestion on my own, but still

[10:29:28] <shoshy> i can't put sort before / after aggregate

[10:29:28] <joannac> shoshy: this is what I was coding with my docs. change field names appropriately

[10:29:31] <joannac> db.baz.aggregate([{$sort: {date: -1}}, {$group:{_id: "$item", "id2": {$first: "$_id"}}}])

[10:29:38] <shoshy> ahhh

[10:29:44] <shoshy> there's a $sort option..

[10:31:13] <shoshy> joannac: thanks a lot! i changed to : "db.groups.aggregate([{$sort: {created: -1}}, {$group:{_id: "$group_id", "id2": {$first: "$created"}}}])" and how do i add the real mongo db _id to each group? as adding a field must add an accumulator

[10:31:43] <shoshy> and adding $push will add all the _ids per group

[10:32:04] <shoshy> as $first is already used

[10:32:40] <joannac> the $first: "$_id" is the part that makes it keep the _id of the document

[10:34:22] <shoshy> ahhh... so i only need to change it to db.groups.aggregate([{$sort: {created: -1}}, {$group:{_id: "$group_id", "id2": {$first: "$_id"}}}])

[10:34:45] <shoshy> so you sort it by 'created', then group it by 'group_id' and take the 1st document per group

[10:34:57] <shoshy> makes sense...

[11:06:19] <geo> Hi. Recently I switch to mongodb version 2.8 rc1. I want to take advantage of collection level locking. I'm using a standalone server(no replica set). The server itself is quite powerful: 40 cores, 250 GB RAM. But I encounter some issue after it's running for few days.. The load on server increases to much.

[11:06:52] <geo> I have few updates, mostly only read operations

[11:07:42] <geo> The DB is quite big, over 40 M docs

[11:08:04] <geo> is anyone playing with 2.8 ?

[11:31:35] <remonvv> Hi all. I'm seeing some strange behaviour for fire and forgot (w=0) writes. It looks like once some buffer hits a limit (I assume it's not writing the data as fast as it is being delivered) the mongod seems to freeze for a while before continuing normal operation rather than slowing down/throttling. Does this sound familiar to anyone

[11:39:07] <joannac> remonvv: flushing to disk?

[11:40:09] <joannac> geo: what kind of load? memory? disk i/o? cpu?

[11:45:54] <geo> joannac, it's about cpu load. I may reach the load_average over 1000 on a 40 core machine

[11:50:00] <hemangpatel> Hi

[11:50:28] <hemangpatel> Can mongo store long interger value ? or is there any limit ?

[11:52:49] <joannac> hemangpatel: NumberLong

[11:52:58] <joannac> (which is 64bit)

[11:53:14] <joannac> geo: coming from mongodb? what do the logs say?

[11:58:11] <geo> joannac: Well, logs just say that some requests takes lot of time to complete. It can take more then 30 sec, so, there can be lots of timeout

[12:01:22] <foofoobar> GothAlice: Hi!

[12:01:35] <geo> joannac: Most of the queries I have problems with are: { count: "mycollection", query: { tags: { $in: [ “tag1", “tag2", “tag3" ] }, lang: "en" }, limit: 1000, skip: 0 }

[12:03:36] <foofoobar> I have this aggregate function: http://hastebin.com/pomunohudu.js

[12:03:45] <foofoobar> How do I add a sort here by _id.month ?

[12:07:11] <remonvv> joannac: Could be but that shouldn't result in a stop-the-world sort of thing should it?

[12:28:29] <joannac> remonvv: sure it can. user processes can be stuck waiting for the system to finish i/o

[12:28:40] <joannac> geo: indexed?

[12:29:21] <joannac> foofoobar: http://docs.mongodb.org/manual/reference/operator/aggregation/sort/

[12:29:47] <foofoobar> joannac: I didnt get this working with mongodb and made a sort with javascript now after I got the result from mongodb..

[12:30:02] <foofoobar> the problem is that I dont know how to filter for this nested value (_id.month)

[12:30:14] <joannac> ...

[12:30:26] <joannac> $sort: {"_id.month" : 1}

[12:30:33] <joannac> same way you sort normally in mongodb

[12:39:24] <geo> joannac: Sure, I have index on all fields I work with

[13:16:48] <newbsduser> hello, guys is there way to limit mongodb's memory usage by using OS settings (on linux 2.6 kernel)?

[13:24:14] <newbsduser> i need to efficient way to do that

[13:25:24] <drecute> Hi

[13:30:44] <drecute> when I get the error "exception:CSV file ends while inside quoted field" during mongoimport, how do I know the exact row that's failing?

[13:31:50] <geo> joannac: I temporary removed count operations, and now all queries run fast. Why count is so slow?

[14:27:14] <GothAlice> geo: Just catching up on the backlog, but generally counting is slow for the same reason .skip() with very large values is slow: if there's an index involved in your query, it'll have to iterate the entire index contents before giving you a count (or to skip ahead it'll have to iterate the number you are skipping). It can't simply say: there's 12,000 records, it has to work it out.

[14:27:50] <GothAlice> Whereas normal querying will start streaming records to you as soon as it's found a match (basically), and will continue to "buffer" additional matches while your application chews on the ones found so far.

[14:30:04] <GothAlice> (And if there isn't an index involved, it'll actually have to walk through the records on-disk—all of them—before returning a count. That is understandably crazy slow.)

[15:16:54] <remonvv> GothAlice: You wouldn't believe the amount of count performance related discussions this channel has seen ;)

[15:17:25] <GothAlice> Most databases implement index b-trees similarly, thus will have similar penalties applied to count() and skip() operations…

[15:17:33] <GothAlice> MySQL exhibits the same O(n) behaviour.

[15:18:37] <remonvv> Well, although that's true there were a couple of issues with MongoDB's implementation.

[15:19:09] <nimomo> Hi, I want to query my collection. How can I get only one item from the document? the item is in inner array. what should I write instead of {item:1} in the projection part?

[15:19:21] <remonvv> And some databases allow configuration of "counted" b-trees that have leaf counts on each node for quick counts.

[15:19:34] <GothAlice> nimomo: Are you looking for a _particular_ array element?

[15:19:43] <nimomo> GothAlice: yes

[15:20:06] <GothAlice> nimomo: Then I'll need to know the actual question you are trying to ask that data before I can assist in formulating a query to match that question.

[15:20:07] <nimomo> I have the fields data->bla->item

[15:20:15] <remonvv> pagination through skip/limit is not a great implementation for larger sets anyway. If it aint O(1)-O(log n) it aint scalin'!

[15:20:55] <nimomo> I want to retrieve only the item

[15:21:02] <remonvv> nimomo: You cannot do what I think you're asking. A query always returns the root document (albeit potentially filtered).

[15:21:02] <nimomo> without the other fields

[15:21:22] <GothAlice> remonvv: (My MongoDB CMS can rank and present every Asset in the entire CMS on the search page if you search for nothing, and it takes ~200ms to generate. 10 years of City Hall data. ;)

[15:21:26] <GothAlice> remonvv: False.

[15:21:32] <nimomo> but I can return by projection, not?

[15:21:35] <remonvv> You can remove fields from the resultset documents but that is the extent of your options. If you need what you need you should look at schema refactoring.

[15:21:48] <remonvv> With the AF? possibly.

[15:21:51] <GothAlice> nimomo: You still haven't stated what, exactly, you are trying to accomplish. :/

[15:22:14] <remonvv> GothAlice: What is false?

[15:22:21] <GothAlice> remonvv: $elemMatch and $ projection, $slice, or a variety of other means allow you to return subsets of nested arrays.

[15:22:33] <remonvv> He's trying to get an element from an embedded array as a the pure result from a query.

[15:22:34] <nimomo> GothAlice: I have an email field within data which is son of the root. I want to return only the email field... (without the other fields)

[15:22:43] <remonvv> Ignoring AF for the moment, no you can't

[15:23:03] <remonvv> Even $slice will not change the result document structure

[15:23:21] <nimomo> GothAlice: something like - db.mycollection.find({data.email": 1})

[15:23:23] <remonvv> It will simply remove data from the result documents

[15:23:53] <remonvv> nimomo: You need to phrase your question more clearly, preferably by putting a document and your intended result in a pastie or something,.

[15:23:55] <GothAlice> nimomo: db.example.find({}, {"data.email": 1}) — should actually work. It'll return the e-mail field from all array members, though, which didn't sound like what you want.

[15:24:14] <remonvv> But what (I think) you want isn't possible.

[15:24:42] <remonvv> given {a:[1,2,3]} you can return {a:[1]} but not 1

[15:25:08] <nimomo> GothAlice: it's exactly what I want.. is there any way to show only email fieldname (without data fieldname)

[15:25:15] <remonvv> And the {a:[1]} only for a specific schemas

[15:25:37] <GothAlice> nimomo: I can't parse your question.

[15:25:52] <GothAlice> nimomo: Ah, I think I might. No. That's a minor detail that can be automatically skipped over by appropriate abstraction. (A la the .scalar() thing I wrote for MongoEngine which will "unpack" single fields into tuples.)

[15:25:53] <remonvv> nimomo: What you want is not possible with the native query language.

[15:26:20] <nimomo> thanks

[15:27:14] <GothAlice> Using MongoEngine if you write: Collection.objects.scalar('data.email') you'll get back an iterable of those literal values. (Because behind-the-scenes the "cursor" wrapper tracks which fields you are asking for and then unpacks the deeply nested structure into a flat one before returning the results on each iteration.)

[15:29:02] <remonvv> Yeah it's a problem best solved in higher level ORM/ODM. Mine has similar functionality although it comes with warnings.

[15:30:40] <GothAlice> remonvv: I wrote mine to ensure all normal cursor methods continued to work. (I.e. you could continue to .skip(), .limit(), .all() to get the results as a concrete list, .explain(), etc.)

[15:31:08] <nimomo> interesting

[15:31:27] <remonvv> Same, it's a cursor extension on mine as well but I'm not sure if it's a very good idea.

[15:32:21] <GothAlice> Do you ever want just a list of _ids matching a query? If you've _ever_ needed to do that, anything worth doing once is worth automating. ;)

[15:34:25] <remonvv> Oh there's a ton of use cases for it. I'm just not sure if my/our implementation is necessarily more clear than c = find({..}, {_id:1}).cursor() -> while(c) { myId = c.next().id }

[15:34:27] <GothAlice> Removing boilerplate like: Collection.objects(creator__in=[i._id for i in User.objects(role='admin').only('_id')]) (list comprehension… everywhere you would need to do that) vs. Collection.objects(creator__in=User.objects(role='admin').scalar('_id')) — much nicer. (Runs .only() first to limit the returned fields, ofc.)

[15:35:15] <GothAlice> That inline comprehension is gibberish to all but the most expert eye in the language. I, for example, can't recognize that from modem line noise. :| (What language?)

[15:35:16] <remonvv> We've had a number of discussions about it and there are some valid arguments against making heavy lifting look like lightweight stuff.

[15:35:27] <GothAlice> It's actually quite light-weight, though…

[15:35:44] <GothAlice> You project only the fields you want, you then iterate the list of desired fields and unpack them into an array on each .next() iteration.

[15:35:54] <GothAlice> (Or tuple in Python.)

[15:36:29] <GothAlice> It's like the difference between mysql_fetch_assoc and mysql_fetch_array. ;)

[15:37:10] <remonvv> Your example implies that the scalar(..) method decides/modifies the field projection.

[15:37:36] <GothAlice> It does; it first calls .only() (which sets exclusive projection inclusion) on its arguments.

[15:38:27] <GothAlice> Quite literally it's the same as the list comprehension example, just not as a comprehension which is hard to read.

[15:40:20] <GothAlice> remonvv: https://github.com/MongoEngine/mongoengine/blob/master/mongoengine/queryset/base.py#L933-L954

[15:40:22] <remonvv> I'm not sure a cursor (based) convenience method should affect the query. But yes, so that was basically the discussion.

[15:41:51] <remonvv> In our case anyway; either do "black magic" and affect the query being executed or scope it on cursor completely (so you basically get anyCursor.scalar(yourfield) which abstracts it away from query/projection

[15:42:27] <remonvv> But you know, sometimes pragmatic > clean

[15:42:35] <GothAlice> Used in the next() iterator here: https://github.com/MongoEngine/mongoengine/blob/master/mongoengine/queryset/base.py#L1330-L1345 (also used in .bulk() and __getitem__ skip/limit slicing). The "heavy lifting" is this: https://github.com/MongoEngine/mongoengine/blob/master/mongoengine/queryset/base.py#L1616-L1628

[15:43:00] <GothAlice> I like pragmatism. And a dozen or so lines of code to support it isn't terrible at all. ;)

[15:45:16] <remonvv> Well it's a cursor scoped method that affects the underlying query isn't it? Clean in terms of correct isolation and such. Something being clean is a highly subjective discussion anyway ;)

[15:46:17] <remonvv> I'd prefer db.col.find({...}, {field:1}).scalar("field") over db.col.find({...}).scalar("field") with implied field projection is what I'm saying

[15:47:38] <GothAlice> https://github.com/MongoEngine/mongoengine/blob/master/mongoengine/queryset/base.py#L942 — pretty explicitly documented that that is exactly what it does. ;)

[15:48:39] <GothAlice> And writing the field twice is, IMHO, utterly hideous. The point of abstractions is to reduce work, not give you more.

[15:49:47] <remonvv> It's two different things. That's what I meant with the obfuscation argument. One tells the database what to do, the other is convenience to remove boiler plate. Making the latter decide on the former is something that people could have a discussion about ;)

[15:50:22] <remonvv> We did, we landed on the former. It's "cleaner" in that it allows for better isolation so we can do cool things like :

[15:51:06] <remonvv> somewhat pseudo : cachedIds = db.col.find({...}).cache().filter("_id")

[15:51:19] <remonvv> loads the entire resultset to warm up cache and results the ids that were used to warm up in one go.

[15:51:38] <remonvv> But anyway, opinionated libraries are good.

[15:52:54] <remonvv> cache() returning a cursor wrapper that warms entity caches with whatever results are pulled through the encapsulated cursor by the way

[15:53:09] <GothAlice> ¬_¬ MongoEngine features caching by default.

[15:53:37] <remonvv> ouch

[15:53:49] <GothAlice> How ouch if you have full control over it?

[15:54:07] <remonvv> How do you invalidate that cache over multiple instances?

[15:54:09] <GothAlice> Same with auto-dereferencing up to X levels deep. (By default, 1.)

[15:55:25] <GothAlice> https://github.com/MongoEngine/mongoengine/blob/master/mongoengine/queryset/queryset.py — all of the caching (or not caching) "magic" is here.

[15:55:43] <GothAlice> Similar .cache() with an opposing .no_cache() method.

[15:56:36] <GothAlice> The cache is local to the query, used to substantially improve performance of certain standard tools. list(query) runs len(query) frequently, so that's cached rather than running count() each time, etc.

[15:57:33] <remonvv> Yes but then you'd have to always cache your query results

[15:57:39] <GothAlice> Why always?

[15:57:50] <remonvv> so len(query) == count(query)

[15:58:14] <GothAlice> Uhm… it's an either / or situation. If you enable caching, you get caching. If you disable it, it's disabled.

[15:59:21] <GothAlice> (I keep it disabled most of the time. No point caching throwaway queries, only ones that will be repeatedly iterated.)

[16:05:33] <remonvv> Right but do you feel it's immediately obvious to developers that when you use cache() they're implictly storing an entire result set in memory?

[16:06:37] <remonvv> The alternative being, if you want to do that just grab the resultset and iterate over it explicitly, it's one or two lines of extra code for an edge case.

[16:06:51] <GothAlice> This is typical behaviour for Python ODMs. (All of the ORMs and ODMs I have ever used behave this way, and it's not quite "all of it in memory at once", it's buffered in chunks to allow for slicing previously iterated records and such.)

[16:07:20] <cancancu> Hey guys, I have a quick question. What kind of data, MongoDB is good for storing ?

[16:07:37] <GothAlice> cancancu: All data except graphs and data where relational integrity or transactional safety are utter requirements.

[16:07:59] <cancancu> For example, Cassandra is good for write-heavy workload. Is there any usecase which is optimized using MongoDB than Cassandra

[16:08:26] <GothAlice> cancancu: Optimization without measurement is by definition premature. One would have to evaluate that on a use case by use case basis.

[16:08:31] <remonvv> cancancu: You're oversimplifying it a bit ;)

[16:09:16] <cancancu> GothAlice: Would it be fair to say that MongoDB is good for storing blobs than Cassandra ? If so, what is the main reason ?

[16:10:10] <GothAlice> cancancu: For example, in a rather write-heavy (and write-amplifying, i.e. one write can trigger a dozen more writes) use case, I run distributed RPC using MongoDB as a message bus / queue. I last benchmarked 1.9 million dRPC round-trips (submit task, pull, execute, submit response, all with locking) per second per host.

[16:10:52] <GothAlice> cancancu: For the BLOB aspect, I currently have 25 TiB of BLOB data in MongoDB using GridFS. It's designed for it.

[16:11:07] <cancancu> Well, I was looking on web that if you need heavy write, choose Cassandra because of it's storage engine optimzed for write performance. I was wondering is there any specific case which motivates to use MongoDB

[16:11:25] <remonvv> cancancu: Good or above average better than similar NoSQL storage solutions? You can't see MongoDB is above averagely good at certain workloads if you don't specify what you have to compare it against.

[16:11:38] <remonvv> cancancu: Did you read on WHY Cassandra is good for write heavy loads?

[16:12:06] <cancancu> remonvv: Yup, I know why Cassndra is optmized for write performance

[16:13:29] <remonvv> cancancu: Well it's optimized for that because that was a requirement of the company that built it. The question to ask yourself is how did they manage near linear scalability for writes and what are the consequences (nothing comes for free).

[16:13:44] <cancancu> Somone also wrote that Cassandra is good for storing stuctured data and MongoDB is good for storing blob data without reasoning it why ? Does it even make sense ?

[16:13:58] <cheeser> not really

[16:14:24] <GothAlice> I'd throw that article out and find a real one. ;)

[16:15:04] <cancancu> GothAlice: I am looking for a real once :) When would one choose MongoDB over Cassandra ?

[16:15:12] <cheeser> always ;)

[16:15:24] <GothAlice> remonvv: http://irclogger.com/.mongodb/2014-11-19#1416454077-1416452929 — "what is MongoDB good for" is a common question, so one time I answered it thoroughly and now can link to it. (Need to bookmark so I can find it back faster in the future!)

[16:16:59] <cheeser> there's a social networking reference architecture on the website. ;)

[16:17:11] <GothAlice> Indeed. All things are possible.

[16:17:39] <GothAlice> And if you only need single-hop lookups, it's a-ok. As soon as you need "11th degree" relationships and things like that, you're borked unless you're using a real graph DB.

[16:17:43] <remonvv> Is that true? Any use case that needs availability is pretty much one where you shouldn't use MongoDB

[16:17:51] <remonvv> MongoDB is CP

[16:18:01] <remonvv> (in reference to CAP)

[16:18:33] <cheeser> i'd use neo4j for a social network probably.

[16:18:37] <GothAlice> remonvv: http://cl.ly/image/14270N034743 — I guess I won't invite you to the 1000 day uptime party.

[16:18:44] <GothAlice> :P

[16:18:52] <remonvv> GothAlice: That has nothing to do with availability ;)

[16:19:01] <GothAlice> Five nines.

[16:19:28] <remonvv> a good MBTF != available

[16:19:32] <GothAlice> Across the entire cluster, though, it's 100%. In that near 1000 day period the DB was at no point inaccessible.

[16:19:50] <GothAlice> (Only node downtime is for process upgrades.)

[16:19:56] <remonvv> I'm not sure what you're saying. You're claiming MongoDB is full CAP?

[16:20:13] <cancancu> GothAlice: Thanks for the link

[16:20:50] <cancancu> remonvv: MongoDB is a master/slave. Though there is a concept of secondary master, but still a single point of failure

[16:20:56] <remonvv> If you pull out certain instance MongoDB stops being fully available (as per its design). I don't get what else there is to it.

[16:20:59] <GothAlice> remonvv: I'm arguing that the proof is in the pudding: there is no reason why MongoDB would be any more unstable than any other long-running daemon.

[16:21:39] <cheeser> cancancu: um. what?

[16:21:55] <GothAlice> Stops being fully available? Not my cluster. You pull a primary, the cluster elects a new one (<10ms in testing), all connections are reset, the clients with in-progress queries will reconnect and retry them, and you're done.

[16:22:08] <remonvv> cancancu: It has replication, yes. It's scaling strategy is based on sharding (splitting your data into smaller chunks, distributing those chunks across multiple instances and making the client aware of where each chunk lives)

[16:22:22] <GothAlice> "Effectively zero downtime" even though yeah, connections had to be reconnected and queries re-run.

[16:22:27] <cheeser> i wouldn't call mongodb SPOF at all.

[16:22:54] <cancancu> I am rather trying to find a specific usecase which will be more optimized with MongoDB

[16:22:56] <remonvv> You are confusing uptime with availability (again, as per CAP)

[16:22:59] <cheeser> Single Point of Failure

[16:23:40] <GothAlice> remonvv: If at no point is it "unavailable" while also being "up", uptime and availability _are_ the same.

[16:24:15] <remonvv> Uhm, that's liking saying "if my car has never broken down before it's indestructable"

[16:24:34] <GothAlice> remonvv: I can't predict the future, only give you stats based on my last 1000 days of experience on this particular cluster. ;)

[16:25:03] <cancancu> cheeser: you wouldn't call MongoDB a single point of failure, but you are also not dicussing the pain in case of Master failure :)

[16:25:15] <GothAlice> cancancu: That's the thing; there is zero pain.

[16:25:31] <GothAlice> A primary goes away, the secondaries figure out who had the latest data, elect that member, and just keeps going.

[16:25:40] <cheeser> mongodb has used "master" for several years. mongo uses primary/secondary set ups now.

[16:25:53] <cheeser> primary goes down, a secondary steps up, cluster keeps chugging.

[16:26:05] <GothAlice> (master/slave is deprecated as replica sets are proper high-availability setups.)

[16:26:13] <Derick> GothAlice: that does give you a 20-30s time period where you can't write to the MongoDB cluster though?

[16:26:15] <cancancu> GothAlice: I wish if it's go. There is even a lot of pain in peer-to-peer which there are no special purpose nodes :)

[16:26:21] <GothAlice> Derick: 30s? Hell no!

[16:26:45] <remonvv> I don't understand how this can even be a debate ;)

[16:26:51] <Derick> GothAlice: no?

[16:27:40] <Derick> GothAlice: curious to see what time it takes your cluster to reelect a new primary and sync

[16:27:48] <GothAlice> Derick: Give me a moment.

[16:28:51] <remonvv> In CAP you get to pick two out of three letters. Mongo is CP. It isn't A. A stands for availability. Split brain, elections and I'm sure a few other situations make it so you cannot read or write certain data.

[16:29:35] <GothAlice> remonvv: Sure. In 8+ years of using MongoDB none of these situations has ever occurred.

[16:29:42] <Derick> remonvv: I've always found that the borders between P and A can be rather vague ;-)

[16:29:46] <cancancu> remonvv: MongoDB is CP :) which means in case of failure of P it always return consistenct data and sacrifies A

[16:30:52] <remonvv> Derick: I would call MongoDB comfortably C, in theory P and not A ;)

[16:30:57] <remonvv> cancancu: That's exactly right.

[16:31:53] <cancancu> remonvv: It's not the case with MongDB. All other databases have Master/Slave or Primary/Secondary architecture

[16:32:30] <cheeser> cancancu: mongodb has primary/secondary. i'm not sure what you're trying to say.

[16:32:49] <remonvv> cancancu: No, there are databases for each of the three possible CAP combinations. MongoDB uses replication (per shard), and sharding for scaling.

[16:33:13] <remonvv> Found it : http://i.stack.imgur.com/a9hMn.png

[16:33:30] <cancancu> Guys :( I am trying to find a usecase where MongDO is more suited than other NoSQL databases :(

[16:33:36] <remonvv> Why?

[16:33:42] <remonvv> That almost sounds like homework

[16:34:03] <cheeser> "nosql" is a pretty meaningless term.

[16:34:11] <remonvv> MongoDB is a general purpose document storage. It's good, it's easy to get started with (biggest plus in my book) and it's mature.

[16:34:29] <remonvv> But you can't compare it with Cassandra unless you add a use case.

[16:34:34] <cheeser> and getting maturer!

[16:34:52] <remonvv> That's like asking if it's a car or boat is faster. It rather depends on whether or not you want to cross a body of water.

[16:35:06] <remonvv> cheeser: A new storage engine will certainly help with that ;)

[16:35:26] <cheeser> a couple of them, in fact.

[16:35:31] <remonvv> That wasn't a great analogy but you get my drift.

[16:35:37] <cancancu> Yes it's my homework :( Cassandra is good for write-heavy workload. What MongoDB is good at :(

[16:35:39] <GothAlice> Right, so kill -9'ing a few processes gets me: three events at 2014-12-02T11:31:40.534-0500 relating to loss of communication with the primary, 2014-12-02T11:31:41.901-0500 marking the primary as DOWN, 2014-12-02T11:31:43.012-0500 a new primary is confirmed as elected by the rsMgr. Nowhere near 30 seconds.

[16:35:41] <remonvv> Really? I only read about the one that's likely to become the default.

[16:35:42] <cancancu> The answer I get is everything

[16:35:50] <remonvv> GothAlice: You're testing happy paths.

[16:35:58] <GothAlice> remonvv: That was under load.

[16:35:58] <cheeser> cancancu: because it mostly is.

[16:36:00] <remonvv> Nobody's saying elections take 30 seconds. He's saying they can.

[16:36:02] <remonvv> And they do.

[16:36:16] <cheeser> i've seen reports of 30s elections, too.

[16:36:23] <remonvv> Load doesn't really affect elections. Size of set, network availability/performance, etc.

[16:36:29] <cheeser> though I think that shouldn't happen (as much?) in 2.8

[16:36:53] <remonvv> We've had EC2 instances that stopped seeing eachother

[16:36:59] <GothAlice> In the three real-world cases where failover has happened in my cluster, election back to read/write took < 5s each time.

[16:37:03] <remonvv> Which means no majority -> no election

[16:37:28] <remonvv> That makes you lucky and well prepared, not permanently available ;)

[16:37:45] <GothAlice> remonvv: Again, I can't predict the future, only give you information about the cluster's performance to date.

[16:37:46] <remonvv> But anyway, this is an apples and oranges discussion. Theory versus reality.

[16:38:45] <remonvv> I understand, availability in this context refers to "Can every client always write and read the data it wants" (given a proper cluster topology)

[16:39:03] <remonvv> You can remove certain minority of instances from a healthy mongodb cluster that make that not the case.

[16:39:26] <remonvv> And if you consider cluster metadata as part of the "data" being referred to here you can even argue that config servers are SPoF

[16:39:39] <GothAlice> remonvv: That's why this cluster has three.

[16:39:53] <remonvv> A cluster goes to metadata read-only if any of those three fail.

[16:40:21] <remonvv> But that's nitpicking, I wouldn't consider that a real-world issue in all but a few cases.

[16:40:53] <GothAlice> remonvv: http://docs.mongodb.org/manual/core/sharded-cluster-config-servers/#sharding-config-server — see the warning box.

[16:41:05] <remonvv> GothAlice: I'm pretty familiar with how it works ;)

[16:41:26] <GothAlice> remonvv: It seems there are multiple definitions of "single point of failure" being thrown around, even in these docs.

[16:42:16] <remonvv> Are there? there's not much room for different interpretations.

[16:43:32] <remonvv> I wonder if cancancu is still paying attention :p

[16:43:53] <cancancu> remonvv: not really :(

[16:44:00] <remonvv> Haha, thought so.

[16:44:43] <cancancu> Let me read all the messages above and repeat my question: What MongoDB is good far :P

[16:44:52] <remonvv> cancancu: That being said though, you need to figure out what it is you want to know. MongoDB doesn't have a very specific sweetspot for particular use cases or data types. There are things it can do and things it cannot do or cannot do well/efficiently.

[16:45:25] <remonvv> cancancu: If you do not know what to pick as a database tech, then pick MongoDB. It's easy to get going, it's general purpose and it scales up easily.

[16:45:28] <santib> hey folks quick question, is there any way to prevent the deletion of a document that has references in other documents

[16:45:34] <GothAlice> santib: Nope.

[16:45:35] <remonvv> santib: No

[16:45:43] <santib> thanks

[16:45:58] <GothAlice> santib: So, you'll have to take the opposite approach. Handle the case where the reference doesn't exist when you try to read it.

[16:46:02] <remonvv> cancancu: So that's something MongoDB doesn't have; relational logic

[16:46:17] <GothAlice> Often one does not need it.

[16:46:22] <cancancu> remonvv: Your answer is highly appreciated :) That's what I was looking for. However, I would really like tto know that what are the things it can't do efficiently ?

[16:46:24] <remonvv> Let us all hope and pray it stays that way

[16:46:27] <remonvv> GothAlice: Amen sir

[16:46:46] <GothAlice> http://www.javaworld.com/article/2088406/enterprise-java/how-to-screw-up-your-mongodb-schema-design.html is a rough overview of how to convert your thinking from relational to document.

[16:47:04] <geo_> GothAlice: thx for reply, yes seems that there is no magic method to make count faster. For now I will try to avoid using it directly.

[16:47:20] <santib> GothAlice: thanks

[16:47:35] <GothAlice> I.e. I store all replies to a discussion thread within that thread, rather than having a separate collection then needing to "join" them, MongoDB lets me query (and update) exact comments, "paginating" them, first/last, etc. all embedded inside the parent document.

[16:47:44] <remonvv> cancancu: It's hard (and sometimes impossible) to maintain relational integrity, its not very good at graph-like schema's, currently its storage engine is mmap based which can cause various performance problems, it's not very efficient with disk space

[16:48:00] <GothAlice> Fragmentation of the on-disk stripes…

[16:48:23] <GothAlice> And whatever you do, don't try to run mongod on btrfs unless you are a wizzard, 'arry.

[16:49:45] <GothAlice> geo_: There's a reason MongoEngine likes to cache the count() of a result so heavily. ;)

[16:59:53] <remonvv> geo_: You can maintain your own counters in some cases. Why do you need to count a large set of results?

[17:02:55] <remonvv> geo_: Also, not all count()s are created equal. There are some optimizations in the b-tree walk that make certain counts (e.g. low cardinality) faster iirc.

[17:03:42] <GothAlice> Also full-collection counts which cheat and use the _id index bucket count, apparently.

[17:04:22] <Derick> cheat? :-) It's an optimisation! :P

[17:04:49] <remonvv> It's broken Derick ;)

[17:05:05] <remonvv> You get inconsistent results

[17:05:08] <Derick> does jira know?

[17:05:26] <remonvv> We used to have a screenshot here where three different prompts counted the same collection and disagreed on all ;)

[17:05:39] <Derick> that doesn't sound good

[17:05:52] <remonvv> It's expected, multi instance count() aggregation

[17:06:16] <remonvv> It's not a problem with the optimization :)

[17:06:33] <remonvv> I've seen it go down while adding documents though, that was special.

[17:07:00] <Derick> heh - I can imagine

[17:07:13] <remonvv> GothAlice: Atomic counter for what?

[17:07:24] <GothAlice> "Number of living documents in the collection."

[17:08:11] <remonvv> Like maintain an in-memory counter for each collection?

[17:08:30] <GothAlice> .insert would $inc for each record inserted, .remove would $dec for each record removed. In both cases the operation could simply keep a counter and when done perform a single atomic update of the counter. In-memory, on-disk, it'd likely just be a field in a metadata system collection somewhere.

[17:08:47] <GothAlice> s/keep a counter/keep a local counter

[17:08:50] <remonvv> If it has to go to disk you just sliced insert/remove performance in two though.

[17:08:51] <Derick> WT might actually do this...

[17:10:01] <remonvv> Derick: Does it need to? No option to add counts to b-tree nodes or something?

[17:10:19] <GothAlice> Derick: Indeed; this is even quite rollable client-side, albeit with a race condition. (nInseted, nRemoved, etc. result of the operation passed to a secondary query to update the document count metadata.)

[17:10:27] <Derick> I am not really sure whether WT *uses* btrees

[17:10:44] <remonvv> You can only either make it slow or eventually consistent (seperate counters that is)

[17:10:52] <Derick> remonvv: it would change the on-disk format, not?

[17:11:18] <remonvv> Derick: Of the indexes, yes. But it has to be an index flag anyway so it's backwards compatible.

[17:11:24] <GothAlice> Eventually consistent would be more than adequate, and a step up from the current "pseudo-random" behaviour in larger clusters described above. ;)

[17:11:26] <remonvv> And the index data has a version field

[17:11:44] <remonvv> GothAlice: Agree that it is better than it is now, but counting your entire collection is relatively rare.

[17:11:51] <Derick> sure - I don't think it is something they change lightly though.

[17:12:05] <remonvv> Derick: Wise ;)

[17:12:17] <GothAlice> You'd even be able to estimate how off (and in what direction it's off) by looking at currentOp; sum(inserts) - sum(removals) would give you the bias.

[17:12:25] <remonvv> Derick: I like WT already, regardless of count performance being improved.

[17:12:26] <GothAlice> (Roughly.)

[17:12:53] <Derick> remonvv: why so? (just being curious)

[17:14:12] <remonvv> Derick: Well, wired tiger has much higher granularity locking and it has a more mature disk management

[17:14:59] <geo_> remonvv: Well, count was used mostly for pagination

[17:15:04] <remonvv> And compression, why on earth we had to wait 4+ years for that is beyond me ;) Find me a mongod that's capped on cpu.

[17:15:27] <Derick> remonvv: mostly because "in memory is on disk" ...

[17:15:29] <remonvv> geo_ : You'll probably want to implement pagination differently if you have enough data to make your counts >1ms

[17:15:35] <Derick> but yeah, WT looks very cool.

[17:17:11] <remonvv> Derick: Hm, pretty sure you could add compression between the raw read/writes but anyway I'm glad it's there now.

[17:58:20] <rfv> hey guys, one of my collections has documents in it, but when trying to fetch them, it doesn't return anything (ie. count() returns something > 0, but findOne() returns null)

[18:27:34] <PirosB3> rfv: rebuild indexes and try again

[18:27:36] <PirosB3> happens to me toio

[18:31:20] <cers> I'm very new to mongodb, and apparently there's something I haven't quite grasped when it comes to the find function. I have entries that might look something like {type: "exoplanet", dimensions: {diameter: { value:1000, unit: 'km'}}, and I want to find any entry where type contains the word 'planet' and dimensions.diameter is between two values x and y. I should not that dimensions.diameter might not exist.

[18:31:34] <cers> How would I go about writing such a find function?

[18:31:50] <rfv> PirosB3: I just did that, but it didn't do anything ... I have a collection with only one document, count() returns 1, findOne() returns null :|

[18:32:30] <PirosB3> lol :)

[18:32:31] <PirosB3> find?

[18:33:31] <rfv> nothing

[18:36:43] <cers> huh - not sure what I did wrong before, all of a sudden it seems to work

[18:57:48] <_Clever> aff

[18:57:57] <_Clever> brazukas here??

[19:07:03] <styles> https://privatepaste.com/67e205789e I'm trying to get this query working, the matchign doesn't seem to work

[19:07:09] <styles> It returns 0 results every time

[21:41:45] <kajsa_a> Question: does anyone know why mondodump with --query and mongo shell with the same query would report different numbers of records?

[21:45:48] <GothAlice> kajsa_a: Are you getting this number from the output of .count()?

[21:45:54] <GothAlice> (Or the actual dump of records?)

[21:46:31] <kajsa_a> GothAlice: I'm comparing the number from the dump output to the number from db.find(<query>).count()

[21:46:43] <kajsa_a> GothAlice: count from shell is higher

[21:46:51] <GothAlice> .count() can be wrong.

[21:47:14] <tskaggs> Has any successfully updated a large collection of 100k+ items in mongodb via nodejs?

[21:47:41] <kajsa_a> GothAlice: hmm, thanks - didn't know that! :)

[21:48:48] <GothAlice> kajsa_a: Depends on your cluster setup (replication lag, sharding aggregate estimation of counts, etc.) but it can vary even from call to call and connection to connection (i.e. even if you ask for the count at the same moment across multiple connections, each might get a different answer.)

[21:48:49] <Boomtime> tskaggs: is this on a sharded cluster?

[21:48:58] <GothAlice> Boomtime: Exactly. ^_^

[21:49:21] <Boomtime> oops, sorry that was meant for kajsa_a

[21:49:44] <GothAlice> tskaggs: I'm sure someone, somewhere has. What's the issue you're encountering, if any?

[21:49:50] <Boomtime> though it applies to both i suppose

[21:49:52] <kajsa_a> yes, it is, which it sounds like is probably the issue. Sorry, guess I should have read the docs more closely

[21:50:11] <GothAlice> kajsa_a: No worries; it's a pretty obscure thing to have to worry about.

[21:50:43] <tskaggs> No this isn't a sharded cluster. Just one cluster

[21:51:03] <GothAlice> tskaggs: Replica set, then? (Primary + secondaries?)

[21:51:22] <GothAlice> "Cluster" is a non-descriptive term for any group of systems working together on something. ;)

[21:51:56] <GothAlice> tskaggs: My earlier question still applies, though. What's the issue you're encountering?

[21:54:29] <tskaggs> It takes forever to for a findall.. then reformat... but the big issue is when I pass the id and whole object to update(id, newobject){}; and it just doesn't update and has a $set error

[21:54:50] <tskaggs> GothAlice ^ and here's my controller http://hastebin.com/polelolumi.lua

[21:55:52] <GothAlice> tskaggs: Which driver abstraction are you using?

[21:56:29] <GothAlice> (/ database layer)

[21:57:16] <tskaggs> GothAlice sorry new to mongo what do you mean?

[21:57:39] <GothAlice> tskaggs: Which library are you using as your database connection layer? I.e. in Python that'd be something like pymongo, MongoEngine, etc.

[21:58:08] <tskaggs> oh "sails-mongo": "^0.10.4",

[22:00:26] <GothAlice> sails-mongo is, well, extremely pre-1.0.

[22:00:58] <GothAlice> https://github.com/balderdashy/sails-mongo/blob/master/lib/collection.js#L196 — it does not look like the first argument can be a random ObjectId, it looks like it needs to be {_id: ObjectId(…)}.

[22:01:24] <GothAlice> It basically has no tests and integrated documentation, though, so I can't actually confirm if this is the case or not… at all. :/

[22:02:23] <tskaggs> GothAlice Yea it's not random. in the hastebin I do a find() to return all items, then an each to return each individual item, reformat it, then pass the proper ID and object to the update()

[22:03:05] <tskaggs> I wasn't user how to update Mongo quicker than that since I assume other people have dealt with 100k+ item collections.. :/

[22:03:50] <GothAlice> tskaggs: The only way to get acceptable performance on large-scale updates is by using MongoDB's atomic update operations.

[22:04:13] <GothAlice> http://docs.mongodb.org/manual/reference/operator/update/ — these things. (I.e. $inc, $push, etc.)

[22:04:16] <tskaggs> GothAlice #_# atomic sounds cool.

[22:05:26] <GothAlice> tskaggs: It means that it's either gaurenteed to work, or gaurenteed to fail; you won't get a "half-applied" atomic operation.

[22:05:26] <GothAlice> (It also means, in the current MongoDB 2.6 case, that only one change is applied at a time, all in a linear order.)

[22:05:26] <GothAlice> (2.8 reduces locking using a new on-disk back-end.)

[22:05:26] <Boomtime> update is only atomic for single documents, it is not atomic on a multi-update

[22:05:33] <GothAlice> Indeed.

[22:06:03] <GothAlice> Multi-updates are a bit "riskier", but there are ways of handling that, too. (I.e. two-phase commits.)

[22:06:16] <Boomtime> to be clear: a "multi document update" is applied as a series of atomic single document updates

[22:06:48] <GothAlice> Failure part-way through will result in partial application of the changes to the documents. (I.e. some will have been updated, others not.)

[22:07:22] <tskaggs> D:

[22:07:28] <tskaggs> hm

[22:07:32] <GothAlice> tskaggs: One adapts to that limitation. ;)

[22:07:42] <tskaggs> GothAlice Boomtime I see

[22:07:44] <tskaggs> lol

[22:07:58] <GothAlice> In your case you are $set'ing a new structure, so you can resume quite easily from where you left off by also querying on that field not existing. (I.e. only find records that haven't been updated yet.)

[22:09:28] <tskaggs> ok cool

[22:10:13] <GothAlice> A lot of these things sound scarier than they are; they just require some thought before running off with scissors. ;)

[22:10:34] <tskaggs> GothAlice I am worried but I'll try.

[22:10:35] <tskaggs> lol

[22:10:57] <kajsa_a> so true - which is why my count issue concerned me :)

[22:14:03] <GothAlice> Speaking of count(), it'd be frickin' awesome if any operation that can return an inaccurate number also includes its confidence / estimated standard deviation / accuracy.

[22:15:45] <Boomtime> count() in a cluster will always be equal or higher than the actual, it isn't really even an estimate, and no parameters can be placed on exactly how wide of the mark it is

[22:17:01] <GothAlice> Better living through entropy, I say. ;)

[22:17:29] <cheeser> entropy isn't what it used to be, though.

[22:21:52] <GothAlice> Things like that will lead to "Achievement Unlocked: Heisenbug" (a la https://exogen.github.io/nose-achievements/#builtin:heisenbug ;)

[22:35:41] <kajsa_a> GothAlice, I wrote up an answer to the StackOverflow question that someone else had posted on my same issue, to try to make it more searchable for the next person to stumble onto this nuance - http://stackoverflow.com/questions/21776666/mongodump-and-then-remove-not-exact-same-number-of-records/27260279

Log file Viewer

Help | Karma | Search:

#mongodb logs for Tuesday the 2nd of December, 2014