[00:02:39] <appledash> This is the test data, BTW: https://gist.githubusercontent.com/AppleDash/32518e17866c7bbf0ed7/raw/19ddb617b31849fefe74f180a91ab5aa0416bb73/gistfile1.txt As it is, the query is intended to retrieve all the documents that match any of the attributes queried for, and sort them in the order of the ones with the most matches first. (It does so fine)
[00:26:13] <kYuZz> I have a MongoDB install on the server of one of my clients.
[00:26:13] <kYuZz> An application I have no control on is using one single Mongo database without authentication.
[00:26:13] <kYuZz> Basically I would like that every new database requires authentication, but the previous one needs to be accessed without using credentials.
[00:26:15] <kYuZz> I managed to enable authentication successfully, but now the previous database requires authentication.
[00:34:10] <GothAlice> appledash: Right, so you have several changes you need to make here. In your $group, add: {…, val: {$push: "$value"}}, after the $group add a new pipeline stage: {$project: {vallen: {$size: "$val"}}} then you can sort on "vallen". This, if I've interpreted your request correctly.
[00:35:17] <appledash> Not entirely sure if you have
[00:35:18] <SubCreative> Anyone familiar with working with geospacial queries and mongodb?
[00:35:42] <Boomtime> just ask your question and maybe somebody can help you
[00:35:48] <SubCreative> Im working with $near query and things work, next I'd like to calculate and display the distance that it's being sorted by.
[00:35:49] <GothAlice> appledash: The above will give you the count of the "matches". The search criteria part I'm still pondering.
[00:35:50] <appledash> GothAlice: It works fine as it is, but I just want to make the val in the name, val pairs be an array instead, and have it so I can check if something is in that array
[00:36:18] <GothAlice> appledash: Gist me what you want as your output from that sample data?
[00:36:36] <GothAlice> appledash: An example of the desired output may be easier for me to grok.
[00:36:38] <appledash> IT already provides the right output
[00:36:55] <appledash> Runnint the query as it is on that test data provides the results I want
[00:37:19] <appledash> It works 100% fine, but I want to make it do more things because that's what I need
[00:37:54] <GothAlice> appledash: Except that you want to change it to produce something else and thus the current output is _not_ your desired output.
[00:38:53] <appledash> What I'm trying to say is that it isn't broken
[00:39:25] <appledash> The desired output will still be exactly the same
[00:39:38] <appledash> But the query used to get it needs to change because I want to change how the database is laid out
[00:40:29] <GothAlice> Your sample data and aggregate return nothing on my local machine, interestingly. :/
[00:40:40] <appledash> Instead of the attributes being {name: "NameHere", value: "ValueHere"}, I want them to become {name: "NameHere", value: ["Value1Here", "Value2Here"]}
[00:40:52] <appledash> And instead of checking if the value is equal, I want to check if it's in the array
[00:42:32] <SubCreative> Boomtime: Im aware of $geoNear however its not available in meteor.js
[00:43:03] <SubCreative> Boomtime: My questions was related to mongo/meteor (i'm in #meteor also) just wanted to see if anyone had experience here working with maps/locations/distance in meteor
[00:43:12] <GothAlice> appledash: Then your current query should actually work unmodified.
[00:43:33] <Boomtime> SubCreative: you only mentioned $near before, I'm not sure it's possible with $near
[00:43:54] <GothAlice> appledash: {… val: "female"} if "val" is an array will compare and actually ask it 'any val element is "female"' instead.
[00:44:40] <Boomtime> SubCreative: MongoDB can do what you want but your choice of framework doesn't allow it - there is little I can do about that
[00:45:20] <GothAlice> appledash: Your current data model is giving me a serious sinking feeling in the pit of my stomach, though. One assumes your arbitrary key/value storage is, in fact, limited to a certain set of valid keys. If this is the case, why have this convoluted list-of-abstract-properties model?
[00:47:06] <appledash> Will have about 100-200 attributes per thing
[00:47:17] <appledash> with up to 20-30 being queried against at once
[00:47:26] <GothAlice> appledash: I use sparse fields with potentially hundreds of fields to describe metadata in my metadata filesystem currently holding 25 TiB of data…
[00:47:40] <GothAlice> (With indexes on many of those fields to speed up querying.)
[00:48:04] <appledash> Well, how would I even modify my query if my data wasn't how it is?
[00:48:14] <GothAlice> appledash: Records can differ nearly 100% in the attributes they have. (Only _id, path, parent, and parents fields are common.)
[00:49:57] <Boomtime> it is not the same without it
[00:50:05] <Boomtime> the difference is very subtle
[00:50:06] <appledash> I don't think that does what you think it does GothAlice
[00:50:10] <GothAlice> Boomtime: Currently, due to the weird nesting here, but if those "virtual" attributes moved up into the document proper, things become infinitely simpler.
[00:50:29] <appledash> I suppose they could be moved to the root document
[00:57:34] <GothAlice> appledash: And I can't reproduce your query to test modifications locally. No output using https://gist.githubusercontent.com/AppleDash/32518e17866c7bbf0ed7/raw/19ddb617b31849fefe74f180a91ab5aa0416bb73/gistfile1.txt as sample data against the https://gist.githubusercontent.com/AppleDash/48e4d803e6fe870f7915/raw/85146717ad71839822b70451eba45319d51f1fe0/gistfile1.txt query.
[01:03:45] <GothAlice> appledash: If you change "purple|pink" to ["purple", "pink"] in your second test record, then change "blue" references in the query to "pink" ones, it just works.
[01:04:04] <appledash> Now, that brings me into another question
[01:04:06] <Boomtime> appledash: if I understand correctly, you want a doubly nested array
[01:04:09] <appledash> How the heck do I update() that?
[01:05:08] <appledash> If I wanted to change that with update(), what would I do?
[01:05:15] <GothAlice> appledash: $set to replace the value, or $push to append to the list…
[01:05:21] <GothAlice> $pull to remove an element…
[01:10:28] <appledash> I'm not 100% sure I understand how that works, but OK.
[01:10:33] <GothAlice> So, db.characters.update({_id: ObjectId("…"), attributes.name: "hairColor"}, {$push: {"attributes.$.val": "plaid"}}) — this would add it to the list
[01:11:15] <GothAlice> So, attributes.name in the query part will search for a particular array element. Once it finds it, it remembers it as $. It's then used in the update portion of that to represent the array element previously found, thus instead of attributes.27.val, you have attributes.$.val since MongoDB figured out the index for you.
[01:11:26] <Boomtime> you probably mean $addToSet (otherwise you could have the same hairColor twice)
[01:12:37] <Boomtime> however, you have a problem removing attributes - you'd need to update the whole array since you're already using the positional operator
[01:13:13] <appledash> It does not seem to be working.
[01:13:15] <GothAlice> Or adding to a list if you aren't sure if the record *has* that named property or not… or aren't sure it was an array to begin with…
[01:13:46] <appledash> Updateing Gyro to have a hairColor of ["purple", "pink"] and then running the query with gender: female and hairColor: pink returns Gyro followed by AppleDash
[01:20:22] <appledash> Updateing Gyro to have a hairColor of ["purple", "pink"] and then running the query with gender: male and hairColor: pink returns Gyro followed by AppleDash
[01:34:17] <appledash> The most useful thing ever would be like... A Sublime Text plugin that lets me run the contents of the current file as a query and see the results
[01:34:29] <appledash> Because that entire query has been written in the line editor of `mongo`
[01:34:51] <Boomtime> an easy test is to use the shell, make a variable of each stage, then add them into aggregation one at a time and ensure the results at each point are what you expect
[01:35:29] <Boomtime> eg. your "unwind" stage would be -> u = { $unwind: "$attributes" }
[01:35:46] <Boomtime> do the same for each stage.. m = { $match... } etc
[01:36:10] <Boomtime> now, in the shell you can do this: db.col.aggregate( m, u )
[01:36:34] <GothAlice> Also: .js files can be executed in the interactive shell. And that such an editor plugin would be relatively trivial. (a la echo "$SELECTION" | mongo) And the shell does support continuation, e.g. http://cl.ly/image/3Y003w100U0h
[01:36:46] <Boomtime> the add your second match: db.col.aggregate( m, u, m2 )
[01:37:03] <GothAlice> Boomtime: You're missing brackets? Or is the shell super smart about that?
[01:54:11] <Boomtime> you need to express those rules, as defined here: http://docs.mongodb.org/manual/reference/operator/aggregation/group/#accumulator-operator
[01:54:45] <appledash> Basically I'm just looking for a way to pull out the name and attributes on each document in that query
[01:54:54] <appledash> without doing a separate query to find that out about each thing
[01:55:05] <appledash> If it's possible I'll figure it out
[01:55:09] <appledash> But if it isn't I want to know now :P
[01:55:17] <Boomtime> you probably just want a single $addToSet
[01:55:48] <Boomtime> but maybe you want some more complicated rules depending on the source documents
[01:57:48] <appledash> I seem to have figured it out
[02:01:56] <appledash> Well, I figured it out but it's not doing what I want. With the way this query is working, I do not think that doing what I want is very possible.
[02:04:01] <Boomtime> construct a sample output that you want and pastebin/gist
[02:04:42] <appledash> Well, it is easier just to explain... Instead of that query returning just _id and score, I want it to return everything, _id, score, name, and attributes. All the attributes.
[02:05:12] <appledash> Basically the same as .find() would, but with the addition of score, and sorted in the proper order and stuff, an only matching the $match of course
[02:05:16] <Boomtime> nope, insufficient, you need to say how the attributes will go together, apparently $addToSet didn't do it
[02:06:43] <appledash> $addToSet only returned those attributes that matches the query
[02:26:26] <GothAlice> http://www.javaworld.com/article/2088406/enterprise-java/how-to-screw-up-your-mongodb-schema-design.html This is the primary article I use as a reference for moving from the relational world to MongoDB one. It's a reasonable guide, though doesn't stress the importance of "data locality" and the "amplification effect" as much as I'd like.
[02:27:00] <GothAlice> (I.e. embedding 1:∞ relational data can make sense, but mostly if you always want all of that "related" data when you fetch the record. The latter relates to the question "do you store user IDs against the groups they belong to, or the group IDs against the user?" — usually go for the choice resulting in a smaller set, etc.)
[02:27:39] <appledash> I feel like there are things MongoDB is good for and things it is not.
[02:28:18] <GothAlice> appledash: Of course. I would never recommend one stores graph data in MongoDB unless you only ever need, and will only ever need single-hop lookups.
[02:28:49] <appledash> If I want to say make a twitter clone, and store users on it (along with passwords, emails, etc) and then store the tweets those users have made, as well as people they are following, etc
[02:28:54] <appledash> no way in hell I'm using mongo
[02:29:21] <GothAlice> Really? MongoDB solved some of the problems Twitter encountered quite well.
[02:29:30] <GothAlice> (Twitter was originally a full Java CMS, BTW.)
[02:30:27] <GothAlice> For example: Twitter used auto-increment integer IDs. Now they do not, because that can not scale. (They went so far as to write a completely separate service whose only purpose is to generate new IDs, to let them scale that one issue separately from the rest of the app.)
[02:30:28] <appledash> It definitely does not seem like the right choice by any means
[02:30:54] <GothAlice> appledash: I have 25 TiB of data in my personal dataset, and this includes a complete copy of my incoming Twitter feed. :)
[02:31:05] <GothAlice> (And a whole lot more, obv.)
[02:32:38] <GothAlice> Same for any data where transactional safety or referential integrity are mission-critical, such as financials. These I would recommend a truly transaction-safe relational database.
[02:34:57] <GothAlice> However forums, short messaging services, instant messengers, etc. are all quite modelable in MongoDB, including some interesting MongoDB-specific optimizations. (My forums embed thread replies within the thread document itself. I use "caching" references that bundle certain often-used fields with their reference, saving millions of extra lookups to generate most listings. &c.)
[02:35:47] <GothAlice> MongoDB really forces you to think about data differently. :) Odd, certainly!
[02:36:35] <GothAlice> Most people with relational backgrounds I describe "replies stored within the thread record" to, blink a few times then call me crazy. ;)
[02:37:48] <GothAlice> Boomtime: Is there any way to get "progress" information from a long-running update?
[02:38:31] <appledash> Well, I think you are crazy already
[02:39:14] <appledash> I think my use works alright because I'm storing a bunch of one thing and that one thing will be the only thing I'm ever storing and I will only ever store it one way
[02:40:42] <Boomtime> GothAlice: not exactly "progress" information (how do you know ahead of time what will match?)
[02:40:52] <Boomtime> but you can get current information
[02:46:42] <GothAlice> Well, I have a copy of the query and update SON objects passed to the original update()…
[02:47:01] <Boomtime> if you are in full control of the query/command such that you can make ever one of them uniquely fingerprintable, then you are OK
[02:49:04] <Boomtime> you would need to know ahead of time how many documents matched
[02:49:21] <Boomtime> which means it would have to run the query twice just to get an _estimate_
[02:49:46] <GothAlice> Run the query, count it. Re-run in the update. The difference caused by the race condition there will be statistically insignificant compared to the size of the full collection.
[02:50:04] <Boomtime> so run the query twice... guess how popular that is?
[02:50:48] <Boomtime> but you care about performance
[02:51:04] <Boomtime> no, you can run the query twice
[02:52:21] <Boomtime> dunno what stats that reports
[02:52:29] <Boomtime> surprised there is no numUpdated
[02:52:46] <Boomtime> suggest an improvement if it's not there
[02:52:56] <GothAlice> This first query of mine updates every record in the collection. Counting first is immaterial. I would also hope that db.collection.count() would have a short-circuit path to a recorded statistic against the collection, rather than needing to iterate an index or documents.
[02:54:03] <Boomtime> .count() short circuits in the best way it can - how many index buckets are assigned for _id index, that is what you want
[02:57:54] <GothAlice> (I'm effectively wanting to be able to report progress information when updating the cached references of mine. I.e. a user changes their display name on the forums, I need to update all comment "author" references to have the new cached value—it's okay if this is wrong for an indeterminate period of time—and this can take some time if the user is very active. A "spinner" while the operation completes is the fallback I'll have to use,
[02:59:40] <GothAlice> Boomtime: All other "progress" tickets are currently marked "closed" and "works as designed". Hope is not strong.
[02:59:41] <Boomtime> a very nasty way to do it, is to poll (every second or so?) using a count() query for the thing that is going away.. since those matches should be depleting
[03:05:10] <GothAlice> JIRA goes down for maintenance the moment I start really digging through tickets and try to watch a few of them.
[03:06:18] <GothAlice> 'Cause a read-only mode for maintenance windows is too useful. T_T
[03:09:04] <GothAlice> With a two hour maintenance window, I'll have to continue this tomorrow. Wasn't much hope, though. As mentioned, all the tickets I found so far had been closed "works as intended". (There seems to be a "we can't do this perfectly (give actual percentage progress) so we aren't going to try" stance at play.)
[03:09:26] <cheeser> GothAlice: if they used mongodb that wouldn't be a problem. ;)
[03:11:15] <GothAlice> cheeser: If they used Cloudfront, this wouldn't be a problem. >:|
[03:11:35] <GothAlice> (Read-only cache mode is perfectly acceptable for browsing tickets.)
[08:04:46] <scruz> how do i change this $project step to embedded syntax: {‘$project’: {‘location’: ‘$location_path.City’}}? i’ve tried {‘location’: {‘$location_name_path’: {‘City’: 1}}, but i get an error
[10:03:50] <shoshy> hey, i have a collection with documents holding a "created" date. Each such document might have dups with the same group_id (different _id of course though). How can i update all the LATEST documents with distinct group_id to the current date? The distinct command just returns an array...
[10:05:43] <joannac> shoshy: aggregation framework can get you the documents; you can then go through and update them
[10:07:06] <shoshy> joannac: ok, thanks! so it's the same as just saying .distinct() and then iterating over them
[10:07:50] <shoshy> right? there's no single command i can make to update the ones who have the latest "created" property and distinct "group_id" property
[10:08:06] <joannac> shoshy: I'm not sure how distinct would get you only the latest ones?
[10:09:01] <shoshy> joannac: it won't but i'll get me all the documents distincted by "group_id" and then i'll run sort(-1)
[10:09:24] <shoshy> but maybe it's the wrong approach... :/
[10:09:54] <joannac> you can, you're just going to have to do a lot of queries
[10:10:07] <joannac> one for the distinct, then one for each group_id
[10:25:13] <joannac> I would sort by created:-1, and then group on group_id, and then use $first to grab the first _id field (which will be the _id of the document with the latest date for "created")
[10:27:43] <shoshy> joannac: could you please help me bit with it? i'm new to it... thanks! i'll try your suggestion on my own, but still
[10:29:28] <shoshy> i can't put sort before / after aggregate
[10:29:28] <joannac> shoshy: this is what I was coding with my docs. change field names appropriately
[10:31:13] <shoshy> joannac: thanks a lot! i changed to : "db.groups.aggregate([{$sort: {created: -1}}, {$group:{_id: "$group_id", "id2": {$first: "$created"}}}])" and how do i add the real mongo db _id to each group? as adding a field must add an accumulator
[10:31:43] <shoshy> and adding $push will add all the _ids per group
[10:32:40] <joannac> the $first: "$_id" is the part that makes it keep the _id of the document
[10:34:22] <shoshy> ahhh... so i only need to change it to db.groups.aggregate([{$sort: {created: -1}}, {$group:{_id: "$group_id", "id2": {$first: "$_id"}}}])
[10:34:45] <shoshy> so you sort it by 'created', then group it by 'group_id' and take the 1st document per group
[11:06:19] <geo> Hi. Recently I switch to mongodb version 2.8 rc1. I want to take advantage of collection level locking. I'm using a standalone server(no replica set). The server itself is quite powerful: 40 cores, 250 GB RAM. But I encounter some issue after it's running for few days.. The load on server increases to much.
[11:06:52] <geo> I have few updates, mostly only read operations
[11:07:42] <geo> The DB is quite big, over 40 M docs
[11:31:35] <remonvv> Hi all. I'm seeing some strange behaviour for fire and forgot (w=0) writes. It looks like once some buffer hits a limit (I assume it's not writing the data as fast as it is being delivered) the mongod seems to freeze for a while before continuing normal operation rather than slowing down/throttling. Does this sound familiar to anyone
[11:53:14] <joannac> geo: coming from mongodb? what do the logs say?
[11:58:11] <geo> joannac: Well, logs just say that some requests takes lot of time to complete. It can take more then 30 sec, so, there can be lots of timeout
[13:30:44] <drecute> when I get the error "exception:CSV file ends while inside quoted field" during mongoimport, how do I know the exact row that's failing?
[13:31:50] <geo> joannac: I temporary removed count operations, and now all queries run fast. Why count is so slow?
[14:27:14] <GothAlice> geo: Just catching up on the backlog, but generally counting is slow for the same reason .skip() with very large values is slow: if there's an index involved in your query, it'll have to iterate the entire index contents before giving you a count (or to skip ahead it'll have to iterate the number you are skipping). It can't simply say: there's 12,000 records, it has to work it out.
[14:27:50] <GothAlice> Whereas normal querying will start streaming records to you as soon as it's found a match (basically), and will continue to "buffer" additional matches while your application chews on the ones found so far.
[14:30:04] <GothAlice> (And if there isn't an index involved, it'll actually have to walk through the records on-disk—all of them—before returning a count. That is understandably crazy slow.)
[15:16:54] <remonvv> GothAlice: You wouldn't believe the amount of count performance related discussions this channel has seen ;)
[15:17:25] <GothAlice> Most databases implement index b-trees similarly, thus will have similar penalties applied to count() and skip() operations…
[15:17:33] <GothAlice> MySQL exhibits the same O(n) behaviour.
[15:18:37] <remonvv> Well, although that's true there were a couple of issues with MongoDB's implementation.
[15:19:09] <nimomo> Hi, I want to query my collection. How can I get only one item from the document? the item is in inner array. what should I write instead of {item:1} in the projection part?
[15:19:21] <remonvv> And some databases allow configuration of "counted" b-trees that have leaf counts on each node for quick counts.
[15:19:34] <GothAlice> nimomo: Are you looking for a _particular_ array element?
[15:20:06] <GothAlice> nimomo: Then I'll need to know the actual question you are trying to ask that data before I can assist in formulating a query to match that question.
[15:20:07] <nimomo> I have the fields data->bla->item
[15:20:15] <remonvv> pagination through skip/limit is not a great implementation for larger sets anyway. If it aint O(1)-O(log n) it aint scalin'!
[15:20:55] <nimomo> I want to retrieve only the item
[15:21:02] <remonvv> nimomo: You cannot do what I think you're asking. A query always returns the root document (albeit potentially filtered).
[15:21:22] <GothAlice> remonvv: (My MongoDB CMS can rank and present every Asset in the entire CMS on the search page if you search for nothing, and it takes ~200ms to generate. 10 years of City Hall data. ;)
[15:21:32] <nimomo> but I can return by projection, not?
[15:21:35] <remonvv> You can remove fields from the resultset documents but that is the extent of your options. If you need what you need you should look at schema refactoring.
[15:22:21] <GothAlice> remonvv: $elemMatch and $ projection, $slice, or a variety of other means allow you to return subsets of nested arrays.
[15:22:33] <remonvv> He's trying to get an element from an embedded array as a the pure result from a query.
[15:22:34] <nimomo> GothAlice: I have an email field within data which is son of the root. I want to return only the email field... (without the other fields)
[15:22:43] <remonvv> Ignoring AF for the moment, no you can't
[15:23:03] <remonvv> Even $slice will not change the result document structure
[15:23:21] <nimomo> GothAlice: something like - db.mycollection.find({data.email": 1})
[15:23:23] <remonvv> It will simply remove data from the result documents
[15:23:53] <remonvv> nimomo: You need to phrase your question more clearly, preferably by putting a document and your intended result in a pastie or something,.
[15:23:55] <GothAlice> nimomo: db.example.find({}, {"data.email": 1}) — should actually work. It'll return the e-mail field from all array members, though, which didn't sound like what you want.
[15:24:14] <remonvv> But what (I think) you want isn't possible.
[15:24:42] <remonvv> given {a:[1,2,3]} you can return {a:[1]} but not 1
[15:25:08] <nimomo> GothAlice: it's exactly what I want.. is there any way to show only email fieldname (without data fieldname)
[15:25:15] <remonvv> And the {a:[1]} only for a specific schemas
[15:25:37] <GothAlice> nimomo: I can't parse your question.
[15:25:52] <GothAlice> nimomo: Ah, I think I might. No. That's a minor detail that can be automatically skipped over by appropriate abstraction. (A la the .scalar() thing I wrote for MongoEngine which will "unpack" single fields into tuples.)
[15:25:53] <remonvv> nimomo: What you want is not possible with the native query language.
[15:27:14] <GothAlice> Using MongoEngine if you write: Collection.objects.scalar('data.email') you'll get back an iterable of those literal values. (Because behind-the-scenes the "cursor" wrapper tracks which fields you are asking for and then unpacks the deeply nested structure into a flat one before returning the results on each iteration.)
[15:29:02] <remonvv> Yeah it's a problem best solved in higher level ORM/ODM. Mine has similar functionality although it comes with warnings.
[15:30:40] <GothAlice> remonvv: I wrote mine to ensure all normal cursor methods continued to work. (I.e. you could continue to .skip(), .limit(), .all() to get the results as a concrete list, .explain(), etc.)
[15:31:27] <remonvv> Same, it's a cursor extension on mine as well but I'm not sure if it's a very good idea.
[15:32:21] <GothAlice> Do you ever want just a list of _ids matching a query? If you've _ever_ needed to do that, anything worth doing once is worth automating. ;)
[15:34:25] <remonvv> Oh there's a ton of use cases for it. I'm just not sure if my/our implementation is necessarily more clear than c = find({..}, {_id:1}).cursor() -> while(c) { myId = c.next().id }
[15:34:27] <GothAlice> Removing boilerplate like: Collection.objects(creator__in=[i._id for i in User.objects(role='admin').only('_id')]) (list comprehension… everywhere you would need to do that) vs. Collection.objects(creator__in=User.objects(role='admin').scalar('_id')) — much nicer. (Runs .only() first to limit the returned fields, ofc.)
[15:35:15] <GothAlice> That inline comprehension is gibberish to all but the most expert eye in the language. I, for example, can't recognize that from modem line noise. :| (What language?)
[15:35:16] <remonvv> We've had a number of discussions about it and there are some valid arguments against making heavy lifting look like lightweight stuff.
[15:35:44] <GothAlice> You project only the fields you want, you then iterate the list of desired fields and unpack them into an array on each .next() iteration.
[15:40:22] <remonvv> I'm not sure a cursor (based) convenience method should affect the query. But yes, so that was basically the discussion.
[15:41:51] <remonvv> In our case anyway; either do "black magic" and affect the query being executed or scope it on cursor completely (so you basically get anyCursor.scalar(yourfield) which abstracts it away from query/projection
[15:42:27] <remonvv> But you know, sometimes pragmatic > clean
[15:42:35] <GothAlice> Used in the next() iterator here: https://github.com/MongoEngine/mongoengine/blob/master/mongoengine/queryset/base.py#L1330-L1345 (also used in .bulk() and __getitem__ skip/limit slicing). The "heavy lifting" is this: https://github.com/MongoEngine/mongoengine/blob/master/mongoengine/queryset/base.py#L1616-L1628
[15:43:00] <GothAlice> I like pragmatism. And a dozen or so lines of code to support it isn't terrible at all. ;)
[15:45:16] <remonvv> Well it's a cursor scoped method that affects the underlying query isn't it? Clean in terms of correct isolation and such. Something being clean is a highly subjective discussion anyway ;)
[15:46:17] <remonvv> I'd prefer db.col.find({...}, {field:1}).scalar("field") over db.col.find({...}).scalar("field") with implied field projection is what I'm saying
[15:47:38] <GothAlice> https://github.com/MongoEngine/mongoengine/blob/master/mongoengine/queryset/base.py#L942 — pretty explicitly documented that that is exactly what it does. ;)
[15:48:39] <GothAlice> And writing the field twice is, IMHO, utterly hideous. The point of abstractions is to reduce work, not give you more.
[15:49:47] <remonvv> It's two different things. That's what I meant with the obfuscation argument. One tells the database what to do, the other is convenience to remove boiler plate. Making the latter decide on the former is something that people could have a discussion about ;)
[15:50:22] <remonvv> We did, we landed on the former. It's "cleaner" in that it allows for better isolation so we can do cool things like :
[15:51:19] <remonvv> loads the entire resultset to warm up cache and results the ids that were used to warm up in one go.
[15:51:38] <remonvv> But anyway, opinionated libraries are good.
[15:52:54] <remonvv> cache() returning a cursor wrapper that warms entity caches with whatever results are pulled through the encapsulated cursor by the way
[15:53:09] <GothAlice> ¬_¬ MongoEngine features caching by default.
[15:53:49] <GothAlice> How ouch if you have full control over it?
[15:54:07] <remonvv> How do you invalidate that cache over multiple instances?
[15:54:09] <GothAlice> Same with auto-dereferencing up to X levels deep. (By default, 1.)
[15:55:25] <GothAlice> https://github.com/MongoEngine/mongoengine/blob/master/mongoengine/queryset/queryset.py — all of the caching (or not caching) "magic" is here.
[15:55:43] <GothAlice> Similar .cache() with an opposing .no_cache() method.
[15:56:36] <GothAlice> The cache is local to the query, used to substantially improve performance of certain standard tools. list(query) runs len(query) frequently, so that's cached rather than running count() each time, etc.
[15:57:33] <remonvv> Yes but then you'd have to always cache your query results
[15:57:50] <remonvv> so len(query) == count(query)
[15:58:14] <GothAlice> Uhm… it's an either / or situation. If you enable caching, you get caching. If you disable it, it's disabled.
[15:59:21] <GothAlice> (I keep it disabled most of the time. No point caching throwaway queries, only ones that will be repeatedly iterated.)
[16:05:33] <remonvv> Right but do you feel it's immediately obvious to developers that when you use cache() they're implictly storing an entire result set in memory?
[16:06:37] <remonvv> The alternative being, if you want to do that just grab the resultset and iterate over it explicitly, it's one or two lines of extra code for an edge case.
[16:06:51] <GothAlice> This is typical behaviour for Python ODMs. (All of the ORMs and ODMs I have ever used behave this way, and it's not quite "all of it in memory at once", it's buffered in chunks to allow for slicing previously iterated records and such.)
[16:07:20] <cancancu> Hey guys, I have a quick question. What kind of data, MongoDB is good for storing ?
[16:07:37] <GothAlice> cancancu: All data except graphs and data where relational integrity or transactional safety are utter requirements.
[16:07:59] <cancancu> For example, Cassandra is good for write-heavy workload. Is there any usecase which is optimized using MongoDB than Cassandra
[16:08:26] <GothAlice> cancancu: Optimization without measurement is by definition premature. One would have to evaluate that on a use case by use case basis.
[16:08:31] <remonvv> cancancu: You're oversimplifying it a bit ;)
[16:09:16] <cancancu> GothAlice: Would it be fair to say that MongoDB is good for storing blobs than Cassandra ? If so, what is the main reason ?
[16:10:10] <GothAlice> cancancu: For example, in a rather write-heavy (and write-amplifying, i.e. one write can trigger a dozen more writes) use case, I run distributed RPC using MongoDB as a message bus / queue. I last benchmarked 1.9 million dRPC round-trips (submit task, pull, execute, submit response, all with locking) per second per host.
[16:10:52] <GothAlice> cancancu: For the BLOB aspect, I currently have 25 TiB of BLOB data in MongoDB using GridFS. It's designed for it.
[16:11:07] <cancancu> Well, I was looking on web that if you need heavy write, choose Cassandra because of it's storage engine optimzed for write performance. I was wondering is there any specific case which motivates to use MongoDB
[16:11:25] <remonvv> cancancu: Good or above average better than similar NoSQL storage solutions? You can't see MongoDB is above averagely good at certain workloads if you don't specify what you have to compare it against.
[16:11:38] <remonvv> cancancu: Did you read on WHY Cassandra is good for write heavy loads?
[16:12:06] <cancancu> remonvv: Yup, I know why Cassndra is optmized for write performance
[16:13:29] <remonvv> cancancu: Well it's optimized for that because that was a requirement of the company that built it. The question to ask yourself is how did they manage near linear scalability for writes and what are the consequences (nothing comes for free).
[16:13:44] <cancancu> Somone also wrote that Cassandra is good for storing stuctured data and MongoDB is good for storing blob data without reasoning it why ? Does it even make sense ?
[16:15:24] <GothAlice> remonvv: http://irclogger.com/.mongodb/2014-11-19#1416454077-1416452929 — "what is MongoDB good for" is a common question, so one time I answered it thoroughly and now can link to it. (Need to bookmark so I can find it back faster in the future!)
[16:16:59] <cheeser> there's a social networking reference architecture on the website. ;)
[16:17:11] <GothAlice> Indeed. All things are possible.
[16:17:39] <GothAlice> And if you only need single-hop lookups, it's a-ok. As soon as you need "11th degree" relationships and things like that, you're borked unless you're using a real graph DB.
[16:17:43] <remonvv> Is that true? Any use case that needs availability is pretty much one where you shouldn't use MongoDB
[16:19:32] <GothAlice> Across the entire cluster, though, it's 100%. In that near 1000 day period the DB was at no point inaccessible.
[16:19:50] <GothAlice> (Only node downtime is for process upgrades.)
[16:19:56] <remonvv> I'm not sure what you're saying. You're claiming MongoDB is full CAP?
[16:20:13] <cancancu> GothAlice: Thanks for the link
[16:20:50] <cancancu> remonvv: MongoDB is a master/slave. Though there is a concept of secondary master, but still a single point of failure
[16:20:56] <remonvv> If you pull out certain instance MongoDB stops being fully available (as per its design). I don't get what else there is to it.
[16:20:59] <GothAlice> remonvv: I'm arguing that the proof is in the pudding: there is no reason why MongoDB would be any more unstable than any other long-running daemon.
[16:21:55] <GothAlice> Stops being fully available? Not my cluster. You pull a primary, the cluster elects a new one (<10ms in testing), all connections are reset, the clients with in-progress queries will reconnect and retry them, and you're done.
[16:22:08] <remonvv> cancancu: It has replication, yes. It's scaling strategy is based on sharding (splitting your data into smaller chunks, distributing those chunks across multiple instances and making the client aware of where each chunk lives)
[16:22:22] <GothAlice> "Effectively zero downtime" even though yeah, connections had to be reconnected and queries re-run.
[16:22:27] <cheeser> i wouldn't call mongodb SPOF at all.
[16:22:54] <cancancu> I am rather trying to find a specific usecase which will be more optimized with MongoDB
[16:22:56] <remonvv> You are confusing uptime with availability (again, as per CAP)
[16:23:40] <GothAlice> remonvv: If at no point is it "unavailable" while also being "up", uptime and availability _are_ the same.
[16:24:15] <remonvv> Uhm, that's liking saying "if my car has never broken down before it's indestructable"
[16:24:34] <GothAlice> remonvv: I can't predict the future, only give you stats based on my last 1000 days of experience on this particular cluster. ;)
[16:25:03] <cancancu> cheeser: you wouldn't call MongoDB a single point of failure, but you are also not dicussing the pain in case of Master failure :)
[16:25:15] <GothAlice> cancancu: That's the thing; there is zero pain.
[16:25:31] <GothAlice> A primary goes away, the secondaries figure out who had the latest data, elect that member, and just keeps going.
[16:25:40] <cheeser> mongodb has used "master" for several years. mongo uses primary/secondary set ups now.
[16:28:51] <remonvv> In CAP you get to pick two out of three letters. Mongo is CP. It isn't A. A stands for availability. Split brain, elections and I'm sure a few other situations make it so you cannot read or write certain data.
[16:29:35] <GothAlice> remonvv: Sure. In 8+ years of using MongoDB none of these situations has ever occurred.
[16:29:42] <Derick> remonvv: I've always found that the borders between P and A can be rather vague ;-)
[16:29:46] <cancancu> remonvv: MongoDB is CP :) which means in case of failure of P it always return consistenct data and sacrifies A
[16:30:52] <remonvv> Derick: I would call MongoDB comfortably C, in theory P and not A ;)
[16:31:53] <cancancu> remonvv: It's not the case with MongDB. All other databases have Master/Slave or Primary/Secondary architecture
[16:32:30] <cheeser> cancancu: mongodb has primary/secondary. i'm not sure what you're trying to say.
[16:32:49] <remonvv> cancancu: No, there are databases for each of the three possible CAP combinations. MongoDB uses replication (per shard), and sharding for scaling.
[16:33:13] <remonvv> Found it : http://i.stack.imgur.com/a9hMn.png
[16:33:30] <cancancu> Guys :( I am trying to find a usecase where MongDO is more suited than other NoSQL databases :(
[16:33:42] <remonvv> That almost sounds like homework
[16:34:03] <cheeser> "nosql" is a pretty meaningless term.
[16:34:11] <remonvv> MongoDB is a general purpose document storage. It's good, it's easy to get started with (biggest plus in my book) and it's mature.
[16:34:29] <remonvv> But you can't compare it with Cassandra unless you add a use case.
[16:35:31] <remonvv> That wasn't a great analogy but you get my drift.
[16:35:37] <cancancu> Yes it's my homework :( Cassandra is good for write-heavy workload. What MongoDB is good at :(
[16:35:39] <GothAlice> Right, so kill -9'ing a few processes gets me: three events at 2014-12-02T11:31:40.534-0500 relating to loss of communication with the primary, 2014-12-02T11:31:41.901-0500 marking the primary as DOWN, 2014-12-02T11:31:43.012-0500 a new primary is confirmed as elected by the rsMgr. Nowhere near 30 seconds.
[16:35:41] <remonvv> Really? I only read about the one that's likely to become the default.
[16:35:42] <cancancu> The answer I get is everything
[16:36:16] <cheeser> i've seen reports of 30s elections, too.
[16:36:23] <remonvv> Load doesn't really affect elections. Size of set, network availability/performance, etc.
[16:36:29] <cheeser> though I think that shouldn't happen (as much?) in 2.8
[16:36:53] <remonvv> We've had EC2 instances that stopped seeing eachother
[16:36:59] <GothAlice> In the three real-world cases where failover has happened in my cluster, election back to read/write took < 5s each time.
[16:37:03] <remonvv> Which means no majority -> no election
[16:37:28] <remonvv> That makes you lucky and well prepared, not permanently available ;)
[16:37:45] <GothAlice> remonvv: Again, I can't predict the future, only give you information about the cluster's performance to date.
[16:37:46] <remonvv> But anyway, this is an apples and oranges discussion. Theory versus reality.
[16:38:45] <remonvv> I understand, availability in this context refers to "Can every client always write and read the data it wants" (given a proper cluster topology)
[16:39:03] <remonvv> You can remove certain minority of instances from a healthy mongodb cluster that make that not the case.
[16:39:26] <remonvv> And if you consider cluster metadata as part of the "data" being referred to here you can even argue that config servers are SPoF
[16:39:39] <GothAlice> remonvv: That's why this cluster has three.
[16:39:53] <remonvv> A cluster goes to metadata read-only if any of those three fail.
[16:40:21] <remonvv> But that's nitpicking, I wouldn't consider that a real-world issue in all but a few cases.
[16:40:53] <GothAlice> remonvv: http://docs.mongodb.org/manual/core/sharded-cluster-config-servers/#sharding-config-server — see the warning box.
[16:41:05] <remonvv> GothAlice: I'm pretty familiar with how it works ;)
[16:41:26] <GothAlice> remonvv: It seems there are multiple definitions of "single point of failure" being thrown around, even in these docs.
[16:42:16] <remonvv> Are there? there's not much room for different interpretations.
[16:43:32] <remonvv> I wonder if cancancu is still paying attention :p
[16:44:43] <cancancu> Let me read all the messages above and repeat my question: What MongoDB is good far :P
[16:44:52] <remonvv> cancancu: That being said though, you need to figure out what it is you want to know. MongoDB doesn't have a very specific sweetspot for particular use cases or data types. There are things it can do and things it cannot do or cannot do well/efficiently.
[16:45:25] <remonvv> cancancu: If you do not know what to pick as a database tech, then pick MongoDB. It's easy to get going, it's general purpose and it scales up easily.
[16:45:28] <santib> hey folks quick question, is there any way to prevent the deletion of a document that has references in other documents
[16:45:58] <GothAlice> santib: So, you'll have to take the opposite approach. Handle the case where the reference doesn't exist when you try to read it.
[16:46:17] <GothAlice> Often one does not need it.
[16:46:22] <cancancu> remonvv: Your answer is highly appreciated :) That's what I was looking for. However, I would really like tto know that what are the things it can't do efficiently ?
[16:46:24] <remonvv> Let us all hope and pray it stays that way
[16:46:46] <GothAlice> http://www.javaworld.com/article/2088406/enterprise-java/how-to-screw-up-your-mongodb-schema-design.html is a rough overview of how to convert your thinking from relational to document.
[16:47:04] <geo_> GothAlice: thx for reply, yes seems that there is no magic method to make count faster. For now I will try to avoid using it directly.
[16:47:35] <GothAlice> I.e. I store all replies to a discussion thread within that thread, rather than having a separate collection then needing to "join" them, MongoDB lets me query (and update) exact comments, "paginating" them, first/last, etc. all embedded inside the parent document.
[16:47:44] <remonvv> cancancu: It's hard (and sometimes impossible) to maintain relational integrity, its not very good at graph-like schema's, currently its storage engine is mmap based which can cause various performance problems, it's not very efficient with disk space
[16:48:00] <GothAlice> Fragmentation of the on-disk stripes…
[16:48:23] <GothAlice> And whatever you do, don't try to run mongod on btrfs unless you are a wizzard, 'arry.
[16:49:45] <GothAlice> geo_: There's a reason MongoEngine likes to cache the count() of a result so heavily. ;)
[16:59:53] <remonvv> geo_: You can maintain your own counters in some cases. Why do you need to count a large set of results?
[17:02:55] <remonvv> geo_: Also, not all count()s are created equal. There are some optimizations in the b-tree walk that make certain counts (e.g. low cardinality) faster iirc.
[17:03:42] <GothAlice> Also full-collection counts which cheat and use the _id index bucket count, apparently.
[17:04:22] <Derick> cheat? :-) It's an optimisation! :P
[17:07:13] <remonvv> GothAlice: Atomic counter for what?
[17:07:24] <GothAlice> "Number of living documents in the collection."
[17:08:11] <remonvv> Like maintain an in-memory counter for each collection?
[17:08:30] <GothAlice> .insert would $inc for each record inserted, .remove would $dec for each record removed. In both cases the operation could simply keep a counter and when done perform a single atomic update of the counter. In-memory, on-disk, it'd likely just be a field in a metadata system collection somewhere.
[17:08:47] <GothAlice> s/keep a counter/keep a local counter
[17:08:50] <remonvv> If it has to go to disk you just sliced insert/remove performance in two though.
[17:10:01] <remonvv> Derick: Does it need to? No option to add counts to b-tree nodes or something?
[17:10:19] <GothAlice> Derick: Indeed; this is even quite rollable client-side, albeit with a race condition. (nInseted, nRemoved, etc. result of the operation passed to a secondary query to update the document count metadata.)
[17:10:27] <Derick> I am not really sure whether WT *uses* btrees
[17:10:44] <remonvv> You can only either make it slow or eventually consistent (seperate counters that is)
[17:10:52] <Derick> remonvv: it would change the on-disk format, not?
[17:11:18] <remonvv> Derick: Of the indexes, yes. But it has to be an index flag anyway so it's backwards compatible.
[17:11:24] <GothAlice> Eventually consistent would be more than adequate, and a step up from the current "pseudo-random" behaviour in larger clusters described above. ;)
[17:11:26] <remonvv> And the index data has a version field
[17:11:44] <remonvv> GothAlice: Agree that it is better than it is now, but counting your entire collection is relatively rare.
[17:11:51] <Derick> sure - I don't think it is something they change lightly though.
[17:12:17] <GothAlice> You'd even be able to estimate how off (and in what direction it's off) by looking at currentOp; sum(inserts) - sum(removals) would give you the bias.
[17:12:25] <remonvv> Derick: I like WT already, regardless of count performance being improved.
[17:17:11] <remonvv> Derick: Hm, pretty sure you could add compression between the raw read/writes but anyway I'm glad it's there now.
[17:58:20] <rfv> hey guys, one of my collections has documents in it, but when trying to fetch them, it doesn't return anything (ie. count() returns something > 0, but findOne() returns null)
[18:27:34] <PirosB3> rfv: rebuild indexes and try again
[18:31:20] <cers> I'm very new to mongodb, and apparently there's something I haven't quite grasped when it comes to the find function. I have entries that might look something like {type: "exoplanet", dimensions: {diameter: { value:1000, unit: 'km'}}, and I want to find any entry where type contains the word 'planet' and dimensions.diameter is between two values x and y. I should not that dimensions.diameter might not exist.
[18:31:34] <cers> How would I go about writing such a find function?
[18:31:50] <rfv> PirosB3: I just did that, but it didn't do anything ... I have a collection with only one document, count() returns 1, findOne() returns null :|
[19:07:03] <styles> https://privatepaste.com/67e205789e I'm trying to get this query working, the matchign doesn't seem to work
[19:07:09] <styles> It returns 0 results every time
[21:41:45] <kajsa_a> Question: does anyone know why mondodump with --query and mongo shell with the same query would report different numbers of records?
[21:45:48] <GothAlice> kajsa_a: Are you getting this number from the output of .count()?
[21:45:54] <GothAlice> (Or the actual dump of records?)
[21:46:31] <kajsa_a> GothAlice: I'm comparing the number from the dump output to the number from db.find(<query>).count()
[21:46:43] <kajsa_a> GothAlice: count from shell is higher
[21:48:48] <GothAlice> kajsa_a: Depends on your cluster setup (replication lag, sharding aggregate estimation of counts, etc.) but it can vary even from call to call and connection to connection (i.e. even if you ask for the count at the same moment across multiple connections, each might get a different answer.)
[21:48:49] <Boomtime> tskaggs: is this on a sharded cluster?
[21:51:22] <GothAlice> "Cluster" is a non-descriptive term for any group of systems working together on something. ;)
[21:51:56] <GothAlice> tskaggs: My earlier question still applies, though. What's the issue you're encountering?
[21:54:29] <tskaggs> It takes forever to for a findall.. then reformat... but the big issue is when I pass the id and whole object to update(id, newobject){}; and it just doesn't update and has a $set error
[21:54:50] <tskaggs> GothAlice ^ and here's my controller http://hastebin.com/polelolumi.lua
[21:55:52] <GothAlice> tskaggs: Which driver abstraction are you using?
[21:57:16] <tskaggs> GothAlice sorry new to mongo what do you mean?
[21:57:39] <GothAlice> tskaggs: Which library are you using as your database connection layer? I.e. in Python that'd be something like pymongo, MongoEngine, etc.
[22:00:26] <GothAlice> sails-mongo is, well, extremely pre-1.0.
[22:00:58] <GothAlice> https://github.com/balderdashy/sails-mongo/blob/master/lib/collection.js#L196 — it does not look like the first argument can be a random ObjectId, it looks like it needs to be {_id: ObjectId(…)}.
[22:01:24] <GothAlice> It basically has no tests and integrated documentation, though, so I can't actually confirm if this is the case or not… at all. :/
[22:02:23] <tskaggs> GothAlice Yea it's not random. in the hastebin I do a find() to return all items, then an each to return each individual item, reformat it, then pass the proper ID and object to the update()
[22:03:05] <tskaggs> I wasn't user how to update Mongo quicker than that since I assume other people have dealt with 100k+ item collections.. :/
[22:03:50] <GothAlice> tskaggs: The only way to get acceptable performance on large-scale updates is by using MongoDB's atomic update operations.
[22:04:13] <GothAlice> http://docs.mongodb.org/manual/reference/operator/update/ — these things. (I.e. $inc, $push, etc.)
[22:06:03] <GothAlice> Multi-updates are a bit "riskier", but there are ways of handling that, too. (I.e. two-phase commits.)
[22:06:16] <Boomtime> to be clear: a "multi document update" is applied as a series of atomic single document updates
[22:06:48] <GothAlice> Failure part-way through will result in partial application of the changes to the documents. (I.e. some will have been updated, others not.)
[22:07:58] <GothAlice> In your case you are $set'ing a new structure, so you can resume quite easily from where you left off by also querying on that field not existing. (I.e. only find records that haven't been updated yet.)
[22:10:57] <kajsa_a> so true - which is why my count issue concerned me :)
[22:14:03] <GothAlice> Speaking of count(), it'd be frickin' awesome if any operation that can return an inaccurate number also includes its confidence / estimated standard deviation / accuracy.
[22:15:45] <Boomtime> count() in a cluster will always be equal or higher than the actual, it isn't really even an estimate, and no parameters can be placed on exactly how wide of the mark it is
[22:17:01] <GothAlice> Better living through entropy, I say. ;)
[22:17:29] <cheeser> entropy isn't what it used to be, though.
[22:21:52] <GothAlice> Things like that will lead to "Achievement Unlocked: Heisenbug" (a la https://exogen.github.io/nose-achievements/#builtin:heisenbug ;)
[22:35:41] <kajsa_a> GothAlice, I wrote up an answer to the StackOverflow question that someone else had posted on my same issue, to try to make it more searchable for the next person to stumble onto this nuance - http://stackoverflow.com/questions/21776666/mongodump-and-then-remove-not-exact-same-number-of-records/27260279