[01:55:29] <japhar81> im trying to figure out how to store this to quickly get 'all unanswered' for a user, as well as 'all answers' for a user
[01:55:33] <tim_t> how about have a source collection of question objects prepped with the question field, an answer field, and a user field. whenever a user answers one, dupe it to an answers collection with the user and answer fields filled. compare the collections filtering by user to see which ones they did not answer?
[01:55:39] <japhar81> i went down the road of question IDs saved per-user
[01:55:43] <japhar81> but that's an unbounded list too
[02:04:30] <Boomtime> and you need to find which of those questions a particular user has not answered?
[02:05:04] <japhar81> i need all Question records that have not been answered
[02:05:11] <japhar81> and all answered questions with answers
[02:06:14] <tim_t> but you want to look at a user, see which 100 out of the pool of possibly zillions in the future and see which out of those 100 are un/answered?
[02:06:29] <tim_t> which 100 questions out of the pool
[02:12:26] <Boomtime> it all sounds pointless, are you holding something to ransom? what makes you think somebody is going to answer 100 questions let alone 50,000?
[02:13:01] <Boomtime> (this is academic of course, we can just talk about how you'd do this exercise only if you like)
[02:13:11] <tim_t> they might if it is incentivised
[02:13:42] <tim_t> like a game-show type thing… gambling perhaps
[02:13:44] <japhar81> its a dumbed-down example, every one you answer in a group pre-populates hundreds of others with a best-effort answer
[02:13:55] <japhar81> and you can go tweak at-will
[02:14:10] <japhar81> but no one wants to talk about multivariant matching i assume, so i simplified it :)
[02:15:24] <tim_t> why delete instead of say deactivate and keep using a reference?
[02:21:20] <tim_t> do you still want to be able to later on see what question the answer answers?
[02:34:00] <chetandhembre> What is best practices for inserting 100 document (object) in bulk ? I am using BulkWriteOperation (in java) is it right ? or should i use insert(docs) function ?
[02:35:21] <cheeser> are you having any problems with it?
[02:36:38] <chetandhembre> yeah .. so many connection opens on server ?
[02:36:55] <chetandhembre> and eventually making mongodb slow
[02:37:49] <Boomtime> so you have a problem: mongodb slows down (or something like that)
[02:38:01] <Boomtime> and you have evidence: lots of connections
[02:38:10] <chetandhembre> I am not concern with write guaranty .. but it should not make my mongodb server slow
[02:38:40] <Boomtime> are you reading from it at the same time as you write?
[02:38:43] <chetandhembre> Boomtime: yeah .. I am writing on primary and reading from secondary
[02:39:06] <chetandhembre> sorry *yeah* was for your earlier message :p
[02:39:34] <Boomtime> if you are reading from a secondary then your insert op is irrelevant - the secondary always sees them as distinct write operations
[02:40:09] <Boomtime> the secondary has at least the same write load as the primary
[02:40:26] <Boomtime> for what reason are you reading from the secondary?
[02:41:12] <chetandhembre> i am not concern with availability or freshness of data so i am happy with eventual consistency
[02:41:45] <chetandhembre> *or may be i am concern with availability but not freshness
[02:42:13] <Boomtime> how does reading from a secondary improve availability?
[02:44:41] <chetandhembre> I am reading from secondary because in this way i am reducing load from my primary .. and eventually secondary will sync with primary ..
[02:45:56] <chetandhembre> what is best way to bulk insert into mongodb ?
[02:46:37] <Boomtime> but that same load is still present, you have not achieved any reduction in overall load - the same machines still need to service the same load
[02:47:25] <Boomtime> so why are you trying to save a particular machine?
[02:47:47] <chetandhembre> Boomtime: yes . but it is only write previously it handles both read + write .. i think there i manage to make load less on priamry
[02:48:24] <Boomtime> that isn't an answer - why does it matter that the secondary be the one that is most loaded?
[02:48:42] <Boomtime> you have the same odds of failure on the secondary as on the primary
[02:48:51] <chetandhembre> Because this machine is my primary node from replica set ? I have very simple replica set one primary + one secondary
[02:51:56] <chetandhembre> but again bulk insert problem ?
[02:52:00] <Boomtime> with 2 members only then when either of them are missing the remaiing member cannot form a majority
[02:52:08] <Boomtime> bulk insert doesn't matter to you
[02:52:28] <Boomtime> you only write to the primary, those writes are sent as individual ops when the secondary replicates them
[02:53:01] <Boomtime> hilariously, you have probably made your system worse by reading from the secondary - it might be under a greater write-load than the primary
[04:50:53] <xxtjaxx> Hi! I'm using the native mongodb driver and I was wondering whats the best way to use it? use the Db or MongoClient?
[05:02:01] <chetandhembre> any one want to help mw with this ?
[05:04:43] <nap> Hi Guys, Is it true that inorder to use mongo 2.6.4, I should be in paid service with Mongo? Can't I use http://repo.mongodb.com/apt/ubuntu trusty/mongodb-enterprise/stable multiverse -> under GPL? like how I do for 2.4.
[05:12:17] <joannac> nap: The enterprise build was never free to use in production without a support agreement
[06:20:51] <vineetdaniel> can i run db.curentOp from ruby/python
[06:25:06] <chetandhembre__> is your mongodb driver provide it ? it yes then you can
[07:35:14] <appledash> Can someone help me out with this, please: http://dba.stackexchange.com/questions/83969/how-can-i-make-parts-of-query-criteria-optional-with-mongodb
[07:35:18] <appledash> No idea why it is tagged SQL
[07:35:27] <appledash> I put "query" and it decided that SQL was the closest match
[07:49:16] <kali> appledash: for the optional part, $in is what you want
[07:49:38] <appledash> Alright... What about the order?
[07:50:10] <kali> appledash: for the sort part, there is no arbitrary sort in mongodb, but you're in luck: your sort order is the reverse alphabectic, so just use name:-1
[07:50:40] <appledash> You realize that is not my real data, right? :P
[07:52:05] <appledash> If I'm working with only around 10-15MB of data, would it perhaps be better just to select everything and then work with it in code?
[07:53:21] <kali> appledash: mmmm that or add a sort key to your documents
[07:54:00] <appledash> What do you mean by that / how would I go about doing so?
[07:54:25] <kali> i should also mention that you may want to consider storing the attributes in a table rather than a hash: [{ name: "a1", value:"blah"}, {name:"a2", value:"foo"}, ...]
[07:54:45] <appledash> Why would I want to do that?
[07:55:33] <kali> appledash: it makes many things easier: mongodb is designed around schema where document keys are keywords, not arbitrary values
[07:56:20] <kali> for instance, your previous selection would become { "attributes.value": "Foo" } regardless the number of attributes
[07:56:39] <kali> and you can optimize all the case with only one index
[07:56:51] <appledash> But usually the query would be like 10 or so attributes all with different values
[07:57:56] <kali> can you show us a more real life document ?
[07:58:36] <appledash> Sure, give me a few moments.
[07:59:12] <kali> you can only go so far with foos and bars :)
[08:05:23] <appledash> kali: https://gist.githubusercontent.com/AppleDash/32518e17866c7bbf0ed7/raw/19ddb617b31849fefe74f180a91ab5aa0416bb73/gistfile1.txt If my query is like {gender: "male", hairColor: "blue"} then I want it to return AppleDash followed by Gyro and Inky, the last 2 it doesn't matter what order in.
[08:25:54] <kali> you could have said so earlier :)
[08:26:05] <kali> anyway. the facet schema will help with index
[08:27:14] <kali> for the rest... it's a bit tricky. fething everything and sorting in code is a valid option as long as all criteria are selective. but if you have gender in the criteria, then you'll get fetch half your database in the code
[08:27:46] <kali> you may manage it with a generated aggregation query
[08:28:06] <appledash> It will most likely be a bit more specific than just gender
[08:28:42] <kali> yes, but you need to fetch all docs that match at least one criteria
[14:08:49] <Zelest> I got a replicaset of 3 nodes.. 2 dedicated servers and 1 virtual machine.. now we plan on adding a third dedicated server and I wonder how to do this without breaking anything? As I've understood, you should have an non-equal number of nodes due to the election and such?
[14:09:13] <Zelest> Can I reconfigure the virtual node to never vote or something? or what is the proper way?
[14:15:01] <cheeser> you could add an arbiter, too.
[14:58:32] <deviantony> I've setup MMS automation agent on a mongodb node hosted in our datacenter
[14:58:50] <deviantony> but after the setup of the monitoring agent, I keep getting the same log line
[14:58:59] <deviantony> Nothing to do. Either the server detected the possibility of another monitoring agent running, or no Hosts are configured on the MMS group.
[16:30:13] <GothAlice> drecute: Then your input CSV is invalid. (I.e. didn't escape or quote extra comma usage, etc.) This is an unfortunate state for your data. If mongoimport can't do it, then I'd write a light-weight Python script to slurp the data in using a more flexible CSV parser, and one where you can catch abnormal data prior to insert.
[16:30:40] <GothAlice> drecute: https://docs.python.org/2/library/csv.html#examples — or the equivalent in your preferred language.
[16:31:12] <GothAlice> Mmike: I don't think it'll go through secondary before becoming the primary in the situation where no other hosts are known (i.e. right after rs.initialize())
[16:31:33] <GothAlice> Mmike: Sounds like something relatively simple to test, though. :)
[16:32:23] <tim_t> I have a program that enqueues email messages in mongo and i have a separate thread dedicated to picking up messages and sending them. Currently I am polling every 10 seconds (just picked a number) and grabbing a list of messages, sending them and clearing the queue. I do not like this polling as it seems to be bad form. Is there some sort of callback or some other asynchronous method mongo has I can use instead?
[16:32:32] <Mmike> it's clear from the log file what happens :)
[16:33:02] <GothAlice> Includes link to working library and runner. :)
[16:34:08] <GothAlice> Mmike: So, what you have is a queue. You've got something pushing data into the queue, and something pulling data out of the queue to work on it. Capped collections in MongoDB offer pretty much the exact solution you are looking for: insert-order, high-performance, and the ability to have a pending cursor "tail -f" the collection to get push notifications of new data.
[16:34:41] <Mmike> tim_t, I guess she's talking to you :)
[16:36:36] <Mmike> (that's what my kid does when he discovers something new that he likes :) )
[16:36:44] <GothAlice> Futures are awesome. That presentation and sample library I linked I'm turning into https://github.com/marrow/marrow.task — but I'm stuck waiting on https://jira.mongodb.org/browse/SERVER-15815 before I can fully replicate the capabilities of Futures within the context of MongoDB.
[16:37:01] <tim_t> okay, thanks for the lead GothAlice
[16:39:04] <GothAlice> Mmike: As a note, the default thread pool worker (both in the back-port and in core in 3) is a terrible abomination of a thread pool implementation. It does terrifyingly dumb things, and shoots itself in the foot performance-wise. https://github.com/marrow/marrow.util/blob/develop/marrow/util/futures.py is my replacement thread pool implementation that features auto-scaling, thread exhaustion, etc. and is ~30% faster. :)
[16:42:11] <GothAlice> tim_t: And no worries. Apologies again for the name confusion. ^_^
[16:42:23] <foofoobar> Hi. I have a collection of my finances with entries like {date:’…’, amount: -210, description:’’)}. I want to get the income+spendings for a single month (or better said: for every month in the last year).
[16:42:28] <foofoobar> What is the best way to do this?
[16:42:50] <foofoobar> Should I get all recors of the last year and add them to months manually (nodejs) or can mongodb do this?
[16:43:09] <GothAlice> foofoobar: What you're looking for is aggregation: http://docs.mongodb.org/manual/core/aggregation-introduction/
[16:43:49] <GothAlice> The first example explained on that page should be applicable to your query. :)
[16:46:51] <foofoobar> GothAlice: so the $match would then do a match on a single month?
[16:55:08] <GothAlice> foofoobar: Or whatever other range you desire, aye.
[16:55:56] <foofoobar> GothAlice: I want to calculating for each month
[16:56:19] <foofoobar> so the question is, am I doing this query for each month or is it possible to make a single query which does this for all months
[16:59:15] <GothAlice> I'll note that "making a single query that does this for all months" will be an ever-growing query. (I.e. there may be no upper limit on the number of records that need to be scanned in order to calculate the total.) To avoid large recalculations at work we "pre-aggregate" our data.
[17:00:04] <GothAlice> We track clicks, so when a click comes in we insert a record like this: https://gist.github.com/amcgregor/94427eb1a292ff94f35d
[17:00:16] <GothAlice> This will keep track of "per hour" statistics; for your case, you'd want per-month.
[17:01:15] <GothAlice> (So "hour" would be "month" in your case, and you'd strip off the day and hour in addition to minute, second, and microsecond from the date.) Then you'll always have the latest statistics by querying one record per month tops, instead of one record per transaction per month.
[17:07:44] <foofoobar> I’ll have a look at it. First trying to get my aggregate :>
[17:07:54] <foofoobar> So this is my current query: db.entries.aggregate([{$match: {Valutadatum: { $gt: new Date(2014, 10, 0)}}}, {$group: {total: { $sum: "Betrag" } }}])
[17:08:12] <foofoobar> It does not work because I have to input an _id I think.. "errmsg" : "exception: a group specification must include an _id",
[17:10:59] <GothAlice> Well, Betrag isn't a summable value.
[17:11:04] <GothAlice> You've got it as a string for some reason.
[17:11:22] <foofoobar> No it’s a float. "Betrag" : -52.6,
[17:11:23] <GothAlice> http://docs.mongodb.org/manual/reference/operator/aggregation/#date-operators — btw, there are quite a few values you can pull out of dates. :)
[17:11:35] <GothAlice> "FOLGELASTSCHRIFT" — your example returns a string.
[17:12:02] <foofoobar> I stripped out Betrag because it was not relevant in this example.
[17:12:24] <GothAlice> Except that it is; try not to assume what is or is not relevant about a query when attempting to work on said query. :/
[17:12:42] <GothAlice> Regardless, sum your values. Also, don't store financial values as floats, ever.
[17:12:49] <foofoobar> Sorry. This is a complete example: http://hastebin.com/fesoricuzu
[17:12:59] <GothAlice> (If USD or other "cents" system, store number of cents as an integer.)
[17:13:44] <foofoobar> Yeah I know the problematic. Currently it’s not relevant because I’m just trying to understand how I can do things like this with mongodb
[17:13:51] <GothAlice> (In a JS REPL such as your browser, 0.1 + 0.2 != 0.3.
[17:14:23] <GothAlice> It's a bad habit, though, even to do that in testing… mostly because of *how abysmally wrong* things can be.
[17:15:09] <GothAlice> Betrag, though, certainly is what you want to sum. Did you catch the _id addition in my last example?
[17:17:04] <GothAlice> "Beguenstigter/Zahlungspflichtiger" — your choice of keys worries me greatly, BTW. One should not have symbols other than _ in key names, for a variety of reasons including consistency of field access. (You can't use an attribute reference to get access to that field.)
[17:17:35] <foofoobar> GothAlice: It’s parsed from a csv, I need to clean this up.
[17:17:44] <GothAlice> foofoobar: Additionally, the full string value of each key (plus six bytes) is stored against every single document in the collection; which in your case will add up to more storage dedicated to the keys than the actual values.
[17:22:44] <GothAlice> That range pair should be $gte / $lt (Greater than or equal to the start of month X, less than the start on month X+1.)
[17:23:18] <GothAlice> (Your current range query would have the effect of ignoring any transaction at midnight on the morning of the 1st.)
[17:23:30] <GothAlice> Otherwise should be great. :)
[17:26:26] <foofoobar> Wow. My spendings are a way too much :>
[17:28:27] <foofoobar> I’m getting a different value when doing this by hand. Something still needs some improvement
[17:29:50] <GothAlice> Also, for bonus points, it'd be interesting to have two $sum's, each $if'd to only $sum the positive and negative values separately. (Thus credit vs. debit totals on the account.)
[17:31:48] <foofoobar> I’m currently only filtering for negative values
[17:34:17] <GothAlice> https://gist.github.com/amcgregor/1ca13e5a74b2ac318017 is an example of one of my aggregates from work, this one doing a breakdown by day-of-the-week. :)
[17:34:37] <GothAlice> (I.e. you could find out that it's your Friday night partying that's causing your surprising spendings. ;)
[17:44:47] <tim_t> erm... any java example i can look at that does the tailing trick?
[17:46:34] <GothAlice> http://docs.mongodb.org/manual/tutorial/create-tailable-cursor/ includes a C++ example… see also: https://github.com/deftlabs/mongo-java-tailable-cursor
[17:48:59] <GothAlice> kali: The one I provided is an extension to the default driver to improve support for event-based tailing. It isn't really a simplified example of how to do it.
[17:49:00] <kali> but i had a quick look and from 30000 feet, it looks right
[17:49:15] <GothAlice> (So much code to do so little… T_T)
[17:56:07] <GothAlice> The main Reader loop is basically identical to my Python example: loop over the cursor as long as it's alive, when it eventually dies (times out, for example), automatically retry.
[17:56:24] <GothAlice> This Java version is hideous from the standpoint of tracking all IDs seen in a HashSet…
[17:56:33] <GothAlice> (Instead of simply the "last" one seen, for the purposes of retrying…)
[17:57:40] <GothAlice> The sleep(100) is to handle the edge case where the collection is completely empty—the cursor will always immediately return null in that case, so retrying is needed until some data, any data is added to the collection. (Thus the {nop: True} record in my Python example.)
[18:17:28] <GothAlice> tim_t: http://shtylman.com/post/the-tail-of-mongodb/ goes into depth on the mechanisms at play and how they effect performance. (That sleep(100) will introduce sawtooth lag.)
[18:33:17] <hexus0> is there anything I should be worried about if I set up a 3 server replica set where each server is in a geographically different data center?
[18:34:03] <GothAlice> hexus0: Other than needing to pay closer attention to where your queries are directed, not really. See: http://docs.mongodb.org/manual/data-center-awareness/
[18:34:26] <tim_t> oh nice. no need to remove entries with this technique
[18:34:51] <hexus0> I was reading the replica set docs and it said to put the majority of servers in the primary data center, but didn't touch much on having an equal number of servers in different data centers
[18:35:10] <GothAlice> tim_t: Indeed. Though you very likely will need to track where you "leave off" each time the worker quits. (Or atomically change a boolean flag on each processed entry and filter on that.)
[18:36:38] <GothAlice> hexus0: The issue would be that if the primary DC loses connection to the two foreign DCs, it'll stop functioning. (It'll become a read-only secondary until a connection can be re-established with either of the other DCs.)
[18:37:44] <hexus0> Would it be smarter than to have a 4 server replica set with two servers in the primary data center and an arbiter?
[18:38:10] <GothAlice> Generally you would have a minimal HA replica set in the primary DC, then secondaries in offsite DCs to speed up local querying (and to provide offsite streaming backup).
[18:38:36] <GothAlice> "Minimal HA" = high availability = two mongod replica members and an arbiter.
[18:39:24] <GothAlice> That way you don't rely on outside DC connections to maintain local write-capability.
[18:39:57] <hexus0> when you say 2 mongod replica members, you mean in addition to the primary
[18:40:07] <hexus0> so that entire set would be 5 mongo instances and 1 arbiter
[18:40:39] <GothAlice> No, one will be a primary, the other a secondary, then two more, one in each of the two foreign DCs with a lower priority than the ones in the main DC (to ensure they never become primary). And an arbiter to keep the number of hosts odd.
[18:42:26] <GothAlice> hexus0: Definitely investigate http://docs.mongodb.org/manual/data-center-awareness/ though — you'll want to be careful about where queries get sent.
[18:42:49] <hexus0> Doing that right now :) digging through the docs
[21:07:38] <foofoobar> This is my code: http://hastebin.com/ecumaxerid.js
[21:07:44] <foofoobar> Now I want to do the following things:
[21:08:18] <foofoobar> 1) The query in the code is just calculating everything below 0 (spendings), I now need to get income, too. Do I make a separate query for this?
[21:08:32] <GothAlice> Nope, you can do both with one.
[21:08:53] <foofoobar> 2) I want to get the values I calculated with (1) also for a few others months, do I make a query for all of them?
[21:09:00] <foofoobar> GothAlice: Can you hint me how?
[21:11:35] <GothAlice> Something like that should do.
[21:12:20] <GothAlice> Basically what that says is, the "credit" sum is the $sum of "value" (in German from your sample data ;) for only positive values. (You'd do the reverse for debit.)
[21:13:30] <GothAlice> Then when running that one query, you get both answers. (You could also have a "balance" final balance that is the sum of both positive and negative.)
[21:15:17] <golya> Hi. I am new to mongodb, and have the following basic question: how do you model one-to-manys? If A has many Bs, and one B has many Cs, and I'd like to query for such Cs, which belong to a specific A, how to design collections?
[21:15:21] <GothAlice> foofoobar: {$group: {_id: {…}, credit: {$sum: {$cond: {if: {$gt: ["$value", 0]}, then: "$value", else: 0}}, debit: …} — I forgot a closing } after the $gt. :)
[21:16:09] <GothAlice> golya: http://www.javaworld.com/article/2088406/enterprise-java/how-to-screw-up-your-mongodb-schema-design.html contains a nice short summary of how to bridge ways of modelling data relationally with how MongoDB generally models things.
[21:16:32] <foofoobar> give me some minutes, I need to understand this :)
[21:19:37] <GothAlice> golya: golya: The example I usually fall back on is that of forums (like phpbb). In relational you have a table of replies, a table of threads, and a table of forums. You'd JOIN across these to answer questions like "all replies to a thread", or "search for all replies by user X, grouped by forum/thread".
[21:19:39] <GothAlice> In MongoDB, because you often want the replies with a thread (when looking at a thread), I embed the replies as a list of embedded sub-documents. A la thread = {title: "Awesome thread!", creator: ObjectId(GothAlice.id), contents: [{creator: …, comment: "Awesome."}, …]}
[21:20:54] <golya> Well, if you are in, I describe my mini-problem, with 3 entities, and tell about that.
[21:21:00] <GothAlice> However if I need to answer that other question, I need to perform two queries: first, look up the user ID by username (usually), then a second query to find all comments by that user. (db.thread.find({"contents.creator": GothAlice.id}))
[21:22:43] <foofoobar> So when doing this for the last 3 months, I’m executing this query 3 times with different dates, correct?
[21:22:43] <GothAlice> golya: I would need to know more about your use case and desired use of the data than just "A has Bs, B has Cs, I need Cs by A."
[21:22:58] <golya> I'd like to design a site for collectors. Admins will upload collections (finite predefined set of say cards).
[21:23:17] <GothAlice> foofoobar: You could do that, but since you're grouping by year/month you'll automatically get back one record per month regardless of the range you are requesting.
[21:23:28] <golya> Users would mark, which cards they need, and which ones they have extra to offer in exchange.
[21:23:50] <golya> So A: a collection of cards, B: one particular card type in a collection
[21:24:18] <golya> and C: storing somehow, that a particular user needs a concrete B, or has offered B to exchange
[21:24:34] <golya> I think I can embed the cards to the collections
[21:24:52] <golya> But not sure where to store user needs / offers
[21:25:18] <GothAlice> golya: One simple approach is to store the ObjectIds of the cards the user needs and offers in two lists.
[21:26:12] <GothAlice> (That'd give you a maximum number of "wanted" and "offered" cards combining to just over one million per user.)
[21:27:05] <golya> GothAlice: ok, so a collection for UserNeed, which references cards, and users (by object _ids)
[21:27:17] <GothAlice> Store the list of needs / offers directly in the user record.
[21:27:47] <GothAlice> {username: "GothAlice", needs: ["Charizard", "Pikachu"], offers: ["Squirtle", "Diglett"]} (except with the IDs of the cards rather than their names. ;)
[21:28:33] <golya> GothAlice: ok, clear. then, the next q: could I query user needs, which are in a given collection?
[21:28:57] <golya> In User document, there are just card ids
[21:29:13] <GothAlice> golya: Very much so. For example, to ask the "user" collection for all usernames that want Pikachu: db.user.find({needs: "Pikachu"})
[21:31:34] <GothAlice> That will not do what you expect. :)
[21:31:58] <GothAlice> If "needs" is a mapping/dict/object/subdocument like that, you'll have some extreme difficulty applying indexes.
[21:32:13] <GothAlice> (Since the "path" to the field you want to index will depend on how people have named their "collections".)
[21:33:08] <GothAlice> golya: Instead you could store it as a list of collections. {username: "GothAlice", collection: [{name: "collection1", needs: ["Charizard", …], offers: […]}, {name: "collection2", needs: […], offers: […]}]}
[21:33:10] <golya> GothAlice: collections are not user named groups, they are distinct world of cards, named by the admin.
[21:33:46] <GothAlice> Still, you would have to index many values instead of two values as in my example. (Index on collection.needs and collection.offers.)
[21:34:05] <GothAlice> Vs. indexing on needs.collection1, needs.collection2, … needs.collectionX
[21:34:26] <GothAlice> Then offers.collection1, … offers.collectionN.
[21:34:38] <foofoobar> GothAlice: Yeah I can extend the matched range, awesome!
[21:35:23] <golya> GothAlice: ok, I haven't though about indexing. So you said the array will be more easily indexable.
[21:35:36] <GothAlice> A list of collection sub-documents remains queryable, too. You can then ask questions like, give me all users that have marked needs or offers on collection X. (db.users.find({collection.name: "collection1"}))
[21:36:07] <GothAlice> golya: It'll be the only way you could make that data indexable without extracting it and jumping through a whole lot of hoops to pretend that MongoDB knows what a relationship is. (You'd have to do all that manually. ;)
[21:36:49] <GothAlice> foofoobar: The magic of $group on extracted date components. :)
[21:37:08] <golya> GothAlice: so generally [{key:"value of key", field1: value1}] fits mongodb better, than [ "value of key": { field1: value1}]
[21:37:30] <GothAlice> Yes. Variable key names will lead to insanity and possibly kicking of your dog, if applicable. ;)
[21:37:31] <golya> than { "value of key": { field1: value1}}
[21:39:51] <GothAlice> golya: There's no JS involved here. :/
[21:41:04] <GothAlice> golya: An example implementation that does this, again back to my forums example: https://github.com/bravecollective/forums/blob/develop/brave/forums/component/thread/model.py#L58-L73 (model) — This adds a new comment to a thread: https://github.com/bravecollective/forums/blob/develop/brave/forums/component/thread/model.py#L99-L110
[21:41:08] <GothAlice> https://github.com/bravecollective/forums/blob/develop/brave/forums/component/thread/model.py#L124-L127 gets a specific reply from whatever thread it's in…
[21:41:36] <GothAlice> And in this case it's handy to ask for the first or last reply to a thread: https://github.com/bravecollective/forums/blob/develop/brave/forums/component/thread/model.py#L166-L192 — that's possible, too.
[21:41:54] <GothAlice> MongoDB lets you query and manipulate nested data structures extremely easily.
[21:42:07] <golya> GothAlice: ah, so what bothers me: by grouping needs by collection, you state your card-collection relation twice
[21:42:44] <GothAlice> golya: As an example from MtG, could you not have a card in multiple collections? A la Wrath of God reprints? ;)
[21:43:23] <GothAlice> Would it not be most efficient to store the card once, and just reference in the card data: {name: "Wrath of God", collection: ["Beta", "Morridin", "8th", …]}
[21:43:44] <golya> GothAlice: no, in my model, every card belongs to a single collection
[21:44:07] <GothAlice> golya: Often what appears as data duplication in MongoDB is actually an optimization to allow you to more easily query your data.
[21:44:33] <GothAlice> While yes, the card may belong to only one collection, explicitly having the collection referenced in the user allows you to perform additional queries in the absence of "join" support.
[21:45:05] <GothAlice> (I.e. find me all users interested in collection X.)
[21:45:34] <GothAlice> To answer that question without the "extra" reference would require loading the IDs of all cards in that collection then querying $in for _all_ of them. (Which would be a rather bad idea.)
[21:46:13] <GothAlice> (16 or so extra bytes per collection per user vs. potentially several megabytes on each query.)
[21:46:39] <golya> GothAlice: but back to screwing up mongodb by 1:1 mapping database tables to documents, how would it hurt, if I store userneeds as: {card:"id of card needed", user:"user id"}?
[21:47:08] <GothAlice> What types of queries (questions) do you need answered?
[21:48:08] <GothAlice> For example, even if you have to answer in SQL, how would you express the question: "What are the user names of every user interested in selling or buying 'Wrath of God'?"
[21:51:58] <styles> Hey guys i want to calculate data every hour, but during some hours data doesn't exist. Using aggrate is there a way to fill in missing data?
[21:52:37] <golya> GothAlice: good and practical point
[21:53:09] <GothAlice> golya: When embedding the needs/offers in the user, this question is trivial: var wrath_id = db.card.findOne({name: 'Wrath of God'}, {_id: 1})._id; db.user.find({$or: [{collection.need: wrath_id}, {collection.offer: wrath_id}]})
[21:53:54] <GothAlice> (Two queries, very little data transfer, and blazingly fast if indexed. Get the ID of the card, query users' need/offer lists, regardless of collection, for this ID.)
[21:54:46] <GothAlice> styles: I deal with aggregate holes in my click tracking analytics by populating a "default value" for missing time periods in the application, rather than trying to enforce "empty record" creation inside MongoDB. (I.e. tolerating the missing data rather than populating it.)
[21:55:04] <GothAlice> styles: Python's "defaultdict" is quite useful for this purpose.
[21:55:17] <styles> so I should basically... do something like date time range and enumerate every value?
[21:55:35] <styles> I'm not sure what "defaultdict" is I'm writing everything in Go
[21:56:36] <GothAlice> Presumably you're enumerating it already to emit values… somewhere. Instead, enumerate the records into a placeholder structure that can provide data for missing values, then enumerate that placeholder.
[21:57:08] <quattro_> is it ok to run mongodb on servers with non ecc ram when using replication?
[21:57:38] <GothAlice> golya: With the "joining" collection like that (which is flat-out not cricket in MongoDB for this type of use) you'd need to load up the card ID, then query the "joining" table to find all distinct user IDs, then query the users collection for *all* of those IDs. That could be tens of megabytes (or more) of data transfer in those ID lists alone.
[21:58:01] <GothAlice> quattro_: Aye. It's even relatively OK if you just have journalling enabled on a single host, but redundancy is always good.
[21:58:52] <styles> it's been hard to get somebody to answer this question. I assumed that's how it should be done, but I really wish Mongo had support
[21:58:57] <quattro_> anyone already tried the new beta with compression?
[21:59:08] <GothAlice> styles: No worries. :) (defaultdict is such a structure in Python; "dict" -> "dictionary", Python's term for mappings, hash tables, JS objects, or "document" in MongoDB terms.)
[22:02:31] <quattro_> ah yeah mysql did use a lot less disk space
[22:03:04] <quattro_> I build a server monitoring service just for internal usage but noticing a lot of data being build up, a lot more than expected
[22:03:14] <quattro_> I think the compression would really be good for my data
[22:03:45] <GothAlice> MongoDB acts like a filesystem; it pre-allocates on-disk stripes which it then populates with data; like a filesystem, this storage can become fragmented.
[22:05:24] <quattro_> yeah i have 0 bytes left, was running on a 20gb ssd vps :(
[22:10:54] <GothAlice> Hitting zero free is usually one of those "now we need to mongodump, clear everything, and mongorestore to get our indexes working again" situations.
[22:11:58] <GothAlice> quattro_: Something I find useful on space-constrained systems is "one directory per db" (lets you easily mount an extra drive and move the data around on a per-db basis if needed) and "smallfiles" (start at a much smaller stripe size.)
[23:06:23] <hahuang65> if I did a mongorestore on a collection of 959621954250, how long might taht take?
[23:07:26] <hahuang65> It's been about 6 days now and it's only at 17%. Is that sort of speed normal>
[23:10:00] <hahuang65> araujo: sorry about that, mis-chat.
[23:17:05] <GothAlice> hahuang65: 959,621,954,250 records? Or bytes? — If those are records, even containing nothing other than their _id, that's 19 TiB of data.
[23:17:34] <hahuang65> GothAlice: well the progress says: Progress: 169302378663/959621954250
[23:17:42] <GothAlice> Yup, those are records, then.
[23:17:44] <hahuang65> GothAlice: is that bytes or records from mongorestore
[23:23:26] <GothAlice> Across a 100MBit network operating at 80% efficiency (typical TCP overhead and whatnot) the data transfer alone (let alone disk committing, atomic operation modelling, etc. happening server-side) will require 23.3 days of transfer. Gigabit at 80% efficiency will take 2.33 to simply transfer the data. If you are restoring into an existing collection, indexes are being built as you go, and if there is replication, all inserts are being handed
[23:23:26] <GothAlice> down through the replica set, too.
[23:24:03] <GothAlice> hahuang65: So yeah, that will take an insanely long period of time. It may be worthwhile to do the mongorestore locally, shut down the local server, and clone the database files themselves up to their final destination.
[23:24:21] <GothAlice> (Having mongorestore write directly to the files instead of via mongod.)
[23:25:02] <GothAlice> (And all those numbers are optimistic, assuming records whose only contents is an _id ObjectId field.)
[23:26:01] <GothAlice> hahuang65: If you can pastebin an "average" sample record for me, I can perform slightly more accurate minimum time estimates.
[23:29:38] <hahuang65> araujo: arghhh sorry, I keep doing that
[23:30:02] <hahuang65> GothAlice: I can't do your suggestion because this is a dump from 2.0.4 and a restore into 2.6.x right?
[23:33:09] <GothAlice> hahuang65: That… could be an issue. You seem to be in a right spot. I'd still attempt a local direct-write restore… it'll be much faster, eliminate network constraints, etc., etc.
[23:33:55] <GothAlice> At the same time, you may need to consider pivoting your data structure to reduce the number of individual records being stored. (We pre-aggregate our click tracking data rather than storing each click as a separate record. This gives us per-hour accuracy instead of per-click accuracy, but it also means we generate no more than 24 records per day per job being tracked.)
[23:35:50] <hahuang65> GothAlice: we'd have to think about this one. Maybe we can do it locally. Sort of short on disk space though.
[23:56:33] <GothAlice> hahuang65: Now I'd *really* love to see an example record so that I may be able to provide assistance in optimizing that dataset.
[23:56:49] <appledash> Would anyone here be able to help me with modifying a query that someone else here helped me write last night? Here's the query: https://gist.githubusercontent.com/AppleDash/48e4d803e6fe870f7915/raw/85146717ad71839822b70451eba45319d51f1fe0/gistfile1.txt I'd like to modify it so that 'val' is an array of values instead of just one value, and instead of checking of the values for a given attribute name match
[23:56:51] <appledash> the queried values, I just want to see if the queried value is in the array of values.