PMXBOT Log file Viewer

Help | Karma | Search:

#mongodb logs for Monday the 1st of December, 2014

(Back to #mongodb overview) (Back to channel listing) (Animate logs)
[01:48:32] <tim_t> japhar81: just use a "this question is answered" flag
[01:48:56] <japhar81> tim_t: i dont think im explaining my model right
[01:48:59] <tim_t> either check if the answer field is empty or put an "isAnswered" boolean
[01:48:59] <japhar81> i have 100 questions
[01:49:02] <japhar81> and 100 users
[01:49:07] <japhar81> each user can answer 0-100
[01:49:11] <tim_t> k
[01:49:20] <japhar81> rather all users can
[01:49:24] <japhar81> they're not inter-dependent
[01:49:36] <japhar81> so if user1 answered 10, i need the other 90 back
[01:49:42] <japhar81> if user2 answered 50, i need the other 50
[01:49:43] <japhar81> etc
[01:52:20] <tim_t> okay sure
[01:52:29] <Boomtime> i've beein idling here for 5 hours but i still seem to be missing some context to the above
[01:52:52] <Boomtime> meh, 4 hours and a bit actually
[01:54:16] <Boomtime> japhar81: is there a question? (sorry, if it was explained houra ag)
[01:54:36] <japhar81> :) yeah it was from this morning
[01:54:43] <japhar81> im trying to figure out the right model for two unbounded lists
[01:54:47] <japhar81> i have a pool of questions
[01:54:52] <japhar81> they're added/removed at-will
[01:54:55] <japhar81> and a set of users
[01:55:29] <japhar81> im trying to figure out how to store this to quickly get 'all unanswered' for a user, as well as 'all answers' for a user
[01:55:33] <tim_t> how about have a source collection of question objects prepped with the question field, an answer field, and a user field. whenever a user answers one, dupe it to an answers collection with the user and answer fields filled. compare the collections filtering by user to see which ones they did not answer?
[01:55:39] <japhar81> i went down the road of question IDs saved per-user
[01:55:43] <japhar81> but that's an unbounded list too
[01:55:45] <japhar81> seems kinda ugly
[01:57:27] <tim_t> its capped to 10000 at most though, right?
[02:00:17] <tim_t> what do you mean by unbounded exactly
[02:03:28] <japhar81> its not capped
[02:03:30] <japhar81> that's really the problem
[02:03:37] <japhar81> the questions list will keep growing
[02:03:55] <japhar81> 10k is 90 days away
[02:03:58] <japhar81> 50k is probably 6 months
[02:04:30] <Boomtime> and you need to find which of those questions a particular user has not answered?
[02:05:04] <japhar81> i need all Question records that have not been answered
[02:05:11] <japhar81> and all answered questions with answers
[02:06:14] <tim_t> but you want to look at a user, see which 100 out of the pool of possibly zillions in the future and see which out of those 100 are un/answered?
[02:06:29] <tim_t> which 100 questions out of the pool
[02:06:39] <japhar81> yep
[02:07:13] <tim_t> okay makes sense
[02:07:51] <japhar81> oh, extra fun, Casbah doesn't support DBRef very well
[02:07:54] <japhar81> so i can't use that
[02:08:01] <japhar81> so i was just stashing a list of questionIds
[02:08:02] <tim_t> can the 100 candidate questions change from session to session?
[02:08:08] <japhar81> yeah they can
[02:08:16] <japhar81> a user can answer a question that's then deleted
[02:08:23] <japhar81> new questions can be added
[02:08:24] <japhar81> etc
[02:08:28] <japhar81> i can't really pre-compute it
[02:08:47] <Boomtime> where did the 100 come from?
[02:09:11] <Boomtime> are 100 simply selected at random from the pool for each user?
[02:09:55] <japhar81> 100 is the total # in the collection
[02:10:07] <japhar81> so my test setup has 10 users, and 100 questions
[02:10:12] <japhar81> right now each user has 0 answered
[02:10:15] <Boomtime> "japhar81: the questions list will keep growing"
[02:10:21] <japhar81> yep
[02:10:31] <japhar81> let me try a different explanation :)
[02:10:32] <japhar81> so im a user
[02:10:37] <japhar81> i've answered 0 questions ever
[02:10:37] <Boomtime> this sounds like a contradiction, please explain how it is not
[02:10:44] <japhar81> there are 100 in the collection right now
[02:10:53] <japhar81> when i log in, my unasnwered questions list is all 100
[02:10:56] <japhar81> i log out
[02:11:02] <japhar81> 10 are deleted, but 20 others are added
[02:11:08] <japhar81> when i get in, there are 110 unanswered
[02:11:36] <japhar81> in SQL-land this would be a users table, a questions table, and a questionAnswers table
[02:11:38] <japhar81> and I'd just join
[02:12:26] <Boomtime> it all sounds pointless, are you holding something to ransom? what makes you think somebody is going to answer 100 questions let alone 50,000?
[02:13:01] <Boomtime> (this is academic of course, we can just talk about how you'd do this exercise only if you like)
[02:13:11] <tim_t> they might if it is incentivised
[02:13:42] <tim_t> like a game-show type thing… gambling perhaps
[02:13:44] <japhar81> its a dumbed-down example, every one you answer in a group pre-populates hundreds of others with a best-effort answer
[02:13:55] <japhar81> and you can go tweak at-will
[02:14:10] <japhar81> but no one wants to talk about multivariant matching i assume, so i simplified it :)
[02:15:24] <tim_t> why delete instead of say deactivate and keep using a reference?
[02:16:12] <japhar81> i could
[02:16:15] <tim_t> and when the code presumably draws new questions it simply does not offer any that are deactivated
[02:16:29] <japhar81> that would be ok, i was trying to avoid the extra filter
[02:16:31] <japhar81> but i dont mind it
[02:21:20] <tim_t> do you still want to be able to later on see what question the answer answers?
[02:34:00] <chetandhembre> What is best practices for inserting 100 document (object) in bulk ? I am using BulkWriteOperation (in java) is it right ? or should i use insert(docs) function ?
[02:35:21] <cheeser> are you having any problems with it?
[02:36:38] <chetandhembre> yeah .. so many connection opens on server ?
[02:36:55] <chetandhembre> and eventually making mongodb slow
[02:37:49] <Boomtime> so you have a problem: mongodb slows down (or something like that)
[02:38:01] <Boomtime> and you have evidence: lots of connections
[02:38:10] <chetandhembre> I am not concern with write guaranty .. but it should not make my mongodb server slow
[02:38:40] <Boomtime> are you reading from it at the same time as you write?
[02:38:43] <chetandhembre> Boomtime: yeah .. I am writing on primary and reading from secondary
[02:39:06] <chetandhembre> sorry *yeah* was for your earlier message :p
[02:39:34] <Boomtime> if you are reading from a secondary then your insert op is irrelevant - the secondary always sees them as distinct write operations
[02:40:09] <Boomtime> the secondary has at least the same write load as the primary
[02:40:26] <Boomtime> for what reason are you reading from the secondary?
[02:41:12] <chetandhembre> i am not concern with availability or freshness of data so i am happy with eventual consistency
[02:41:45] <chetandhembre> *or may be i am concern with availability but not freshness
[02:42:13] <Boomtime> how does reading from a secondary improve availability?
[02:44:41] <chetandhembre> I am reading from secondary because in this way i am reducing load from my primary .. and eventually secondary will sync with primary ..
[02:45:56] <chetandhembre> what is best way to bulk insert into mongodb ?
[02:46:37] <Boomtime> but that same load is still present, you have not achieved any reduction in overall load - the same machines still need to service the same load
[02:47:25] <Boomtime> so why are you trying to save a particular machine?
[02:47:47] <chetandhembre> Boomtime: yes . but it is only write previously it handles both read + write .. i think there i manage to make load less on priamry
[02:48:24] <Boomtime> that isn't an answer - why does it matter that the secondary be the one that is most loaded?
[02:48:42] <Boomtime> you have the same odds of failure on the secondary as on the primary
[02:48:51] <chetandhembre> Because this machine is my primary node from replica set ? I have very simple replica set one primary + one secondary
[02:49:05] <Boomtime> no arbiter?
[02:49:09] <chetandhembre> no
[02:49:30] <Boomtime> ok, ignoring that problem for the moment..
[02:49:32] <chetandhembre> as i have single secondary .. so primary goes down secondary will become priamry
[02:49:41] <Boomtime> no it won't
[02:49:48] <Boomtime> you should try it
[02:50:30] <chetandhembre> ok .. I read doc .. i arbitrary only play role when you have even number of secondary
[02:50:34] <Boomtime> if you had an arbiter that would be true - given that the two remaining nodes could determine majority
[02:50:37] <chetandhembre> am i right ?
[02:50:58] <Boomtime> even number of members
[02:51:08] <Boomtime> members = primary or secondary
[02:51:16] <Boomtime> or arbiters
[02:51:37] <chetandhembre> ohh . i will look into it
[02:51:38] <Boomtime> you are better off with an odd number of _members_ in total
[02:51:39] <chetandhembre> thnks
[02:51:56] <chetandhembre> but again bulk insert problem ?
[02:52:00] <Boomtime> with 2 members only then when either of them are missing the remaiing member cannot form a majority
[02:52:08] <Boomtime> bulk insert doesn't matter to you
[02:52:28] <Boomtime> you only write to the primary, those writes are sent as individual ops when the secondary replicates them
[02:53:01] <Boomtime> hilariously, you have probably made your system worse by reading from the secondary - it might be under a greater write-load than the primary
[02:53:13] <chetandhembre> yeah..
[02:53:24] <chetandhembre> what should i do now ?
[02:53:45] <Boomtime> use the default read pref unless you can construct a compelling reason for doing otherwise
[02:54:16] <Boomtime> "spreading load" is not a compelling reason unless you are spreading across continents
[02:54:23] <chetandhembre> my read preferences are "secondaryprefered "
[02:55:22] <Boomtime> go back to default (primary), re-test and find the actual source of the slow down
[02:56:12] <Boomtime> you mentioned "lots of connections" before
[02:56:21] <Boomtime> can you quantify that?
[02:56:22] <chetandhembre> So not primary + secondary + arbitar setup ?
[02:56:33] <Boomtime> yes, P + S + A is a good config
[02:56:41] <chetandhembre> So I am getting about 7000 open connection one time
[02:56:47] <Boomtime> read/write from primary (the default)
[02:56:54] <chetandhembre> P ? S ? A ?
[02:57:04] <Boomtime> Primary, Secondary, Arbiter
[02:57:38] <Boomtime> 7000 is quite high, what driver are you using? language
[02:57:47] <chetandhembre> Java
[02:57:55] <chetandhembre> Java mongoldb driver
[02:58:08] <Boomtime> what sort of tasks are you doing? can you describe briefly what your app does
[02:59:11] <chetandhembre> It is social network management .. I am storing profile of user (avg 80) in mongodb
[02:59:22] <chetandhembre> in single bulk upload
[03:00:13] <Boomtime> you have only a single application instance?
[03:00:25] <chetandhembre> no about 10
[03:00:29] <chetandhembre> min
[03:02:18] <Boomtime> ok, have you changed any connection defaults?
[03:02:56] <chetandhembre> mean ?
[03:03:20] <Boomtime> do you set connectionsPerHost or other options?
[03:03:30] <chetandhembre> yeah yeah
[03:03:40] <chetandhembre> let me paste my configuration here
[03:04:02] <chetandhembre> MongoClientOptions.Builder builder = new MongoClientOptions.Builder();
[03:04:02] <chetandhembre> builder.connectionsPerHost(100);
[03:04:03] <chetandhembre> builder.connectTimeout(1000 * 2);
[03:04:03] <chetandhembre> builder.socketTimeout(1000 * 60);
[03:04:03] <chetandhembre> builder.socketKeepAlive(true);
[03:04:03] <chetandhembre> builder.maxWaitTime(1000 * 10);
[03:04:05] <chetandhembre> builder.maxConnectionIdleTime(1000 * 10);
[03:04:07] <chetandhembre> builder.maxConnectionLifeTime(1000 * 60 * 5);
[03:04:09] <chetandhembre> builder.threadsAllowedToBlockForConnectionMultiplier(1000);
[03:04:11] <chetandhembre> builder.readPreference(ReadPreference.secondaryPreferred());
[03:04:59] <Boomtime> i must go for a bit, i'll have a look when i get back
[03:05:10] <chetandhembre> sure
[04:50:53] <xxtjaxx> Hi! I'm using the native mongodb driver and I was wondering whats the best way to use it? use the Db or MongoClient?
[05:02:01] <chetandhembre> any one want to help mw with this ?
[05:04:43] <nap> Hi Guys, Is it true that inorder to use mongo 2.6.4, I should be in paid service with Mongo? Can't I use http://repo.mongodb.com/apt/ubuntu trusty/mongodb-enterprise/stable multiverse -> under GPL? like how I do for 2.4.
[05:12:17] <joannac> nap: The enterprise build was never free to use in production without a support agreement
[06:20:26] <vineetdaniel> hi
[06:20:51] <vineetdaniel> can i run db.curentOp from ruby/python
[06:25:06] <chetandhembre__> is your mongodb driver provide it ? it yes then you can
[07:35:14] <appledash> Can someone help me out with this, please: http://dba.stackexchange.com/questions/83969/how-can-i-make-parts-of-query-criteria-optional-with-mongodb
[07:35:18] <appledash> No idea why it is tagged SQL
[07:35:27] <appledash> I put "query" and it decided that SQL was the closest match
[07:49:16] <kali> appledash: for the optional part, $in is what you want
[07:49:38] <appledash> Alright... What about the order?
[07:50:10] <kali> appledash: for the sort part, there is no arbitrary sort in mongodb, but you're in luck: your sort order is the reverse alphabectic, so just use name:-1
[07:50:40] <appledash> You realize that is not my real data, right? :P
[07:51:11] <kali> shame :)
[07:51:32] <kali> sorry, not $in but $or
[07:52:05] <appledash> If I'm working with only around 10-15MB of data, would it perhaps be better just to select everything and then work with it in code?
[07:53:21] <kali> appledash: mmmm that or add a sort key to your documents
[07:54:00] <appledash> What do you mean by that / how would I go about doing so?
[07:54:25] <kali> i should also mention that you may want to consider storing the attributes in a table rather than a hash: [{ name: "a1", value:"blah"}, {name:"a2", value:"foo"}, ...]
[07:54:45] <appledash> Why would I want to do that?
[07:55:33] <kali> appledash: it makes many things easier: mongodb is designed around schema where document keys are keywords, not arbitrary values
[07:56:20] <kali> for instance, your previous selection would become { "attributes.value": "Foo" } regardless the number of attributes
[07:56:39] <kali> and you can optimize all the case with only one index
[07:56:51] <appledash> But usually the query would be like 10 or so attributes all with different values
[07:57:56] <kali> can you show us a more real life document ?
[07:58:36] <appledash> Sure, give me a few moments.
[07:59:12] <kali> you can only go so far with foos and bars :)
[08:00:26] <appledash> Mhm
[08:05:23] <appledash> kali: https://gist.githubusercontent.com/AppleDash/32518e17866c7bbf0ed7/raw/19ddb617b31849fefe74f180a91ab5aa0416bb73/gistfile1.txt If my query is like {gender: "male", hairColor: "blue"} then I want it to return AppleDash followed by Gyro and Inky, the last 2 it doesn't matter what order in.
[08:06:12] <appledash> This is a small example
[08:06:29] <appledash> Even closer to real life would be like 50-100 attributes and querying like 20-30 of them at once
[08:07:19] <kali> appledash: you should read this: http://blog.mongodb.org/post/59757486344/faceted-search-with-mongodb
[08:14:19] <appledash> kali: This seems like what I want to do, but how does this help me with ordering?
[08:20:01] <appledash> Perhaps I can just query using $or and then sort the results in code
[08:22:02] <kali> it's independant from ordering
[08:24:16] <kali> ow, i think i get it now. you want all documents that match one criteria, the one which have the more matches coming first
[08:25:43] <appledash> Yes
[08:25:45] <appledash> That is correct
[08:25:54] <kali> you could have said so earlier :)
[08:26:05] <kali> anyway. the facet schema will help with index
[08:27:14] <kali> for the rest... it's a bit tricky. fething everything and sorting in code is a valid option as long as all criteria are selective. but if you have gender in the criteria, then you'll get fetch half your database in the code
[08:27:46] <kali> you may manage it with a generated aggregation query
[08:28:06] <appledash> It will most likely be a bit more specific than just gender
[08:28:42] <kali> yes, but you need to fetch all docs that match at least one criteria
[08:29:43] <kali> $match: {facetted search}, $unwind: attributes, $match: { $facetted search }, $group: { _id: _id, score: count(1)}, $sort: { score: -1 }
[08:30:09] <appledash> Oh, you are right about that, yeah
[08:30:26] <kali> that should work, and all the work will be done in the database instead
[08:33:00] <appledash> Alright
[08:33:23] <appledash> Right now can I just replace faceted search with a basic $or query for testing?
[08:33:35] <kali> yeah
[08:33:47] <kali> mmm no.
[08:33:51] <appledash> no?
[08:33:54] <kali> you wont be able to $unwind
[08:34:15] <appledash> I see
[08:44:03] <appledash> kali: I am getting a "Can't canonicalize query: BadValue unknwon top level operator: $match"
[08:44:19] <appledash> I have { } around it
[08:52:29] <appledash> OK, new one, "the group aggregate field 'score' must be defined as an expression inside an object"
[08:52:31] <appledash> wat
[09:03:18] <appledash> I made it work, but it returns stuff in the same order every time no matter what I specify as the search
[09:03:44] <appledash> If I query for hairColor blue or gender male it returns all 3
[09:03:50] <appledash> If I set gender to "blah" it only returns 2
[09:04:10] <appledash> If I set gender to "female" it returns all 3 but in the same order as the first query
[09:04:59] <appledash> And it seems that my help has died
[09:17:45] <kali> appledash: show me the query
[09:20:20] <appledash> kali: Sure
[09:20:25] <appledash> one sec
[09:20:38] <appledash> I've got it to the point where it is just returning nothing
[09:21:29] <appledash> https://gist.githubusercontent.com/AppleDash/a3b57b04f370afe0d9f9/raw/45a303b0a2fedc22b93514660a74e27258b8bef0/gistfile1.json
[09:21:34] <appledash> kali: ^
[09:21:40] <appledash> keep in mind I am working manually here
[09:25:42] <kali> appledash: the second match is wrong
[09:26:04] <kali> it should be exactly the same as the first one
[09:26:38] <appledash> I tried it exactly the same
[09:26:43] <appledash> It returns also nothing.
[09:27:15] <kali> ha ! yeah
[09:27:18] <appledash> Hm?
[09:27:48] <kali> the $elemmatch must go
[09:28:00] <kali> second match must be: {$match: {"$or": [{ "attributes": { name: "gender", val: "female" }}, {"attributes": {name: "hairColor", val: "blue"}}]}}
[09:29:18] <appledash> It is... Kind of working?
[09:29:30] <appledash> It is returning duplicate results though, is that meant to happen?
[09:30:07] <kali> show me
[09:31:17] <appledash> Err, now that I think about it, it is also returning results in the wrong order...
[09:32:01] <appledash> https://gist.githubusercontent.com/AppleDash/4fa3163f6e576b408971/raw/244dd382eb18aeced8bfdb81991e45a82de18365/gistfile1.txt
[09:32:16] <appledash> That is not really the right order
[09:32:30] <appledash> It isn't even backwards or something
[09:32:33] <appledash> it is just... wrong
[09:32:41] <kali> yeah yeah yrah
[09:32:43] <kali> hang on
[09:32:50] <kali> i'll get this thing working :)
[09:32:52] <appledash> Uhhh
[09:32:54] <appledash> I forgot
[09:32:58] <appledash> I removed somethiung=
[09:33:01] <appledash> I removed the $group
[09:33:06] <kali> ha :)
[09:33:29] <appledash> After re-adding group
[09:33:31] <appledash> it seems to work fine
[09:33:44] <kali> good
[09:33:55] <appledash> thanks for the help
[09:34:05] <kali> you're welcome
[12:42:18] <vineetdaniel> hi
[12:42:41] <vineetdaniel> i can see a lot of repl write worker in db.currentOp() is it normal ?
[12:42:55] <vineetdaniel> around 16 such operation in progress
[13:12:51] <saturne_> hello
[13:37:37] <saturne> noob question. in mongodb, dbs is the DATABASE, and collection is THE "table", true?
[13:38:13] <chetandhembre__> yup
[13:40:36] <saturne> oky :)
[13:40:38] <saturne> thanks
[14:08:49] <Zelest> I got a replicaset of 3 nodes.. 2 dedicated servers and 1 virtual machine.. now we plan on adding a third dedicated server and I wonder how to do this without breaking anything? As I've understood, you should have an non-equal number of nodes due to the election and such?
[14:09:13] <Zelest> Can I reconfigure the virtual node to never vote or something? or what is the proper way?
[14:15:01] <cheeser> you could add an arbiter, too.
[14:16:07] <Zelest> hmms... true
[14:58:08] <deviantony> ahoy
[14:58:32] <deviantony> I've setup MMS automation agent on a mongodb node hosted in our datacenter
[14:58:50] <deviantony> but after the setup of the monitoring agent, I keep getting the same log line
[14:58:59] <deviantony> Nothing to do. Either the server detected the possibility of another monitoring agent running, or no Hosts are configured on the MMS group.
[14:59:04] <deviantony> Am I missing something?
[16:21:54] <drecute> Hi there
[16:22:50] <drecute> I imported data in csv into mongodb in tsv format and my collection look like this:
[16:24:20] <drecute> > db.jivecomment.find().limit(5) { "_id" : ObjectId("547591c4dd6c60beece741cc"), "commentid,parentcommentid,objecttype,objectid,parentobjecttype,parentobjectid,userid,name,email,url,ip,body,creationdate,modificationdate,status" : "1001,,1464927464,1016,2020,1153207,8432,,,,10.122.64.4,<body><span>Nothing could be truer than that...</span></body>,1309515967414,1309515967414,2" }
[16:24:40] <drecute> is there a way to transform it?
[16:26:38] <GothAlice> … re-import it as CSV instead of TSV… since those are commas, not tabs.
[16:28:53] <drecute> GothAlice: I imported as csv initially, but mongo dropped some data.
[16:29:47] <Mmike> Hola, lads. When i do rs.initialize(), mongod first gets into STARTUP, then STARTUP2, then SECONDARY and then PRIMARY?
[16:29:49] <Mmike> Right?
[16:30:13] <GothAlice> drecute: Then your input CSV is invalid. (I.e. didn't escape or quote extra comma usage, etc.) This is an unfortunate state for your data. If mongoimport can't do it, then I'd write a light-weight Python script to slurp the data in using a more flexible CSV parser, and one where you can catch abnormal data prior to insert.
[16:30:40] <GothAlice> drecute: https://docs.python.org/2/library/csv.html#examples — or the equivalent in your preferred language.
[16:31:12] <GothAlice> Mmike: I don't think it'll go through secondary before becoming the primary in the situation where no other hosts are known (i.e. right after rs.initialize())
[16:31:33] <GothAlice> Mmike: Sounds like something relatively simple to test, though. :)
[16:32:17] <Mmike> well
[16:32:20] <Mmike> <- dork
[16:32:23] <tim_t> I have a program that enqueues email messages in mongo and i have a separate thread dedicated to picking up messages and sending them. Currently I am polling every 10 seconds (just picked a number) and grabbing a list of messages, sending them and clearing the queue. I do not like this polling as it seems to be bad form. Is there some sort of callback or some other asynchronous method mongo has I can use instead?
[16:32:32] <Mmike> it's clear from the log file what happens :)
[16:32:49] <GothAlice> Mmike: https://gist.github.com/amcgregor/4207375
[16:33:02] <GothAlice> Includes link to working library and runner. :)
[16:34:08] <GothAlice> Mmike: So, what you have is a queue. You've got something pushing data into the queue, and something pulling data out of the queue to work on it. Capped collections in MongoDB offer pretty much the exact solution you are looking for: insert-order, high-performance, and the ability to have a pending cursor "tail -f" the collection to get push notifications of new data.
[16:34:41] <Mmike> tim_t, I guess she's talking to you :)
[16:34:48] <GothAlice> Aye, sorry.
[16:34:58] <GothAlice> Silly IRC client colours both of you the same. :/
[16:34:59] <Mmike> futures! Didn't know about that one!
[16:35:46] <Mmike> eh
[16:35:48] <Mmike> python3 :/
[16:35:59] <GothAlice> Nope.
[16:36:02] <GothAlice> pip install futures
[16:36:09] <GothAlice> Should work all the way back to 2.5.
[16:36:19] <GothAlice> It's just bundled in core in 3.
[16:36:19] <Mmike> python-concurrent.futures - backport of concurrent.futures package from Python 3.2
[16:36:21] <Mmike> wo wo wo
[16:36:23] <GothAlice> :)
[16:36:36] <Mmike> (that's what my kid does when he discovers something new that he likes :) )
[16:36:44] <GothAlice> Futures are awesome. That presentation and sample library I linked I'm turning into https://github.com/marrow/marrow.task — but I'm stuck waiting on https://jira.mongodb.org/browse/SERVER-15815 before I can fully replicate the capabilities of Futures within the context of MongoDB.
[16:37:01] <tim_t> okay, thanks for the lead GothAlice
[16:39:04] <GothAlice> Mmike: As a note, the default thread pool worker (both in the back-port and in core in 3) is a terrible abomination of a thread pool implementation. It does terrifyingly dumb things, and shoots itself in the foot performance-wise. https://github.com/marrow/marrow.util/blob/develop/marrow/util/futures.py is my replacement thread pool implementation that features auto-scaling, thread exhaustion, etc. and is ~30% faster. :)
[16:42:11] <GothAlice> tim_t: And no worries. Apologies again for the name confusion. ^_^
[16:42:23] <foofoobar> Hi. I have a collection of my finances with entries like {date:’…’, amount: -210, description:’’)}. I want to get the income+spendings for a single month (or better said: for every month in the last year).
[16:42:28] <foofoobar> What is the best way to do this?
[16:42:50] <foofoobar> Should I get all recors of the last year and add them to months manually (nodejs) or can mongodb do this?
[16:43:09] <GothAlice> foofoobar: What you're looking for is aggregation: http://docs.mongodb.org/manual/core/aggregation-introduction/
[16:43:49] <GothAlice> The first example explained on that page should be applicable to your query. :)
[16:46:51] <foofoobar> GothAlice: so the $match would then do a match on a single month?
[16:55:08] <GothAlice> foofoobar: Or whatever other range you desire, aye.
[16:55:56] <foofoobar> GothAlice: I want to calculating for each month
[16:56:19] <foofoobar> so the question is, am I doing this query for each month or is it possible to make a single query which does this for all months
[16:58:33] <GothAlice> You can do both.
[16:59:15] <GothAlice> I'll note that "making a single query that does this for all months" will be an ever-growing query. (I.e. there may be no upper limit on the number of records that need to be scanned in order to calculate the total.) To avoid large recalculations at work we "pre-aggregate" our data.
[16:59:53] <foofoobar> ok
[17:00:04] <GothAlice> We track clicks, so when a click comes in we insert a record like this: https://gist.github.com/amcgregor/94427eb1a292ff94f35d
[17:00:16] <GothAlice> This will keep track of "per hour" statistics; for your case, you'd want per-month.
[17:01:15] <GothAlice> (So "hour" would be "month" in your case, and you'd strip off the day and hour in addition to minute, second, and microsecond from the date.) Then you'll always have the latest statistics by querying one record per month tops, instead of one record per transaction per month.
[17:01:47] <foofoobar> GothAlice: http://hastebin.com/zidohupeye.js
[17:01:49] <GothAlice> foofoobar: See http://docs.mongodb.org/ecosystem/use-cases/pre-aggregated-reports/ for more information.
[17:01:51] <foofoobar> Why does this match?
[17:02:18] <GothAlice> Because both of the dates are greater than the one provided?
[17:02:28] <GothAlice> (30 > 0)
[17:02:41] <foofoobar> new Date(2014, 11, 0) is december
[17:03:10] <GothAlice> No, it's not.
[17:03:29] <GothAlice> Run that in a JS REPL: http://cl.ly/image/1B0m2K3e0x33
[17:03:52] <GothAlice> So in your case, 23h > 0h on the same day.
[17:04:11] <foofoobar> How bogus is that..
[17:04:19] <foofoobar> new Date(2014, 0, 0) is 2013, dec 31 ?!
[17:04:20] <GothAlice> The attempt to use a day value of zero is bogus.
[17:04:32] <GothAlice> In most sane languages, zero is not a valid option for any of those values.
[17:04:58] <foofoobar> Why is this not throwing an error?
[17:05:29] <foofoobar> Yeah..
[17:06:05] <GothAlice> foofoobar: https://www.destroyallsoftware.com/talks/wat
[17:07:44] <foofoobar> I’ll have a look at it. First trying to get my aggregate :>
[17:07:54] <foofoobar> So this is my current query: db.entries.aggregate([{$match: {Valutadatum: { $gt: new Date(2014, 10, 0)}}}, {$group: {total: { $sum: "Betrag" } }}])
[17:08:12] <foofoobar> It does not work because I have to input an _id I think.. "errmsg" : "exception: a group specification must include an _id",
[17:08:12] <GothAlice> Stop using zeros in dates.
[17:08:36] <GothAlice> Aye; you're not grouping on anything.
[17:08:45] <GothAlice> In your case, you want to group on month.
[17:09:34] <foofoobar> GothAlice: Ah yes, I had the wrong Date() thing, there should be a ‚1‘ ;)
[17:09:59] <GothAlice> No, there should never be zeros in any date element. Time, sure, date, not so much.
[17:10:13] <GothAlice> So you need: {$group: {_id: {year: {$year: "$Valutadatum"}, month: {$month: "$Valutadatum"}}, total: {$sum: "$Valutadatum"}}
[17:10:44] <foofoobar> but then the date is summed
[17:10:50] <foofoobar> I want Betrag to be summed
[17:10:59] <GothAlice> Well, Betrag isn't a summable value.
[17:11:04] <GothAlice> You've got it as a string for some reason.
[17:11:22] <foofoobar> No it’s a float. "Betrag" : -52.6,
[17:11:23] <GothAlice> http://docs.mongodb.org/manual/reference/operator/aggregation/#date-operators — btw, there are quite a few values you can pull out of dates. :)
[17:11:35] <GothAlice> "FOLGELASTSCHRIFT" — your example returns a string.
[17:11:40] <GothAlice> From: http://hastebin.com/zidohupeye.js
[17:11:50] <foofoobar> Thats „Buchungstext"
[17:12:02] <foofoobar> I stripped out Betrag because it was not relevant in this example.
[17:12:24] <GothAlice> Except that it is; try not to assume what is or is not relevant about a query when attempting to work on said query. :/
[17:12:42] <GothAlice> Regardless, sum your values. Also, don't store financial values as floats, ever.
[17:12:49] <foofoobar> Sorry. This is a complete example: http://hastebin.com/fesoricuzu
[17:12:59] <GothAlice> (If USD or other "cents" system, store number of cents as an integer.)
[17:13:44] <foofoobar> Yeah I know the problematic. Currently it’s not relevant because I’m just trying to understand how I can do things like this with mongodb
[17:13:51] <GothAlice> (In a JS REPL such as your browser, 0.1 + 0.2 != 0.3.
[17:14:23] <GothAlice> It's a bad habit, though, even to do that in testing… mostly because of *how abysmally wrong* things can be.
[17:15:09] <GothAlice> Betrag, though, certainly is what you want to sum. Did you catch the _id addition in my last example?
[17:17:04] <GothAlice> "Beguenstigter/Zahlungspflichtiger" — your choice of keys worries me greatly, BTW. One should not have symbols other than _ in key names, for a variety of reasons including consistency of field access. (You can't use an attribute reference to get access to that field.)
[17:17:35] <foofoobar> GothAlice: It’s parsed from a csv, I need to clean this up.
[17:17:44] <GothAlice> foofoobar: Additionally, the full string value of each key (plus six bytes) is stored against every single document in the collection; which in your case will add up to more storage dedicated to the keys than the actual values.
[17:22:02] <foofoobar> GothAlice: http://hastebin.com/onetezobep.js
[17:22:36] <foofoobar> I think it works..
[17:22:44] <GothAlice> That range pair should be $gte / $lt (Greater than or equal to the start of month X, less than the start on month X+1.)
[17:23:18] <GothAlice> (Your current range query would have the effect of ignoring any transaction at midnight on the morning of the 1st.)
[17:23:30] <GothAlice> Otherwise should be great. :)
[17:26:26] <foofoobar> Wow. My spendings are a way too much :>
[17:28:27] <foofoobar> I’m getting a different value when doing this by hand. Something still needs some improvement
[17:29:50] <GothAlice> Also, for bonus points, it'd be interesting to have two $sum's, each $if'd to only $sum the positive and negative values separately. (Thus credit vs. debit totals on the account.)
[17:31:48] <foofoobar> I’m currently only filtering for negative values
[17:34:17] <GothAlice> https://gist.github.com/amcgregor/1ca13e5a74b2ac318017 is an example of one of my aggregates from work, this one doing a breakdown by day-of-the-week. :)
[17:34:37] <GothAlice> (I.e. you could find out that it's your Friday night partying that's causing your surprising spendings. ;)
[17:44:47] <tim_t> erm... any java example i can look at that does the tailing trick?
[17:46:34] <GothAlice> http://docs.mongodb.org/manual/tutorial/create-tailable-cursor/ includes a C++ example… see also: https://github.com/deftlabs/mongo-java-tailable-cursor
[17:46:37] <GothAlice> tim_t: ^
[17:47:22] <kali> https://github.com/deftlabs/mongodb-examples/blob/master/mongo-java-tailable-cursor-example/src/main/com/deftlabs/examples/mongo/TailableCursorExample.java
[17:47:29] <kali> too late
[17:47:47] <GothAlice> Bada bing. Nah; that's better than the deflabs link of mine, mine is very beta.
[17:47:54] <tim_t> woo thanks
[17:48:24] <kali> GothAlice: it's the same on :)
[17:48:27] <kali> +e
[17:48:59] <GothAlice> kali: The one I provided is an extension to the default driver to improve support for event-based tailing. It isn't really a simplified example of how to do it.
[17:49:00] <kali> but i had a quick look and from 30000 feet, it looks right
[17:49:15] <GothAlice> (So much code to do so little… T_T)
[17:49:33] <kali> GothAlice: well... java
[17:54:21] <tim_t> fascinating... it is still a polling technique. makes sense though given the context.
[17:55:04] <GothAlice> Well, no, not quite.
[17:55:12] <GothAlice> TAILABLE + AWAITDATA = blocking.
[17:56:07] <GothAlice> The main Reader loop is basically identical to my Python example: loop over the cursor as long as it's alive, when it eventually dies (times out, for example), automatically retry.
[17:56:24] <GothAlice> This Java version is hideous from the standpoint of tracking all IDs seen in a HashSet…
[17:56:33] <GothAlice> (Instead of simply the "last" one seen, for the purposes of retrying…)
[17:57:40] <GothAlice> The sleep(100) is to handle the edge case where the collection is completely empty—the cursor will always immediately return null in that case, so retrying is needed until some data, any data is added to the collection. (Thus the {nop: True} record in my Python example.)
[18:17:28] <GothAlice> tim_t: http://shtylman.com/post/the-tail-of-mongodb/ goes into depth on the mechanisms at play and how they effect performance. (That sleep(100) will introduce sawtooth lag.)
[18:33:17] <hexus0> is there anything I should be worried about if I set up a 3 server replica set where each server is in a geographically different data center?
[18:34:03] <GothAlice> hexus0: Other than needing to pay closer attention to where your queries are directed, not really. See: http://docs.mongodb.org/manual/data-center-awareness/
[18:34:24] <hexus0> Got it, thanks!
[18:34:26] <tim_t> oh nice. no need to remove entries with this technique
[18:34:51] <hexus0> I was reading the replica set docs and it said to put the majority of servers in the primary data center, but didn't touch much on having an equal number of servers in different data centers
[18:35:10] <GothAlice> tim_t: Indeed. Though you very likely will need to track where you "leave off" each time the worker quits. (Or atomically change a boolean flag on each processed entry and filter on that.)
[18:36:38] <GothAlice> hexus0: The issue would be that if the primary DC loses connection to the two foreign DCs, it'll stop functioning. (It'll become a read-only secondary until a connection can be re-established with either of the other DCs.)
[18:37:22] <hexus0> Yea thats what I figured
[18:37:44] <hexus0> Would it be smarter than to have a 4 server replica set with two servers in the primary data center and an arbiter?
[18:38:10] <GothAlice> Generally you would have a minimal HA replica set in the primary DC, then secondaries in offsite DCs to speed up local querying (and to provide offsite streaming backup).
[18:38:36] <GothAlice> "Minimal HA" = high availability = two mongod replica members and an arbiter.
[18:39:24] <GothAlice> That way you don't rely on outside DC connections to maintain local write-capability.
[18:39:57] <hexus0> when you say 2 mongod replica members, you mean in addition to the primary
[18:40:07] <hexus0> so that entire set would be 5 mongo instances and 1 arbiter
[18:40:39] <GothAlice> No, one will be a primary, the other a secondary, then two more, one in each of the two foreign DCs with a lower priority than the ones in the main DC (to ensure they never become primary). And an arbiter to keep the number of hosts odd.
[18:40:54] <hexus0> oh got it!
[18:41:01] <GothAlice> :) That'd be the "minimum" three-DC setup.
[18:41:35] <hexus0> thanks! :)
[18:42:26] <GothAlice> hexus0: Definitely investigate http://docs.mongodb.org/manual/data-center-awareness/ though — you'll want to be careful about where queries get sent.
[18:42:49] <hexus0> Doing that right now :) digging through the docs
[21:06:54] <foofoobar> GothAlice: sooo
[21:07:12] <foofoobar> I got it working now with the query we composed.
[21:07:13] <GothAlice> foofoobar: How goes your querying?
[21:07:15] <GothAlice> \o/
[21:07:38] <foofoobar> This is my code: http://hastebin.com/ecumaxerid.js
[21:07:44] <foofoobar> Now I want to do the following things:
[21:08:18] <foofoobar> 1) The query in the code is just calculating everything below 0 (spendings), I now need to get income, too. Do I make a separate query for this?
[21:08:32] <GothAlice> Nope, you can do both with one.
[21:08:53] <foofoobar> 2) I want to get the values I calculated with (1) also for a few others months, do I make a query for all of them?
[21:09:00] <foofoobar> GothAlice: Can you hint me how?
[21:11:30] <GothAlice> {$group: {_id: {…}, credit: {$sum: {$cond: {if: {$gt: ["$value", 0], then: "$value", else: 0}}, debit: …}
[21:11:35] <GothAlice> Something like that should do.
[21:12:20] <GothAlice> Basically what that says is, the "credit" sum is the $sum of "value" (in German from your sample data ;) for only positive values. (You'd do the reverse for debit.)
[21:13:30] <GothAlice> Then when running that one query, you get both answers. (You could also have a "balance" final balance that is the sum of both positive and negative.)
[21:13:39] <GothAlice> foofoobar: ^
[21:15:17] <golya> Hi. I am new to mongodb, and have the following basic question: how do you model one-to-manys? If A has many Bs, and one B has many Cs, and I'd like to query for such Cs, which belong to a specific A, how to design collections?
[21:15:21] <GothAlice> foofoobar: {$group: {_id: {…}, credit: {$sum: {$cond: {if: {$gt: ["$value", 0]}, then: "$value", else: 0}}, debit: …} — I forgot a closing } after the $gt. :)
[21:16:09] <GothAlice> golya: http://www.javaworld.com/article/2088406/enterprise-java/how-to-screw-up-your-mongodb-schema-design.html contains a nice short summary of how to bridge ways of modelling data relationally with how MongoDB generally models things.
[21:16:32] <foofoobar> give me some minutes, I need to understand this :)
[21:19:37] <GothAlice> golya: golya: The example I usually fall back on is that of forums (like phpbb). In relational you have a table of replies, a table of threads, and a table of forums. You'd JOIN across these to answer questions like "all replies to a thread", or "search for all replies by user X, grouped by forum/thread".
[21:19:39] <GothAlice> In MongoDB, because you often want the replies with a thread (when looking at a thread), I embed the replies as a list of embedded sub-documents. A la thread = {title: "Awesome thread!", creator: ObjectId(GothAlice.id), contents: [{creator: …, comment: "Awesome."}, …]}
[21:20:54] <golya> Well, if you are in, I describe my mini-problem, with 3 entities, and tell about that.
[21:21:00] <GothAlice> However if I need to answer that other question, I need to perform two queries: first, look up the user ID by username (usually), then a second query to find all comments by that user. (db.thread.find({"contents.creator": GothAlice.id}))
[21:22:25] <foofoobar> GothAlice: Works! Awesome :) http://hastebin.com/guperajoga.js
[21:22:43] <foofoobar> So when doing this for the last 3 months, I’m executing this query 3 times with different dates, correct?
[21:22:43] <GothAlice> golya: I would need to know more about your use case and desired use of the data than just "A has Bs, B has Cs, I need Cs by A."
[21:22:58] <golya> I'd like to design a site for collectors. Admins will upload collections (finite predefined set of say cards).
[21:23:17] <GothAlice> foofoobar: You could do that, but since you're grouping by year/month you'll automatically get back one record per month regardless of the range you are requesting.
[21:23:28] <golya> Users would mark, which cards they need, and which ones they have extra to offer in exchange.
[21:23:50] <golya> So A: a collection of cards, B: one particular card type in a collection
[21:24:18] <golya> and C: storing somehow, that a particular user needs a concrete B, or has offered B to exchange
[21:24:34] <golya> I think I can embed the cards to the collections
[21:24:52] <golya> But not sure where to store user needs / offers
[21:25:18] <GothAlice> golya: One simple approach is to store the ObjectIds of the cards the user needs and offers in two lists.
[21:26:12] <GothAlice> (That'd give you a maximum number of "wanted" and "offered" cards combining to just over one million per user.)
[21:27:05] <golya> GothAlice: ok, so a collection for UserNeed, which references cards, and users (by object _ids)
[21:27:08] <GothAlice> No.
[21:27:17] <GothAlice> Store the list of needs / offers directly in the user record.
[21:27:47] <GothAlice> {username: "GothAlice", needs: ["Charizard", "Pikachu"], offers: ["Squirtle", "Diglett"]} (except with the IDs of the cards rather than their names. ;)
[21:28:33] <golya> GothAlice: ok, clear. then, the next q: could I query user needs, which are in a given collection?
[21:28:57] <golya> In User document, there are just card ids
[21:29:13] <GothAlice> golya: Very much so. For example, to ask the "user" collection for all usernames that want Pikachu: db.user.find({needs: "Pikachu"})
[21:29:22] <GothAlice> Er, all users that want.
[21:29:35] <GothAlice> All usernames that want would be: db.user.find({needs: "Pikachu"}, {username: 1})
[21:30:07] <golya> Well, you missed one abstraction level, the Collection, but anyway, I figured it out myself
[21:30:18] <golya> You can store user needs by collections in the User object
[21:30:51] <golya> {username: "GothAlice", needs: {"collection1":["Charizard", "Pikachu"], "collection2":["card"]}}
[21:31:34] <GothAlice> That will not do what you expect. :)
[21:31:58] <GothAlice> If "needs" is a mapping/dict/object/subdocument like that, you'll have some extreme difficulty applying indexes.
[21:32:13] <GothAlice> (Since the "path" to the field you want to index will depend on how people have named their "collections".)
[21:33:08] <GothAlice> golya: Instead you could store it as a list of collections. {username: "GothAlice", collection: [{name: "collection1", needs: ["Charizard", …], offers: […]}, {name: "collection2", needs: […], offers: […]}]}
[21:33:10] <golya> GothAlice: collections are not user named groups, they are distinct world of cards, named by the admin.
[21:33:46] <GothAlice> Still, you would have to index many values instead of two values as in my example. (Index on collection.needs and collection.offers.)
[21:34:05] <GothAlice> Vs. indexing on needs.collection1, needs.collection2, … needs.collectionX
[21:34:26] <GothAlice> Then offers.collection1, … offers.collectionN.
[21:34:38] <foofoobar> GothAlice: Yeah I can extend the matched range, awesome!
[21:35:23] <golya> GothAlice: ok, I haven't though about indexing. So you said the array will be more easily indexable.
[21:35:36] <GothAlice> A list of collection sub-documents remains queryable, too. You can then ask questions like, give me all users that have marked needs or offers on collection X. (db.users.find({collection.name: "collection1"}))
[21:36:07] <GothAlice> golya: It'll be the only way you could make that data indexable without extracting it and jumping through a whole lot of hoops to pretend that MongoDB knows what a relationship is. (You'd have to do all that manually. ;)
[21:36:49] <GothAlice> foofoobar: The magic of $group on extracted date components. :)
[21:37:08] <golya> GothAlice: so generally [{key:"value of key", field1: value1}] fits mongodb better, than [ "value of key": { field1: value1}]
[21:37:30] <GothAlice> Yes. Variable key names will lead to insanity and possibly kicking of your dog, if applicable. ;)
[21:37:31] <golya> than { "value of key": { field1: value1}}
[21:37:50] <golya> sure, thanks for clarifying
[21:38:16] <GothAlice> (Notably again you'd only need one index to cover all of them vs. one index per possible grouping.)
[21:39:18] <golya> It's just doesn't feels good to store data to search in an array for me.
[21:39:40] <golya> I mean, in pure js
[21:39:51] <GothAlice> golya: There's no JS involved here. :/
[21:41:04] <GothAlice> golya: An example implementation that does this, again back to my forums example: https://github.com/bravecollective/forums/blob/develop/brave/forums/component/thread/model.py#L58-L73 (model) — This adds a new comment to a thread: https://github.com/bravecollective/forums/blob/develop/brave/forums/component/thread/model.py#L99-L110
[21:41:08] <GothAlice> https://github.com/bravecollective/forums/blob/develop/brave/forums/component/thread/model.py#L124-L127 gets a specific reply from whatever thread it's in…
[21:41:36] <GothAlice> And in this case it's handy to ask for the first or last reply to a thread: https://github.com/bravecollective/forums/blob/develop/brave/forums/component/thread/model.py#L166-L192 — that's possible, too.
[21:41:54] <GothAlice> MongoDB lets you query and manipulate nested data structures extremely easily.
[21:42:07] <golya> GothAlice: ah, so what bothers me: by grouping needs by collection, you state your card-collection relation twice
[21:42:44] <GothAlice> golya: As an example from MtG, could you not have a card in multiple collections? A la Wrath of God reprints? ;)
[21:43:23] <GothAlice> Would it not be most efficient to store the card once, and just reference in the card data: {name: "Wrath of God", collection: ["Beta", "Morridin", "8th", …]}
[21:43:44] <golya> GothAlice: no, in my model, every card belongs to a single collection
[21:44:07] <GothAlice> golya: Often what appears as data duplication in MongoDB is actually an optimization to allow you to more easily query your data.
[21:44:33] <GothAlice> While yes, the card may belong to only one collection, explicitly having the collection referenced in the user allows you to perform additional queries in the absence of "join" support.
[21:45:05] <GothAlice> (I.e. find me all users interested in collection X.)
[21:45:34] <GothAlice> To answer that question without the "extra" reference would require loading the IDs of all cards in that collection then querying $in for _all_ of them. (Which would be a rather bad idea.)
[21:46:13] <GothAlice> (16 or so extra bytes per collection per user vs. potentially several megabytes on each query.)
[21:46:39] <golya> GothAlice: but back to screwing up mongodb by 1:1 mapping database tables to documents, how would it hurt, if I store userneeds as: {card:"id of card needed", user:"user id"}?
[21:46:53] <golya> in a UserNeed collection?
[21:47:08] <GothAlice> What types of queries (questions) do you need answered?
[21:48:08] <GothAlice> For example, even if you have to answer in SQL, how would you express the question: "What are the user names of every user interested in selling or buying 'Wrath of God'?"
[21:51:58] <styles> Hey guys i want to calculate data every hour, but during some hours data doesn't exist. Using aggrate is there a way to fill in missing data?
[21:52:37] <golya> GothAlice: good and practical point
[21:53:09] <GothAlice> golya: When embedding the needs/offers in the user, this question is trivial: var wrath_id = db.card.findOne({name: 'Wrath of God'}, {_id: 1})._id; db.user.find({$or: [{collection.need: wrath_id}, {collection.offer: wrath_id}]})
[21:53:54] <GothAlice> (Two queries, very little data transfer, and blazingly fast if indexed. Get the ID of the card, query users' need/offer lists, regardless of collection, for this ID.)
[21:54:46] <GothAlice> styles: I deal with aggregate holes in my click tracking analytics by populating a "default value" for missing time periods in the application, rather than trying to enforce "empty record" creation inside MongoDB. (I.e. tolerating the missing data rather than populating it.)
[21:55:04] <styles> GothAlice, humm
[21:55:04] <GothAlice> styles: Python's "defaultdict" is quite useful for this purpose.
[21:55:17] <styles> so I should basically... do something like date time range and enumerate every value?
[21:55:35] <styles> I'm not sure what "defaultdict" is I'm writing everything in Go
[21:56:36] <GothAlice> Presumably you're enumerating it already to emit values… somewhere. Instead, enumerate the records into a placeholder structure that can provide data for missing values, then enumerate that placeholder.
[21:57:08] <quattro_> is it ok to run mongodb on servers with non ecc ram when using replication?
[21:57:38] <GothAlice> golya: With the "joining" collection like that (which is flat-out not cricket in MongoDB for this type of use) you'd need to load up the card ID, then query the "joining" table to find all distinct user IDs, then query the users collection for *all* of those IDs. That could be tens of megabytes (or more) of data transfer in those ID lists alone.
[21:58:01] <GothAlice> quattro_: Aye. It's even relatively OK if you just have journalling enabled on a single host, but redundancy is always good.
[21:58:32] <styles> GothAlice, i like that
[21:58:34] <styles> thanks buddy
[21:58:43] <quattro_> GothAlice: awesome thanks
[21:58:52] <styles> it's been hard to get somebody to answer this question. I assumed that's how it should be done, but I really wish Mongo had support
[21:58:57] <quattro_> anyone already tried the new beta with compression?
[21:59:08] <GothAlice> styles: No worries. :) (defaultdict is such a structure in Python; "dict" -> "dictionary", Python's term for mappings, hash tables, JS objects, or "document" in MongoDB terms.)
[21:59:15] <styles> gotcha
[21:59:40] <GothAlice> quattro_: I rolled my own lzma compression at the application layer. ¬_¬
[22:01:03] <quattro_> GothAlice: how much data are you storing?
[22:01:09] <GothAlice> quattro_: 25 TiB.
[22:02:31] <quattro_> ah yeah mysql did use a lot less disk space
[22:03:04] <quattro_> I build a server monitoring service just for internal usage but noticing a lot of data being build up, a lot more than expected
[22:03:14] <quattro_> I think the compression would really be good for my data
[22:03:45] <GothAlice> MongoDB acts like a filesystem; it pre-allocates on-disk stripes which it then populates with data; like a filesystem, this storage can become fragmented.
[22:05:24] <quattro_> yeah i have 0 bytes left, was running on a 20gb ssd vps :(
[22:10:16] <GothAlice> :|
[22:10:18] <GothAlice> That's Very Bad™.
[22:10:54] <GothAlice> Hitting zero free is usually one of those "now we need to mongodump, clear everything, and mongorestore to get our indexes working again" situations.
[22:11:58] <GothAlice> quattro_: Something I find useful on space-constrained systems is "one directory per db" (lets you easily mount an extra drive and move the data around on a per-db basis if needed) and "smallfiles" (start at a much smaller stripe size.)
[23:06:23] <hahuang65> if I did a mongorestore on a collection of 959621954250, how long might taht take?
[23:07:26] <hahuang65> It's been about 6 days now and it's only at 17%. Is that sort of speed normal>
[23:09:37] <hahuang65> araujo: whats up
[23:10:00] <hahuang65> araujo: sorry about that, mis-chat.
[23:17:05] <GothAlice> hahuang65: 959,621,954,250 records? Or bytes? — If those are records, even containing nothing other than their _id, that's 19 TiB of data.
[23:17:34] <hahuang65> GothAlice: well the progress says: Progress: 169302378663/959621954250
[23:17:42] <GothAlice> Yup, those are records, then.
[23:17:44] <hahuang65> GothAlice: is that bytes or records from mongorestore
[23:17:46] <hahuang65> okay
[23:18:04] <hahuang65> :\ so how long should something like this take? like months?
[23:18:15] <GothAlice> Quite a long time, yes. Let me do some napkin calculations.
[23:18:27] <GothAlice> Gigabit or 100mBit network?
[23:21:16] <hahuang65> GothAlice: giga
[23:23:26] <GothAlice> Across a 100MBit network operating at 80% efficiency (typical TCP overhead and whatnot) the data transfer alone (let alone disk committing, atomic operation modelling, etc. happening server-side) will require 23.3 days of transfer. Gigabit at 80% efficiency will take 2.33 to simply transfer the data. If you are restoring into an existing collection, indexes are being built as you go, and if there is replication, all inserts are being handed
[23:23:26] <GothAlice> down through the replica set, too.
[23:24:03] <GothAlice> hahuang65: So yeah, that will take an insanely long period of time. It may be worthwhile to do the mongorestore locally, shut down the local server, and clone the database files themselves up to their final destination.
[23:24:21] <GothAlice> (Having mongorestore write directly to the files instead of via mongod.)
[23:25:02] <GothAlice> (And all those numbers are optimistic, assuming records whose only contents is an _id ObjectId field.)
[23:26:01] <GothAlice> hahuang65: If you can pastebin an "average" sample record for me, I can perform slightly more accurate minimum time estimates.
[23:29:29] <hahuang65> araujo: huh
[23:29:38] <hahuang65> araujo: arghhh sorry, I keep doing that
[23:30:02] <hahuang65> GothAlice: I can't do your suggestion because this is a dump from 2.0.4 and a restore into 2.6.x right?
[23:33:09] <GothAlice> hahuang65: That… could be an issue. You seem to be in a right spot. I'd still attempt a local direct-write restore… it'll be much faster, eliminate network constraints, etc., etc.
[23:33:55] <GothAlice> At the same time, you may need to consider pivoting your data structure to reduce the number of individual records being stored. (We pre-aggregate our click tracking data rather than storing each click as a separate record. This gives us per-hour accuracy instead of per-click accuracy, but it also means we generate no more than 24 records per day per job being tracked.)
[23:35:50] <hahuang65> GothAlice: we'd have to think about this one. Maybe we can do it locally. Sort of short on disk space though.
[23:36:18] <hahuang65> In a tight spot for sure.
[23:37:58] <hahuang65> GothAlice: Appreciate the chat and the tips :)
[23:38:17] <GothAlice> hahuang65: It never hurts to (try to?) help. :)
[23:51:09] <hahuang65> GothAlice: this sucks lol.
[23:51:40] <GothAlice> hahuang65: Generally one monitors the growth of data and would take corrective action before things got this bad. :/
[23:52:51] <hahuang65> GothAlice: generally one has an ops team that consists of more than 2 people also.
[23:52:55] <hahuang65> :\
[23:53:54] <hahuang65> GothAlice: doesn't help that our lead has put off this mongo upgrade for 3 years.
[23:54:45] <GothAlice> May I ask what that dataset represents? ('Cause it _is_ quite impressive that you have so much of whatever it is. ;)
[23:55:32] <hahuang65> GothAlice: activities that users perform. We're a gamification company.
[23:55:38] <GothAlice> :|
[23:56:16] <hahuang65> heh why :|
[23:56:33] <GothAlice> hahuang65: Now I'd *really* love to see an example record so that I may be able to provide assistance in optimizing that dataset.
[23:56:49] <appledash> Would anyone here be able to help me with modifying a query that someone else here helped me write last night? Here's the query: https://gist.githubusercontent.com/AppleDash/48e4d803e6fe870f7915/raw/85146717ad71839822b70451eba45319d51f1fe0/gistfile1.txt I'd like to modify it so that 'val' is an array of values instead of just one value, and instead of checking of the values for a given attribute name match
[23:56:51] <appledash> the queried values, I just want to see if the queried value is in the array of values.
[23:56:54] <hahuang65> GothAlice: PM me.
[23:57:45] <appledash> (This query is normally auto-generated by my code now, but that's the example one that was created last night)