[00:00:32] <Synt4x`> reading up on it now, 1 moment
[00:05:19] <Synt4x`> so I would do something like db.mydb.aggregate( [ { $group:{ date:{'date':date} }, person{$addToSet:person} }] )? and I do this while iterating through my Person list?
[00:09:12] <Synt4x`> actually I found an alternative I think, .distinct('person',{'date':date}) should work
[00:11:04] <joannac> don't you have to run that per date?
[00:13:21] <Synt4x`> @joannac yea I did dates = .distinct('date') , for c in dates: people = .distinct('person', {'date':c}) , and then will do a for x in people , .findOne('person':x) and have the data I want
[00:13:24] <Synt4x`> is that a bad way of doing it?
[00:14:09] <Boomtime> that will work, but so will the single operation I quoted for the aggregation pipeline
[00:14:46] <Synt4x`> ok, what does that return to me the 1 line aggregation?
[00:15:52] <Boomtime> it gives you a cursor, similar to a find, but each document in it looks like this: { _id: [date], person: [ 'name1', 'name2', etc ] }
[00:16:54] <Synt4x`> ah ok awesome, and if I wanted to add 'score' to that as well I would just do: $group: { _id: "$date", person: { $addToSet: "$person" }. {$addToSet: "$score"} }
[00:16:55] <Boomtime> the _id field is unique, it is the distinct list of dates that exist
[00:17:13] <Boomtime> the person array in each document is the distinct list of names that occurred on that date
[00:18:18] <Synt4x`> but if a person has multiple entries under one date, i.e. "Joe Smith", "2014/9/25", "50" , "Joe Smith", "2014/9/25", "50", and so on (same score, same person, multiple entries)
[00:18:20] <Boomtime> this is the first time you've mentioned 'score' so I don't know what it is, but what you've written is not valid syntax
[00:18:23] <Synt4x`> it will keep all of them in right?
[00:18:45] <Synt4x`> sorry score is a decimal number representing how they performed
[00:18:50] <Boomtime> $addToSet will keep distinct only, you said you wanted distinct
[00:24:23] <Boomtime> you probably want this to include score: $group: { _id: "$date", person: { $addToSet: { name: "$person", score: "$score" } } }
[00:24:57] <Boomtime> 'person' will be an array of documents { name: 'NameA', score: XX }
[00:25:10] <Boomtime> be aware that the set is formed on the combination of name and score
[00:25:36] <Boomtime> if you have the same name on the same date twice but with different scores then it will appear both times in the array for that date
[00:27:21] <Synt4x`> thats actually perfect and exactly what I'm looking for, thanks very much for this!
[00:43:01] <Synt4x`> ok sorry having one more issue, I set dataSet = db.mydb.aggregate() , and then when I do for c in dataSet and print c it shows "ok" , "result"
[00:43:39] <Synt4x`> I was expecting to see {'_id':date, 'person' [{name, score}, {name, score}] , etc.
[00:45:26] <Boomtime> in a driver you will have to do in code what the shell does by guessing what you mean
[00:47:24] <Synt4x`> ah ok, so it seems it's housed in dataSet['result']
[00:55:49] <Synt4x`> still here Boomtime? 1 more question now that I have it working properly, what do I do if I want to add non-distinct elements to this,for instance, opponent_score (which will be repeating almost 100% vs all)
[00:58:25] <Boomtime> I can't really tell from your description what you want to achieve, but you would be better off learning the numerous other powerful operators available in the aggregation pipeline
[00:59:38] <Boomtime> if you can't figure it out, then you should construct a set of example docs and desired output as a gist on github along with what you've tried
[01:00:30] <Boomtime> you'll get a better response here if you know exactly what you want and can show you've put some effort in already
[01:04:29] <Synt4x`> yea ok I just assumed you would know easily, it's tough switching from my original method to this on the fly and the documentation on $addtoSet doesn't really give many unique examples
[01:04:37] <Synt4x`> but i'll keep looking I guess, thanks anyways
[01:06:50] <Boomtime> i would need to understand what you want first, there are many possibilities, put some example documents in a gist along with the output you want
[01:07:46] <Boomtime> or pastebin, or [insert temporary text service here]
[01:11:15] <Synt4x`> Here might be an example set, http://pastebin.com/YP5KZwsL
[01:12:18] <Synt4x`> I want it sorted how we already have {Date, (Name, Score) --- the pairing of (name,score) is distinct}, but I also would like to have opponent score listed with each record, and that does not have to be distinct, in fact it will very often be the same repeating with each record, but has the potential to be different
[01:14:24] <Boomtime> i think you need to provide an example output.. i honestly can't tell what you are expecting
[01:14:29] <Synt4x`> does that make sense as I described it?
[01:16:26] <Boomtime> also, same name same date but with different score
[01:16:44] <Boomtime> also, those permutations plus whatever complexities the addition of 'opponent score' introduces
[01:24:42] <Synt4x`> ok, so this would be a sample input: http://pastebin.com/FGe5HTFt , this would be an expected output: http://pastebin.com/mKUTXHKr
[01:25:26] <Synt4x`> You can see Robert's score of 83.25 does not repeat (it's inconsequential which of the opponent scores vs the 83.25 it chooses), but he has 2 scores on 2014/09/25 because he has 83.25 and 99.25 so they both are included
[01:29:51] <Boomtime> ok, lets focus on 'Robert' records since those are the most interesting
[01:29:53] <Synt4x`> basically of all of the ones that get chosen from the original date, (name, score) $group, I just want to add the opponent_score, whatever that was in that instance, to each of those (name, score)
[01:35:55] <Boomtime> you're unique combination is date,name,score and then you simply want whatever happens to be the first opponent-score to be seen on each combination
[01:38:51] <Boomtime> ok, you probably want to do 2 pipeline stages - the first to de-duplicate opponent-scores, i.e group on the combination of date,name,score and select op-score using $first
[01:40:47] <Boomtime> you can then $group again on $_id.date (as it will have become) and now you build your array as before, including op-score knowing it won't cause any new duplicates
[01:44:04] <Synt4x`> sorry just thinking for a minute to make sure I understand how it would work
[01:45:52] <Synt4x`> will the method you describe make all records on 2014/09/25 for example show the same opponent score? like is it selecting one opp_score ($first) and applying it to all of them
[01:57:28] <Synt4x`> trying to read from the $first page on mongodb docs, but again it only gives one example, so I think this should be correct but not 100%
[01:58:57] <Boomtime> i don't actually know what your given pipeline will produce - you should try it and see
[01:59:05] <Boomtime> i don't think it will be correct
[01:59:19] <Boomtime> my suggestion was to use 2 consecutive pipeline stages
[02:00:24] <Boomtime> aggregation is a pipeline, you are using a grand total of 1 stage at the moment, but you can use as many as you like
[02:00:43] <culthero> How do you prevent a mongodb shard from running out of disk space? It has a TTL collection but the volume over a short time increased
[02:00:51] <Synt4x`> ok I think I understand the grouping concept, I guess what I don't fully grasp is where to include the $first to make it non-unique/required
[02:01:24] <Synt4x`> I am assuming if I put it in the original one, it will also be required to be unique, like eahc of those were
[02:02:24] <Boomtime> your first pipeline stage contains the $first
[02:02:51] <Boomtime> your second pipeline stage contains something like the 'original' we came up with
[02:03:07] <Boomtime> i think you need to go and play with this a lot more, use the shell
[02:03:09] <Synt4x`> ok, I will spend a few hours toying with it now, thanks for all of your help / explanations. Tough for me to learn a new concept from scratch, it seems like the documentation ont his wasn't very thorough
[02:06:14] <Boomtime> but that isn't replica-sets, nor does it prevent you from using it on a sharded-cluster, just that the collection cannot be sharded
[02:06:31] <culthero> right, the capped collection
[02:06:41] <Boomtime> capped-collections have always been available on replica-sets, it's kind of essential
[02:06:41] <culthero> because of a fulltext index on it
[02:07:14] <Boomtime> then you have requirements which are at odds with each other
[02:07:17] <culthero> it is i/o bound, once it gets to something like 2-3m documents it throttles the disk
[02:07:36] <culthero> agreed, but not really sure how to address that scenario
[02:08:23] <Boomtime> that is a tricky one - you may have to do it manually
[02:08:36] <Boomtime> or at least, by a script that figures it out
[02:08:53] <culthero> In general I want to search 30-40m documents relatively instantaneously
[02:09:06] <Boomtime> also, you will need to pay very careful attention to the balancer action
[02:09:28] <Boomtime> 40 million documents is nothing
[02:15:21] <culthero> 40m documents = each document is roughly 5kb, inserted 60-80 per second..
[02:15:41] <culthero> one field is fulltext indexed
[02:15:58] <Boomtime> ok, so that will be a little punishing
[02:16:30] <Boomtime> and you do distributed full-text queries i take it?
[02:16:41] <culthero> yeah they are sharded on hash keys
[02:17:17] <Boomtime> oh well, if you have size requirements you will probably have to track them yourself
[02:17:33] <Boomtime> mms monitoring can give you alerts on shard dbstorage sizes
[02:17:39] <culthero> the idea being is I have to know how many times something occurred.
[02:18:21] <culthero> sure but right now for my setup, 3x config servers + 4 shards + mongos instance is pretty cheap
[02:18:26] <Boomtime> the ttl index is a good idea, but it doesn't respond to size - you need a feedback loop to shorten the ttl if the size starts growing
[02:18:43] <culthero> I can't alter the index on the fly smartly right?
[02:25:02] <culthero> so I will probably run a 3rd node app on my mongos instance that reads something like df -h from each of the shards, and if it drops below like 4gb to reduce the TTL to oldest record - 5 hours
[02:25:16] <culthero> for now I am dropping my TTL signifigantly
[02:25:47] <culthero> or I suppose I can check it every 10 minutes within the loop
[02:26:18] <Boomtime> be careful with changing TTL significantly - it can cause a sudden mass delete that cause performance impact to other ops
[03:27:03] <Synt4x`> sigh still not able to do this without a TON of iteration, and I know there is a better way now with the aggregation. Hackhands has had me waiting ~1 hour with no expert yet, so if anybody here is great with aggregation and wants to make some $ I will paypal you what I would have paid them ($1/minute)
[03:49:08] <Synt4x`> can't even pay someone to help ;(, hackhands guy couldn't help either
[03:49:14] <Synt4x`> I need an aggregate savior :-p
[05:01:08] <Synt4x`> I'm very VERY close to what I want, if anybody is here
[05:03:52] <Synt4x`> this gets me what I want except that if (name, score) is the same, but (opponent_score) is different, it will have 2 entries for (name, score), I want it to only have 1
[05:04:05] <Synt4x`> I need (name, score) to be unique, with opponent score just included as extra data (doesnt have to be unique)
[05:04:49] <Synt4x`> it's a bit confusing, "opponent_score" is actually me playing, thats my score. "Score" is the score of my opponent who I played against, but this entry is for their result
[05:09:06] <joannac> if you don't care, then why keep the score at all?
[05:10:17] <Synt4x`> because over time I think it should average out, basically I played 2 "teams" that day, he could have easily been matched vs either one of my teams, so which one it compares him to on the calculates is indifferent
[05:10:31] <Synt4x`> but I still want to do a comparison as far as his %difference from my team, and what percentile he falls in for the day
[05:15:41] <joannac> group on (date, name, score) first to only get uniques
[05:15:50] <joannac> and then do the group stage you have above
[05:22:22] <Synt4x`> hrmm sigh this cycle again, I tried 2 hours earlier with Boomtime to do this, and have spent $50 on hackhands/codementor with people who didnt get it right :-p
[05:22:39] <Synt4x`> how would I actually use first in my group statement, I tried earlier and inside, outside, etc.
[05:24:08] <Synt4x`> oops gist is like pastebin huh? I thought you were asking "for the gist of it"
[05:24:16] <joannac> that's because you're doing it inside an $addToSet
[05:25:18] <Synt4x`> wowwwww im so dumb, I tried so many times to add it outside of the 'addToSet'{} brackets but inside of the opponent{} ones and it never worked
[05:27:23] <Synt4x`> well, waste of $50 and a bunch of time, but this seems to work beautifully!
[05:29:02] <Synt4x`> so if there is no $sort at all before first, it will just grab whatever one the dictionary decides is first? (also unsorted since it's a dict I believe)
[06:09:31] <Synt4x`> it only attaches 1 opp_score at the end of the entire list, so it's like {date, {name, score}, {name, score}..., opp_score}, and ideally I want it to have {date, {name, score, opp_score}, {name, score, opp_score}} , I'm just indifferent when there are 1 score (2x) with 2 diff opp_scores, then it can choose either one, doesnt matter
[06:09:38] <Synt4x`> but not for all of them to be tied to one, that would skew it too hard
[06:10:57] <Synt4x`> hrmm I guess I misunderstood it then, let me take a look at some two-group stage examples to see if I can see how
[06:13:26] <Synt4x`> this page: http://docs.mongodb.org/manual/reference/operator/aggregation/group/ only has examples of them using 1 group at a time, do I do them both within the same aggregate command?
[06:15:13] <Synt4x`> ok yea I read this, so I guess my question is this then, by the looks of the example, if we group on date, (name, score), it gets rid of all of the other data, so if we re-group on that group, we don't have opp_score anymore
[06:15:19] <Synt4x`> (that was the problem I was having conceptually)
[06:16:12] <joannac> see how they're grouping on (city, state) and yet they're also operating on the population
[06:16:34] <joannac> in your example, it would be $first: opponent_score
[06:16:44] <joannac> or $last or whichever operator you decided on
[06:19:35] <Synt4x`> sorry if I'm being obtuse here, but the first $group in both examples are: $group{ {state, city}, population }, they both have all 3 items in their first group command
[06:23:31] <joannac> but fine, group uniquely on (date, name, score), and grab a single instance of opp_score
[06:24:34] <Synt4x`> ok but that's what we're doing now, we have only 1 instance of opp_score for each date, i.e. {date, {name, score}, {name, score}, ..., opp_score}, but in reality there could be 50 different opp_scores that day, not just one
[06:26:55] <Synt4x`> yea, I guess it is, I mean it just seems like layers of queries, I should be able to understand it I feel like, but yea, obviously I don't
[06:37:02] <joannac> have you actually run that code?
[06:37:42] <Synt4x`> yes, it ran, gave me my results, but I just checked and again only 1 opp_score is at the end, so I'm assuming I made a mistake in my second query, maybe because I kept opp_score outside again the second time
[06:44:33] <Synt4x`> ok, so it's definitely my second query that is wrong, the first contains everything I would want it to with a structure of {id: {date, name, score}, opp_score}
[06:45:19] <Synt4x`> in my second I want to turn that into: {date, _id: {name, score}, oppscore:{oppscore}} **I think**
[06:55:28] <Synt4x`> I tried doing 'opp_score':'$opp_score' which is what I thought it would be, but it didn't put any opp_score anywhere in the output, so I switched it to '$our_score' and I think doing that makes it read from the original set defeating the purpose
[06:55:53] <Synt4x`> so I guess the question is, how do I access opp_score from the first $group query, in the 2nd group query
[06:59:28] <Synt4x`> hold on I may have gotten it (or I have a hypothesis to try at least)
[07:02:02] <Synt4x`> nevermind that failed ;(, I tried accessing like in the example, in the second group statement doing $_id.score, etc.
[07:05:33] <[1]zwoop> Im using a rest api as the only means to communicate with mongodb. The problem im haivng is that i have to specify the datamodel for every object i want to send/retrive from the database. How is this done more genral so that i can send retrive any object throgh the api?
[07:06:10] <Synt4x`> I still think I want this as expected output: {_id: date , opponent: [ {{name, score}, {opp_score}, {{name, score}, {opp_score}... ] } **or {name, score, opp_score}, {name, score, opp_score}... but I can't make either work
[07:07:11] <joannac> Synt4x`: do you understand what the first $group is doing
[07:07:37] <joannac> do you know what you are getting out of the first $group, what your documents look like?
[07:10:37] <Synt4x`> yes, the first one is doing this [ { _id: {date, score, name}, opp_score}. {_id: {date, score, name}, opp_score}...
[07:11:14] <joannac> what do you want your second group stage to do, then?
[07:11:31] <joannac> what should the unique field be?
[07:13:20] <Synt4x`> ok in the first one it's already filtered out my multiple entries, so for instance there is an entry in the db, {name:gfx_644, score: 110.24, opp_score:106.60} and then another one {name:gfx_644, score: 110.24, opp_score:94.94} (2 entries, same score, diff. opp_score)
[07:13:30] <Synt4x`> when I do the first query, I'm down to only one of those, which is what I want
[07:14:22] <Synt4x`> so now I'd like it to be {_id: date, {name, score, opp_score}} where all {name, score, opp_score} are already unique I believe because of the first query, so just to separate them by date and that's it
[07:15:09] <joannac> you're playing very loose with definitions here
[07:16:19] <joannac> you want documents that look like that?
[07:18:54] <Synt4x`> sorry your right, I was thinking it auto grouped them by date but it's because I as using addToSet, so I want them to look like this: {_id : date , opponent : [ {name, score, opp_score}, {name, score, opp_score}... ] }
[07:24:27] <joannac> I suggest you play around in the shell while you figure out syntax
[07:24:43] <joannac> and put your group stage together piece by piece
[07:25:40] <Synt4x`> the thing is, it's easy to see how I access the $_id.X's , how do I access opp_score from the first grouping? that's what I don't see, the example says just $pop but that didn't seem to work as I expected
[07:43:59] <joannac> is it by chance very late where you are?
[07:44:13] <joannac> you're having braces problems :p
[07:44:28] <Synt4x`> 12:43am and I'm on 5 hours of sleep, yes ;(. I spent 12 hours today trying to fix this one query, something that seemed soooo simple, and I could do iteratively
[07:44:34] <Synt4x`> and somehow has turned into this marathon of pain
[07:45:22] <joannac> anyway, I'm glad you figured it out
[07:45:34] <joannac> and despite all the pain i think you understand aggregation now
[07:45:45] <joannac> even though i had to prod you through it
[07:55:20] <Synt4x`> seriously, can I send you some $ on paypal or something?
[07:55:24] <Synt4x`> this has really helped me out a ton
[08:06:03] <Synt4x`> Hooooooooooooooorayyyyyyyyyyyyyyyyyyyyy, got it. and MAN... what a challenging day, thanks so much to everybody here I NEED to sleep but I will pay it forward in any way possible
[09:50:43] <Forest> Hello. i am using node js with mongoDB, i am chunking array and trying to insert the JSON objects into mongodb,but not all of them are inserted,what might be the problem? I get no error message.
[09:53:42] <Forest> This is my code https://dpaste.de/NnZ2 Only 999 elemetns of cesty is isnerted and 2600 of vrcholy. Can anyone help me please?
[10:10:19] <shepik> hi, is this really "mogodb chat&support"?
[10:10:41] <thib> i use mongodb on a vagrant virtual machine, i've tried to update it to 2.6.5 but it seems to failed configurate mongodb-org-server, any idea why ?
[12:29:38] <ssarah> Can you tell me the easiest way to ensure external connections to a mongo machine use ssl? Or point me to the guide manual on it?
[12:49:21] <beebeeep> hello folks, does anybody knows, what means this record in config.locks: config(obj: 237636; size: 0/0/0 Gb)> db.locks.find({state: {$ne: 0}})
[12:50:38] <beebeeep> (balancer was disabled for months)
[13:22:05] <Forest> Hello, i am using mongodb with node js and i have a problem that array of documents i want to insert is larger than 16 MB. Can you help me how can i split the array correctly? I tried to split the array into smaller arrays of 8000 elements,but that does not guarantee the size. Any help would be appreciated.
[15:17:35] <kakashiAL> I am trying to understand this code: https://paste.xinu.at/r4CfO/
[15:17:37] <kakashiAL> it say: give mySchema object a method that get 2 paramters, an id and a callback(cb)
[15:17:40] <kakashiAL> because its asynch it need a callback, but what does .exec.(cb) means?
[16:05:02] <hydrajump> hi if I don't specify fork=true in the mongd.conf does that mean that service start mongod will not run in the backgroun?
[17:41:32] <locojay> hi i ve removed about 100g from gridfs but when i do showdbs the db still hast a size not reflecting the delete docs
[18:38:44] <skaag> how do I enable the oplog on a replica set?
[18:39:09] <Derick> oplogs are automatically created when you start a node with --replSet=name
[18:39:14] <skaag> I ran rs.printReplicationInfo() and I see the last oplog event is from sometime in february 2014
[18:39:54] <skaag> can I put replSet=name in /etc/mongodb.conf and it will behave the same?
[19:23:26] <Derick> you either connect to a primary node, or use rs.slaveOk( true );
[19:26:57] <hydrajump> Derick: I need a quick check because I don't know if I'm doing the right thing
[19:27:35] <hydrajump> Derick: I have 3 members in a replicaset. I created a new member and just added it to the replicaset on the primary using rs.add("xxx")
[19:30:31] <hydrajump> Derick: rs.status() returns the following for the new member
[20:57:08] <HairAndBeardGuy_> yes, but sorry, i've never used mongo-hadoop.
[20:58:26] <joannac> ditto. i suggest you share your code so someone can look, though
[20:59:10] <HairAndBeardGuy_> if you're having problems doing it with mongo-hadoop, could you not use another library?
[21:11:20] <mongoexplore> I don't know any other alternative to create bson file on hadoop
[21:11:43] <mongoexplore> do you have suggestion for alternative?
[21:12:04] <mongoexplore> I am converting 500gb of data
[21:12:14] <mongoexplore> so single machine will not work
[21:21:18] <mongoexplore> anyone want last peace of pizza?
[21:27:58] <cheeser> mongoexplore: no one can answer your question if you don't ask it.
[21:30:30] <mongoexplore> cheeser, the question i asked was: I am using mongo-hadoop to convert existing json files into bson... but when I do it everything get assign as _id : {full json}
[21:30:42] <mongoexplore> instead of being document
[21:32:23] <mongoexplore> code is simple mapper and reducers are just simple cat
[21:33:47] <cheeser> what does your output look like?
[21:35:41] <Bumptious> is there an easy way to find a gap in series of integers, each integer from a separate document?
[21:36:12] <Bumptious> as you may expect this is for the purposes of order/position
[21:37:02] <mongoexplore> so input is {"_id":"12", "key":"field"} output bson will look like {"_id": "{\"_id":\"12\", \"key\":"\field\"} "}
[21:38:39] <joannac> Bumptious: why not just sort them?
[21:38:48] <cheeser> mongoexplore: i'm guessing it's because RecordWriters expect a <K,V> pair and you're only giving it one item so it assumed a null value.
[21:46:01] <Bumptious> hmm yeah. i'm not sure why i feel compelled to make it work without that interaction
[21:48:41] <Bumptious> i think because i'm dealing with a game where there's potentially a lot of ansychronous actions going on
[21:50:50] <Bumptious> i'm just going to try it the way you suggest and see if problems arise :), thanks, joannac
[22:57:37] <mango_> Just working on a MongoDB puzzle, if there is a network partition, between D.C A (2 x secondarys) and D.C. B (1 x Primary) when the network partition has been removed, is there a re-election?
[23:00:53] <cheeser> i don't think so. that 3rd machine would just reestablish communication with the other 2 (one of which would be primary)
[23:01:45] <joannac> Well, the one in DC B will step down
[23:01:54] <joannac> and one of the machines in DC A will become primary
[23:02:07] <joannac> so after the partition is gone, the machine in DC A will stay primary