[00:28:30] <d4rklit3> is insertion order of ObjectID's a viable solution for list ordering?
[00:28:47] <d4rklit3> like say a document has an array of ObjectIDs
[00:29:01] <d4rklit3> if i just reorder the array, and use that order would that work?
[00:31:52] <joannac> d4rklit3: I'm not sure I understand the usecase
[00:32:09] <d4rklit3> so i have a list of some arbitrary objects in an array attached to a document
[00:32:16] <d4rklit3> basically a list of ObjectIDs
[00:32:25] <d4rklit3> can i use the array's index order as the order of the list?
[00:32:41] <d4rklit3> like in sql, i would have to use a join table for something like this
[00:32:46] <joannac> yes, but it might change if you modify the array?
[00:33:03] <d4rklit3> well modifying the array's order would be to that affect
[00:33:37] <d4rklit3> it would be my responsibility as the developer to ensure that array operations preserve index order unless specifically intended to change it.
[00:34:07] <joannac> so you're asking me an application question?
[00:34:46] <d4rklit3> purely, mongo will never reorder an array of ObjectID's unless the application tells it to
[04:53:20] <morenoh149> hahuang65: I'd suspect it can't be done. No mention of it here http://docs.mongodb.org/manual/core/index-creation/
[04:53:37] <hahuang65> morenoh149: no worries, I can do it by hand
[05:10:19] <Jonno_FTW> I'm having ap roblem with pymongo, .limit(100) does nothing
[05:17:48] <joannac> Jonno_FTW: need more detail than that
[05:19:18] <joannac> Jonno_FTW: does .limit(1) work?
[05:21:15] <Jonno_FTW> I fixed it, the problem was I was calling .find().limit(100), then later calling .count() on the cursor
[05:24:32] <Jonno_FTW> I actually wanted .count(True)!
[05:25:23] <Jonno_FTW> couldn't possibly have a .size() method in pymongo to match the js api
[07:44:44] <dougb> this is going to be a dumb question, but I'm saving timestamps in mongo under the "t" field, and when I view the record in mongohub I see the following: "t": new Date("2015-04-07T00:00:00-0700") but when I try to filter by setting it to find by {"t": {"$gte": "2001-01-01"}} it's not returning any results
[07:45:44] <soupdiver> Hey guys! I'm currently getting through the official Dockerfile for MongoDB. Can anybody tell me where the directory is set where mongod does write the data to? https://github.com/docker-library/mongo/blob/master/2.6/Dockerfile
[07:46:08] <joannac> dougb: you need to compare dates with dates, not dates with strings
[07:47:01] <joannac> soupdiver: looks like /data/db
[07:48:24] <soupdiver> @joannac Yeah but WHERE is that setup? I can just see that the Volume is defined and the access rights are set. But where id mongod told to actually use that dir?
[07:49:48] <jdo_dk> i have a document like: myid: 123, docs: [{read:true,name:"Test"},{read:false,name:"Test2"}], can i somehow make a query which "both" match myid:123 and docs.read:true ?
[07:55:25] <joannac> jdo_dk: screesnhot or pastebin
[07:55:46] <soupdiver> @joannac allright. Thank's :) It just looked not "normal" to use that path. That's why I was confused.
[07:57:13] <dougb> ok, I think I got past that date issue. Now I'm getting BadValue Unsupported projection option: count: { $sum: "$c" } when I'm trying to return the result of getting the sum of a single column in the results
[07:59:27] <dougb> joannac: http://pastebin.com/rrMfB0bH that's what I'm trying to do, using mgo w/ go. I switched the "$sum": "c" from "$sum": "$c" just trying to see what works
[07:59:38] <jdo_dk> joannac: Yeah. And the query is docs.read:false, so dont understand why they are there..
[07:59:54] <joannac> jdo_dk: there's one array element with read = true. done
[08:00:03] <joannac> jdo_dk: what you want to do is not supported afaik
[08:00:14] <joannac> dougb: where did you get that code?
[08:00:43] <joannac> dougb: okay. go back to the docs.
[08:00:54] <joannac> there's no operator $sum in a find() call
[08:01:13] <joannac> you probably want aggregation
[08:02:04] <joannac> jdo_dk: try http://stackoverflow.com/questions/6038818/mongodb-query-to-match-each-element-in-doc-array-to-a-condition although it's going to be really slow
[08:03:27] <jdo_dk> joannac: Thanks. Why cant i query on subdocuments ?
[08:03:51] <joannac> jdo_dk: you can, just not the way you want
[08:04:09] <joannac> jdo_dk: there's no syntax for "I want every element in my array to satisfy this condition"
[08:05:30] <jdo_dk> joannac: So i will get the array if matched or not ? And im unable to query only the elements in the list? Now i got it. :D
[08:06:21] <joannac> jdo_dk: ah, right. there's no operator for "return me only the elements in this array that matches YYY"
[08:07:01] <jdo_dk> joannac: Thanks for clarify. :p
[08:09:30] <dougb> joannac: thank you for your help, does this look better? http://pastebin.com/3GiPvT7f it gives me the results I'm looking for
[08:52:34] <mtree> Ok, so here's the thing: i want to allow my app to read from replicas and do not wait for propagation to other replicas when saving. What should I add to query string?
[09:19:07] <Derick> if you need to read from replicas, then you need to set your readPreference as well
[09:22:30] <joe1234> HI, I would like to know if it's safe to ignore the schema design constraints chosen due to fragmentation and reallocation problems ( of memory mapped files) if the engine is switched switch to wiredTiger(I can't find a clear yes or no answer on the net).
[11:55:58] <KekSi> i'm quite lost right now: has anyone tried to get a cluster with x509 authentication to run in docker instances?
[11:57:02] <KekSi> how do i get the server to actually use the certificate (it just says "The server certificate does not match the host name 127.0.0.1")
[12:07:42] <Derick> diegoaguilar: what was the subject? i've many articles! was it for solr or something?
[12:07:46] <joe1234> KekSi, Never done that for mongo , but I guess you gotta generate a cert whose CN matches the hostname of the docker container (result of hostname command), you can verify using "openssl x509 -in certfile.ext -text " (maybe you 'll have to specify the cert encoding CER or DER), there you can check the CN, you also gotta check that you properly defined your container's hostname
[12:10:19] <kiddorails> We have mongo db of size > 2TB. We need to daily remove the data which is 3 months old. We have been doing it by remove() in batches to prevent db lockdown, but this is very time consuming.
[12:10:46] <kiddorails> I was reading about TTL in Mongo. This seems very appealing but I'm not sure if it will work well for us.
[12:11:24] <kiddorails> First, if we index the entire db in foreground, it will lock it down for Read/Write.
[12:11:43] <kiddorails> If we do it in background, it may turn out to be very time consuming and high on RAM. Not sure. :/
[12:12:08] <kiddorails> I'm curious what would be the best way to handle this problem
[12:59:04] <rom1504> Hi, I'm getting "Failed: error unmarshaling bytes on document #0: JSON decoder out of sync - data changing underfoot?" when running mongoimport on a (simple and small) json file, any clue what the problem is ?
[13:36:42] <jokke> is it possible to get all distinct keys for an embedded document in documents of a collection?
[13:37:45] <jokke> eg. i have { _id: 'foo', name: 'bar', embedded: { foo: 'bar', baz: 'foobar' } }
[13:38:11] <jokke> so with only that document i'd expect to get a result with values: ['foo', 'baz']
[13:54:00] <magesing> Hi everyone I installed mongo as a backend for a sharelatex server... I started to get error 503 when I try and access sharelatex, and I have tracked the problem to my mongod misbehaving... When I have mongod started, and I try to open a shell to it with "mongo" I get:
[13:54:12] <magesing> Error: couldn't connect to server 127.0.0.1:27017 (127.0.0.1), connection attempt failed at src/mongo/shell/mongo.js:146
[13:54:30] <magesing> How can I fix this error and get my database running again? Thanks.
[13:55:25] <compeman> magesing: did you try mongo --repair
[13:56:04] <magesing> compeman: yes, I just did: mongod --repair --dbpath=/srv/sharelatex-data/data
[13:56:12] <magesing> followed by: service mongod restart
[14:03:59] <magesing> compeman: aah I see, my dbpath somehow got set to /var/lib/mongo instead of /srv/sharelatex-data/data ... I think the install of gitlab may have done it.
[14:36:33] <Aartsie> GothAlice: Thank you but that is not what i mean... i have a lot of old data but when i delete this data it wil not reduce the used disk space
[14:36:44] <GothAlice> Ah, for that you need to compact your collections.
[14:36:52] <magesing> OK, I have a server which has mongod up-and running and bound to 172.17.42.1, I can connnect to it from the local server using mongo --host 172.17.42.1 If I open a shell inside the docker container, I can successfully ping 172.17.42.1, however, mongo --host 172.17.42.1 won't connect... How can I fix that? Thanks.
[14:37:31] <GothAlice> magesing: Sounds like a docker NAT issue.
[14:37:40] <Aartsie> GothAlice: Thank you :) then it will be release the disk space ?
[14:38:18] <GothAlice> Aartsie: Sorta. It'll compact everything back down to be as efficient as possible. It likely won't delete unused stripes. A full repair would do that, but it's an offline operation and can take a long time (and needs 2.1x as much free disk space as you have data.)
[14:40:53] <Aartsie> Why issn't there a good clean function in MongoDB because this is ridiculous... You want to clean up because you reach the max of your diskspace and before doing this you will need more disk space
[14:44:59] <Derick> GothAlice: interestingly, the docs say that "compact" only needs at most 2Gb of free space... http://docs.mongodb.org/manual/reference/command/compact/#disk-space
[14:46:54] <Derick> GothAlice: "repairDatabase" needs 2n + 2Gb (and returns data back to the OS): http://docs.mongodb.org/manual/reference/command/repairDatabase/#dbcmd.repairDatabase
[14:49:09] <magesing> Is it possible to connect to a running docker daemon with telnet? I'm still trying to debug why the connection from my docker image isn't working.
[14:56:32] <GothAlice> Aartsie: As with anything, failure to use a tool correctly is rarely the fault of the tool. If you're hosting a database, of any kind, you better be monitoring disk and RAM utilization, replication latencies, page faulting, etc., etc., etc.
[14:57:35] <GothAlice> (And have sufficient plans in place to handle abnormal situations such as running low on disk.)
[14:58:50] <Aartsie> it is no problem to put more diskspace on the server but it is weird that there is no function for deleting this old data and reduce used diskspace
[14:59:31] <GothAlice> Aartsie: repair is that function.
[14:59:53] <Aartsie> yeah ok but it is not possible to repair one collections
[15:00:01] <GothAlice> Due to the way repair works, it's pretty much the only way to accomplish what you're asking, anyway. (It literally reads all of the data out and dumps it into freshly created stripes.)
[15:00:33] <GothAlice> Aartsie: That's because MongoDB data isn't organized into file-per-collection or anything like that. All collections within one database are stored in the same stripe set.
[15:05:45] <GothAlice> Welp, another thing to add to my FAQ. XD
[15:06:11] <Aartsie> GothAlice: Thank you for the explanation
[15:06:15] <Derick> GothAlice: i tweeted about your FAQ earlier today
[15:08:06] <GothAlice> Heh; whenever I get mentioned online I get notifications out the whazoo. I'll probably get re-pinged within 24h after Google indexes your tweet. XD
[15:10:09] <MacWinner> if you want to do a seamless failover of a primary node to secondary, is it better to tweak replica set priorities? I see that doing rs.stepDown() can cause something like 10-20 seconds of downtime
[15:10:46] <MacWinner> if you reconfigured the node priorities, would it be cleaner?
[15:12:58] <GothAlice> MacWinner: If you can avoid elections (i.e. by ensuring there's only ever one node with the highest—I think—priority) the failover process should be pretty quick.
[15:13:21] <GothAlice> It'll still boot out existing connections when the election is called, but a quorum would be met very quickly.
[15:15:26] <MacWinner> GothAlice, thanks.. just want to make i understood you.. so if I have a 3 member RS, and node-1 is currently primary and all priorities are equal, I'm thinking of doing this: Set node-2 priority higher than node-1. Will the RS automatically do a failover when it detects a higher priority member that is not primary?
[15:15:44] <GothAlice> MacWinner: Reconfigurations do call elections, AFIK.
[15:16:42] <GothAlice> Basically you want to set your current primary to be the highest priority, the secondary you want to become primary to be the next highest, and the secondary you only want to use (read-only) in the event of a chain of catastrophic failures as the lowest priority (or zero; i.e. never-primary).
[15:19:11] <MacWinner> GothAlice, got it... so then if I do a stepDown on primary, it should elect the second node faster because there is no contention?
[15:19:50] <Derick> a node with priority=0 will never be elected
[15:21:06] <MacWinner> i see.. so if node-a=2, node-b=1, and node-c=0, then would doing a stepdown on node-a theoretically make the election process very fast?
[15:21:14] <GothAlice> MacWinner: And if the current primary is still operational with the high priority, it'll simply be re-elected on the spot.
[15:21:55] <GothAlice> Telling node-a to step down will have it immediately step up again, as it's the only node in the highest available priority.
[15:21:58] <MacWinner> ahh.. that was my next question.. it seems like the priorities would make node-a become prmary automatically even if you did stepdown
[15:22:27] <GothAlice> This is to have control over how high availability failover (i.e. the primary _utterly explodes_) is handled within your cluster.
[15:22:55] <GothAlice> Typically used as a light-weight way of indicating that certain nodes may be sub-optimal, i.e. because they're in a different datacenter, have really slow disks, etc.
[15:22:57] <MacWinner> alright.. thanks.. i'll play around with it now in my test environment (thanks to the script you provided me a couple weeks ago)
[15:24:19] <MacWinner> oh, have you upgraded to mongo 3.0.1 with wiretiger? i'm particularly interested in the compression since my data is highly reptititve
[15:24:29] <GothAlice> Do you have unlimited amounts of RAM?
[15:24:59] <GothAlice> (Seems to be the largest outstanding problem with WT that prevents use on systems that are scaled within the bounds of sanity.)
[15:26:51] <Derick> GothAlice: IIRC, the stepDown is by default "60 seconds" meaning it will take 60 seconds before it is available for election as primairy again. You can change that by using rs.stepDown(180) f.e.
[15:27:14] <GothAlice> Derick: Ah! Good to know, and that would interact with priorities in a very interesting way.
[15:27:46] <MacWinner> GothAlice, thanks for tickets.. i'll hold off for now on WT upgrade
[15:28:56] <GothAlice> MacWinner: At the moment I can not recommend WiredTiger for production use due (in the majority of cases; there are situations where it would likely be A-OK) due to the severity of some of these outstanding tickets. I've encountered two of these directly, and can reproduce one of them with the eventual result of causing a kernel panic in the VM. (The one mentioning it getting very slow; let it go, you'll encounter OOM-killer and it has a
[15:28:56] <GothAlice> chance of nuking something important accidentally.)
[15:30:00] <MacWinner> cool.. i'm just OCD about upgrading for some stupid reason
[15:30:17] <MacWinner> so i need to resist and counteract myself
[15:30:44] <GothAlice> For the most part it improved query performance and substantially reduced our disk usage (despite us using tricks like single-character field names) but if rapidly upserting 17K records causes it to explode… nah. ;)
[15:31:47] <MacWinner> if i'm not mistaken, will wiredtiger make using single-character field names obsolute? ie, will it compress it automatically?
[15:33:20] <GothAlice> (One would allocate shorter huffman codes for things appearing in nested arrays, etc.)
[15:34:27] <GothAlice> I managed to explain huffman trees/coding to a SEO specialist co-worker of mine, in 10 minutes, with the aid of a whiteboard. A the end he made me proud; he asked the best follow-up question ever: "So, uh, this is why zipping a JPEG or movie makes it bigger?" Ding ding ding! (And yes, it was me attempting to get him to stop doing that. ;)
[15:36:04] <GothAlice> Derick: Any plans for "speedy" wire protocol compression?
[15:36:43] <Derick> I'm not aware of any changes to the wire protocol. I don't think a decision to change that is going to be taken lightly.
[15:44:35] <MacWinner> so basically you think it's alright to upgrade to 3.0.1 in production but not upgrade storage engine just yet?
[15:46:28] <zarry> Hey All, I was interested in changing my replset name(_id) but most examples I found via stackoverflow or otherwise did not work.
[15:46:36] <zarry> Has anyone done this recently with success?
[16:01:22] <Torkable> I'm aware of ObjectId("...").getTimestamp(), but is there are way to select the date, from the objectId, and return it as part of the query?
[16:03:52] <GothAlice> Torkable: No, not currently. The date projection operators don't work on ObjectIds. Vote and watch https://jira.mongodb.org/browse/SERVER-9406 for updates.
[16:04:30] <GothAlice> However, you can query ranges by building fake ObjectIds which you pass in your query. (The projection side is just lacking at the moment.)
[16:06:42] <zarry> @GothAlice do you have any experience with changing replset names?
[16:06:57] <GothAlice> zarry: Alas, no, I treat set names as immutable after construction.
[16:07:04] <GothAlice> (If I need a new set, I spin up a new cluster.)
[16:07:05] <Torkable> GothAlice: yea, I need to add real dates to my docs then, thanks
[16:07:30] <GothAlice> Torkable: Indeed. Projection is the biggest reason any of my records have explicit creation time fields. (I only have that where absolutely required.)
[16:08:58] <Torkable> GothAlice: agreed, looks a bit messy, but its a log collection and the Ids arn't gunna cut it I guess :(
[16:10:06] <GothAlice> Torkable: For my logs, I don't bother with it. Ripping the UNIX timestamp integer out of the binary ObjectId in the presentation layer is really, really fast, and the main use for a "creation time" in that use case, for me, is for querying, not projection. (I.e. range querying on _id to select time ranges.)
[16:10:59] <GothAlice> The projection thing mainly cripples statistical analysis, i.e. chunking your data into time intervals.
[16:11:12] <Torkable> the problem I ran into is, "when were these docs created"
[16:11:15] <GothAlice> (So my Hit records _do_ have creation times due to the need to perform aggregate queries on it.)
[16:11:29] <Torkable> and I don't want to check each one by hand with getTimestamp()
[16:11:57] <GothAlice> "when were these docs created" is a question perfectly answerable by getTimestamp. I'm not seeing the problem.
[16:12:22] <GothAlice> What are you actually doing with the answer to that question? Displaying it?
[16:12:51] <Torkable> the log collection is more of a temp back up
[16:13:17] <GothAlice> Waitaminute. You better be using a capped collection for that. ;)
[16:13:20] <Torkable> users removed docs, they get placed in a deleted doc table for a while incase a mistake was made
[16:13:43] <GothAlice> (I.e. I certainly hope this date querying/projection stuff isn't to determine if old data needs to be cleaned up; MongoDB has several ways of doing this for you)
[16:14:09] <Torkable> so lets say a user says, "all my data of this type is gone"
[16:14:23] <Torkable> and I find 10 deleted docs that belonged to this user in my "log" collection
[16:14:43] <Torkable> I want to know when each one was deleted, aka created in the log collection
[16:14:57] <Torkable> and I don't want to run getTimestamp() over and over
[16:15:12] <GothAlice> Two questions to your last two statements: why, and why?
[16:16:01] <Torkable> when were these docs deleted
[16:16:11] <Torkable> and what if there are more than 10
[16:16:13] <GothAlice> No. I'm asking why you want to know when each one was deleted, and why getTimestamp is no-go?
[16:16:28] <Torkable> I dont' want to run getTimestamp over and over
[16:16:34] <GothAlice> If you feel not wanting to call getTimestamp is an optimization, it's a premature one. (Alice's law #144: Any optimization without measurement is by definition premature.)
[16:16:36] <Torkable> what if 100 docs were deleted?
[16:17:07] <GothAlice> getTimestamp performs, in the optimal case, a binary slice (a tiny bit of pointer math) and casting to integer. It's going to be _blindingly_ fast.
[16:17:28] <Torkable> ok I think maybe I missed somthing
[16:17:37] <Torkable> can I run getTimestamp on all the docs
[16:17:46] <Torkable> and see the creation date for each one
[16:18:00] <GothAlice> … considering you iterate a cursor, yes, you can iteratively run getTimestamp on all of the results of your query in a single line of code.
[16:18:20] <Madmallard> I am having what seems like a corruption error with MongoDB 3.0.1
[16:22:24] <GothAlice> As long as you aren't calling db.collection.save() with the result of the processing function, you should be fine. (You don't want to accidentally save out a creation time field you'll never really use.)
[16:23:45] <Madmallard> GothAlice i'm not using WiredTiger
[16:24:15] <GothAlice> Madmallard: Hmm. mmapv1 should be quite stable in MongoDB 3. Do you have a reproducible test case?
[18:10:35] <mazzy> the weather channel or The Weather Channel
[18:10:40] <StephenLynx> since the text is in upper case.
[18:10:54] <StephenLynx> that is not upper case, that is mixed.
[18:10:55] <GothAlice> [Technical aside: High bit low, second to last bit high, third to last bit high (lowercase) or low (uppercase) typically indicates the combination of text and indicates capitalization in ASCII.]
[18:11:11] <GothAlice> And "title case" has some peculiar language-specific rules.
[18:12:07] <GothAlice> I.e. it doesn't exist in French. (I.e. "the" wouldn't be capitalized) Nouns are, though. Also, particle words in English aren't typically capitalized. (I.e. "Department of Justice" — lower-case "of")
[18:12:45] <GothAlice> This is effectively an un-winnable data normalization problem.
[18:13:29] <mazzy> I meant "capitalize the first letter of every word"
[18:13:51] <GothAlice> "Department Of Justice" is wrong, though. That naive title case approach isn't very good.
[18:14:43] <GothAlice> It's one of those situations where you have to rely on users entering it correctly, and potentially storing both what they wrote, and the all-lowercase version with a unique index to prevent differences in case alone.
[18:16:22] <GothAlice> A French example: there's a department of the Québec government called: "Comité de déontologie policière" — no title case.
[18:16:49] <GothAlice> (The English version of that is: "Quebec Police Ethics Committee" — title case.)
[18:21:13] <GothAlice> Natural language is hard. ^_^ (https://raw.githubusercontent.com/marrow/util/develop/marrow/util/escape.py < the "unescape" function here includes personal pronoun mapping… *shudder*)
[18:40:23] <Prometheian> I've got a shitty ERD for an app I built for school. I was wondering if I could express the data better in document format: http://i.imgur.com/xHx6x6y.jpg
[18:41:28] <GothAlice> Prometheian: Plan on having more than 3.3 million words of locations and ratings per restaurant?
[18:42:09] <GothAlice> Then you could have two collections combining all of that: Restaurants and Users.
[18:44:03] <GothAlice> Where RestaurantTypes becomes a list of strings (the types), same for Ethnicities, locations is a list of embedded documents, and ratings are likewise a list of embedded documents. (You can add manual _id's to embedded documents, so you can have "internal references" and such.) Or you could split locations from restaurants, and just embed ratings in the location.
[18:44:38] <GothAlice> Generally it's a bad idea to nest more than one level deep (thus it's ill-advised to put reviews inside locations inside restaurants).
[18:44:55] <GothAlice> http://www.javaworld.com/article/2088406/enterprise-java/how-to-screw-up-your-mongodb-schema-design.html < if you're coming from a relational background, this may be helpful
[18:45:24] <owen1> i want to allow connections to my mongo from my ec2s. i am on AWS. what the best way to do that?
[18:47:13] <GothAlice> owen1: Set up firewall security groups to explicitly allow only your own EC2 VMs to connect, then just connect.
[18:48:31] <Prometheian> Yeah, I'm tempted to use Mongo but I wasn't sure if it was the better option.
[18:49:03] <Prometheian> And I heard joins were slow in Mongo so you have to create your DB differently.
[18:50:28] <GothAlice> However, if you pivot your data a bit (I'd go with the three collection approach: restaurants, locations, users, reviews embedded in locations) MongoDB could work very well.
[18:52:23] <Prometheian> I need to relate restaurants to locations somehow.
[18:52:30] <Prometheian> Would I do that in my backend or frontend?
[18:54:29] <owen1> GothAlice: don't i need to modify some stuff on my ec2 where mongo is installed?
[18:54:57] <GothAlice> Likely you'd need to change the bindAddress to 0.0.0.0 to make it listen on the correct interface.
[18:55:09] <GothAlice> But definitely have the firewall configured before you do that.
[18:56:01] <Prometheian> GothAlice: Would I do my 'joins' in my app layer then?
[18:56:18] <GothAlice> Prometheian: Yup; they'd just be additional queries.
[18:56:59] <Prometheian> So effectively I'm moving the join logic up a layer and storing the data in mongo which is faster?
[18:57:00] <GothAlice> There's also "pre-aggregation" to store the averages, that way you don't need to iterate all of the reviews each time you want to show one of the stats.
[18:57:28] <Streemo> If there a way to add a document with name "blah" to the collection only if a document with name "blah" doesn't already exist within 10 miles of it. If it does eist, then increment the alredy existing document
[18:57:32] <GothAlice> You're also increasing data locality (i.e. when looking at a location, you likely want the reviews, or vice-versa, when looking at a review you'd naturally need to location details anyway).
[18:59:31] <Streemo> well im glad youre still on here :)
[19:00:10] <Streemo> Btw, you mean conditional upsert, right?
[19:00:28] <owen1> GothAlice: is bindaddress something i define in my /etc/mongodb.conf ?
[19:00:30] <GothAlice> And yeah, actually, now that I read instead of skimming, an upsert is what you want. $setOnInsert will let you set any properties you aren't directly querying on.
[19:01:02] <GothAlice> owen1: Aye. And apologies, its current name is bindIp: http://docs.mongodb.org/manual/reference/configuration-options/#net.bindIp
[19:12:40] <owen1> can i allow my subnet instead of individual ec2?
[19:14:01] <GothAlice> Alas, I haven't AWS'ed in many years. That documentation is your best bet for what is and is not possible using security groups.
[19:31:08] <abishek> the eval function is restricted on my production servers, so how can I execute queries on mongo db without using eval?
[19:57:26] <Prometheian> Is there anything special about an embedded sub-document or is that just a name for a nested object?
[19:59:14] <Streemo> I have documents which have a rating: 1 star, ... , 5 star. But also their count matters too - how ample they are. Is there a good equation to sort by which takes into account count+rating?
[19:59:42] <Prometheian> You mean the number of ratings + the average rating?
[21:09:45] <GothAlice> It's really not, unless you like running out of memory mid-query. ;)
[21:09:49] <abishek> GothAlice is aggregation faster than using a group function?
[21:10:17] <Torkable> GothAlice: I want the documents sans three fields
[21:10:19] <GothAlice> abishek: Depends; aggregation is easier to parallelize in certain circumstances. A "group function" isn't a thing that exists in MongoDB, so it's not like you have another choice other than full client-side processing. ;)
[21:10:47] <abishek> so aggregate is better than a group by
[21:10:57] <GothAlice> abishek: Rather, $group in an aggregate is the way to go. One could use map/reduce, but map/reduce is slower, invokes a JS engine runtime, etc., etc.
[21:24:02] <Torkable> I just want to remove a couple fields from the aggregate results
[21:29:20] <abishek> GothAlice, how do I do `SELECT COUNT(visits)` using aggregate query?
[21:30:13] <abishek> I can see COUNT(*) compared but not COUNT(columnName)
[21:37:54] <greyTEO> when reading the oplog, what is the best way to parse {user.3.status: "cool"}
[21:38:35] <greyTEO> I am using the oplog to update elasticsearch when I noticed that the oplog is only return the array item that was changed
[21:38:45] <GothAlice> abishek: What does COUNT(column) do, *specifically*?
[21:39:33] <greyTEO> would the work around be to re-fetch the entire document instead of just going by what has changed?
[21:39:43] <GothAlice> greyTEO: Not using Mogno Connector for any particular reason? http://blog.mongodb.org/post/29127828146/introducing-mongo-connector and https://github.com/10gen-labs/mongo-connector being relevant links.
[21:40:45] <abishek> GothAlice, if you look at my query here http://pastebin.com/mL8yU0y7, I amconsidering only values from the column that are greater than 0
[21:41:02] <GothAlice> abishek: That didn't answer the question. T_T
[21:41:37] <GothAlice> Looking at that code, I don't even.
[21:41:45] <GothAlice> I don't know what that is, but it's not an aggregate.
[21:43:15] <GothAlice> If you want to pass in variables that can be used in expressions, MongoDB-side, $let is your friend.
[21:43:29] <greyTEO> GothAlice, I have to do a bunch of stuff and not just pipe directly.
[21:43:40] <greyTEO> otherwise that would have worked perfect
[21:44:06] <fewknow> greyTEO: you can write you own doc_manger
[21:44:12] <greyTEO> currently it's oplog -> rabbitmq -> ES
[21:44:15] <fewknow> and still use mongo-connector
[21:45:03] <GothAlice> Simple is better than complex. Complex is better than complicated. Adding intermediaries, with their own logistic and maintenance requirements, enters into the realm of complicated.
[21:46:16] <fewknow> abishek: you can use $group for $gt and then pipe to next part of aggregation. Well I would in the shell anyway . If I understand what you are doing....a little late to the conversation
[21:48:07] <greyTEO> GothAlice, Im assuming you mean don't write your own??
[21:49:18] <GothAlice> Well, it's mostly that any time I see multiple database systems being utilized on a single project I end up slapping the architect. ;) Also "not invented here" syndrome is a Very Bad Thing™ when you're dealing with a) cryptography or b) anything you want to be highly reliable and highly available.
[21:50:40] <GothAlice> (Unless you've got unit and integration tests, tests testing your tests, continuous integration, and some serious planning.)
[21:51:29] <greyTEO> ehh. It's certainly the not invented here syndrome.
[21:51:46] <GothAlice> The only valid excuse for anything, ever: "It seemed like a good idea at the time." ;)
[21:51:50] <greyTEO> python is far from a language I have experience with..
[21:51:59] <GothAlice> I try not to use NIH as a pejorative—I reinvent all the things! ;)
[21:52:37] <greyTEO> lol being the only developer now, I reuse all the things! ;)
[21:52:58] <GothAlice> But even then… I don't roll my own crypto. Anyone can write a crypto system they themselves can not break. (Doesn't mean your next-door neighbour's dog can't break it…)
[21:53:34] <greyTEO> I would never roll my own crypto
[21:58:41] <GothAlice> If you have multiple "workers" chewing in data being piped out of the oplog, you're effectively distributing the jobs "modulo" the number of workers.
[21:59:42] <GothAlice> https://gist.github.com/amcgregor/4207375 is part of a presentation I gave on using MongoDB for task distribution; this is possibly overkill for what you need (I needed full, generic RPC, you don't, really) but handles using MongoDB for this purpose. Includes link to working codebase in the comments.
[21:59:48] <greyTEO> you are talking about tailing cursors?
[22:00:12] <GothAlice> "rabbit would load balance the message if multiple cosumers were tailing the queue"
[22:00:51] <medmr> the monq package puts a nice awkward job queue on top of mongo
[22:01:05] <GothAlice> I'm not familiar enough with rabbit; would it be broadcasting the same message to every listener, or somehow choosing to distribute specific messages to specific workers?
[22:03:59] <greyTEO> GothAlice, you can only have as many thread as you have masters correct?
[22:04:23] <GothAlice> Please define: "master" and "thread" here.
[22:04:51] <GothAlice> medmr: https://github.com/marrow/task/blob/develop/marrow/task/model.py#L181-L193 < my Task model ;)
[22:05:07] <greyTEO> master -> Primary and thread being threads* that can tail the cursor.
[22:06:13] <GothAlice> medmr: https://github.com/marrow/task/blob/develop/marrow/task/queryset.py#L9-L62 is the part that handles tailing cursors with timeouts; rough estimate timeouts, though, until a certain JIRA ticket gets fixed. This code lets you terminate the building of a Query and use a simple for loop iterator to get results back as they arrive.
[22:06:17] <fewknow> GothAlice: I use multiple databases on every project. But I use a DAL(Data Access Layer) that abstracts that from everyone. But different database technologies makes sense at different points in a project
[22:06:40] <GothAlice> greyTEO: I have yet to encounter a limit on the number of tailing cursors (workers/listeners) able to simultaneously query a single collection.
[22:08:23] <greyTEO> I have onyly seen 1 task per primary. Hence why I thought duplicate data would be distributed with more than one listener
[22:08:31] <GothAlice> fewknow: Considering I wouldn't trust a single DAL to handle relational, hierarchical/graph, and document storage sensibly from a single API (too many compromises would need to be made), in the cases where I need multiple DBs, I use dedicated ODM/ORMs. I.e. MongoEngine for MongoDB, SQLAlchemy for anything relational, etc.
[22:09:39] <fewknow> GothAlice: I use MongoEngine as well for Mongo, but I abstract it from everything else so a developer doesn't need to worry where the data is stored or how it is accessed. They just hit the API and the data is retrieved and delivered.
[22:09:52] <fewknow> I leave it up to someone (ME) that know how to store the data to handle it....
[22:10:11] <GothAlice> greyTEO: The example producers and consumers in the sample code I demonstrated on stage using two producers and five consumers. Whichever consumer locked the record first is the one that gets to work on it.
[22:10:12] <fewknow> more for abstraction for anyone above my layer
[22:10:51] <greyTEO> GothAlice, back to the original question, do you know if the mongo-connector refetches the mongo doc before updating ES?
[22:10:53] <GothAlice> greyTEO: This results in trivial load balancing; if a node is too busy servicing another task, it won't try to lock it.
[22:11:07] <GothAlice> greyTEO: I do not know; it's reasonably readable Python code, though.
[22:12:11] <greyTEO> GothAlice, thanks for all the info. I appreciate it
[22:12:54] <greyTEO> GothAlice, i didnt know the producers lock the document on reading in oplog
[22:13:18] <GothAlice> The "locking" I'm referring to is entirely the realm of this implementation, not anything to do with MongoDB itself.
[22:13:19] <greyTEO> then I totally missed your point?
[22:13:59] <greyTEO> because you are working on documents....not just reading them?
[22:14:32] <GothAlice> Ref: 4-job-handler.py from the presentation slides gist: lines 9/10 handles a failure to communicate with MongoDB, line 12 handles failure to acquire the lock we're trying to set on line 7.
[22:15:05] <GothAlice> It's a two-part implementation: it uses a capped collection for messaging (new job, job failed, job finished, etc.) but a real collection to track the jobs themselves.
[22:15:24] <GothAlice> (This allows for sharding; capped collections themselves can't be sharded.)
[22:16:59] <greyTEO> I usually use rabbit for most things in our current system.
[22:17:51] <greyTEO> I am just getting into mongo and denormalizing some mysql tables to mongo.
[22:18:18] <greyTEO> so it seemed to fit mongo oplog -> rabbit -> whatever we currently have in our system
[22:18:39] <greyTEO> GothAlice, again thanks for clearing it up
[22:19:43] <GothAlice> greyTEO: Practicality beats purity; but this task system evolved from a project I joined two weeks in that was using, and I'm not joking: rabbitmq, zeromq, memcache/membase, redis, and postgresql.
[22:19:57] <GothAlice> (All replaced within the next two weeks with nothing but MongoDB. ;)
[22:20:49] <greyTEO> that is too many services to run at a time
[22:20:52] <GothAlice> I mentioned slapping architects that use multiple databases. I think I left a mark on this one. ;)
[22:22:53] <GothAlice> One queue had persistence, the other had routing capabilities. Memcache because we need caches, right? Redis because efficient semi-persistent sets of integers were needed. Postgres for general data. Queues = capped collections, routing = tailing cursor with a query, automatic cache expiry = TTL indexes, $addToSet/etc. for set operations in MongoDB, and, well, MongoDB documents map to Python structures quite well for general storage.
[22:22:57] <greyTEO> I think multiple databases have their place. I disagree when people use services so than can be "cutting edge"...
[22:23:42] <GothAlice> Certainly; if someone needs referential integrity or transactions, I point them at a relational database designed for that purpose. Same for storing graphs—for the love of the gods, people need to stop shoe-horning graphs on top of non-graph DBs.
[22:25:01] <greyTEO> we still use relational data heavily. Our data fits well into a relational model, I moreso liked the throughput I can get with mongo and apache spark for ETL of a 13gb xml file
[22:25:56] <greyTEO> "One queue had persistence, the other had routing capabilities. Memcache because we need caches, right? Redis because efficient semi-persistent sets of integers were needed. Postgres for general data.".....super simple
[22:26:09] <GothAlice> Yup; it seemed like a good idea to them at the time. ;)
[22:26:20] <greyTEO> that makes my nose bleed just reading that
[22:26:29] <GothAlice> I'll not get started on storing at-rest data in XML, though. ;)
[22:27:09] <GothAlice> You may be very interested in http://www.tokutek.com/tokumx-for-mongodb/ by the way.
[22:27:40] <GothAlice> (Stable MongoDB fork using fractal trees for indexes and oplog buffering, with transaction support and compression.)
[22:28:01] <greyTEO> I have also wanted to try the new wiredtiger engine
[22:28:04] <GothAlice> MVCC support was the thing that really caught my attention.
[22:28:39] <GothAlice> WT isn't… quite… production-quality yet. There's a standard list of about half a dozen critical memory management issues I typically paste. ;)
[22:36:13] <GothAlice> (While avoiding InnoDB like the plague. I had to reverse engineer its on-disk format, and it took the 36 hours I had prior to getting my wisdom teeth extracted to recover my client's data after an Amazon EC2/EBS failure across multiple zones.)
[22:36:44] <GothAlice> That was *not* a fun week. ;)
[22:39:03] <GothAlice> Yeah. We switched to Rackspace after that. AWS's SLA explicitly stated their system was architected to isolate failures to a single zone and explicitly recommended doing what we did to ensure high-availability. Well, they were wrong. ;)
[22:43:57] <GothAlice> Weirdly, for application nodes, there are many cheaper solutions. (I'm trialling clever-cloud.com right now at work; so far so good.) For DB? Not so much. (My personal MongoDB dataset would cost me just under half a million dollars per month if I hosted it on compose.io… by self-hosting it, it cost me around ~$10K once. I can afford to replace every drive every month by not cloud hosting it, and it paid for the initial investment
[22:46:09] <greyTEO> looks like it splits on the '.' the current key that I have 'item.3.isVerified'
[22:46:29] <GothAlice> abishek: Just add another $match stage. I.e. $match, $project, $unwind, $match, $group — the second $match is effectively HAVING-filtering the unwound list items prior to the $group.
[22:48:22] <GothAlice> abishek: Again, just use $match. Sometimes, if your query is very complex, you might need to use conditional expressions within the $group projection. Define "mention".
[22:48:57] <GothAlice> greyTEO: Read that as the isVerified field of the third object in the item field.
[22:48:58] <abishek> using the HAVING clause in `db.getCollection(collectionName).group()`
[22:51:16] <GothAlice> abishek: I have never once actually ever used .group(). The documentation pretty clearly states: "Use aggregate() for more complex data aggregation."
[22:51:38] <GothAlice> And "HAVING" is meaningless in MongoDB. It's not a thing.
[22:52:03] <GothAlice> http://docs.mongodb.org/manual/reference/sql-aggregation-comparison/ < even clearly states the equivalent is $match