PMXBOT Log file Viewer

Help | Karma | Search:

#mongodb logs for Tuesday the 7th of April, 2015

(Back to #mongodb overview) (Back to channel listing) (Animate logs)
[00:00:47] <afroradiohead> ok i think i got it from here
[00:00:49] <afroradiohead> thanks joannnac
[00:00:52] <joannac> np
[00:00:58] <afroradiohead> i think i just needed to talk it through
[00:05:38] <afroradiohead> alrighty, that... worked
[00:28:07] <d4rklit3> hi
[00:28:10] <d4rklit3> im wondering
[00:28:30] <d4rklit3> is insertion order of ObjectID's a viable solution for list ordering?
[00:28:47] <d4rklit3> like say a document has an array of ObjectIDs
[00:29:01] <d4rklit3> if i just reorder the array, and use that order would that work?
[00:31:52] <joannac> d4rklit3: I'm not sure I understand the usecase
[00:32:09] <d4rklit3> so i have a list of some arbitrary objects in an array attached to a document
[00:32:16] <d4rklit3> basically a list of ObjectIDs
[00:32:25] <d4rklit3> can i use the array's index order as the order of the list?
[00:32:41] <d4rklit3> like in sql, i would have to use a join table for something like this
[00:32:46] <joannac> yes, but it might change if you modify the array?
[00:33:03] <d4rklit3> well modifying the array's order would be to that affect
[00:33:37] <d4rklit3> it would be my responsibility as the developer to ensure that array operations preserve index order unless specifically intended to change it.
[00:34:07] <joannac> so you're asking me an application question?
[00:34:46] <d4rklit3> purely, mongo will never reorder an array of ObjectID's unless the application tells it to
[00:34:48] <d4rklit3> ?
[00:35:07] <d4rklit3> so even if i copy the document to another collection or mongodump/mongorestore it will be preserved
[00:37:42] <joannac> d4rklit3: oh. i think so
[00:39:28] <d4rklit3> cool
[04:44:44] <hahuang65> is there an easy way to export indexes to import them on another cluster?
[04:49:36] <morenoh149> hahuang65: are both clusters running the same data model?
[04:49:48] <hahuang65> yups
[04:49:58] <hahuang65> exactly
[04:50:10] <hahuang65> building a new test env
[04:53:20] <morenoh149> hahuang65: I'd suspect it can't be done. No mention of it here http://docs.mongodb.org/manual/core/index-creation/
[04:53:37] <hahuang65> morenoh149: no worries, I can do it by hand
[05:10:19] <Jonno_FTW> I'm having ap roblem with pymongo, .limit(100) does nothing
[05:17:48] <joannac> Jonno_FTW: need more detail than that
[05:19:18] <joannac> Jonno_FTW: does .limit(1) work?
[05:21:15] <Jonno_FTW> I fixed it, the problem was I was calling .find().limit(100), then later calling .count() on the cursor
[05:24:32] <Jonno_FTW> I actually wanted .count(True)!
[05:25:23] <Jonno_FTW> couldn't possibly have a .size() method in pymongo to match the js api
[07:44:44] <dougb> this is going to be a dumb question, but I'm saving timestamps in mongo under the "t" field, and when I view the record in mongohub I see the following: "t": new Date("2015-04-07T00:00:00-0700") but when I try to filter by setting it to find by {"t": {"$gte": "2001-01-01"}} it's not returning any results
[07:45:44] <soupdiver> Hey guys! I'm currently getting through the official Dockerfile for MongoDB. Can anybody tell me where the directory is set where mongod does write the data to? https://github.com/docker-library/mongo/blob/master/2.6/Dockerfile
[07:46:08] <joannac> dougb: you need to compare dates with dates, not dates with strings
[07:47:01] <joannac> soupdiver: looks like /data/db
[07:48:24] <soupdiver> @joannac Yeah but WHERE is that setup? I can just see that the Volume is defined and the access rights are set. But where id mongod told to actually use that dir?
[07:49:48] <jdo_dk> i have a document like: myid: 123, docs: [{read:true,name:"Test"},{read:false,name:"Test2"}], can i somehow make a query which "both" match myid:123 and docs.read:true ?
[07:50:02] <joannac> soupdiver: line 29?
[07:50:13] <joannac> soupdiver: alternatively, that's the default path anyway
[07:50:21] <joannac> jdo_dk: yes
[07:51:10] <jdo_dk> joannac: I try like: db.docs.find({myid:123, docs.read:true}) but im not getting any data returned.
[07:51:22] <soupdiver> @joannac But in the container "/etc/mongod.conf" says dbpath=/var/lib/mongodb
[07:51:38] <joannac> jdo_dk: that doesn't parse. you need quotes around docs.read
[07:51:50] <jdo_dk> joannac: I get: SyntaxError: Unexpected token .
[07:52:13] <soupdiver> @joannac that's why I'm confused. I don't hink /data is the default location of mongodb to store it's data or?
[07:52:26] <joannac> soupdiver: okay? is mongod started with the conf file?
[07:52:30] <jdo_dk> joannac: db.docs.find({myid:123, "docs.read":true}) returns both read:true and false.
[07:52:51] <joannac> jdo_dk: proof please
[07:53:08] <joannac> jdo_dk: specifically, a document which has docs.read = false but is returned
[07:53:28] <soupdiver> @joannac The config file is not specifically set anywhere in the Dockerfile or entrypoint.sh it just says "mongod"
[07:53:41] <joannac> soupdiver: there's your answer then
[07:53:55] <joannac> creating a config file doesn't magically make mongod use it
[07:54:38] <soupdiver> @joannac so the built in behaviour of mongod is to use /data if not specifying something else?
[07:54:45] <joannac> soupdiver: /data/db
[07:55:00] <jdo_dk> joannac: How do i provide proofs ?
[07:55:07] <jdo_dk> screenshot ?
[07:55:19] <joannac> soupdiver: http://docs.mongodb.org/manual/tutorial/manage-mongodb-processes/#start-mongod-processes
[07:55:25] <joannac> jdo_dk: screesnhot or pastebin
[07:55:46] <soupdiver> @joannac allright. Thank's :) It just looked not "normal" to use that path. That's why I was confused.
[07:57:13] <dougb> ok, I think I got past that date issue. Now I'm getting BadValue Unsupported projection option: count: { $sum: "$c" } when I'm trying to return the result of getting the sum of a single column in the results
[07:57:31] <jdo_dk> joannac: https://dpaste.de/vKku
[07:58:05] <joannac> dougb: are you doing aggregation?
[07:58:29] <joannac> jdo_dk: the element that has name = Test3 has read = true
[07:58:35] <joannac> same with Test7 and Test8
[07:59:27] <dougb> joannac: http://pastebin.com/rrMfB0bH that's what I'm trying to do, using mgo w/ go. I switched the "$sum": "c" from "$sum": "$c" just trying to see what works
[07:59:38] <jdo_dk> joannac: Yeah. And the query is docs.read:false, so dont understand why they are there..
[07:59:54] <joannac> jdo_dk: there's one array element with read = true. done
[08:00:03] <joannac> jdo_dk: what you want to do is not supported afaik
[08:00:14] <joannac> dougb: where did you get that code?
[08:00:18] <joannac> dougb: that's not valid code
[08:00:31] <dougb> i know, I am writing it...
[08:00:43] <joannac> dougb: okay. go back to the docs.
[08:00:54] <joannac> there's no operator $sum in a find() call
[08:01:13] <joannac> you probably want aggregation
[08:02:04] <joannac> jdo_dk: try http://stackoverflow.com/questions/6038818/mongodb-query-to-match-each-element-in-doc-array-to-a-condition although it's going to be really slow
[08:03:27] <jdo_dk> joannac: Thanks. Why cant i query on subdocuments ?
[08:03:51] <joannac> jdo_dk: you can, just not the way you want
[08:04:09] <joannac> jdo_dk: there's no syntax for "I want every element in my array to satisfy this condition"
[08:05:30] <jdo_dk> joannac: So i will get the array if matched or not ? And im unable to query only the elements in the list? Now i got it. :D
[08:06:21] <joannac> jdo_dk: ah, right. there's no operator for "return me only the elements in this array that matches YYY"
[08:07:01] <jdo_dk> joannac: Thanks for clarify. :p
[08:09:30] <dougb> joannac: thank you for your help, does this look better? http://pastebin.com/3GiPvT7f it gives me the results I'm looking for
[08:15:43] <mtree> hello
[08:16:33] <mtree> what should i set in connection string to NOT wait for index updates when saving queries to mongo?
[08:24:03] <nickenchuggets> how do you push an element onto a mongodb array?
[08:25:52] <nickenchuggets> ah, nvm
[08:52:34] <mtree> Ok, so here's the thing: i want to allow my app to read from replicas and do not wait for propagation to other replicas when saving. What should I add to query string?
[08:53:31] <mtree> connectin string*
[09:05:56] <joannac> mtree: look up write concern and read preference
[09:13:52] <mtree> joannac: thanks :) w=1 seems like what i'm looking for
[09:18:45] <Derick> w=1 is the default
[09:19:07] <Derick> if you need to read from replicas, then you need to set your readPreference as well
[09:22:30] <joe1234> HI, I would like to know if it's safe to ignore the schema design constraints chosen due to fragmentation and reallocation problems ( of memory mapped files) if the engine is switched switch to wiredTiger(I can't find a clear yes or no answer on the net).
[09:39:17] <pamp> hi
[09:39:33] <Derick> hi
[09:40:11] <pamp> anyone have problems to remotely connect to a mongodb DB in version 3.0.1
[09:40:35] <pamp> I can connect directly in the server
[09:40:54] <pamp> but fails when I try to connect remotely
[09:42:00] <Derick> what does it show when you run "netstat -a -n | grep 27017" on the server?
[09:42:21] <pamp> Im in a windows server machine
[09:42:30] <Derick> oh, then I don't really know - sorry
[09:43:21] <Derick> you need to find out on which IPs the MongoDB server is bound
[09:44:02] <Derick> but I also wouldn't be surprised if the Windows firewall would be in the way
[09:52:12] <pamp> nop, I think its a 3.0 version problem
[09:52:49] <pamp> because the settings are properly made
[09:58:30] <pamp> sorry
[09:58:38] <pamp> my problem was robomongo
[09:58:56] <pamp> outdated
[09:59:01] <Derick> ah
[10:44:02] <pamp> its possible with javascript verify if a field is an array
[10:44:02] <pamp> ?
[10:44:30] <pamp> some like if(doc.props[i].v.$type == 4)
[10:49:07] <MatheusOl> pamp: your_item instanceof Array
[10:51:39] <pamp> thanks a lot
[10:51:59] <pamp> an if the field is an embembed document?
[10:52:04] <pamp> and*
[10:52:13] <pamp> its possible to verify that
[10:54:00] <MatheusOl> hm
[10:54:18] <MatheusOl> doc.field instanceof Object, perhaps
[10:54:36] <MatheusOl> Although I'm not sure if Date or ObjectID would return true for that
[10:57:10] <pamp> http://dpaste.com/0R6YA27 this is my script
[10:57:36] <pamp> is matching the field I want , but is matching other fields too, like Int64
[10:58:35] <pamp> example of a document http://dpaste.com/2BPGF7Q
[10:59:03] <pamp> I need to match only the last element of props array, who is a embembed document
[10:59:34] <pamp> but is matching this fields too { "k" : "qOffset1sn", "v" : NumberLong(0) }, { "k" : "qOffset2sn", "v" : NumberLong(0) },
[11:05:17] <betamarine> hi, i'm new to mongo. how do I make a local db for testing?
[11:05:23] <betamarine> something like db.sqlite3
[11:06:26] <pamp> I managed to solve my problem thank you
[11:07:13] <pamp> betamarine: justo db.sqllite3.insert({data})
[11:55:58] <KekSi> i'm quite lost right now: has anyone tried to get a cluster with x509 authentication to run in docker instances?
[11:57:02] <KekSi> how do i get the server to actually use the certificate (it just says "The server certificate does not match the host name 127.0.0.1")
[12:01:01] <diegoaguilar> Hello Derick
[12:01:10] <diegoaguilar> you passed me link at ur site
[12:01:15] <diegoaguilar> with some example documents
[12:01:17] <diegoaguilar> last time
[12:01:21] <diegoaguilar> could u provide again?
[12:07:42] <Derick> diegoaguilar: what was the subject? i've many articles! was it for solr or something?
[12:07:46] <joe1234> KekSi, Never done that for mongo , but I guess you gotta generate a cert whose CN matches the hostname of the docker container (result of hostname command), you can verify using "openssl x509 -in certfile.ext -text " (maybe you 'll have to specify the cert encoding CER or DER), there you can check the CN, you also gotta check that you properly defined your container's hostname
[12:10:19] <kiddorails> We have mongo db of size > 2TB. We need to daily remove the data which is 3 months old. We have been doing it by remove() in batches to prevent db lockdown, but this is very time consuming.
[12:10:46] <kiddorails> I was reading about TTL in Mongo. This seems very appealing but I'm not sure if it will work well for us.
[12:11:24] <kiddorails> First, if we index the entire db in foreground, it will lock it down for Read/Write.
[12:11:43] <kiddorails> If we do it in background, it may turn out to be very time consuming and high on RAM. Not sure. :/
[12:12:08] <kiddorails> I'm curious what would be the best way to handle this problem
[12:59:04] <rom1504> Hi, I'm getting "Failed: error unmarshaling bytes on document #0: JSON decoder out of sync - data changing underfoot?" when running mongoimport on a (simple and small) json file, any clue what the problem is ?
[13:00:33] <rom1504> oh --jsonArray fixed it..
[13:36:02] <jokke> hello
[13:36:42] <jokke> is it possible to get all distinct keys for an embedded document in documents of a collection?
[13:37:45] <jokke> eg. i have { _id: 'foo', name: 'bar', embedded: { foo: 'bar', baz: 'foobar' } }
[13:38:11] <jokke> so with only that document i'd expect to get a result with values: ['foo', 'baz']
[13:54:00] <magesing> Hi everyone I installed mongo as a backend for a sharelatex server... I started to get error 503 when I try and access sharelatex, and I have tracked the problem to my mongod misbehaving... When I have mongod started, and I try to open a shell to it with "mongo" I get:
[13:54:12] <magesing> Error: couldn't connect to server 127.0.0.1:27017 (127.0.0.1), connection attempt failed at src/mongo/shell/mongo.js:146
[13:54:30] <magesing> How can I fix this error and get my database running again? Thanks.
[13:55:25] <compeman> magesing: did you try mongo --repair
[13:56:04] <magesing> compeman: yes, I just did: mongod --repair --dbpath=/srv/sharelatex-data/data
[13:56:12] <magesing> followed by: service mongod restart
[13:56:19] <compeman> sudo rm /var/lib/mongodb/mongod.lock
[13:56:19] <magesing> then mongo, and got the same error
[13:56:27] <compeman> there may be a lock on your mongo
[13:56:31] <compeman> sudo rm /var/lib/mongodb/mongod.lock
[13:56:35] <compeman> then mongo --repair again
[13:57:29] <magesing> does repair need to be done with the daemon running or stopped?
[13:57:57] <compeman> by the way, are you sure that your mongo is running?
[13:58:00] <magesing> tried both, neither worked
[13:58:01] <compeman> try also mongod
[13:58:26] <magesing> mongod 34090 1 0 09:56 ? 00:00:00 /usr/bin/mongod -f /etc/mongod.conf
[13:58:30] <magesing> yes mongo is running
[13:58:37] <compeman> kill the process
[13:58:41] <compeman> start again
[13:59:03] <compeman> also check : netstat -anp | grep 27017
[13:59:38] <magesing> ok, I killed it by stopping the process, checked the process was gone, restarted, and tried mongo, same error
[13:59:56] <compeman> can you restart mongo
[14:00:00] <compeman> manually
[14:00:02] <compeman> with mongod
[14:00:04] <magesing> netstat says:tcp 0 0 172.17.42.1:27017 0.0.0.0:* LISTEN 34829/mongod
[14:00:18] <magesing> compeman: trying manually
[14:00:22] <compeman> there must be some log
[14:00:25] <compeman> when mongo starts
[14:01:35] <magesing> compeman: starting it manually now it is currently " preallocating a journal file"
[14:01:42] <magesing> now is waiting for connections
[14:01:56] <magesing> aaand mongo works
[14:02:06] <magesing> ... so is the problem in my settings or my init script?
[14:02:10] <compeman> yeap
[14:02:21] <compeman> just configure your dbpath
[14:02:26] <compeman> i think.
[14:03:59] <magesing> compeman: aah I see, my dbpath somehow got set to /var/lib/mongo instead of /srv/sharelatex-data/data ... I think the install of gitlab may have done it.
[14:34:20] <Aartsie> Hi all!
[14:34:42] <Aartsie> Is it possible to reduce the diskspace in MongoDB 3 ?
[14:35:18] <GothAlice> Aartsie: http://docs.mongodb.org/manual/reference/configuration-options/#storage.mmapv1.smallFiles
[14:36:33] <Aartsie> GothAlice: Thank you but that is not what i mean... i have a lot of old data but when i delete this data it wil not reduce the used disk space
[14:36:44] <GothAlice> Ah, for that you need to compact your collections.
[14:36:52] <magesing> OK, I have a server which has mongod up-and running and bound to 172.17.42.1, I can connnect to it from the local server using mongo --host 172.17.42.1 If I open a shell inside the docker container, I can successfully ping 172.17.42.1, however, mongo --host 172.17.42.1 won't connect... How can I fix that? Thanks.
[14:36:53] <GothAlice> http://docs.mongodb.org/manual/reference/command/compact/
[14:37:31] <GothAlice> magesing: Sounds like a docker NAT issue.
[14:37:40] <Aartsie> GothAlice: Thank you :) then it will be release the disk space ?
[14:38:18] <GothAlice> Aartsie: Sorta. It'll compact everything back down to be as efficient as possible. It likely won't delete unused stripes. A full repair would do that, but it's an offline operation and can take a long time (and needs 2.1x as much free disk space as you have data.)
[14:40:53] <Aartsie> Why issn't there a good clean function in MongoDB because this is ridiculous... You want to clean up because you reach the max of your diskspace and before doing this you will need more disk space
[14:44:59] <Derick> GothAlice: interestingly, the docs say that "compact" only needs at most 2Gb of free space... http://docs.mongodb.org/manual/reference/command/compact/#disk-space
[14:46:54] <Derick> GothAlice: "repairDatabase" needs 2n + 2Gb (and returns data back to the OS): http://docs.mongodb.org/manual/reference/command/repairDatabase/#dbcmd.repairDatabase
[14:49:09] <magesing> Is it possible to connect to a running docker daemon with telnet? I'm still trying to debug why the connection from my docker image isn't working.
[14:56:32] <GothAlice> Aartsie: As with anything, failure to use a tool correctly is rarely the fault of the tool. If you're hosting a database, of any kind, you better be monitoring disk and RAM utilization, replication latencies, page faulting, etc., etc., etc.
[14:57:35] <GothAlice> (And have sufficient plans in place to handle abnormal situations such as running low on disk.)
[14:58:50] <Aartsie> it is no problem to put more diskspace on the server but it is weird that there is no function for deleting this old data and reduce used diskspace
[14:59:31] <GothAlice> Aartsie: repair is that function.
[14:59:53] <Aartsie> yeah ok but it is not possible to repair one collections
[15:00:01] <GothAlice> Due to the way repair works, it's pretty much the only way to accomplish what you're asking, anyway. (It literally reads all of the data out and dumps it into freshly created stripes.)
[15:00:33] <GothAlice> Aartsie: That's because MongoDB data isn't organized into file-per-collection or anything like that. All collections within one database are stored in the same stripe set.
[15:05:45] <GothAlice> Welp, another thing to add to my FAQ. XD
[15:06:11] <Aartsie> GothAlice: Thank you for the explanation
[15:06:15] <Derick> GothAlice: i tweeted about your FAQ earlier today
[15:06:18] <GothAlice> It never hurts to help.
[15:06:21] <GothAlice> Derick: I noticed! XP
[15:06:37] <GothAlice> Derick: That was a rapid set of retweets and favourites hitting my mailbox. :D
[15:06:44] <Derick> mostly bots
[15:08:06] <GothAlice> Heh; whenever I get mentioned online I get notifications out the whazoo. I'll probably get re-pinged within 24h after Google indexes your tweet. XD
[15:08:11] <GothAlice> (Speaking of bots.)
[15:10:09] <MacWinner> if you want to do a seamless failover of a primary node to secondary, is it better to tweak replica set priorities? I see that doing rs.stepDown() can cause something like 10-20 seconds of downtime
[15:10:46] <MacWinner> if you reconfigured the node priorities, would it be cleaner?
[15:12:58] <GothAlice> MacWinner: If you can avoid elections (i.e. by ensuring there's only ever one node with the highest—I think—priority) the failover process should be pretty quick.
[15:13:21] <GothAlice> It'll still boot out existing connections when the election is called, but a quorum would be met very quickly.
[15:15:26] <MacWinner> GothAlice, thanks.. just want to make i understood you.. so if I have a 3 member RS, and node-1 is currently primary and all priorities are equal, I'm thinking of doing this: Set node-2 priority higher than node-1. Will the RS automatically do a failover when it detects a higher priority member that is not primary?
[15:15:44] <GothAlice> MacWinner: Reconfigurations do call elections, AFIK.
[15:16:42] <GothAlice> Basically you want to set your current primary to be the highest priority, the secondary you want to become primary to be the next highest, and the secondary you only want to use (read-only) in the event of a chain of catastrophic failures as the lowest priority (or zero; i.e. never-primary).
[15:19:11] <MacWinner> GothAlice, got it... so then if I do a stepDown on primary, it should elect the second node faster because there is no contention?
[15:19:50] <Derick> a node with priority=0 will never be elected
[15:21:06] <MacWinner> i see.. so if node-a=2, node-b=1, and node-c=0, then would doing a stepdown on node-a theoretically make the election process very fast?
[15:21:14] <GothAlice> MacWinner: And if the current primary is still operational with the high priority, it'll simply be re-elected on the spot.
[15:21:55] <GothAlice> Telling node-a to step down will have it immediately step up again, as it's the only node in the highest available priority.
[15:21:58] <MacWinner> ahh.. that was my next question.. it seems like the priorities would make node-a become prmary automatically even if you did stepdown
[15:22:27] <GothAlice> This is to have control over how high availability failover (i.e. the primary _utterly explodes_) is handled within your cluster.
[15:22:55] <GothAlice> Typically used as a light-weight way of indicating that certain nodes may be sub-optimal, i.e. because they're in a different datacenter, have really slow disks, etc.
[15:22:57] <MacWinner> alright.. thanks.. i'll play around with it now in my test environment (thanks to the script you provided me a couple weeks ago)
[15:23:09] <GothAlice> :D
[15:23:10] <GothAlice> It's handy, no?
[15:23:17] <MacWinner> extremely
[15:24:19] <MacWinner> oh, have you upgraded to mongo 3.0.1 with wiretiger? i'm particularly interested in the compression since my data is highly reptititve
[15:24:29] <GothAlice> Do you have unlimited amounts of RAM?
[15:24:59] <GothAlice> (Seems to be the largest outstanding problem with WT that prevents use on systems that are scaled within the bounds of sanity.)
[15:24:59] <MacWinner> you mean through sharding?
[15:25:07] <GothAlice> No… I mean on a single physical host.
[15:25:46] <MacWinner> i have some servers with 16gb.. and another with 64gb.. the dataset currently is about 10gb
[15:26:19] <GothAlice> MacWinner: https://jira.mongodb.org/browse/SERVER-17421 https://jira.mongodb.org/browse/SERVER-17424 https://jira.mongodb.org/browse/SERVER-17386 https://jira.mongodb.org/browse/SERVER-17456 https://jira.mongodb.org/browse/SERVER-17542 ancillary: https://jira.mongodb.org/browse/SERVER-16311
[15:26:20] <MacWinner> sorry, i didn't understand your comment about the largest outstanding problem with WT
[15:26:27] <GothAlice> Have some JIRA tickets.
[15:26:51] <Derick> GothAlice: IIRC, the stepDown is by default "60 seconds" meaning it will take 60 seconds before it is available for election as primairy again. You can change that by using rs.stepDown(180) f.e.
[15:27:14] <GothAlice> Derick: Ah! Good to know, and that would interact with priorities in a very interesting way.
[15:27:17] <MacWinner> Derick, good to know
[15:27:46] <MacWinner> GothAlice, thanks for tickets.. i'll hold off for now on WT upgrade
[15:28:56] <GothAlice> MacWinner: At the moment I can not recommend WiredTiger for production use due (in the majority of cases; there are situations where it would likely be A-OK) due to the severity of some of these outstanding tickets. I've encountered two of these directly, and can reproduce one of them with the eventual result of causing a kernel panic in the VM. (The one mentioning it getting very slow; let it go, you'll encounter OOM-killer and it has a
[15:28:56] <GothAlice> chance of nuking something important accidentally.)
[15:30:00] <MacWinner> cool.. i'm just OCD about upgrading for some stupid reason
[15:30:06] <GothAlice> Heh, so am I.
[15:30:17] <MacWinner> so i need to resist and counteract myself
[15:30:44] <GothAlice> For the most part it improved query performance and substantially reduced our disk usage (despite us using tricks like single-character field names) but if rapidly upserting 17K records causes it to explode… nah. ;)
[15:31:47] <MacWinner> if i'm not mistaken, will wiredtiger make using single-character field names obsolute? ie, will it compress it automatically?
[15:32:05] <GothAlice> Effectively yes.
[15:32:24] <Derick> the engine doesn't know about the format though. compression is per "block"
[15:32:25] <GothAlice> I mean, it still has to decompress them, so there's still overhead in the wire protocol if you use long keys.
[15:32:55] <GothAlice> Derick: Keys in a collection are desperately screaming for huffman coding.
[15:33:15] <Derick> GothAlice: yes - which incidently I had to work out by hand in uni once.
[15:33:19] <Derick> (huffman coding)
[15:33:20] <GothAlice> (One would allocate shorter huffman codes for things appearing in nested arrays, etc.)
[15:34:27] <GothAlice> I managed to explain huffman trees/coding to a SEO specialist co-worker of mine, in 10 minutes, with the aid of a whiteboard. A the end he made me proud; he asked the best follow-up question ever: "So, uh, this is why zipping a JPEG or movie makes it bigger?" Ding ding ding! (And yes, it was me attempting to get him to stop doing that. ;)
[15:34:57] <Derick> hwhw
[15:34:59] <Derick> hehe even
[15:36:04] <GothAlice> Derick: Any plans for "speedy" wire protocol compression?
[15:36:43] <Derick> I'm not aware of any changes to the wire protocol. I don't think a decision to change that is going to be taken lightly.
[15:44:35] <MacWinner> so basically you think it's alright to upgrade to 3.0.1 in production but not upgrade storage engine just yet?
[15:46:28] <zarry> Hey All, I was interested in changing my replset name(_id) but most examples I found via stackoverflow or otherwise did not work.
[15:46:36] <zarry> Has anyone done this recently with success?
[16:01:22] <Torkable> I'm aware of ObjectId("...").getTimestamp(), but is there are way to select the date, from the objectId, and return it as part of the query?
[16:03:52] <GothAlice> Torkable: No, not currently. The date projection operators don't work on ObjectIds. Vote and watch https://jira.mongodb.org/browse/SERVER-9406 for updates.
[16:04:11] <Torkable> GothAlice: thanks
[16:04:30] <GothAlice> However, you can query ranges by building fake ObjectIds which you pass in your query. (The projection side is just lacking at the moment.)
[16:06:42] <zarry> @GothAlice do you have any experience with changing replset names?
[16:06:57] <GothAlice> zarry: Alas, no, I treat set names as immutable after construction.
[16:07:04] <GothAlice> (If I need a new set, I spin up a new cluster.)
[16:07:05] <Torkable> GothAlice: yea, I need to add real dates to my docs then, thanks
[16:07:14] <zarry> :( ok, appreciated!
[16:07:30] <GothAlice> Torkable: Indeed. Projection is the biggest reason any of my records have explicit creation time fields. (I only have that where absolutely required.)
[16:08:58] <Torkable> GothAlice: agreed, looks a bit messy, but its a log collection and the Ids arn't gunna cut it I guess :(
[16:10:06] <GothAlice> Torkable: For my logs, I don't bother with it. Ripping the UNIX timestamp integer out of the binary ObjectId in the presentation layer is really, really fast, and the main use for a "creation time" in that use case, for me, is for querying, not projection. (I.e. range querying on _id to select time ranges.)
[16:10:44] <Torkable> hmm
[16:10:59] <GothAlice> The projection thing mainly cripples statistical analysis, i.e. chunking your data into time intervals.
[16:11:12] <Torkable> the problem I ran into is, "when were these docs created"
[16:11:15] <GothAlice> (So my Hit records _do_ have creation times due to the need to perform aggregate queries on it.)
[16:11:29] <Torkable> and I don't want to check each one by hand with getTimestamp()
[16:11:57] <GothAlice> "when were these docs created" is a question perfectly answerable by getTimestamp. I'm not seeing the problem.
[16:12:22] <GothAlice> What are you actually doing with the answer to that question? Displaying it?
[16:12:51] <Torkable> the log collection is more of a temp back up
[16:13:17] <GothAlice> Waitaminute. You better be using a capped collection for that. ;)
[16:13:20] <Torkable> users removed docs, they get placed in a deleted doc table for a while incase a mistake was made
[16:13:43] <GothAlice> (I.e. I certainly hope this date querying/projection stuff isn't to determine if old data needs to be cleaned up; MongoDB has several ways of doing this for you)
[16:14:09] <Torkable> so lets say a user says, "all my data of this type is gone"
[16:14:23] <Torkable> and I find 10 deleted docs that belonged to this user in my "log" collection
[16:14:43] <Torkable> I want to know when each one was deleted, aka created in the log collection
[16:14:57] <Torkable> and I don't want to run getTimestamp() over and over
[16:15:12] <GothAlice> Two questions to your last two statements: why, and why?
[16:15:33] <Torkable> >.<
[16:15:34] <Torkable> ok
[16:15:46] <Torkable> boss is like, did your code fuck this up?
[16:15:53] <Torkable> and I say, lets check
[16:16:01] <Torkable> when were these docs deleted
[16:16:11] <Torkable> and what if there are more than 10
[16:16:13] <GothAlice> No. I'm asking why you want to know when each one was deleted, and why getTimestamp is no-go?
[16:16:28] <Torkable> I dont' want to run getTimestamp over and over
[16:16:34] <GothAlice> If you feel not wanting to call getTimestamp is an optimization, it's a premature one. (Alice's law #144: Any optimization without measurement is by definition premature.)
[16:16:36] <Torkable> what if 100 docs were deleted?
[16:16:39] <Torkable> or 1000?
[16:17:07] <GothAlice> getTimestamp performs, in the optimal case, a binary slice (a tiny bit of pointer math) and casting to integer. It's going to be _blindingly_ fast.
[16:17:14] <Torkable> im not worried about perf
[16:17:28] <Torkable> ok I think maybe I missed somthing
[16:17:37] <Torkable> can I run getTimestamp on all the docs
[16:17:46] <Torkable> and see the creation date for each one
[16:18:00] <GothAlice> … considering you iterate a cursor, yes, you can iteratively run getTimestamp on all of the results of your query in a single line of code.
[16:18:20] <Madmallard> I am having what seems like a corruption error with MongoDB 3.0.1
[16:18:25] <Torkable> ok how do I do that
[16:18:29] <GothAlice> Madmallard: WiredTiger?
[16:19:44] <GothAlice> Torkable: db.collection.find({… query here …}, {… projection here …}).forEach(function(doc){ doc.created = doc._id.getTimestamp(); doSomethingElse(doc); })
[16:20:03] <Torkable> GothAlice: you're awesome, thank you
[16:21:00] <GothAlice> Now, assigning it to the document like this is probably a bad idea, FYI.
[16:21:12] <GothAlice> (I'd pass it as a second argument to the "processing" function I called "doSomethingElse" here.)
[16:22:00] <Torkable> mmmm I just want the output
[16:22:24] <GothAlice> As long as you aren't calling db.collection.save() with the result of the processing function, you should be fine. (You don't want to accidentally save out a creation time field you'll never really use.)
[16:23:45] <Madmallard> GothAlice i'm not using WiredTiger
[16:24:15] <GothAlice> Madmallard: Hmm. mmapv1 should be quite stable in MongoDB 3. Do you have a reproducible test case?
[16:26:32] <Madmallard> Nevermind
[16:26:44] <Madmallard> The issue was that I didn't realize the mongo viewer was unordered
[16:26:47] <Madmallard> lol
[16:27:17] <GothAlice> XD
[16:27:35] <GothAlice> I do enjoy when concerning-sounding issues turn out to be something simple.
[18:02:12] <zarry> Anyone have any luck updating replset names? Downtime would be ok, just can't seem to do it even with downtime.
[18:05:22] <mazzy> hi this could be a silly question but it might be quite interesting from some perspective
[18:05:53] <mazzy> do you store text in uppercase or lowercase?
[18:06:46] <StephenLynx> both.
[18:07:12] <mazzy> what you mean?
[18:07:42] <StephenLynx> what do YOU mean? How do you figure how what characters are supposed to be lower or upper case when you read?
[18:08:10] <mazzy> let assume to have a list of brands
[18:09:02] <mazzy> tipically they are represented with the first letter of every part of the brand in uppercase
[18:09:29] <mazzy> for example
[18:09:32] <mazzy> the weater channel
[18:09:56] <mazzy> the weather channel
[18:10:19] <mazzy> how would you store it?
[18:10:28] <StephenLynx> " of every part of the brand in uppercase"
[18:10:35] <StephenLynx> upper case then.
[18:10:35] <mazzy> the weather channel or The Weather Channel
[18:10:40] <StephenLynx> since the text is in upper case.
[18:10:54] <StephenLynx> that is not upper case, that is mixed.
[18:10:55] <GothAlice> [Technical aside: High bit low, second to last bit high, third to last bit high (lowercase) or low (uppercase) typically indicates the combination of text and indicates capitalization in ASCII.]
[18:11:00] <GothAlice> Specifically, "title case".
[18:11:11] <GothAlice> And "title case" has some peculiar language-specific rules.
[18:12:07] <GothAlice> I.e. it doesn't exist in French. (I.e. "the" wouldn't be capitalized) Nouns are, though. Also, particle words in English aren't typically capitalized. (I.e. "Department of Justice" — lower-case "of")
[18:12:45] <GothAlice> This is effectively an un-winnable data normalization problem.
[18:13:01] <mazzy> sorry StephenLynx
[18:13:12] <mazzy> I have explained myself bad
[18:13:29] <mazzy> I meant "capitalize the first letter of every word"
[18:13:51] <GothAlice> "Department Of Justice" is wrong, though. That naive title case approach isn't very good.
[18:14:43] <GothAlice> It's one of those situations where you have to rely on users entering it correctly, and potentially storing both what they wrote, and the all-lowercase version with a unique index to prevent differences in case alone.
[18:16:22] <GothAlice> A French example: there's a department of the Québec government called: "Comité de déontologie policière" — no title case.
[18:16:49] <GothAlice> (The English version of that is: "Quebec Police Ethics Committee" — title case.)
[18:21:13] <GothAlice> Natural language is hard. ^_^ (https://raw.githubusercontent.com/marrow/util/develop/marrow/util/escape.py < the "unescape" function here includes personal pronoun mapping… *shudder*)
[18:40:23] <Prometheian> I've got a shitty ERD for an app I built for school. I was wondering if I could express the data better in document format: http://i.imgur.com/xHx6x6y.jpg
[18:41:28] <GothAlice> Prometheian: Plan on having more than 3.3 million words of locations and ratings per restaurant?
[18:41:44] <Prometheian> Probably not, no.
[18:42:09] <GothAlice> Then you could have two collections combining all of that: Restaurants and Users.
[18:44:03] <GothAlice> Where RestaurantTypes becomes a list of strings (the types), same for Ethnicities, locations is a list of embedded documents, and ratings are likewise a list of embedded documents. (You can add manual _id's to embedded documents, so you can have "internal references" and such.) Or you could split locations from restaurants, and just embed ratings in the location.
[18:44:38] <GothAlice> Generally it's a bad idea to nest more than one level deep (thus it's ill-advised to put reviews inside locations inside restaurants).
[18:44:55] <GothAlice> http://www.javaworld.com/article/2088406/enterprise-java/how-to-screw-up-your-mongodb-schema-design.html < if you're coming from a relational background, this may be helpful
[18:45:24] <owen1> i want to allow connections to my mongo from my ec2s. i am on AWS. what the best way to do that?
[18:47:13] <GothAlice> owen1: Set up firewall security groups to explicitly allow only your own EC2 VMs to connect, then just connect.
[18:48:31] <Prometheian> Yeah, I'm tempted to use Mongo but I wasn't sure if it was the better option.
[18:49:03] <Prometheian> And I heard joins were slow in Mongo so you have to create your DB differently.
[18:49:09] <GothAlice> Joins don't exist.
[18:49:22] <GothAlice> Straight-up "no" on that point. ;)
[18:49:38] <Prometheian> Ah, well. :P
[18:50:28] <GothAlice> However, if you pivot your data a bit (I'd go with the three collection approach: restaurants, locations, users, reviews embedded in locations) MongoDB could work very well.
[18:52:23] <Prometheian> I need to relate restaurants to locations somehow.
[18:52:30] <Prometheian> Would I do that in my backend or frontend?
[18:54:29] <owen1> GothAlice: don't i need to modify some stuff on my ec2 where mongo is installed?
[18:54:57] <GothAlice> Likely you'd need to change the bindAddress to 0.0.0.0 to make it listen on the correct interface.
[18:55:09] <GothAlice> But definitely have the firewall configured before you do that.
[18:56:01] <Prometheian> GothAlice: Would I do my 'joins' in my app layer then?
[18:56:18] <GothAlice> Prometheian: Yup; they'd just be additional queries.
[18:56:59] <Prometheian> So effectively I'm moving the join logic up a layer and storing the data in mongo which is faster?
[18:57:00] <GothAlice> There's also "pre-aggregation" to store the averages, that way you don't need to iterate all of the reviews each time you want to show one of the stats.
[18:57:28] <Streemo> If there a way to add a document with name "blah" to the collection only if a document with name "blah" doesn't already exist within 10 miles of it. If it does eist, then increment the alredy existing document
[18:57:32] <GothAlice> You're also increasing data locality (i.e. when looking at a location, you likely want the reviews, or vice-versa, when looking at a review you'd naturally need to location details anyway).
[18:57:47] <Prometheian> True, cool.
[18:58:03] <Prometheian> It's just a bit weird to think about. I can normalize a sql db just fine but Mongo is new :D
[18:58:05] <GothAlice> Streemo: Conditional upsert.
[18:58:13] <Streemo> GothAlice: cool, thanks.
[18:58:19] <GothAlice> Or rather, conditional update.
[18:58:28] <Streemo> btw, nice you're a moderator now
[18:58:54] <GothAlice> db.foo.update({… query …}, {… updates …}) (only update matching documents; I *believe* this can include geo queries.)
[18:59:08] <Streemo> hehe i hope youre right
[18:59:14] <GothAlice> Streemo: We had a particularly persistent abusive user, I offered to assist with keeping the rabble down. ;)
[18:59:20] <Streemo> lol
[18:59:31] <Streemo> well im glad youre still on here :)
[19:00:10] <Streemo> Btw, you mean conditional upsert, right?
[19:00:28] <owen1> GothAlice: is bindaddress something i define in my /etc/mongodb.conf ?
[19:00:30] <GothAlice> And yeah, actually, now that I read instead of skimming, an upsert is what you want. $setOnInsert will let you set any properties you aren't directly querying on.
[19:01:02] <GothAlice> owen1: Aye. And apologies, its current name is bindIp: http://docs.mongodb.org/manual/reference/configuration-options/#net.bindIp
[19:01:04] <Streemo> hmmm
[19:01:05] <Streemo> ok
[19:01:08] <Streemo> ill think on it
[19:01:11] <owen1> GothAlice: thank you
[19:01:14] <Streemo> thanks GothAlice
[19:01:22] <GothAlice> It never hurts to help. ^_^
[19:06:04] <owen1> GothAlice: just so i'll understand it, what exactly bind_ip = 0.0.0.0 means?
[19:06:17] <GothAlice> "Allow connections from any network interface."
[19:06:44] <GothAlice> (You would then use firewall rules to ensure only authorized hosts can connect.)
[19:07:17] <GothAlice> Without that firewall in place, though, that bind_ip would open your MongoDB server to remote use. (Not a good thing.)
[19:07:39] <owen1> GothAlice: oh. those firewall rules, is it somewhere on AWS docs?
[19:09:42] <GothAlice> owen1: It's the standard EC2 firewall system.
[19:09:50] <GothAlice> Managed via the AWS console or via API.
[19:10:00] <owen1> i'll read about it. thanks
[19:10:05] <GothAlice> See: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html
[19:10:30] <owen1> GothAlice: so i need to create a security group
[19:10:49] <GothAlice> Yup; one you design to allow your application servers access to the MongoDB port on the MongoDB VMs.
[19:10:50] <owen1> and apply it on my ec2
[19:10:54] <GothAlice> Exacatly.
[19:10:55] <owen1> (the one with the mongo)
[19:11:37] <owen1> GothAlice: thanks agoin
[19:11:38] <owen1> again
[19:11:46] <GothAlice> No worries.
[19:12:40] <owen1> can i allow my subnet instead of individual ec2?
[19:14:01] <GothAlice> Alas, I haven't AWS'ed in many years. That documentation is your best bet for what is and is not possible using security groups.
[19:14:52] <owen1> cool. reading it now
[19:31:08] <abishek> the eval function is restricted on my production servers, so how can I execute queries on mongo db without using eval?
[19:57:26] <Prometheian> Is there anything special about an embedded sub-document or is that just a name for a nested object?
[19:59:14] <Streemo> I have documents which have a rating: 1 star, ... , 5 star. But also their count matters too - how ample they are. Is there a good equation to sort by which takes into account count+rating?
[19:59:42] <Prometheian> You mean the number of ratings + the average rating?
[20:00:15] <Streemo> mmm
[20:00:20] <Streemo> perhaps a bayesian estimate
[20:01:09] <unholycrab> collections with lots of indexes.... CPU heavy or RAM heavy?
[20:01:38] <Streemo> probably ram
[20:01:44] <Streemo> right?
[20:02:03] <unholycrab> yeah i see. they need to fit in ram
[20:02:41] <Streemo> ok what about
[20:03:08] <Streemo> log(N)*(sum over N_i)/(sum over 1 from i to N)
[20:03:14] <Streemo> log(N) * avg(N)
[20:03:17] <Streemo> errr
[20:03:23] <Streemo> log(N) * avg Rating
[20:03:38] <Streemo> so something rated 5 with only 1 vote is scored lower than something rated 5 but with 10 votes
[20:05:00] <Prometheian> But is something rated 4 @ 10 votes placed higher then 5 @ 2, or 5 @ 5?
[20:06:29] <Streemo> sure
[20:06:40] <Streemo> 4 @ 100 votes is better than 5 @ 2 votes
[20:06:50] <Streemo> Law of large numbers
[20:06:52] <Prometheian> kk, just double checking your logic to make sure it was what you wanted :)
[20:09:33] <Streemo> it could be tuned up
[20:09:37] <Streemo> any recommendations?
[20:12:40] <Prometheian> Not really, except to possibly add options to weight 5's heavier, or more votes heavier.
[20:13:46] <Streemo> a smoothed step function
[20:13:57] <Streemo> like the fermi-driac distribution but sihfted to the right towards 5
[20:14:07] <Streemo> composed with log(N)
[20:14:22] <Streemo> like this
[20:14:31] <Streemo> -------
[20:14:39] <Streemo> _____________________/
[20:14:41] <Streemo> eh
[20:14:52] <Streemo> minus 0ne underscore
[20:14:55] <Streemo> or 2
[20:14:59] <Prometheian> I'm not that good with algorithms to help here. :P
[20:15:06] <Streemo> well jsut the picture
[20:15:14] <Streemo> im trying to find a function to match the picture heh
[20:23:26] <Prometheian> Average the ratings, multiply that number by the number of votes. Then multiply it by (avgRating*0.1)
[20:23:43] <Prometheian> Or something
[20:23:47] <Prometheian> lol...
[20:24:06] <Prometheian> 4.56*0.1 is .456 which will actually hurt the rating. Hmm
[20:24:50] <Prometheian> You could make the rating n*(avg^2)
[20:25:03] <Prometheian> So the larger the rating the greater the weight.
[20:36:30] <Streemo> no what i did was this:
[20:37:03] <Streemo> v = average rating, N = number of votes. S(v,N) = log(N)*1/(1+exp(-3*(v-3)))
[20:37:32] <Streemo> so ratings averaging 1,2,3 are never scored high regardless of the number of votes
[20:37:43] <Streemo> ratings which are 4 and heavily voted on may score higher than newly rated 5's
[20:41:17] <Streemo> will mongodb take a sort function like this
[20:41:31] <Streemo> where V and N are obtained from the document's fields
[20:44:00] <Prometheian> no idea
[20:44:03] <Prometheian> but nice!
[20:56:15] <ra21vi> is there a way to add two or more ObjectIds to get a unique objectId which I can later generate same hash with same input
[20:56:32] <ra21vi> uhh, its very difficult to explain it
[21:00:39] <Torkable> so I tried to exclude several fields in an aggregate query and it said I can only exclude the _id
[21:00:44] <Torkable> anyway around this?
[21:06:29] <GothAlice> ra21vi: If it's hard to explain, it's probably a bad idea. If it's easy to explain, it *may* be a good idea. ;)
[21:06:56] <abishek> how can I convert this group by query to aggregate query http://pastebin.com/mL8yU0y7 ? I am new to MongoDB
[21:07:25] <GothAlice> (Also: Simple is better than complex; complex is better than complicated. Alice's laws 30-31 + 44-45.)
[21:07:38] <Torkable> surprise, $project: { fieldName: undefined } didn't work
[21:07:46] <GothAlice> abishek: http://docs.mongodb.org/manual/reference/sql-aggregation-comparison/
[21:08:31] <GothAlice> Torkable: http://docs.mongodb.org/manual/tutorial/project-fields-from-query-results/
[21:08:53] <Torkable> GothAlice: aggregate projections seem to be a little different
[21:08:55] <GothAlice> Er, rather, http://docs.mongodb.org/manual/reference/operator/aggregation/project/ which is very similar.
[21:09:01] <Torkable> http://docs.mongodb.org/manual/reference/operator/aggregation/project/
[21:09:03] <GothAlice> <field>: <1 or true>
[21:09:08] <GothAlice> It's similar enough. ;)
[21:09:14] <Torkable> can't exclude...
[21:09:15] <GothAlice> (It just lets you also do more.)
[21:09:17] <Torkable> :(
[21:09:22] <GothAlice> Well, no.
[21:09:31] <GothAlice> Due to the way aggregate queries work, "excluding" is the wrong approach.
[21:09:36] <Torkable> which is what I want lol
[21:09:45] <GothAlice> It's really not, unless you like running out of memory mid-query. ;)
[21:09:49] <abishek> GothAlice is aggregation faster than using a group function?
[21:10:17] <Torkable> GothAlice: I want the documents sans three fields
[21:10:19] <GothAlice> abishek: Depends; aggregation is easier to parallelize in certain circumstances. A "group function" isn't a thing that exists in MongoDB, so it's not like you have another choice other than full client-side processing. ;)
[21:10:47] <abishek> so aggregate is better than a group by
[21:10:57] <GothAlice> abishek: Rather, $group in an aggregate is the way to go. One could use map/reduce, but map/reduce is slower, invokes a JS engine runtime, etc., etc.
[21:11:09] <abishek> ok
[21:23:51] <Torkable> grrr
[21:24:02] <Torkable> I just want to remove a couple fields from the aggregate results
[21:29:20] <abishek> GothAlice, how do I do `SELECT COUNT(visits)` using aggregate query?
[21:30:13] <abishek> I can see COUNT(*) compared but not COUNT(columnName)
[21:37:54] <greyTEO> when reading the oplog, what is the best way to parse {user.3.status: "cool"}
[21:38:35] <greyTEO> I am using the oplog to update elasticsearch when I noticed that the oplog is only return the array item that was changed
[21:38:45] <GothAlice> abishek: What does COUNT(column) do, *specifically*?
[21:39:33] <greyTEO> would the work around be to re-fetch the entire document instead of just going by what has changed?
[21:39:43] <GothAlice> greyTEO: Not using Mogno Connector for any particular reason? http://blog.mongodb.org/post/29127828146/introducing-mongo-connector and https://github.com/10gen-labs/mongo-connector being relevant links.
[21:40:23] <GothAlice> https://github.com/10gen-labs/mongo-connector/wiki/Usage%20with%20ElasticSearch
[21:40:27] <fewknow> blamo
[21:40:30] <GothAlice> :)
[21:40:45] <abishek> GothAlice, if you look at my query here http://pastebin.com/mL8yU0y7, I amconsidering only values from the column that are greater than 0
[21:41:02] <GothAlice> abishek: That didn't answer the question. T_T
[21:41:37] <GothAlice> Looking at that code, I don't even.
[21:41:45] <GothAlice> I don't know what that is, but it's not an aggregate.
[21:41:58] <abishek> line 14, `obj.order_id > 0`
[21:42:09] <GothAlice> Yeah. Still no idea what I'm looking at.
[21:42:12] <abishek> how can I pass conditions like this on aggregate query
[21:42:37] <fewknow> abishek: $match
[21:42:41] <GothAlice> Well, no.
[21:42:41] <fewknow> lol
[21:42:59] <GothAlice> http://docs.mongodb.org/manual/reference/operator/aggregation/let/
[21:43:15] <GothAlice> If you want to pass in variables that can be used in expressions, MongoDB-side, $let is your friend.
[21:43:29] <greyTEO> GothAlice, I have to do a bunch of stuff and not just pipe directly.
[21:43:40] <greyTEO> otherwise that would have worked perfect
[21:44:06] <fewknow> greyTEO: you can write you own doc_manger
[21:44:12] <greyTEO> currently it's oplog -> rabbitmq -> ES
[21:44:15] <fewknow> and still use mongo-connector
[21:45:03] <GothAlice> Simple is better than complex. Complex is better than complicated. Adding intermediaries, with their own logistic and maintenance requirements, enters into the realm of complicated.
[21:46:16] <fewknow> abishek: you can use $group for $gt and then pipe to next part of aggregation. Well I would in the shell anyway . If I understand what you are doing....a little late to the conversation
[21:48:07] <greyTEO> GothAlice, Im assuming you mean don't write your own??
[21:49:18] <GothAlice> Well, it's mostly that any time I see multiple database systems being utilized on a single project I end up slapping the architect. ;) Also "not invented here" syndrome is a Very Bad Thing™ when you're dealing with a) cryptography or b) anything you want to be highly reliable and highly available.
[21:50:40] <GothAlice> (Unless you've got unit and integration tests, tests testing your tests, continuous integration, and some serious planning.)
[21:51:29] <greyTEO> ehh. It's certainly the not invented here syndrome.
[21:51:46] <GothAlice> The only valid excuse for anything, ever: "It seemed like a good idea at the time." ;)
[21:51:50] <greyTEO> python is far from a language I have experience with..
[21:51:59] <GothAlice> I try not to use NIH as a pejorative—I reinvent all the things! ;)
[21:52:37] <greyTEO> lol being the only developer now, I reuse all the things! ;)
[21:52:58] <GothAlice> But even then… I don't roll my own crypto. Anyone can write a crypto system they themselves can not break. (Doesn't mean your next-door neighbour's dog can't break it…)
[21:53:34] <greyTEO> I would never roll my own crypto
[21:53:43] <greyTEO> that is looking for trouble
[21:53:55] <GothAlice> So, may I ask, why rabbit in the middle there?
[21:54:36] <greyTEO> mutiple cosumers, if needed, since we are updating millions of docs per day
[21:54:48] <GothAlice> Tailing cursors on the capped collection found to be unsuitable?
[21:55:05] <GothAlice> (Capped collections are ring buffers; you won't be able to get much more efficient than a tailing cursor on it.)
[21:55:06] <greyTEO> no, that is what is feeding rabbit
[21:55:21] <GothAlice> Well, no, I mean, have you tested having the listeners listening directly to the oplog?
[21:55:30] <GothAlice> (Capped collections are designed for this use; this is how internal replication works.)
[21:56:00] <greyTEO> wouldnt multiple listeners distrubite the same message?
[21:56:20] <GothAlice> Your tailing cursors are returning query results; you can filter for a subset of the messages efficiently.
[21:56:27] <greyTEO> rabbit would load balance the message if multiple cosumers were tailing the que
[21:56:42] <greyTEO> it's by timestamp and inc currently
[21:56:57] <greyTEO> where timestamp inc $gte
[21:57:07] <GothAlice> That's… unimportant, really. ;)
[21:57:24] <greyTEO> that is the subset isnt it?
[21:57:57] <GothAlice> Sorta.
[21:58:41] <GothAlice> If you have multiple "workers" chewing in data being piped out of the oplog, you're effectively distributing the jobs "modulo" the number of workers.
[21:59:02] <GothAlice> (Worker 1 gets job 1, worker 2 gets job 2, worker 1 gets job 3, worker 2 gets job 4, …)
[21:59:42] <GothAlice> https://gist.github.com/amcgregor/4207375 is part of a presentation I gave on using MongoDB for task distribution; this is possibly overkill for what you need (I needed full, generic RPC, you don't, really) but handles using MongoDB for this purpose. Includes link to working codebase in the comments.
[21:59:48] <greyTEO> you are talking about tailing cursors?
[22:00:12] <GothAlice> "rabbit would load balance the message if multiple cosumers were tailing the queue"
[22:00:51] <medmr> the monq package puts a nice awkward job queue on top of mongo
[22:01:05] <GothAlice> I'm not familiar enough with rabbit; would it be broadcasting the same message to every listener, or somehow choosing to distribute specific messages to specific workers?
[22:01:12] <GothAlice> medmr: Link?
[22:01:44] <greyTEO> I think we are working in different directions.
[22:01:51] <medmr> https://www.npmjs.com/package/monq
[22:01:53] <GothAlice> medmr: Nevermind, I found it on npm. Eeeeeeeeeeew.
[22:01:57] <medmr> GothAlice: haha
[22:01:57] <GothAlice> ^_^
[22:02:16] <greyTEO> you are using Mongo as a form of rabbitmq /worker/slave config
[22:02:21] <medmr> the job document format is.. suboptimal
[22:02:43] <GothAlice> medmr: Just the API alone leaves a bad taste. ;) I'm too used to languages with powerful introspection capabilities.
[22:03:02] <medmr> oh right you're a python guy
[22:03:05] <greyTEO> GothAlice, nice link though.
[22:03:05] <medmr> i miss working in python
[22:03:59] <greyTEO> GothAlice, you can only have as many thread as you have masters correct?
[22:04:23] <GothAlice> Please define: "master" and "thread" here.
[22:04:51] <GothAlice> medmr: https://github.com/marrow/task/blob/develop/marrow/task/model.py#L181-L193 < my Task model ;)
[22:05:07] <greyTEO> master -> Primary and thread being threads* that can tail the cursor.
[22:06:13] <GothAlice> medmr: https://github.com/marrow/task/blob/develop/marrow/task/queryset.py#L9-L62 is the part that handles tailing cursors with timeouts; rough estimate timeouts, though, until a certain JIRA ticket gets fixed. This code lets you terminate the building of a Query and use a simple for loop iterator to get results back as they arrive.
[22:06:17] <fewknow> GothAlice: I use multiple databases on every project. But I use a DAL(Data Access Layer) that abstracts that from everyone. But different database technologies makes sense at different points in a project
[22:06:40] <GothAlice> greyTEO: I have yet to encounter a limit on the number of tailing cursors (workers/listeners) able to simultaneously query a single collection.
[22:08:19] <medmr> thanks, interesting
[22:08:23] <greyTEO> I have onyly seen 1 task per primary. Hence why I thought duplicate data would be distributed with more than one listener
[22:08:31] <GothAlice> fewknow: Considering I wouldn't trust a single DAL to handle relational, hierarchical/graph, and document storage sensibly from a single API (too many compromises would need to be made), in the cases where I need multiple DBs, I use dedicated ODM/ORMs. I.e. MongoEngine for MongoDB, SQLAlchemy for anything relational, etc.
[22:09:39] <fewknow> GothAlice: I use MongoEngine as well for Mongo, but I abstract it from everything else so a developer doesn't need to worry where the data is stored or how it is accessed. They just hit the API and the data is retrieved and delivered.
[22:09:52] <fewknow> I leave it up to someone (ME) that know how to store the data to handle it....
[22:10:11] <GothAlice> greyTEO: The example producers and consumers in the sample code I demonstrated on stage using two producers and five consumers. Whichever consumer locked the record first is the one that gets to work on it.
[22:10:12] <fewknow> more for abstraction for anyone above my layer
[22:10:51] <greyTEO> GothAlice, back to the original question, do you know if the mongo-connector refetches the mongo doc before updating ES?
[22:10:53] <GothAlice> greyTEO: This results in trivial load balancing; if a node is too busy servicing another task, it won't try to lock it.
[22:11:07] <GothAlice> greyTEO: I do not know; it's reasonably readable Python code, though.
[22:12:11] <greyTEO> GothAlice, thanks for all the info. I appreciate it
[22:12:54] <greyTEO> GothAlice, i didnt know the producers lock the document on reading in oplog
[22:13:00] <GothAlice> Nopenopenope.
[22:13:18] <GothAlice> The "locking" I'm referring to is entirely the realm of this implementation, not anything to do with MongoDB itself.
[22:13:19] <greyTEO> then I totally missed your point?
[22:13:59] <greyTEO> because you are working on documents....not just reading them?
[22:14:32] <GothAlice> Ref: 4-job-handler.py from the presentation slides gist: lines 9/10 handles a failure to communicate with MongoDB, line 12 handles failure to acquire the lock we're trying to set on line 7.
[22:15:05] <GothAlice> It's a two-part implementation: it uses a capped collection for messaging (new job, job failed, job finished, etc.) but a real collection to track the jobs themselves.
[22:15:24] <GothAlice> (This allows for sharding; capped collections themselves can't be sharded.)
[22:15:34] <greyTEO> ok that makes sense now
[22:16:59] <greyTEO> I usually use rabbit for most things in our current system.
[22:17:51] <greyTEO> I am just getting into mongo and denormalizing some mysql tables to mongo.
[22:18:18] <greyTEO> so it seemed to fit mongo oplog -> rabbit -> whatever we currently have in our system
[22:18:39] <greyTEO> GothAlice, again thanks for clearing it up
[22:19:43] <GothAlice> greyTEO: Practicality beats purity; but this task system evolved from a project I joined two weeks in that was using, and I'm not joking: rabbitmq, zeromq, memcache/membase, redis, and postgresql.
[22:19:57] <GothAlice> (All replaced within the next two weeks with nothing but MongoDB. ;)
[22:20:27] <greyTEO> shit
[22:20:49] <greyTEO> that is too many services to run at a time
[22:20:52] <GothAlice> I mentioned slapping architects that use multiple databases. I think I left a mark on this one. ;)
[22:22:53] <GothAlice> One queue had persistence, the other had routing capabilities. Memcache because we need caches, right? Redis because efficient semi-persistent sets of integers were needed. Postgres for general data. Queues = capped collections, routing = tailing cursor with a query, automatic cache expiry = TTL indexes, $addToSet/etc. for set operations in MongoDB, and, well, MongoDB documents map to Python structures quite well for general storage.
[22:22:54] <GothAlice> ;)
[22:22:57] <greyTEO> I think multiple databases have their place. I disagree when people use services so than can be "cutting edge"...
[22:23:42] <GothAlice> Certainly; if someone needs referential integrity or transactions, I point them at a relational database designed for that purpose. Same for storing graphs—for the love of the gods, people need to stop shoe-horning graphs on top of non-graph DBs.
[22:25:01] <greyTEO> we still use relational data heavily. Our data fits well into a relational model, I moreso liked the throughput I can get with mongo and apache spark for ETL of a 13gb xml file
[22:25:56] <greyTEO> "One queue had persistence, the other had routing capabilities. Memcache because we need caches, right? Redis because efficient semi-persistent sets of integers were needed. Postgres for general data.".....super simple
[22:26:09] <GothAlice> Yup; it seemed like a good idea to them at the time. ;)
[22:26:20] <greyTEO> that makes my nose bleed just reading that
[22:26:29] <GothAlice> I'll not get started on storing at-rest data in XML, though. ;)
[22:27:09] <GothAlice> You may be very interested in http://www.tokutek.com/tokumx-for-mongodb/ by the way.
[22:27:40] <GothAlice> (Stable MongoDB fork using fractal trees for indexes and oplog buffering, with transaction support and compression.)
[22:27:40] <greyTEO> I was looking at that.
[22:27:45] <GothAlice> Yeah; it's neat. :3
[22:28:01] <greyTEO> I have also wanted to try the new wiredtiger engine
[22:28:04] <GothAlice> MVCC support was the thing that really caught my attention.
[22:28:39] <GothAlice> WT isn't… quite… production-quality yet. There's a standard list of about half a dozen critical memory management issues I typically paste. ;)
[22:29:02] <greyTEO> yea that is what I thought.
[22:29:27] <GothAlice> https://jira.mongodb.org/browse/SERVER-17421 https://jira.mongodb.org/browse/SERVER-17424 https://jira.mongodb.org/browse/SERVER-17386 https://jira.mongodb.org/browse/SERVER-17456 https://jira.mongodb.org/browse/SERVER-17542 ancillary: https://jira.mongodb.org/browse/SERVER-16311 < here they are. ^_^
[22:29:35] <greyTEO> i tried tokudb for mysql...not to much different than INNODB
[22:30:21] <greyTEO> lol good times
[22:30:25] <GothAlice> https://jira.mongodb.org/browse/SERVER-17421 https://jira.mongodb.org/browse/SERVER-17424 https://jira.mongodb.org/browse/SERVER-17386 https://jira.mongodb.org/browse/SERVER-17456 https://jira.mongodb.org/browse/SERVER-17542 ancillary: https://jira.mongodb.org/browse/SERVER-16311
[22:30:27] <GothAlice> Er, sorry!
[22:30:31] <GothAlice> http://grimoire.ca/mysql/choose-something-else
[22:30:37] <GothAlice> Sometimes I'm too fast for my multiple clipboard manager. ^_^
[22:32:08] <greyTEO> lol say what you will about MySQL
[22:32:21] <GothAlice> …
[22:32:29] <GothAlice> before you say any more, skim the "Bad Arguments" section. ;)
[22:32:36] <greyTEO> it has stood up well for most of what I have to throw at it
[22:32:47] <greyTEO> lol that is where I went first
[22:32:58] <GothAlice> "It's good enough." and "We haven't run into these problems." are included. ;)
[22:33:32] <greyTEO> postgres is probably your db or choice in the relational world?
[22:33:35] <GothAlice> Aye.
[22:33:59] <GothAlice> To the point that I've got automation for it that baffles people. (No permanent storage on my Postgres hosts.)
[22:34:24] <GothAlice> (wal-E helps… a lot.)
[22:35:08] <greyTEO> I have just never worked with it. I know it's the next big guy in the open source space.
[22:35:26] <GothAlice> If I *absolutely* had to support MySQL-capable clients, I'd use Maria.
[22:35:35] <greyTEO> It is the closest thing to a robust db (compared to SQL server)
[22:35:56] <greyTEO> being open source that is
[22:36:13] <GothAlice> (While avoiding InnoDB like the plague. I had to reverse engineer its on-disk format, and it took the 36 hours I had prior to getting my wisdom teeth extracted to recover my client's data after an Amazon EC2/EBS failure across multiple zones.)
[22:36:44] <GothAlice> That was *not* a fun week. ;)
[22:36:59] <greyTEO> no backups?????
[22:37:16] <GothAlice> Backups were inaccessible due to the cross-zone cascade failure in the EBS volume controllers.
[22:37:36] <GothAlice> (No volumes could be mounted or unmounted; if you tried, say goodbye to that volume.)
[22:38:02] <GothAlice> (Also the EC2 VM you tried attaching it to. Instant several-thousand load average.)
[22:38:20] <greyTEO> damn
[22:39:03] <GothAlice> Yeah. We switched to Rackspace after that. AWS's SLA explicitly stated their system was architected to isolate failures to a single zone and explicitly recommended doing what we did to ensure high-availability. Well, they were wrong. ;)
[22:39:44] <greyTEO> yea that was catastrophic
[22:40:07] <greyTEO> rackspace is where it's at. Pricey but well worth it in my opinion
[22:40:29] <GothAlice> Until a recent dom0 security patch, I had servers with 3 year uptimes.
[22:40:36] <GothAlice> So yeah, I'm a pretty satisfied customer. ;)
[22:41:09] <greyTEO> yea their uptimes are insane
[22:43:57] <GothAlice> Weirdly, for application nodes, there are many cheaper solutions. (I'm trialling clever-cloud.com right now at work; so far so good.) For DB? Not so much. (My personal MongoDB dataset would cost me just under half a million dollars per month if I hosted it on compose.io… by self-hosting it, it cost me around ~$10K once. I can afford to replace every drive every month by not cloud hosting it, and it paid for the initial investment
[22:43:58] <GothAlice> pretty quick. ;)
[22:45:01] <abishek> does MongoDB support the HAVING clause in aggregation?
[22:45:07] <greyTEO> https://github.com/10gen-labs/mongo-connector/blob/master/mongo_connector/doc_managers/doc_manager_base.py#L67
[22:46:09] <greyTEO> looks like it splits on the '.' the current key that I have 'item.3.isVerified'
[22:46:29] <GothAlice> abishek: Just add another $match stage. I.e. $match, $project, $unwind, $match, $group — the second $match is effectively HAVING-filtering the unwound list items prior to the $group.
[22:46:51] <abishek> ok
[22:47:15] <greyTEO> the idea of onsite hardware is apealing but costly. it needs dedicated management. have you looked at bare metal from rackspace
[22:47:22] <GothAlice> greyTEO: Important note: BSON arrays are actually mappings with numeric keys.
[22:47:41] <abishek> and if I have to do it using the group by clause?
[22:48:03] <greyTEO> correct item, 3rd position, key (isVerified)
[22:48:06] <abishek> am sorry, I meant, if I did it using the group function can I mention the HAVING clause?
[22:48:09] <greyTEO> update that key
[22:48:22] <GothAlice> abishek: Again, just use $match. Sometimes, if your query is very complex, you might need to use conditional expressions within the $group projection. Define "mention".
[22:48:57] <GothAlice> greyTEO: Read that as the isVerified field of the third object in the item field.
[22:48:58] <abishek> using the HAVING clause in `db.getCollection(collectionName).group()`
[22:49:14] <greyTEO> yessir
[22:49:40] <greyTEO> I see that mongo-connector doesn't need to refetch the doc, which is what I was hoping for
[22:49:49] <GothAlice> :)
[22:50:01] <GothAlice> It's good to have problems come pre-solved. ;)
[22:50:28] <greyTEO> GothAlice, im taking off. have a good night/day (where ever you are)
[22:50:37] <greyTEO> sure, if it was only what easy
[22:50:40] <greyTEO> lol
[22:50:49] <GothAlice> Have a great one, greyTEO! :)
[22:51:05] <greyTEO> thanks again
[22:51:16] <GothAlice> abishek: I have never once actually ever used .group(). The documentation pretty clearly states: "Use aggregate() for more complex data aggregation."
[22:51:38] <GothAlice> And "HAVING" is meaningless in MongoDB. It's not a thing.
[22:52:03] <GothAlice> http://docs.mongodb.org/manual/reference/sql-aggregation-comparison/ < even clearly states the equivalent is $match
[22:52:23] <abishek> ok