[01:21:36] <JokesOnYou77> Hi all. I have a dataset that I want to store that is highly connected (it's a graph format). I have about 5 million nodes and 50 million links. I've been looking at mongo vs a graph database. Because of the form of the data a graph DB would be natural, but because I have so much data mongo seemed like a good potential candidate. Any thoughts on performanc for such a dataset.
[01:40:30] <Boomtime> @JokesOnYou77: MongoDB will have no trouble storing your data, or retreiving it very fast, the area you will want to concentrate on is the types of queries you need
[02:15:41] <JokesOnYou77> Boomtime, I'll be doing queries that ask for the node itself and all of it's neighbors. I will also need some kind of fuzzy match as each node represents a word (I may be using Lucene for this, I'm still working on how the fuzzy will work, but anything that does edit distance will be fine)
[02:24:13] <JokesOnYou77> Boomtime, and then all of the neighbors' neighbors. At least to a depth of 2. Possibly to a depth of 3.
[02:36:59] <JokesOnYou77> Boomtime, right, so what will that mean in terms of performance, I have no real experience with mongo performance
[02:41:02] <Boomtime> you will want to pay careful attention to your indexes, make sure your queries are working as efficiently as possible - recursive queries means that small performance penalties multiply (potentially geometrically) and stack up really fast
[02:41:52] <Boomtime> you mentioned that 5 million nodes gives 50 million connections, this means 10 connections per node
[02:43:37] <Boomtime> query for those conections means 10 queries (average) per depth thus 10^3 + 10^2 + 10 = 1110 queries average -- assuming you use a flat storage system
[02:44:30] <Boomtime> properly indexed that will be fine, but given the result-set you will need to be careful about what you return as intermediate results too
[05:55:05] <gaganjyot> Hello there, When I connect mongodb I don't need to provide any user/password
[05:55:19] <gaganjyot> so reading the doc, I created a user root with root privillages
[05:55:44] <gaganjyot> but when I need to connect it via the root, I always need to provide the authentication database
[05:56:06] <gaganjyot> How can I avoid the -authenticationDatabase argument
[05:56:26] <gaganjyot> or how can I provide this argument to mongoengine while using with django
[05:57:11] <Boomtime> authentication occurs, by default, against the database you specify to start in
[06:03:50] <Boomtime> you must have done this part already: http://docs.mongodb.org/manual/tutorial/enable-authentication/#enable-client-access-control
[06:06:10] <gaganjyot> mongo 2.6 has no db.create user ?
[06:23:00] <gaganjyot> Really thanks for this Boomtime :)
[08:32:29] <avaq> Hi. My question about the $or, $text and $in operators became a little lengthy for IRC so I made an SO post: https://stackoverflow.com/questions/26116182/or-a-full-text-search-and-an-in .. Anyone who can take a look at it for me? I'm quite confused.
[09:17:56] <d0x> Hi, I implemented a queue in mongodb and thought that the findAndModify() is isolated. But on a concurrent scenario many clients select the same document. Any Idea how to fix that. The document looks like this: { date : xxx, inprgoress: false, data : {xxxx} }
[09:18:06] <d0x> I took this pattern": http://www.mongodb.com/presentations/webinar/mongodb-schema-design-principles-and-practice
[09:21:08] <Boomtime> findAndModify is absolutely atomic, if you have "inprogress:false" as part of your query and $set inprogress:true as part of the modify then only one client will possibly win that race
[09:22:57] <d0x> Boomtime: From the docs: ... if multiple clients issue the findAndModify command and these commands complete the find phase before any one starts the modify phase, these commands could insert the same document. ...
[09:24:14] <Boomtime> query and update are absolutely atomic, the condition you are talking about is not atomic, the second "query" does not get a match therefore it inserts instead
[09:26:09] <Boomtime> oh, and it also requires the target update to not involve a unique index
[09:26:44] <Boomtime> (otherwise the uniqueness constraint would deny the document creation for the second findAndModify/update in the race)
[09:33:41] <d0x> and a unique index is not possible on this scenario
[09:33:59] <d0x> But from this doc you see it is exectuted in two phases
[09:34:09] <d0x> which is written between the lines
[09:37:27] <Lope> I noticed I can't use $setOnInsert and $inc at the same time? {$set:{l:new Date()},$setOnInsert:{r:'fakeReplace',c:5,d:new Date(),s:'a',u:0},$inc:{u:1}}
[10:10:56] <Boomtime> "(7:33:58 PM) d0x: But from this doc you see it is exectuted in two phases
[10:10:56] <Boomtime> (7:34:08 PM) d0x: which is written between the lines"
[10:11:53] <Boomtime> i repeat findAndModify is atomic, even when using upsert - the problem with upsert is that it allows two possible ways to complete, so if you set that flag, your 'racing' command will simply complete via whichever path is still open
[10:55:47] <bee_keeper> What's best practice for adding mass updating mongo? I need to add about 500 records at once. I'm using mongoengine for django.
[11:30:18] <jeffwhelpley> I have a score field in a mongo collection which is used to sort the documents. I want to be able to save the rank of the document in that sort in the actual document (i.e. so if I just pull one document, I know it is ranked #5 or whatever). I could loop through the sorted results and do an update for each document to save the rank, but that seems really expensive. Is there any easier way to do this in Mongo?
[12:59:55] <d0x> Hi, I implemented a queue in mongodb and thought that the findAndModify() is isolated. But on a concurrent scenario many clients select the same document. Any Idea how to fix that. The document looks like this: { date : xxx, inprgoress: false, data : {xxxx} }. Many clients will simultaneously query all in progress = false, orderd by date and limit one via the findAndModify
[13:00:34] <d0x> I got this idea from here: https://speakerdeck.com/mongodb/schema-design-jared-rosoff-10gen?slide=55
[13:03:06] <d0x> The funny thing is that the findAndModify sets a write lock only https://github.com/mongodb/mongo/blob/master/src/mongo/db/commands/find_and_modify.cpp#L100 ??
[13:36:48] <dmitchell> I'm finally upgrading edX from pymongo 2.2 to 2.7 and my GridFS connection on our test server running against Mongo 2.4 is failing auth https://gist.github.com/dmitchell/2f6c458d5eb64f30678a
[13:37:03] <dmitchell> it works on my local dev machine (using mongo 2.6)
[13:40:03] <joannac> dmitchell: can you auth in the mongo shell with the same creds?
[13:40:47] <dmitchell> idk, it's on our continuous build server. i'll see if someone in testeng can, but i don't beleive other builds' tests are failing
[13:40:49] <Petazz> Hi! I'm having a problem trying to use python with mongo: http://pastebin.com/k2Z6AvBZ
[13:42:07] <dmitchell> mongo stores dates as universal times so it can do server side date manipulation and comparison
[13:42:13] <dmitchell> it doesn't stringify the date to iso format
[13:42:27] <dmitchell> you can do so if you need to, but there shouldn't be a need to
[13:44:07] <Petazz> dmitchell: Oh I guess the problem is with the extended json being changed from 2.4 to 2.6
[14:10:49] <benjwadams> why does this geospatial query return bounds inside of the box area i want to exclude?
[14:11:36] <benjwadams> created a 2dsphere index on the geojson as well, to no avail
[14:12:16] <KekS> hi there, i have a question regarding copyDatabase() -- i'm trying to migrate from a mongodb running on a windows server to one running on ubuntu and after using copyDatabase the fields collections, objects, indexes, fileSize and nsSizeMB from stats() are the same but avgObjSize, dataSize, numExtents etc are different
[14:12:34] <KekS> anyone have an idea what's happened there? is that filesystem related?
[14:18:25] <KekS> they also claim different sizes in show dbs..0.453125GB (original) 0.203GB (copy)
[14:22:14] <KekS> i know mongodb usually doubles the space used for a db when it runs out and never shrinks again but i somehow don't think anyone here has taken anything out of the db that'd make the copy allocate less memory
[14:23:37] <talbott> does anyone know what this pymongo insert error means please?
[14:23:43] <talbott> OperationFailure: unexpected index insertion failure on capped collection0 assertion src/mongo/db/structure/btree/key.cpp:597 - collection and its index will not match
[14:25:07] <KekS> i'm guessing you're trying to put things without the field specified as the index of your collection in there
[14:29:04] <benjwadams> I guess mongo's spatial support isn't very robust. Ah well, back to postgis then.
[14:31:00] <benjwadams> Why would a simple bounding box query be returning obviously incorrect results?
[14:32:01] <KekS> benjwadams: no idea, i've stayed with pg for all that stuff
[14:41:17] <bmw0679> From PHP I am trying to update multiple elements from within a nested array. So far it looks like I need to make a single query for each update. Is there a better way to tackle this?
[15:19:45] <culthero> A shard does not need to be a replica set right? If you are just sharding over 3 servers you turn replset off on your shard?
[15:28:38] <dmitchell> (repeat) I'm finally upgrading edX from pymongo 2.4 to 2.7 and my GridFS connection on our continuous test server running against Mongo 2.4 is failing auth https://gist.github.com/dmitchell/2f6c458d5eb64f30678a Did authenticate change?
[15:44:00] <ssarah> hei guys im trying to configure mongdb river for elasticsearch
[15:44:14] <ssarah> im using this config http://pastiebin.com/542acefaefe2f
[15:44:39] <ssarah> what should i use for the db ? do the other fields seem right?
[15:49:02] <culthero> db should be whatever database you have setup as a replicaset
[15:49:17] <culthero> elasticsearch as far as I remember reads the oplog, so any transactions get propigated to it
[15:49:50] <culthero> I am doing the sharded approach and using mongos built in fulltext search for this, found that elasticsearch was difficult to setup while integrating with mongo
[15:50:14] <ssarah> var rst = new ReplSetTest( { name: 'rsTest', nodes: 3 } ) <- i started the replicateset as that
[16:19:11] <ssarah> culthero: yes, that's good now that index at least shows up here https://862110c3df91bc8b000.qbox.io/_plugin/HQ/# (click on connect)
[16:19:21] <ssarah> but, i still cant see any of the docs i inserted
[16:26:23] <culthero> did you run rs.initiate(); on your replica set?
[16:26:48] <culthero> I never got elasticsearch / mongodb river to work;
[16:48:50] <ssarah> oh, so sorry, i thought that was related to my issue. I had a feeling sharding was a bit seamless. What you mean, parity between shards?
[16:49:05] <ssarah> related to my issue: http://pastiebin.com/542acefaefe2f
[16:54:32] <ssarah> pastie fixed now, but i used the right ips
[16:58:01] <culthero> sorry if I'm being obtuse you did confirm that qbox.io can access 54.194.193.234 (no firewall), and your mongod configuration is not set for authentication or bound by IP's
[16:58:37] <culthero> and that you've inserted documents since you've posted that configuration to qbox
[17:01:00] <culthero> IE nmap 54.194.193.234 -p31000-31003 should show your 3 mongo replicasets
[17:01:12] <culthero> replica set ports* listening
[17:11:09] <culthero> You can only make in-place updates of documents. If the update operation causes the document to grow beyond their original size, the update operation will fail.
[17:42:15] <wayne> hey. i'm designing a document and thinking about adding a field called _rev
[17:43:02] <wayne> similar to couchdb's rev, in that i'll assert that the _rev matches the old rev of my document before allowing a two-phase commit to go through
[17:53:43] <donCams> hi. I am creating a schema that is similar to a blog's where I store tags for each article in an array. However, I also want to provide autocomplete function for existing unique tags
[17:54:43] <donCams> should I create a separate document for existing tags? figured that there is no fast enough query to get all the unique tags for existing articles
[17:55:55] <donCams> however, I also don't know how I can make sure that that document is updated to only contain tags that are existing in the articles
[17:58:31] <b1001> Hi guys. I have a query: db.tweets.find({'lang':'ar', 'translated':{$exists:false}}).count() returns 60 documents without the translated field.
[17:59:15] <b1001> When i run it in pymongo: tweetsNeedTrans = db.tweets.find({'lang':'ar', 'translated':{'$exists':'false'}}).count() - it only returns 1.
[18:24:23] <dmitchell> could it be treating the string 'false' as truthy and passing True?
[18:31:38] <b1001> Yes i thought it was something with that 'false'.. It works with capital F.. Why is that?
[18:34:39] <mmars> i'm configuring a replica set. when i run rs.conf() i see mongo has used the hostname but not the fully qualified domain name. why is that?
[18:36:19] <skot> mmars: what hostname did you type in for the configuration?
[18:49:31] <jrdn> i have 3 mongod's running with config (https://gist.github.com/j/26f98eb4aaa020d9bcd4) and on my first one, i run rs.initiate()
[18:49:54] <jrdn> then rs.add('123.123.123.1'), and rs.add('123.123.123.2')
[18:50:25] <jrdn> and rs.status shows "still initializing" and logs of other two mongos say "replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)"
[19:16:12] <culthero> jrdn: bind_ip = should just be commented out, replica sets need individual names like testReplicaSet01, testReplicaSet02, testReplicaSet03
[19:17:02] <culthero> not sure why initial rs.initiate() is just hanging, unless it's because the names conflict
[19:33:56] <jrdn> culthero, okay that's what i ended up doing plus adding to hosts files, so i'll just see if ips work with that commented out :)
[20:11:14] <ToeSnacks> what is the best way to deal with getting 'Invalid BSONObj size: 0 (0x00000000) first element: EOO' on a config server?
[20:27:21] <dmitchell> On my earlier, unanswered Q re auth failure, could it be that my 3 different mongo connections to the same db server using the same credentials are interfering with each other?
[21:16:26] <wc-> hi all, i gave this new mms automation a try, and it is spinning up a new instance then terminating then spinning up another...
[21:16:34] <wc-> is this a known issue at this time?
[22:32:18] <acidjazz> having the oddest mongodate sort issue .. on this one server for some reason ordering by tstamp -> 1 gives me the most current date, where as tstamp -> -1 should be the most current right?
[22:33:09] <TSMluffy> what happens if i rename a collection to an already existing collection name?
[22:34:06] <joannac> acidjazz: right. can you reproduce in a shell? if so, pastebin so I can take a look?
[22:34:34] <joannac> TSMluffy: pretty sure you can't
[22:36:13] <acidjazz> joannac: tnx for confirming im not going crazy.. waiting for a co-worker to mongodump the server this issue is on.. i cant duplicate it locallly yet
[22:39:01] <bmw0679> I am trying to update multiple elements from within a nested array using PHP. So far it looks like I need to make a single query to update each element. Seems like overkill though, is there a better way to tackle this?
[22:56:33] <TSMluffy> question: I have basically 2 collections in the same db, A and B. Collection B got deprecated, so i need to move all its documents to the newer collection A