[04:30:23] <ningu> I am using cayley, which is a graph database with different backends, one of which can be mongo
[04:30:53] <ningu> I want to load about 1 billion triples into it. with its 'bolt' backend this takes about 12 hours on a single machine on GCE (16 vcpus, 60gb ram)
[04:31:42] <ningu> but unfortunately (?) it stores almost everything in two collections -- 'nodes' and 'rels'
[04:32:06] <GothAlice> Yeah; storing graph datasets in MongoDB often comes down to a collection of actual data, and collection storing relationships between those entities.
[04:32:27] <GothAlice> http://blog.zawodny.com/2011/03/06/mongodb-pre-splitting-for-faster-data-loading-and-importing/ can provide some reading on optimizing bulk inserts.
[04:32:55] <GothAlice> This was a topic discussed quite a bit on http://irclogger.com/.mongodb/2014-10-29 — my own contributions can be quickly found by searching for the word "locality".
[04:33:32] <GothAlice> And graph datasets earlier today (follow the irclogger.com link in the channel topic).
[04:34:13] <GothAlice> joannac: I'm an ACI. Seriously. XD
[04:34:44] <ningu> this is not exactly 'big data' but what might be considered 'medium data' i.e. I suspect it could fit nicely on 3-4 machines
[04:35:14] <GothAlice> I have billions of records totalling 24TiB, and I don't consider myself to be "big data". (I don't query my datasets the same way as typical "big data" installations.)
[04:36:57] <GothAlice> Sharding with careful key configuration (and ordering of inserted data) a la the first link I provided will speed up those inserts.
[04:38:10] <GothAlice> In the case of graph associations, (i.e. {_id: {a: ObjectId('left'), b: ObjectId('right')}}) sharding on _id.a would spread the associations around reasonably well. (And split the insertions across as many shards as you can throw at it.)
[04:38:34] <ningu> ok, then my other question is, what's the easiest/best way to get up and running quickly with a sharded cluster? I'm only one person... it would be nice if there were some ansible scripts or whatever or automate some of it.
[04:39:50] <ningu> thanks, I'll cehck both of those out
[04:39:50] <GothAlice> +9001 (over nine thousand) to using MMS.
[04:41:03] <ningu> hmm... 8 servers for free? that might be enough.
[04:41:10] <ningu> I don't know how much storage they have, though.
[04:41:22] <GothAlice> ningu: My script provides a clear illustration of the process involved in configuring a cluster like that, if you prefer to read code. It's a combination of http://docs.mongodb.org/manual/tutorial/deploy-shard-cluster/ and http://docs.mongodb.org/manual/tutorial/deploy-replica-set/
[04:41:40] <ningu> I have a bunch of free credits on Google Compute Engine so I was hoping to use that, just to save money, since this is very experimental.
[04:42:19] <ningu> yes, reading code is good. in theory I'd like to just press a button and have it up but I don't know if that's realistic or affordable. I'm willing to do some work to set it up, as long as I don't have to individually configure every server.
[04:42:33] <GothAlice> ningu: Then MMS is your friend.
[04:42:48] <GothAlice> Install the agent on a few machines, and a few clicks later on the MMS interface, if I understand correctly.
[04:45:33] <ningu> so here's a really basic question, then... they're not going to know the structure of my dataset, so I'll have to set up the shard keys myself, I assume. how does one do that exactly with a cluster... is there a master? do you just connect to any one of them?
[04:46:00] <GothAlice> ningu: There are two separate concepts.
[04:48:53] <ningu> yes, I was asking about the replication. I am more familiar with rdbms where there is a master/slave setup.
[04:49:01] <GothAlice> Generally you have sharding "in front" of your replication (i.e. a-f goes to a set of three replicas, g-m goes to a different set of three, etc.)
[04:49:40] <GothAlice> You have "routers" at the very front that your application connects to. These routers speak mongodb, but parse requests and try to analyze where to send the query. Sometimes they need to send the query mutiple places, then aggregate the results.
[04:49:52] <GothAlice> (mongos processes vs. mongod processes storing the data)
[04:51:23] <GothAlice> (They speak to "configuration" servers which track the cluster's replication members and everything, generally, self-organizes around your sharding keys.)
[04:51:46] <ningu> so realistically for this graph dataset, which is max 1TB (I mean the size of /var/lib/mongodb if stored on a single machine) how many servers would I need to make the bulk loading faster? I know you need 3 config servers for example.
[04:52:02] <GothAlice> Actually, you only *need* one.
[04:52:10] <GothAlice> Same with the sharding process (mongos).
[04:52:40] <GothAlice> And the configuration servers can piggyback on the ones actually storing the data, to save space. Configservers are extremely light-weight.
[04:53:29] <GothAlice> Your basic setup would need only one machine (whole dataset), and can use as many as you want (one per shard division). Replication (high-availability, backup, etc.) is entirely optional (but a Really Good Idea™.)
[04:53:52] <GothAlice> Have two machines, each would need ~1/2 the dataset in space.
[04:54:53] <GothAlice> It can work roughly that way.
[04:55:06] <GothAlice> How you determine your sharding key, and the order you insert your data will have strong impacts in the speed of insert.
[04:55:21] <GothAlice> (See the article I linked.)
[04:56:00] <ningu> I'm not sure if I need HA and backup right now because this is not the master copy of the dataset, and there won't be a ton of people querying it.
[04:56:13] <GothAlice> Bam, you only need sharding, then. :)
[04:56:18] <ningu> basically I want to experiment and see what kind of performance I'll get.
[04:56:20] <GothAlice> http://docs.mongodb.org/manual/tutorial/deploy-shard-cluster/ and MMS are your friends.
[04:56:29] <GothAlice> See also: http://docs.mongodb.org/manual/tutorial/choose-a-shard-key/
[04:56:52] <ningu> ok, well, I have a bunch of stuff to read up on, I guess. :)
[04:57:10] <GothAlice> Choice of key is key, if you'll pardon the lame pun.
[04:57:59] <GothAlice> http://blog.zawodny.com/2011/03/06/mongodb-pre-splitting-for-faster-data-loading-and-importing/ is a really good article about optimizing bulk inserts on sharded clusters, though.
[17:13:12] <giuseppes> I'm trying to set up my mongodb instance on a remote server to be reached from my app, how can i get/ configure usr, password and mongo url?
[17:14:10] <giuseppes> I'm editing my mongod.conf now I've already started to unbind the ip
[22:38:52] <giuseppes> I'm trying to set up my mongodb instance on a remote server to be reached from my app, how can i get/ configure usr, password and mongo url?
[23:30:56] <wsmoak> giuseppes: I haven’t done it, but how about this? http://docs.mongodb.org/manual/tutorial/enable-authentication/
[23:39:45] <giuseppes> wsmoak, I've seen it but I'm not sure if it's what I need
[23:40:52] <wsmoak> that seems to be how you enable authentication … otherwise you can only connect on localhost
[23:40:58] <wsmoak> how are you connecting to mongodb now ?
[23:59:46] <wsmoak> joannac: really? I (briefly) read “localhost exception” and thought that _without_ auth turned on, you could ONLY connect via localhost.