pmxbot IRC Log Viewer

[00:33:43] <GothAlice> Ruh-roh. My IRC relay is dying. Be right back.

[00:49:36] <Synt4x`> this is saying takes 2 arguments (3 given), but it seems to be the same as the example I'm seeing

[00:49:41] <Synt4x`> db.mydb.distinct('id',{'date':{'$gt':'2013-08-01', '$lt':'2014-08-01'}})

[00:51:37] <joannac> works for me

[00:51:43] <joannac> check you haven't made a typo somewhere

[00:52:03] <Synt4x`> I copy/pasted it from my code the same as that

[00:54:03] <joannac> open a mongo shell and run it in there

[00:54:13] <joannac> (which is what I just did, and it worked for me)

[00:55:39] <Synt4x`> ok let me try it

[00:56:15] <Synt4x`> as usual your correct joannac :)

[00:56:22] <Synt4x`> not sure what's up with pymongo, i'll read up some on it

[00:56:24] <Synt4x`> ty

[01:01:33] <Synt4x`> guess I will just avoid it by doing c = db.x.find({date < whatever, date > whatever}), then c.distinct('id') on it

[02:19:53] <mikesm> Anyone here familiar with mongoose for nodejs?

[02:21:58] <mikesm> I guess everyone must be out partying

[04:19:59] <GothAlice> That's better.

[04:29:18] <ningu> if I wanted to bulk load a bunch of data into mongodb, would a sharded cluster help make it faster?

[04:29:44] <joannac> a sharded cluster with multiple mongoS processes, yes

[04:29:49] <GothAlice> ningu: Sharded? Yes. (Replication alone isn't enough, but sharding should do the trick.)

[04:29:50] <joannac> what's the use case?

[04:30:23] <ningu> I am using cayley, which is a graph database with different backends, one of which can be mongo

[04:30:53] <ningu> I want to load about 1 billion triples into it. with its 'bolt' backend this takes about 12 hours on a single machine on GCE (16 vcpus, 60gb ram)

[04:31:42] <ningu> but unfortunately (?) it stores almost everything in two collections -- 'nodes' and 'rels'

[04:31:43] <GothAlice> Graph datasets again. XD

[04:31:50] <ningu> haha, is this a common theme?

[04:32:06] <GothAlice> Yeah; storing graph datasets in MongoDB often comes down to a collection of actual data, and collection storing relationships between those entities.

[04:32:27] <GothAlice> http://blog.zawodny.com/2011/03/06/mongodb-pre-splitting-for-faster-data-loading-and-importing/ can provide some reading on optimizing bulk inserts.

[04:32:36] <ningu> cool, thanks

[04:32:55] <GothAlice> This was a topic discussed quite a bit on http://irclogger.com/.mongodb/2014-10-29 — my own contributions can be quickly found by searching for the word "locality".

[04:33:32] <GothAlice> And graph datasets earlier today (follow the irclogger.com link in the channel topic).

[04:34:13] <GothAlice> joannac: I'm an ACI. Seriously. XD

[04:34:44] <ningu> this is not exactly 'big data' but what might be considered 'medium data' i.e. I suspect it could fit nicely on 3-4 machines

[04:35:14] <GothAlice> I have billions of records totalling 24TiB, and I don't consider myself to be "big data". (I don't query my datasets the same way as typical "big data" installations.)

[04:35:21] <GothAlice> Er, 25TiB.

[04:36:11] <GothAlice> At work we have a few million records and whenever my bosses use the term "big data" I frown at them disapprovingly. ;)

[04:36:25] <ningu> I think this will fit in 1TB or so (not sure) but I don't want it to take 12-24 hours to load.

[04:36:37] <ningu> hence the clustering question.

[04:36:57] <GothAlice> Sharding with careful key configuration (and ordering of inserted data) a la the first link I provided will speed up those inserts.

[04:38:10] <GothAlice> In the case of graph associations, (i.e. {_id: {a: ObjectId('left'), b: ObjectId('right')}}) sharding on _id.a would spread the associations around reasonably well. (And split the insertions across as many shards as you can throw at it.)

[04:38:34] <ningu> ok, then my other question is, what's the easiest/best way to get up and running quickly with a sharded cluster? I'm only one person... it would be nice if there were some ansible scripts or whatever or automate some of it.

[04:39:03] <GothAlice> ningu: https://gist.github.com/amcgregor/c33da0d76350f7018875

[04:39:12] <GothAlice> Have a script to automate it.

[04:39:18] <GothAlice> :)

[04:39:34] <GothAlice> (or, at least, serve as a basis for your own script to automate it)

[04:39:40] <joannac> mms.mongodb.com -> MMS automation

[04:39:44] <GothAlice> ^

[04:39:50] <ningu> thanks, I'll cehck both of those out

[04:39:50] <GothAlice> +9001 (over nine thousand) to using MMS.

[04:41:03] <ningu> hmm... 8 servers for free? that might be enough.

[04:41:10] <ningu> I don't know how much storage they have, though.

[04:41:22] <GothAlice> ningu: My script provides a clear illustration of the process involved in configuring a cluster like that, if you prefer to read code. It's a combination of http://docs.mongodb.org/manual/tutorial/deploy-shard-cluster/ and http://docs.mongodb.org/manual/tutorial/deploy-replica-set/

[04:41:40] <ningu> I have a bunch of free credits on Google Compute Engine so I was hoping to use that, just to save money, since this is very experimental.

[04:42:19] <ningu> yes, reading code is good. in theory I'd like to just press a button and have it up but I don't know if that's realistic or affordable. I'm willing to do some work to set it up, as long as I don't have to individually configure every server.

[04:42:33] <GothAlice> ningu: Then MMS is your friend.

[04:42:48] <GothAlice> Install the agent on a few machines, and a few clicks later on the MMS interface, if I understand correctly.

[04:42:59] <ningu> hmm, ok

[04:43:14] <ningu> so they don't provision the servers, they just configure them. that sounds like what I want.

[04:43:20] <GothAlice> It's pretty sweet.

[04:43:48] <GothAlice> (I use the monitoring and backup features, myself. Streaming offsite replication is glorious.)

[04:45:18] <joannac> GothAlice: 25tb?!

[04:45:31] <GothAlice> joannac: Yeah, it went up a TiB since the last time I mentioned the number.

[04:45:32] <GothAlice> ¬_¬

[04:45:33] <ningu> so here's a really basic question, then... they're not going to know the structure of my dataset, so I'll have to set up the shard keys myself, I assume. how does one do that exactly with a cluster... is there a master? do you just connect to any one of them?

[04:46:00] <GothAlice> ningu: There are two separate concepts.

[04:46:03] <GothAlice> Sharding, and replication.

[04:46:20] <GothAlice> Sharding splits your data across multiple sets. I.e. a-f, g-m, …

[04:46:36] <GothAlice> Replication adds redundancy by duplicating those sets on multiple machines.

[04:46:40] <GothAlice> (High-availability.)

[04:48:53] <ningu> yes, I was asking about the replication. I am more familiar with rdbms where there is a master/slave setup.

[04:49:01] <GothAlice> Generally you have sharding "in front" of your replication (i.e. a-f goes to a set of three replicas, g-m goes to a different set of three, etc.)

[04:49:08] <ningu> ah I see.

[04:49:40] <GothAlice> You have "routers" at the very front that your application connects to. These routers speak mongodb, but parse requests and try to analyze where to send the query. Sometimes they need to send the query mutiple places, then aggregate the results.

[04:49:52] <GothAlice> (mongos processes vs. mongod processes storing the data)

[04:51:23] <GothAlice> (They speak to "configuration" servers which track the cluster's replication members and everything, generally, self-organizes around your sharding keys.)

[04:51:46] <ningu> so realistically for this graph dataset, which is max 1TB (I mean the size of /var/lib/mongodb if stored on a single machine) how many servers would I need to make the bulk loading faster? I know you need 3 config servers for example.

[04:52:02] <GothAlice> Actually, you only *need* one.

[04:52:10] <GothAlice> Same with the sharding process (mongos).

[04:52:30] <ningu> oh, ok.

[04:52:40] <GothAlice> And the configuration servers can piggyback on the ones actually storing the data, to save space. Configservers are extremely light-weight.

[04:52:54] <ningu> that makes sense.

[04:53:29] <GothAlice> Your basic setup would need only one machine (whole dataset), and can use as many as you want (one per shard division). Replication (high-availability, backup, etc.) is entirely optional (but a Really Good Idea™.)

[04:53:52] <GothAlice> Have two machines, each would need ~1/2 the dataset in space.

[04:53:55] <GothAlice> Etc.

[04:54:08] <ningu> ok, and the more machines, the faster the insert, or does it not work that way?

[04:54:11] <GothAlice> If you replicate (backup, HA) you'd need three servers per shard as a minimum.

[04:54:42] <ningu> the more shards I mean.

[04:54:53] <GothAlice> It can work roughly that way.

[04:55:06] <GothAlice> How you determine your sharding key, and the order you insert your data will have strong impacts in the speed of insert.

[04:55:21] <GothAlice> (See the article I linked.)

[04:56:00] <ningu> I'm not sure if I need HA and backup right now because this is not the master copy of the dataset, and there won't be a ton of people querying it.

[04:56:13] <GothAlice> Bam, you only need sharding, then. :)

[04:56:18] <ningu> basically I want to experiment and see what kind of performance I'll get.

[04:56:20] <GothAlice> http://docs.mongodb.org/manual/tutorial/deploy-shard-cluster/ and MMS are your friends.

[04:56:29] <GothAlice> See also: http://docs.mongodb.org/manual/tutorial/choose-a-shard-key/

[04:56:52] <ningu> ok, well, I have a bunch of stuff to read up on, I guess. :)

[04:57:10] <GothAlice> Choice of key is key, if you'll pardon the lame pun.

[04:57:59] <GothAlice> http://blog.zawodny.com/2011/03/06/mongodb-pre-splitting-for-faster-data-loading-and-importing/ is a really good article about optimizing bulk inserts on sharded clusters, though.

[04:58:09] <GothAlice> It explains a lot.

[04:58:13] <ningu> yes, I've bookmarked it :)

[04:58:30] <ningu> I didn't expect to get such detailed information here, it will take a little while to digest everything

[04:58:43] <GothAlice> ningu: ^_^ Apologies if I'm overwhelming. I like to provide comprehensive assistance.

[04:58:49] <ningu> you certainly are.

[04:58:54] <GothAlice> T_T

[05:07:41] <GothAlice> A straight binary FFT would produce too much data. Hmm.

[05:09:28] <GothAlice> https://en.wikipedia.org/wiki/Acoustic_fingerprint Almost all implementations are purely commercial. :'(

[17:12:10] <giuseppes> hi

[17:13:12] <giuseppes> I'm trying to set up my mongodb instance on a remote server to be reached from my app, how can i get/ configure usr, password and mongo url?

[17:14:10] <giuseppes> I'm editing my mongod.conf now I've already started to unbind the ip

[17:14:18] <giuseppes> 127.0.0.1

[22:38:51] <giuseppes> hi

[22:38:52] <giuseppes> I'm trying to set up my mongodb instance on a remote server to be reached from my app, how can i get/ configure usr, password and mongo url?

[23:30:56] <wsmoak> giuseppes: I haven’t done it, but how about this? http://docs.mongodb.org/manual/tutorial/enable-authentication/

[23:39:45] <giuseppes> wsmoak, I've seen it but I'm not sure if it's what I need

[23:40:52] <wsmoak> that seems to be how you enable authentication … otherwise you can only connect on localhost

[23:40:58] <wsmoak> how are you connecting to mongodb now ?

[23:41:10] <giuseppes> wsmoak, ssh

[23:42:20] <wsmoak> meaning you ssh to the server and then do stuff at the command prompt?

[23:43:03] <giuseppes> wsmoak, I can configure anything (I think)

[23:43:22] <giuseppes> I have sudo access to the server

[23:48:29] <joannac> giuseppes: your mongod is running on the default port (27017)? Can you connect to it remotely? (telnet yourhostnamehere 27017)?

[23:48:53] <joannac> you should probably turn on auth like wsmoak said

[23:49:24] <giuseppes> joannac, I'll try

[23:49:45] <giuseppes> do I need to type telnet myhostname 27017 in the command line?

[23:50:10] <joannac> yes

[23:50:16] <joannac> the shell command line

[23:50:34] <giuseppes> it's trying

[23:51:18] <joannac> ctrl+]

[23:51:24] <joannac> it tells you so in the prompt

[23:51:30] <giuseppes> ctrl+c works as well

[23:51:47] <wsmoak> ctrl ] gets you to the telnet> prompt

[23:52:04] <wsmoak> it’s “quit” for the record…

[23:54:29] <giuseppes> wsmoak, trying to add a firewall rule for port 27017

[23:54:41] <giuseppes> for the record it says connection refused

[23:54:51] <giuseppes> so it might be the port that is closed

[23:55:43] <joannac> you should also check your mongod configuration for "bindIp"

[23:55:57] <giuseppes> joannac, bind ip is not bound

[23:56:04] <giuseppes> I've unbound it earler

[23:56:07] <giuseppes> earlier

[23:56:58] <joannac> okay

[23:57:12] <giuseppes> ok I can connect via telnet

[23:58:52] <joannac> you should now turn on auth, so not everyone can connect to your instance and look through your data

[23:59:22] <giuseppes> http://docs.mongodb.org/manual/tutorial/enable-authentication/ this one?

[23:59:25] <giuseppes> joannac,

[23:59:46] <wsmoak> joannac: really? I (briefly) read “localhost exception” and thought that _without_ auth turned on, you could ONLY connect via localhost.

Log file Viewer

Help | Karma | Search:

#mongodb logs for Saturday the 8th of November, 2014