[06:32:55] <duncan_donuts> Pymongo v3… is it just me or is anyone else getting annoyed becuase the ObjectId is no longer manipulated back into the dict?
[06:53:12] <Nikesh> Can anyone help me install MongoDB on Ubuntu 15.04? The apt-get install seems to want to use 'upstart', but apparently Ubuntu now uses systemd
[06:53:24] <Nikesh> initctl: Unable to connect to Upstart: Failed to connect to socket /com/ubuntu/upstart: Connection refused
[06:58:23] <Nikesh> I am leaving for now but will check the logs if anyone has some insight.
[07:54:15] <devdvd> I am working on learning about mongo performance tuning and scaling and was wondering what the effect is of my mappedWithJournal value being higher than my available memory. Does this mean it's swapping out to the hard drive?
[07:54:50] <devdvd> the mapped value is less than physical memory
[08:50:44] <watmm> I have a mongodump cron job set up and recently it's started returning "warning: Failed to connect to x.x.x.x:27017, reason: errno:111 Connection refused", but i can't see anything in mongo's logs and (to my knowledge) the conf wasn't changed. Any idea where to look?
[08:54:23] <devdvd> watmm: is the mongodump being run on the same box as the mongo instance?
[09:34:28] <devdvd> watmm: then the first thing i suggest is trying to establish a connection from the dump box to the mongo instance using something like telnet or nc
[09:35:30] <devdvd> connection refused usually means the target server (in this case mongo) is telling the source(in this case the box running the dump) to piss off
[09:36:19] <devdvd> or even better, try installing the mongo client on the box doing the dump and try to log in to your mongo instnace with it
[09:37:05] <devdvd> while doing this i'd suggest doing a tcpdump on the mongo box filtering by sourceip of the dump server
[09:37:15] <devdvd> this will tell you if traffic is even getting to the server
[10:09:42] <bigsky> what is the difference between mongo and mongo localhost ?
[10:10:26] <Derick> nothing? I might not fully understand the question
[10:12:46] <joannac> "mongo localhost" connects to the mongod on 127.0.0.1, port 27017, to the database "localhost"
[10:12:49] <bigsky> Derick: when i connect mongodb service running my local machine using mongo localhost, i cannot find the collection which can be shown when i connect with mongo
[10:23:15] <Derick> sorry - don't know anything about that Garito
[10:23:30] <Garito> @Derick perhaps is not related with docker
[10:23:40] <Garito> the question is how to automatic start a replica set
[10:23:44] <joannac> Derick: do you have a replset handy? apparently you cna connect to a replset now. http://docs.mongodb.org/manual/reference/program/mongo/#cmdoption--host
[14:12:29] <fxmulder> joannac: I rsynced one replica to another
[14:13:01] <fxmulder> bringing both up seems to have worked, was just unsure since the primary wasn't coming up
[14:17:38] <pamp> its is normal when we separate index and data into different drives, the read performance decrise a lot?
[14:19:22] <pamp> I separated the data, index, journal and logs into different drives.. the write performance increased a lot. but the read performance decrease too much
[14:20:27] <pamp> I should not separate index and data?
[15:10:09] <mrmccrac> snappy compression in mongo 3.0 — is the compression of one document independent of compression in a different doc? or will docs in the same collection benefit from other docs?
[15:18:15] <GothAlice> mrmccrac: AFIK the compression is performed against chunks of the on-disk stripes; it's not document-aware.
[15:18:28] <GothAlice> (And also not collection aware.)
[15:19:19] <GothAlice> However, as a word of warning, WiredTiger has some serious outstanding issues that affect performance, reliability, and data safety.
[15:20:23] <mrmccrac> ive seen a few peculiar things i guess :)
[15:20:36] <GothAlice> https://jira.mongodb.org/browse/SERVER-17424 https://jira.mongodb.org/browse/SERVER-17456 https://jira.mongodb.org/browse/SERVER-16311 (See also: https://jira.mongodb.org/browse/SERVER-17386)
[15:20:44] <GothAlice> These are just the ones that effect me. ;)
[15:21:11] <GothAlice> Under nominal load I can crash a primary every 15-30 seconds. :D
[15:22:00] <mrmccrac> we were crashing it until we added --wiredTigerEngineConfigString=hazard_max=10000
[15:26:34] <GothAlice> The OOM-killer, at one point, even chose to kill a kernel process that panicked the entire VM. That was fun to see. (I enjoy watching things explode and burn too much, I think. ;)
[15:35:53] <boutell> Hi. I am puzzling over the authSource option, which specifies “the source of the authentication credentials.” Apparently this allows the client to say where to verify the password it’s providing… does that make any sense from a security standpoint? Why would the server allow me to say “hey, verify my password against this database of *my* choosing”? That’s why I am pretty sure I have misunderstood what it
[15:40:13] <GothAlice> boutell: Users are tied to databases.
[15:40:31] <GothAlice> boutell: This has the effect that a user can be granted permission on a different database than the one storing the user credentials.
[15:41:13] <GothAlice> boutell: For example, a global "admin" user with permission to read anything. To connect to random database you'll need to specify that the account credentials are actually coming from the "admin" database, or it won't find the user.
[15:47:13] <boutell> GothAlice: okay, but apparently I can specify any database, such as one I have ReadWrite access to, and store credentials saying I can access some other database?
[15:50:52] <GothAlice> If you have UserAdmin on that database, then you can manage users there.
[16:26:07] <boutell> GothAlice: but isn’t this crazypants? It means there’s this permission I can grant on database A which is in reality a ticket to blanket permission on databases B-Z.
[16:26:22] <boutell> it does sound like ReadWrite does not imply UserAdmin, which is good of course.
[16:26:37] <GothAlice> Well, thus you should really carefully control who gets UserAdmin. ;)
[16:27:09] <boutell> yes… I’m wondering why there isn’t simply a single admin database, period. Maybe if I understood when this would ever be a good idea beyond the admin database...
[16:27:45] <GothAlice> For the most part my database users are isolated to their databases, i.e. the only account in the "admin" databases is "admin".
[16:28:38] <boutell> is there some kind of perf win with that or something? As opposed to all users being in the admin database?
[16:28:48] <GothAlice> For example, I have a "rita" user that has permissions on "production" and "staging" databases. The "rita" user is in the production database.
[16:29:20] <GothAlice> Well, for one, if the user is local to the database you don't need to specify the authenticationDatabase when connecting.
[16:30:11] <boutell> I do recall encountering that annoyance.
[16:30:47] <boutell> although, if mongo were designed to only use the admin database for users, you wouldn’t have to specify. The point about nuking users and the database with a single command makes sense I guess.
[16:31:25] <GothAlice> Or you can functionally group your users.
[16:31:39] <GothAlice> Create a database called "shared" for your "shared hosting" clients, etc.
[16:31:49] <GothAlice> (Store the users in "shared", with permissions on their own DBs.)
[16:32:02] <GothAlice> The flexibility means that even use cases I can't imagine right now can be covered.
[16:32:47] <GothAlice> (It also means there are no special cases. Users are users, users live in a DB, users have permissions on DBs. No magic about the "admin" database, etc.)
[16:34:54] <pamp> it is possible to have a cluster, where shards (replica set) are linux environments and the router / config servers are in Windows environments?
[16:38:16] <GothAlice> pamp: Yes, mixing platforms shouldn't matter in that scenario. It shouldn't really matter in any scenario, but having differing IO performance on data-carrying replicas can exhibit some really strange issues.
[16:39:19] <GothAlice> (Like replica secondaries falling far enough behind to re-sync.)
[16:39:25] <GothAlice> In your scenario, though, that won't be an issue.
[16:42:17] <pamp> all server will be on linux environment
[16:42:50] <pamp> only configs and router will be in windows environment
[16:46:14] <boutell> is anybody keeping a terabyte of data that doesn’t shard easily in MongoDB? Do you feel it makes sense at that size or would you use something else in that scenario?
[16:47:17] <StephenLynx> why wouldn't it be easily sharded?
[16:51:19] <GothAlice> boutell: All data is shard-able, even if it just gets allocated to a given shard by hash (random, even distribution) instead of something more "intelligent".
[16:51:30] <GothAlice> With the exception to the above being capped collections.
[16:51:37] <GothAlice> Those can't be sharded for performance reasons.
[16:52:17] <GothAlice> boutell: Sharding makes sense any time the amount of data you have exceeds the amount of RAM you have. Sharding is the only real way to efficiently reduce the per-host data size… by adding more servers.
[16:52:27] <boutell> StephenLynx: that’s an entirely fair point, it’s a pretty arbitrary insistence on my part.
[16:53:02] <StephenLynx> usually arbitrary decisions lead to bad practical consequences.
[16:53:26] <boutell> GothAlice: OK. So on a per-machine basis though, it really isn’t a good idea to go beyond RAM size, while MySQL has been optimized somewhat for that case. (I’m not trying to start a flamewar here, I’m looking at pros and cons and use cases)
[16:53:35] <boutell> (I use MongoDB pretty much exclusively right now & love it)
[16:54:54] <boutell> (I assume that swap partitions don’t work around that especially well?)
[16:54:54] <StephenLynx> cases where you want to use mongo: lots of read and write, constantly horizontal scaling, no need for joins on data.
[16:55:45] <boutell> that makes sense and it’s a pretty conservative and sane way of defining it StephenLynx. Although in practice, I do oodles of joins with mongo, because even when I have to implement my own joins, it’s still vastly better for the kind of data structures I’ve got.
[16:55:49] <StephenLynx> where you don't want to use mongo: need for relational integrity and joins, graph relation
[16:56:28] <boutell> interestingly, in our CMS work, both graph relationships and joins are huge, although we don’t need strict integrity - it’s not hard in practice to just not fetch things that ain’t there no mo
[16:57:10] <StephenLynx> is not that is hard to do, it just isn't optmized at all.
[16:57:31] <StephenLynx> with a relational db it is built from the ground to perform the joins before sending the data to your application.
[16:58:14] <StephenLynx> with a graph db the same to traverse recursive relations.
[16:58:28] <boutell> StephenLynx: yes, I understand. In practice, for a CMS a single VPS can typically serve even a large and substantially popular site, like a major college for instance. And that means the latency between you and mongo is really, really small, which cuts down on the pain of making multiple queries to handle the joins and relations.
[16:59:28] <boutell> mongo *could* support joins though. It’s not as if mongo is conservative about adding support for features that aren’t traditionally considered “lean and mean and NoSQL"
[16:59:28] <StephenLynx> if you are just going to justify it by comparing the hardware you got and the load you have to handle, then any problem could be justified by just throwing more resources until the problem is small enough to be handled.
[16:59:57] <StephenLynx> I am not saying it can't be done, I am saying it could be done better.
[17:00:25] <StephenLynx> the whole point of picking a tech for a project is handling with increasing requirements.
[17:01:05] <StephenLynx> so today you have to deal with a single campus. then you need to integrate an international network of campus and your server bites the dust.
[17:01:06] <boutell> mmm, not every project is intended for “webscale” or will ever be “webscale” or has to be based on that open-ended worry
[17:01:08] <GothAlice> boutell: In the CMS example, there are data design considerations that allow you to not need joins.
[17:01:29] <StephenLynx> then just pick a name from a hat and roll with it.
[17:01:37] <boutell> I would actually say that the vast majority of projects will never be webscale and shouldn’t be twisted into that direction
[17:02:10] <StephenLynx> don't ask for pros and cons if you are just going to say the project is too small for said cons and pros matter.
[17:02:50] <boutell> StephenLynx: I should clarify. I asked the original question out of curiosity. I am not saying it is in any way relevant to our CMS work. I brought that up later in the conversation because we were talking about joins.
[17:02:51] <GothAlice> https://gist.github.com/amcgregor/901c6d5031ed4727dd2f#file-taxonomy-py-L23-L35 < my CMS stores a parent reference, parents list of references, name, coalesced path, and sort order for each "Asset" in the CMS tree. Querying for /foo/bar/baz/diz will perform a single query looking for a path $in ['/foo', '/foo/bar', '/foo/bar/baz', '/foo/bar/baz/diz'] sorted by -path, taking the first result. This gives the "deepest" matching document from
[17:03:15] <boutell> GothAlice: yeah, we do pretty much the same thing.
[17:03:34] <boutell> my 1TB question <> anything to do with the CMS
[17:03:47] <GothAlice> boutell: Extra special trick: the ACLs of all parents are cached in the children, allowing us to avoid a second lookup to get the ACLs of the parents for security validation.
[17:04:04] <StephenLynx> which you just said it isn't easily sharded because you don't want to shard.
[17:04:46] <boutell> StephenLynx: no, I said I was curious about what would happen if it wasn’t easily sharded, it was a hypothetical. I don’t actually have a webscale problem to solve. I just wanted to further my understanding of what Mongo does and doesn’t claim to do, for some future project, maybe. <— likes learning things
[17:05:23] <boutell> I then got onto the subject of joins in a way that generated confusion.
[17:09:44] <boutell> . o O I don’t work for a startup. If you work for a startup, considerations are probably very different. If you don’t have millions of users, you’ve failed, straight up
[17:10:22] <boutell> . o O so yes, you have to pick tools entirely on scalability first, convenience second
[17:24:28] <RWOverdijk> Hello :) I've been able to get mongo to retrieve records near a specific location (geospatial). Now, I'd like to first sort records by a date field, and then sort that result based on distance... In SQL terms, `order by start_date, distance`. I don't know where to get started.
[17:27:09] <StephenLynx> are you using find to get a cursor?
[17:28:54] <RWOverdijk> StephenLynx, Forgive me my ignorance but, what's a cursor? I'm currently using collection.db.command({geoNear : "event", near: [] ... to find the records.
[17:29:26] <StephenLynx> command? I have seen that before.
[17:30:41] <RWOverdijk> I really don't like being a noob again.
[17:31:00] <StephenLynx> let me check geo queries, never used them
[17:31:21] <RWOverdijk> All I can find is near, within etc.
[17:32:29] <StephenLynx> http://mongodb.github.io/node-mongodb-native/2.0/api/Collection.html#geoNear here
[17:32:46] <StephenLynx> you can run the command on the variable representing the collection
[17:34:12] <StephenLynx> so you run geonear first, then by the driver documentation it will yield the result you are already obtaining in the second parameter of the callback.
[17:34:48] <StephenLynx> either the result will be an array or a cursor.
[17:35:01] <RWOverdijk> And by "the command" you mean, sort?
[17:35:29] <RWOverdijk> Because I'm looking for a way to do it the other way around
[17:35:36] <StephenLynx> db.collection('collectionName') returns an variable that represents the collection. on this variable you can run functions like find,aggregate
[17:36:10] <StephenLynx> so you don't need to call command and then specify which command you wish to run, you just run the function that represents the command.
[17:36:11] <RWOverdijk> Does that hold all data, or will it not hold anything until I execute it?
[17:36:21] <RWOverdijk> Because it's a large database :p
[17:36:31] <StephenLynx> that what cursors are for.
[17:36:40] <StephenLynx> they don't hold the documents, but references to them.
[17:39:58] <RWOverdijk> geoNear sorts documents by distance. If you also include a sort() for the query, sort() re-orders the matching documents, effectively overriding the sort operation already performed by geoNear.
[17:40:10] <StephenLynx> you can also use the $geoNear aggregation pipeline
[17:40:21] <StephenLynx> and sort in the aggregation
[17:40:28] <RWOverdijk> I should really read up on the terminology
[17:40:53] <RWOverdijk> An aggregate is a collection of x.
[17:41:07] <RWOverdijk> So what's x in this sense?
[17:41:52] <StephenLynx> an aggregate is a series of operations that are performed at once in the database before returning to your application from what I understand that will return an array.
[17:42:06] <StephenLynx> it includes some complex operations you can't use commands
[17:43:13] <StephenLynx> there is nothing else that will perform those operations.
[17:43:19] <GothAlice> Exactly. If you can get away using standard queries, it's generally recommended to do so.
[17:43:38] <RWOverdijk> I wish I could apply distance sort as a secondary sory.
[17:43:40] <GothAlice> However, sometimes a proper data processing pipeline, with the extra overhead, is either acceptable or required to answer a given query, as StephenLynx says.
[17:50:59] <RWOverdijk> I'm using mongo for media indexing
[17:51:10] <StephenLynx> yeah, you might be better using maria for the data you will join.
[17:51:19] <RWOverdijk> So mongo's being used anyway, but I reallly would love it if the geospatial stuff would work :p
[17:52:00] <RWOverdijk> Oh so, I could reverse this possibly
[17:52:14] <RWOverdijk> I could first fetch records within a certain range, and sort those by date
[17:52:24] <RWOverdijk> But that would mess up the distance sorting
[17:54:21] <RWOverdijk> Final question... aggregates and cursors run over already filtered results, right?
[17:54:43] <StephenLynx> if you filter the results, yes.
[17:54:44] <RWOverdijk> So out of the three mil I have, it'll first filter based on the conditions I gave it, and then allow me to fiddle with that data?
[18:06:15] <StephenLynx> when you get a cursor you have to perform operations to get the data. if you just want them all at once, there is a function toArray that is synchronous if I am not mistaken.
[18:06:20] <RWOverdijk> I shouldn't drink whiskey and ask questions at the same time, I make bad jokes. Sorry :p
[18:08:02] <RWOverdijk> I want something that's soon, and close to show up first. Then, something that's soon but a bit further away to come second, and nicely balanced with other records.
[18:08:06] <StephenLynx> so if I needed a series of documents I just used aggregate so I wouldn't have to perform multiple operations.
[18:08:56] <RWOverdijk> But it's good. I can look up the actual performance loss
[18:09:02] <RWOverdijk> Figure out if it's worth it
[18:09:12] <RWOverdijk> Because aggregates also work on already sorted data
[18:09:31] <RWOverdijk> So if I use near, I can limit, and sort the matching records
[18:09:37] <RWOverdijk> Which I obviously limit to "x"
[18:10:01] <RWOverdijk> And my logic will only apply to a subset of the 3 mil records
[18:11:06] <StephenLynx> yeah, but again, most stuff can be done with cursors.
[18:11:24] <StephenLynx> keep that in mind if you use aggregate and get a bottleneck
[20:46:30] <teen> hi guys i am a total noob here... i have a Channel model that has many Broadcasts - I'm trying to query the most recent Broadcast for each Channel. The problem is the Channel table doesn't have a ref to the Broadcast- but the Broadcast does have a ref to the Channel. Can I still query the Channel docuemnt and populate its Broadcasts? pls halp
[20:47:46] <teen> this would be really simple in sql : (
[20:48:08] <GothAlice> teen: It's simple in MongoDB, too, and pretending you couldn't use joins in SQL, the solution is the same, too.
[20:48:33] <StephenLynx> what if you use aggregation, first you order by time of broadcast then you group using the _id as the channel and set the value of the broadcast as the one you get?
[20:48:47] <StephenLynx> so you will get on the broadcast the last broadcast of each channel
[20:48:59] <GothAlice> Indeed, that would be the optimum approach, StephenLynx.
[20:49:32] <teen> yea but don't i need a for loop then for each channel ?
[20:50:40] <teen> i want the result to look like [{name: 'channel A', last_broadcast: {name: 'broadcast x'}}, {name: 'channel B', last_broadcast: {name: 'broadcast Y'}}]
[22:52:49] <unholycrab> trying to decide if i need to provision my mongodb instance with enough memory to store ALL indexes in memory ALL at the same time
[22:52:56] <unholycrab> or if i should provision something close to that
[22:53:15] <cheeser> it's best if you can keep all the indexes in memory