[00:21:00] <GothAlice> hardwire: It's not new, though. SSL was an enterprise feature.
[00:21:36] <GothAlice> As was/is deeper authentication (LDAP-style) integration, which is a typically enterprise need. (And a hideous PITA to support.)
[00:27:53] <GothAlice> In my experience, also fragile. ;^P
[00:29:56] <GothAlice> "You know that pre-signed host key you were using? Yeah… it's not so good any more." "But, but why?" "Reasons."
[00:41:36] <nicolas_FR> hi there. Maybe not the right place but I can't find help. Mongoose (?) problem I think with sub documents : http://pastebin.com/GLJhSHyd
[00:42:45] <joannac> nicolas_FR: have you tried the various mongoose support forums?
[00:46:27] <nicolas_FR> joannac: like stackoverflow ?
[00:47:10] <joannac> SO, github, google group, gitter
[00:48:09] <joannac> (i don't know anything about mongoose, so can't help further than that)
[00:48:30] <nicolas_FR> joannac: thanks, will post on stackoverflow
[07:45:58] <zerOnepal> Hi there, I have a server running with a collection weigh 1TB and trending size, my application was designed that way, but I trying to trim down the un-necessary entries
[07:48:31] <Boomtime> hi zerOnepal, sounds like you have a use case problem - is there something specific you want help with?
[07:48:36] <zerOnepal> any recommendation on this, how do I safely delete old records without hampering live traffic, I don't have much time to figure this out and I can't afforord to add additional disk for temporary fix... My only choice is to trim the old datas, I am thinking of keeping latest 3months data only
[07:50:36] <Boomtime> zerOnepal: each deletion will add load, these are write operations, albeit small ones, but a large batch will have an effect - if you want to reduce this then delete in small batches (a fewhundred at a time perhaps?) with a writeConcern of majority and have a small sleep between batches (say, 1 second?)
[07:51:04] <Boomtime> notably, this is your use case, these are just general recommendations
[07:51:13] <Boomtime> the details need to be filled in by you
[07:53:20] <zerOnepal> batch deletions, with certain grace time delay... hmmm
[07:54:18] <kurushiyama> zerOnepal: Do you have a date field on the data?
[07:54:23] <zerOnepal> any mongo tools you want to recommend ?
[07:55:50] <Boomtime> a TTL will issue a single delete for everything past that date
[07:55:55] <zerOnepal> shell thing Boomtime, :D.. I wish there exists something like percona tools for such alterations...
[07:56:19] <Boomtime> you must not use a TTL to 'catch up' on the fact that you don't have a TTL
[07:56:26] <kurushiyama> Boomtime: Erm... no? There is a expireAfterSeconds param...?
[07:56:44] <Boomtime> and what would you set that to?
[07:57:28] <Boomtime> TTL is extremely useful so long as not many documents are _initially_ to be deleted
[07:57:58] <Boomtime> if all the documents in the collection are more recent than 3 months, then by all means, create a TTL index for an expiry of 3 months
[07:58:23] <kurushiyama> Boomtime: Say the date field is set to the insert date, and you want to have the doc deleted after 3 months, so you'd set it to 3*30*24*60*60? As per the initial deletion: It is a background process, and we are talking of a fairly recent MongoDB ;)
[07:58:41] <Boomtime> if you have a large collection, with documents spread across a year, do not create a TTL with an expiry of 3 months,because that just means the first run needs to delete 3/4 of the collection in the first pass
[07:59:42] <kurushiyama> zerOnepal: Which version and (if applicaple) which storage engine do you use?
[08:00:25] <zerOnepal> I am stucked with old mongo relase 2.4.x
[08:00:42] <zerOnepal> I am thinking of upgrading it to 3.xx in days to come
[08:00:57] <zerOnepal> Storage engine is the default one on legacy 2.4.x
[08:03:47] <zerOnepal> xhagrg anything you want to add ?
[08:05:58] <zerOnepal> so you guyz recommend/use built in mongo shell to interface/alter/trim mongo records ??
[08:15:26] <Boomtime> zerOnepal: you have provided very little information about your actual use case - collection size, but not document counts, or how many you expect to be deleted, or anything about the application - we'd use the shell for most simple one-off scripts because it's easy
[08:16:25] <Boomtime> a simple loop that deletes a certain number of matching docs out of a progressive cursor, with a sleep between is about 4 lines of javascript - and it's efficient - what do you want?
[08:16:35] <zerOnepal> its about 1TB weigh collection, that I get using mongo-hacker awesome prompt, and I can't afford to calculate the count the records :(
[08:18:18] <Boomtime> mongo-hacker is a shell extension - with or without it won't change the 4 lines of javascript - but the impact of those 4 lines will change radically depending on your circumstances
[08:18:53] <Boomtime> what criteria will you use to delete appropriate records?
[08:20:15] <zerOnepal> my only available low hanging fruit is the "created_at" field on that collection and
[08:20:34] <kurushiyama> zerOnepal: Whut? You can not do a db.coll.count()?
[08:20:46] <Boomtime> awesome - is there an index on that field?
[08:21:12] <Boomtime> maybe just post the .stats() of that collection to a pastebin
[08:21:29] <zerOnepal> that's an expensive query kurushiyama... let me check on slave
[08:21:31] <Boomtime> redact the namespace if it's important
[08:28:23] <kurushiyama> zerOnepal: how critical is the disk space?
[08:33:35] <zerOnepal> way to critical, I will be doomed in few days :(
[08:34:34] <zerOnepal> Hey, isn't there a way to delete in the background? Like when we build indexes?
[09:09:19] <kurushiyama> zerOnepal: Well, we have to find out a few things. I am not too sure wether you say "We can not have any impact" because of the paranoia DBA's usually have or wether your System is really at its limits and a delete operation would have noticeable or even critical impact on UX.
[13:44:01] <bros> Why does this query return 1 matching store and not 4: { '$elemMatch
[13:44:41] <bros> Why does this query return 1 matching store and not 4: { '$elemMatch': { 'stores': { '_id': { '$in': [ObjectId(...), ObjectId(...), ObjectId(...), ObjectId(...)] } } } }, { 'stores.$': 1 }
[14:44:03] <zylo4747> I posted a question to https://groups.google.com/forum/#!forum/mongodb-user but it isn't showing up. Is there something I need to do for a post to become visible? I can't even find it now, it's just gone
[15:32:29] <zylo4747> I posted the issue on serverfault if any of you want to take a look and see if you can help me out http://serverfault.com/questions/774567/intermittent-crashes-of-a-mongodb-replica-set
[15:32:30] <Justin_T> I'm using mongodb 3.0 :( I can't use lookup
[15:55:38] <zylo4747> Do you know if db.repairDatabase() will tell me if corruption was found and fixed?
[16:27:38] <jokke> i'm measuring write performance and read performance for different schemas on a dockerized mongodb sharded cluster. i've noticed that at some point the performance drops radically and the cpu usage of the replication set members goes to a constant 100%. Any leads what might be causing this?
[16:27:54] <jokke> this drop happens when testing writes
[16:28:09] <jokke> after a few minutes into the benchmark
[16:29:12] <jokke> i've capped the max memory of each replication set container to 2g
[16:29:31] <jokke> and assigned individual cpu cores
[16:31:47] <bros> Does elemMatch only return the first element and I need to use aggregate to get them all?
[16:32:55] <jokke> elemMatch doesn't return anything. It just matches documents where at least one element of an array field matches the quer(y|ies)
[18:50:21] <hardwire> Just have to treat them as different things. MongoDB CE and MongoDB Enterprise. One reason I'm not using InfluxDB right now is due to it recently becoming crippleware.
[18:50:42] <hardwire> The most important use is now removed from the open source software.
[18:52:54] <cheeser> tthe philosophy with such enterprise features in mongodb is to only do that to features needed by large enterprises. the more "run of the mill" features will be open and available. (*not an official statement. details subject to change without notice or consultation)
[20:46:01] <jokke> but the shards aren't distributed evenly at all
[20:46:25] <cheeser> the shard key shouldn't be mutable, i believe
[20:46:34] <cheeser> i could be wrong about that one, though.
[20:58:07] <quadHelix> How do I pull back all records using mongoexport? Here is my command: mongoexport --type=csv --host <ip>:<port> --db myorders --collection 57290da5-b67c-4b6b-b9bd-3fa1608e448c -f 'created,job_id,name,' -o myfile.csv
[21:20:04] <kurushiyama> cheeser: _id's are immutable, for starters ;)
[21:20:04] <hardwire> jfhbrook: one big consideration is how you plan to partition data for sharding, if ever.
[21:20:05] <jfhbrook> yeah, we do that in one place but my team never trusted the contractor that wrote that code
[21:20:18] <kurushiyama> jfhbrook: _id's are unique. So you can not have more than one doc per string.
[21:20:30] <jfhbrook> kurushiyama, that's a goal here
[21:20:38] <kurushiyama> jfhbrook: That is one impact you have to carefully consider.
[21:20:48] <kurushiyama> jfhbrook: Furthermore, it makes a bad shard key.
[21:20:55] <hardwire> if you want to determine how information is stored against a partitioned cluster, you'd use your own key regardless.
[21:21:09] <jokke> cheeser: ah yeah sorry, i meant that the timestamp fields of the docs are 2 secs apart
[21:21:31] <jfhbrook> so times I wouldn't want to do this is if I can get in a scenario where I want to update the doc by changing this key (not going to happen ever)
[21:21:44] <jfhbrook> or if I want to do sharding against that property
[21:27:18] <kurushiyama> jokke: a) you can not change the _id, which you are trying to do, as far as I understood. b) You need to use an {_id:"hashed"} index, since most likely you have a monotonically increasing shard key with the use of a date field.
[21:38:13] <kurushiyama> We could use _any_ combination, and it would be monotonically increasing from a lexicographic pov.
[21:39:32] <kurushiyama> So, now say we have b1000c
[21:40:11] <jokke> but i thought that choosing compound indexes allows the sharding algorithm to distribute for example docs with the first string being a into one shard and chunk it via timestamp
[21:40:20] <kurushiyama> It would still be "bigger" than a1000b, and b-docs would monotonically increase in their own right.
[21:41:28] <kurushiyama> jokke: Well, even if you do assign the key ranges manually, you still would be adding to the plus infinity shard only.
[21:41:56] <kurushiyama> jokke: "plus infinity" being a rather abstract thing here, but you get the picture.
[21:45:12] <kurushiyama> Ok, your example is a bit complicated. let us assume you assign a* -b* to shard1 and c*-d* to shard2. At some point, you have an imbalance due to simple variance. Clear so far?
[21:46:43] <jokke> no :/ sorry. why would i have an imbalance?
[21:47:14] <jokke> because for some reason i get more docs in the a* -b* range for example?
[21:50:05] <kurushiyama> And it goes beyond that. A: you can not assign this way. you can assign from -inifinity to a certain value, and from that value to +infinity
[21:50:26] <kurushiyama> As defined by $minKey and $maxKey
[22:19:31] <kurushiyama> Which is one of the reasons for an early split. Furthermore, there can only be ony, and exactly one chunk migration in a cluster at any given point in time.
[22:20:32] <kurushiyama> jokke: Which can become part of the problem when choosing a bad shard key, for obvious reasons.
[22:21:19] <kurushiyama> So, now you want to access the data you saved. How does mongos know where to get the doc?
[22:21:36] <kurushiyama> Assuming you query by shard key.
[22:22:44] <kurushiyama> It gets the key ranges assigned to the individual shards and does a rather simple lexicographic comparison (single or compound does not matter much, as we have seen) to identify the shard the document lives on.
[22:24:06] <kurushiyama> Then mongos contacts that shard, fetches the data pretty much like you'd do in a standalone instance (iirc, _exactly_ the same way) and gets you the data.
[22:24:46] <kurushiyama> That last part "gets you the data" can become pretty tricky, but for now that is sufficient.
[22:30:26] <jokke> the most common query would be "get me all data from panel X in the time range from Y to Z"
[22:31:10] <jokke> which is why i thought using _id would've made sense
[22:31:47] <kurushiyama> jokke: Well, the alternative is a constantly overloaded shard. That does not much for performance. In worst case scenarios I have seen, there were 8 shards (rather big ones), 7 of them idling around, 1 of them totally overloaded up to a pint where the application came to a standstill.
[22:33:38] <kurushiyama> Compared to inserting data, how often do you do this query?
[22:34:33] <kurushiyama> jokke: I know that is a stupid question. But you really have to ask yourself.
[22:35:37] <kurushiyama> jokke: The actual writes? Sure. Taking balancing into account? I doubt that.
[22:36:04] <jokke> mmh i guess you're right about that
[22:36:30] <jokke> from the docs: Generally, choose shard keys that have both high cardinality and will distribute write operations across the entire cluster.
[22:37:03] <kurushiyama> jokke: Well, there are several approaches we could use.
[22:37:53] <kurushiyama> jokke: First, we could use a redundant key.
[22:38:35] <kurushiyama> jokke: I am not too convinced. Let me check
[22:39:35] <kurushiyama> I need a smoke. be back in 5
[22:40:44] <jokke> kurushiyama: i need to get some sleep.. i'll see you tomorrow. Thanks for being so patient and helping me out like that. I really appreciate it, knowing you're usually getting paid for this stuff.
[22:44:01] <kurushiyama> jokke: You might want to dig into "tag based sharding". Just an idea. havent elaborated yet
[22:46:37] <kurushiyama> nah, we still are monotonically increasing. Can you pastebin a sample doc quickly? So maybe I can come up with something.
[22:50:25] <jokke> ok, so this will be the schema i'm most likely going with based on the performance and the way the data comes in: https://p.jreinert.com/nzKb/
[22:51:09] <jokke> but this seemst to distribute pretty well: https://p.jreinert.com/mPK0/
[22:53:19] <jokke> i won't go with a flat data model :) i know we had this discussion a few times already. The data comes in packets though, per panel. so it's way faster to store them inside a "panel-document" (since they're most likely being queried together too)
[22:54:37] <jokke> the doc i just pasted is basically the closest representation of how the data comes in
[22:56:22] <kurushiyama> jokke: Well, it would have the advantage of distributed load (no problems with monotonically increasing shard key). If panels do not have overlapping datasources, we could eliminate scatter/gather by tag based sharding.
[22:59:59] <kurushiyama> jokke: even better: if the panel ID distincts them, we might use the panel ID as a shard key, halfway even distribution taken for granted. Scatter/gather eliminated altogether without tag based sharding.
[23:01:09] <kurushiyama> And if the distribution is uneven, we still could resort to tag based sharding.
[23:01:10] <jokke> hm but wouldn't the chunks become huge?
[23:05:45] <kurushiyama> Lets check. Distributed writes, no scatter/gather, an index we need anyway. At the expense of 8 bytes/record? Great deal in my book.
[23:20:25] <kurushiyama> jokke: So, my solution eliminates scatter/gather and distributes writes evenly while giving good performance at the expense of like 20 bytes (12 ObjectId + 8 additional TS) record.
[23:21:56] <kurushiyama> jokke: Yeah, you'd need a TS for each of the flat docs to correlate for the panel. If a panel itself is time based, even that TS gets eliminated.
[23:22:14] <kurushiyama> Oh, but the panelId would be redundant, ofc.
[23:22:31] <kurushiyama> so 12 bytes oid + panelId.length
[23:23:15] <kurushiyama> jokke: either way, we could correlate, either by time or panelId. As said: your turn ;)
[23:25:16] <kurushiyama> jokke: So { _id: ObjectId(), panelId:yourPanelId, datasource:ds, r:1,i:1.5}, sharded by panelId
[23:25:49] <jokke> i'm still not using flat docs though :D
[23:26:00] <kurushiyama> jokke: Have you read the above?
[23:26:30] <kurushiyama> Well, I don't get it. It solves your problems. All of them.
[23:26:53] <jokke> not the one of a huge index and the one with slower writes
[23:28:19] <jokke> as i said. the data arrives packaged together. so for a panel with 50 datasources i'd have 50 inserts with the flat data model as opposed to 1 insert with my "flat_panel" model
[23:28:22] <kurushiyama> Well, you can not have the cake and eat it. You have to make tradeoffs. And as proven, updates are slower. Your call. But accepting a foul compromise when it comes to sharding, you are doomed.
[23:30:03] <kurushiyama> Well enough. But still, you have the sharding problem. You may be able to use something to get the distribution done, but that would be an index either way. Oh, and you might want to check bulk inserts ;)
[23:37:59] <kurushiyama> jokke: I _think_ it happens when a chunk is supposed to be migrated, but changes after the chunk started to be migrated. Or sth like that. It is a symptom I have regularly seen at... You can guess three times ;)
[23:40:58] <jokke> wow.. i just killed and set up the cluster again and ran _just_ the insert benchmark for the flat_panel model
[23:45:28] <jokke> hm but the read performance is pretty messed up
[23:45:35] <kurushiyama> There is something strange. But let us have a look then. You know what we need. If you can, please insert like a shitload of documents.
[23:46:02] <jokke> on the other hand i'm doing stupid queries... i'm getting the _latest_ data sample