[08:36:05] <chujozaur> Hi guys, I've made a host join existing replica set and it's (after 3 days) still in "RECOVERING" state. I can also see that there is barely nothin happening on that host (very low resource usage)
[09:45:22] <Zelest> Lovely, if I can help in any way, don't hestitate to point me in the right direction. :)
[13:15:53] <VectorX> i have not worked with mongodb, working on a webspider and would like to store html pages and other page binary objects in the db, is there some sort crompression that is enforced ?
[13:16:09] <VectorX> and is mongodb a good solution for such a task
[13:17:05] <VectorX> files should be less than 10mb uncompressed
[13:17:21] <kurushiyama> VectorX Well, a few things you need to understand.
[13:17:39] <kurushiyama> VectorX A MongoDB BSON document can never exceed 16MB in size.
[13:18:48] <kurushiyama> The thing is that the datafiles, as per default config, are compressed anyway. Yes, you can compress your payload to circumvent the 16MB size limit (to a certain extend), but then you will have your payload compressed twice.
[13:19:44] <kurushiyama> VectorX Second, as per a web spider: It depends on what you want to do. Usually, you do not want to load the data and store it away. What you want to do is to index it.
[13:20:23] <VectorX> kurushiyama something like this https://archive.org/web/
[13:20:32] <kurushiyama> VectorX which raises the question "What are your search needs?"
[13:20:36] <VectorX> so the files are saved more as historical data
[13:21:17] <kurushiyama> VectorX I see several problems here, and I guess you make some rather dangerous assumptions.
[13:22:02] <kurushiyama> VectorX First: For archiving, I would not be too sure wether you can be sure that a single URL does not exceed 16MB in size, compressed or uncompressed.
[13:22:38] <jayjo_> Is there a way to catch errors more manually when using mongoimport? I have files that are about 2MB, but one of the documents (of about 5000) is giving an error. This happens to about 10% of the files. I've tried catching the file in a log and attempting it again, but it will insert the documents prior to the line it is struggling with. Can I skip the document and write the actual document to alog somehow
[13:22:44] <kurushiyama> VectorX Say you want to archive a website of which the main purpose is to distribute software.
[13:23:12] <kurushiyama> jayjo_ I guess you should use a proper ETL tool for such tasks.
[13:23:25] <VectorX> kurushiyama well one other thing, for the moment im only storing the home page(html) and the favicon.ico
[13:24:06] <kurushiyama> VectorX You should make the decision early, since basically it gives you two different paths to follow.
[13:25:05] <VectorX> kurushiyama i apologize i have to run to a quick meeting, ill be back soon, but you can give your thoughts and another solution if you have one, i was going with mysql, but relational doesnt seem to be right for this either
[13:25:43] <kurushiyama> VectorX I guess MongoDB will be fine, but it is the question of wether to use GridFS or standard docs. We can discuss that later.
[13:26:09] <kurushiyama> jayjo_ Personally, I use scriptella (or write a custom ETL tool, depending on my needs).
[13:37:56] <dino82> I take it mongorestore doesn't play well with adding a new member into a replicaset
[13:43:17] <dino82> And the logs reported that it dropped all databases for an initialsync
[13:43:32] <kurushiyama> dino82 Because it was missing what? ;)
[13:44:26] <dino82> Doesn't mention anything in the logs. Just created replication oplog, initial sync pending, initial sync drop all databases
[13:45:09] <kurushiyama> dino82 https://docs.mongodb.com/manual/reference/program/mongodump/#cmdoption--oplog and https://docs.mongodb.com/manual/reference/program/mongorestore/#cmdoption--oplogReplay
[13:46:45] <kurushiyama> dino82 I guess you used neither?
[13:47:23] <dino82> This would be a lot easier if the sync would work out-of-the-box rather than requiring a dump/restore ... sigh
[13:47:42] <kurushiyama> dino82 I do a resync once or twice a week. Never failed me.
[13:48:56] <kurushiyama> dino82 I do not see the point for a dump anyway – I found that the sync mechanism is _much_ faster than dump and restore.
[13:49:09] <dino82> Replicating is filling 400GB disks with 110GB of data. If an rs.add() would just replicate and create a secondary, that would be awesome
[13:50:25] <kurushiyama> dino82 Uhm... It does? simply add the new member, lean back and have a coffee. With any halfway decent network, this should be done in less than an hour.
[13:51:14] <kurushiyama> But definetly faster than a dump/restore cycle.
[13:51:35] <dino82> Yep, it does. Primary/Secondary - 110.44GB, rs.add() a new member and it never stops replicating, not sure if it's the same data over and over again but eventually the disks fill and mongod dies
[13:52:11] <dino82> Regardless of engine, tried WiredTiger and same result
[13:52:42] <kurushiyama> dino82 Well, I assume your replication oplog window is too small. Which you can not circumvent, regardless what means you use for transferring the data.
[13:54:19] <kurushiyama> dino82 If I am right, here is what happens: While you sync or dump/restore, your data changes so much that the oplog is rotated in full.
[13:55:30] <kurushiyama> dino82 Which actually causes the problem that MongoDB can not exactly reproduce the data on the secondary from the data set it has plus the current oplog, and hence it restarts the replication.
[13:55:40] <kurushiyama> dino82 But that is just a theory matching the facts.
[13:58:20] <kurushiyama> dino82 There _are_ certain ways to overcome that problem, though.
[14:01:58] <dino82> I'm not sure if it's relevant, the mongods are all in different US regions, nothing local. This replication is an attempt to get everything on the same local network
[14:03:37] <dino82> So there is 20ms+ latency at minimum
[14:05:56] <dino82> This was my first attempt at a mongodump/mongorestore to expedite the sync process since a full sync from scratch wasn't working
[14:06:14] <dino82> According to the documentation, all I need to do is rs.add() and go do something for several hours
[14:07:26] <kurushiyama> dino82 Well, as written above. If your oplog window is too small (aka your oplog collection is too small in size to hold all the changes done during the initial sync or dump/restore), you need to increase it's size.
[14:10:47] <dino82> Hmm, Oplog size is 19GB,~848 hours
[14:16:44] <TheDist> Does anyone know og a good guide on indexing and best practices? I'm new to mongodb and I've just started working to add indexes to speed up some queries (and it's working great), but I don't know when too many indexes is bad, or if there's a better way than how i've done it
[14:17:17] <saml> there's limit on number of indexes
[14:17:40] <kurushiyama> saml Yeah, some <insertIncrediblyHighNumberHere>.
[14:17:53] <saml> A single collection can have no more than 64 indexes.
[14:18:25] <saml> people push to elasticsearch from mongo and query there
[14:18:28] <TheDist> hmm, I must be doing something wrong then because I expect to be able to hit that limit at some point
[14:18:54] <kurushiyama> TheDist Other than the docs? I found indexing to be more an art than hard science, and it _heavily_ depends on your use cases (like _everything_ related to modelling).
[14:22:02] <TheDist> My database, among other things, logs twitch (pretty much IRC) for streamers. The users collection (about 1.2m users) I have an index on the value how long each user has watched each streamer, and another on how love they've subscribed for
[14:22:45] <TheDist> so right now, 18 indexes (including the index on the username, and '_id' which started off indexed)
[14:25:35] <TheDist> without an index, aggregating data about either of those things took about 3500ms, with the index ~10ms, so a dramatic speed increase. I'm not sure how else I could index it that would also allow for growth without having to have 2 indecies for each streamer who uses my database
[14:26:37] <jayjo_> I'm trying to write a script to process data import a little more cleanly... Can I pass an actual document (string) to mongoimport? Or does it have to be written to disk
[14:29:11] <TheDist> The values I'm indexing are subdocuments btw, I tried (ie, {subscription.streamer1: 1} etc...). I've tried indexing just {subscription: 1}, hoping that would work and allow for growth without more indexes, but that index had no impact at all on performance (so I assume, wasn't used?)
[14:34:56] <kurushiyama> TheDist Sounds like overembedding.
[14:39:56] <saml> what are you aggregating? how often? can aggregation result be stored as separate collection or db?
[14:46:03] <TheDist> I'm aggreagating how long users have watched a streamer. For example, one stream has had ~220,000 unique viewers out of the 1.2m in the users collection, and this is aggregated into the number who have viewers <5 minutes, 5m to 1hr, etc.. 5 different time ranges
[14:48:33] <TheDist> It's not something frequently checked, but when it is I'd rather it take the few ms that it does with my seemingly bad index practices than 4s without an index. (and that's just for time viewed, doing the same process for subcriptions to a streamer takes just as long)
[14:50:39] <saml> i'd have periodic job calculating results and storing to a collection or db. querying events collection directly off mongo (or any db) doing date range aggregation is problematic
[14:51:17] <saml> aggregation works at small scale.. but eventually you'll need dedicated event analytics
[14:52:30] <kurushiyama> saml Huh? I have aggregations over rather large data sets and they work pretty well...
[14:53:18] <saml> how many docs are you aggregating? size of the docs?
[14:53:34] <saml> over how many replicasets? how are things sharded?
[14:54:50] <saml> how many aggreations are you running per sec?
[14:56:20] <TheDist> My current system works well, it just requires too many indecies for it to grow. I originally hopped I could just do something as simple as indexing the 'timeWatched' object, and the 'subscriptions' object. Rather than 'timeWatched.streamer1', 'timeWatched.streamer2' etc...
[14:56:33] <kurushiyama> saml aggregations /s? o.O Dunno, I guess 1/300 +- ;) I run them every 5 minutes.
[14:56:58] <saml> yeah that's periodic job or heavily cached
[14:57:26] <kurushiyama> TheDist Can you share your data model? It sounds... ...less than optimal.
[15:00:52] <kurushiyama> saml Yeah, quickshot aggregations can be painful. Might well make sense to do preaggregations, if you do not really need real time data (which is overestimated, imho).
[15:03:44] <TheDist> This is the model I use for my users collection http://pastebin.com/T9tYGew9
[15:05:16] <TheDist> the 'timeVIewed' and 'subscriptions' objects will contain: { streamer1: 0, streamer2: 0, streamer3: 0} etc..., which is just incremented every 60 seconds if the stream is live and they are viewing it
[15:06:06] <TheDist> yeah, my app is all done in node.js, using mongoose for accessing mongo
[15:06:33] <TheDist> streamer1 would be the username of the streamer
[15:06:57] <saml> user1.timeViewed.streamer1 == 0 means user1 viewed streamer1's video 0 times?
[15:08:43] <TheDist> It's for a live stream, so 0 would be viewed 0 minutes. Every minute the app gets an array of all the users watching, and increments their timeViewed.streamer value by 1
[15:08:46] <saml> modeling seems off. i'd put "view" event as separate collection: db.views.insert({streamID: .., userID: .., ts: ..., action: 'started-watching'}) ts is timestamp
[15:10:00] <saml> a user can watch more than one stream at a time?
[15:10:27] <saml> the user has multiple monitor/audio device?
[15:10:43] <Mattx> Hey guys. I've just installed MongoDB, and I can use it through the client mongo (command line), but when I try to connect through the official ruby lib I get this error: MONGODB | Invalid argument - connect(2) for 0.0.105.137:27017
[15:11:03] <Mattx> I'm not sure why it's using that IP, I'm trying to connect to 127.0.0.1:27017
[15:11:44] <saml> TheDist, so aggregation you're running is to get the most popular stream?
[15:12:07] <saml> give me 5 streams that have most number of watchers (over N minutes)?
[15:12:23] <TheDist> yes, it's possible. My app current monitors 8 streamers, some of which are online at the same time and have users watching multiple of them, and on average about 5000 concurrent viewers over all streams
[15:12:59] <Mattx> ah, I see. the problem is the official doc on GitHub is incredible outdated!
[15:13:30] <TheDist> saml, no, the aggregation is just for each specific streamer, so that the streamer can easily see how many people are staying and watching their stream, if there is a big drop off after a 1hr mark for example
[15:13:39] <Mattx> it says this: MongoClient.new("localhost", 27017).db("mydb")
[15:13:46] <Mattx> and the right way is this: Mongo::Client.new ["127.0.0.1:27017"], database: "mydb"
[15:16:30] <saml> db.watchers.find({streamID: 'streamer1', watchDuration: N}) // 5 different N
[15:17:49] <saml> what if a user starts to watch stream1 after 5 minute or something?
[15:18:09] <saml> this feels like time series db's job
[15:20:37] <TheDist> i've tried a .find({ 'timeWatched.streamer1': { $gt: 0, $lte: 5 }}) to get the less than 5 min watched range, and the same sort of thing for every other range to be displayed, but it takes 4s to get all of the ranges, unless I index timeViewed.streamer1 which makes it take 10ms
[15:24:51] <saml> but you only have 8 streams. so 8 indicies?
[15:25:09] <saml> will there be 100 streams and more?
[15:25:47] <saml> whoever is updating User.timeWatched every minute, can just calculate all ranges and save somewhere
[15:26:25] <TheDist> 16, as each streamer has 1 index on timeViewed, and another on subscriptions.
[15:27:55] <TheDist> I'll have to do that then. I wasn't sure if there was some sort of better way to index that wouldn't need more indecies as the number of streamers increase
[15:28:20] <TheDist> if pre-calculating the ranges and saving it at regular intervals is the best option, then that's what I'll have to do.
[15:37:23] <TheDist> Thanks for the advice. I've been learning all this myself, and my app/database is growing faster than I expected so it's bit of a learning curve as knowing what works well with 100k users is different from what's best with 1m users
[16:33:09] <jayjo_> I wrote a script to parse json documents as a crude ETL tool... so now I'm passing a json document to mongoimport in a bash script, but it's taking an extremely large time, because the json is ~100 bytes and there are about 150 million documents. Can I either keep the connection persistent so I don't have to insert each time. Should I just incorporate a dbapi at this point, or what is the recommended way o
[16:33:15] <jayjo_> f handling a situation like this?
[16:59:56] <kurushiyama> jayjo_ Well, I'd write a Go program ;) a 3 way handshake each and every time does not seem feasible to me.
[18:30:43] <dino82> So running mongodump with --oplog, it finished and says 'writing captured oplog to /data/oplog.bson' but the file is 0 bytes, strace says nothing is going on, iostat says nothing is being written.. wtf
[19:22:42] <kurushiyama> dino82 Strange. Very, very strange.
[19:26:35] <dino82> I'm about to just blame azure for everything
[19:49:52] <kurushiyama> dino82 Well, that seems like an educated guess.
[21:11:29] <saml> in mongoshell, can I get bson as string ?
[21:11:49] <saml> var cursor = db.docs.find(); cursor.next();// gives me deserialized object. i want to get serialized bson
[21:20:59] <saml> i'll use driver that gives me bson directly
[21:33:52] <ss23> Hi. I've got a collection 55 million rows in it, and I want to be able to do "db.collection.distinct(foo)", but that's slow. In this case, there's only a single value for "foo" through the entire collection, so I'd expect it to be fast. I've added an index (foo: "hashed"), but it's still 30 seconds, even though there's a single value
[21:34:07] <ss23> Do I have to use a non-hashed index to get decent speed on this, or am I missing some other aspect?
[21:44:00] <poz2k4444> hey guys, I created a replica cluster but I'm not sure how can I test it works, I want to know when a replica is down and when a petition is sent to the primary or the secondaries, can somebody help me with ideas on how to test this?
[21:46:16] <saml> ss23, what does db.runCommand({distinct: 'Your collection name', key: 'your distinct field'}) say?
[21:46:39] <saml> how many distinct values are there?
[21:52:48] <ss23> kurushiyama: Also known as "rely on package maintaners to do what I don't get paid enouh to do" :P
[21:53:11] <kurushiyama> ss23 2.4 is close to eol, you should think of an upgrade anyway. Use a supported OS, and you will have pretty recent packages ;)
[22:23:23] <cheeser> ss23: if you're in a replica set it's pretty straightforward.
[22:34:36] <toutou> @cheeser can you help me with a bson import? im trying to restore the data to my db but anything I do doesnt seem to work
[23:07:50] <toutou> can any1 help with a bson import
[23:13:31] <poz2k4444> toutou: what do you need? can you be more specific?
[23:14:45] <toutou> @poz2k4444 im trying to import my settings.bson file to my site but I cant seem to understand how to do it. Im doing everything right
[23:14:52] <toutou> my website is www.probemas.site
[23:16:12] <poz2k4444> toutou: first, you need to do a dump of the collection or the database, then you do a simple restore of it, you may be doing one of those steps wrong, but I can't help you if you dont give me more details
[23:16:42] <toutou> @poz2k4444 I can host a joinme and u watch me do it?
[23:17:11] <poz2k4444> toutou: I prefer to help you from here, I'm at work right now
[23:25:15] <poz2k4444> toutou: it is more confusing what you are asking man, cause you dont show any commands used, the way you moved the backup or something, kind of difficult to help you that way the more you explain what you did, the better one help you
[23:48:33] <poz2k4444> I can't even, haha look, I gotta go, but what you need to do is a simple restore where the dump directory is, so maybe you can try to delete the database in mongo, then go to the dump directory, and do the mongorestore -d Probmass, or a simple mongorestore should do the trick
[23:52:21] <poz2k4444> toutou: start from scratch man, just make sure the files are the same version in both mongos, the one you got the dump and the one you are restoring, I really gotta go
[23:52:46] <toutou> I did but its ok thanks for the help tho
[23:52:58] <toutou> i should of paid a guy to make my site with mysql lol