[01:42:07] <alexi5> i was able to reduce a schema of 12 tables down to 3 collections
[01:42:40] <alexi5> and was suprised how the document structure actually made sense.
[01:46:07] <Boomtime> yeah, given a rich data format most relational links are simply unnecessary .. some will always remain, but at least you get to see where effort is best spent
[01:47:16] <alexi5> suprising documents are very small. so fetching a document and aggregates and manipulating it in memory is pretty sweet
[01:50:06] <alexi5> largest document size is around 1.8kb
[01:53:47] <Sygin> now that i understand it, i am liking the idea they have going on there
[01:54:17] <Sygin> that you can have failsafe replica sets etc
[01:54:27] <Sygin> and the sharding that evenly distributes stuff
[01:55:01] <GothAlice> Working with a number of other systems, none quite compare to MongoDB in terms of setting up "high availability".
[01:56:23] <Sygin> i'm curious to have like a 500 gb hdd full of data in mongo. and then add a mongo instance as a shard to it. and watch it move over evenly xD
[01:56:45] <Sygin> if i add replica set will it replicate automatically?
[01:57:33] <GothAlice> https://docs.mongodb.com/manual/tutorial/convert-standalone-to-replica-set/ < there's a tutorial for that. ;)
[02:00:36] <Sygin> ok sure. but what if that standalone server had some data in it. would that data be removed in favor of the replicated data of the mongos instance?
[02:00:41] <Boomtime> @Sygin: not sure if you are aware 'shards' and 'replicas' are orthogonal concepts -- a single shard is usually a replica-set, you have multiple sets of these to form a sharded cluster
[02:01:09] <GothAlice> Sygin: If you convert using that tutorial, your data is preserved.
[02:09:53] <alexi5> so basically if you use one replica,i will need an arbiter to be the tie breaker in voting
[02:12:09] <GothAlice> Arbiters make the number of voting members an odd number, breaking ties, though, yes.
[02:13:45] <Boomtime> @alexi5: when you say "use one replica", do you mean to add a second data bearer? i.e having a total of 2 data bearers in the replica-set?
[02:14:09] <GothAlice> Boomtime: "Bearers"? Members, you mean?
[02:14:16] <Boomtime> i would suggest not thinking of these as 'original' and 'replica' or anything that suggests they are not equals
[02:14:48] <Boomtime> the states of primary/secondary are transient, the idea is that the secondary is ready to step up and become primary, should the current primary fail for any reason
[02:15:29] <Boomtime> i say data bearer to distinguish from 'arbiter' which is a member, but not does not have data
[02:16:04] <alexi5> if i understand if you have two members and one goes down, the remaining member state is secondary and is read only
[02:16:14] <GothAlice> Boomtime: I would recommend sticking to the official terminology; replica is the official term. Replica implies replication, replication is a copy, "equals" as you put it.
[02:16:15] <alexi5> in such cases an arbiter will be handy
[02:18:33] <alexi5> we will stick with the official terminology so all can understand
[02:19:16] <GothAlice> Boomtime: https://docs.mongodb.com/manual/core/replica-set-members/ < replica set "members" is the general term for each part of a replica set, primary and secondary members store data, arbiter members don't store data. No need to add more terms. ;)
[02:19:16] <cheeser> which terminology is unofficial?
[02:19:22] <alexi5> I will do some tests with a copy of my database to test creating a replica set
[02:20:24] <cheeser> i don't know about "original" but we use "data bearing node" all the time.
[02:20:47] <alexi5> sounds like we need a glossary in the manual :)
[02:21:19] <Boomtime> there is one, but it has no term for "data bearer", nor "replica" incidentally
[02:21:28] <GothAlice> cheeser: Not on the official documentation, on the replica set members section or subsections, nor glossary. Used a few times in the very top level replication page.
[02:23:51] <Boomtime> then i'd have to clarify that too right?
[02:24:11] <Boomtime> but "data bearer" is clear, but of course, only if we all agree on the term -- so a ticket would be good
[02:24:30] <GothAlice> Similar situation; in the sharding case, all mongod nodes store data, mongos is the single special case.
[02:24:41] <GothAlice> Which data they store is part of why we have distinct terms for them.
[02:26:09] <GothAlice> Over-Hierarchical structuring doesn't improve anything, i.e. adding more layers and terms for little to no gain in clarity. And in the documentation case, more little + icons to hit to find things! ;^P
[02:27:37] <GothAlice> Please, think of the little + icons.
[02:29:27] <alexi5> one question. if i want to store an image (2MB) and its thumbnail (max 180kb) , which is the best way to store both. 1. store image in gridfs and store thumbnail in its metadata or 2) store thumbnail and images in separate documents in a normal collection
[02:29:44] <alexi5> any other suggestions are welcome
[02:31:00] <GothAlice> alexi5: GridFS is optimized for storage of larger files. BSON, standard documents, could work quite well for storage of an image with its thumbnail, using projection to select which you want, and using the Binary storage type.
[02:31:44] <GothAlice> Most drivers abstract away the GridFS protocol for you, making use very much like a virtual filesystem. Even Nginx can serve things directly from GridFS. :)
[02:32:30] <cheeser> if you use java, that exposes GridFS as a FileSystem
[02:32:48] <cheeser> well, it does that regardless, but if you use java, that might be useful :)
[02:33:10] <GothAlice> https://github.com/marrow/mongo/blob/develop/web/db/mongo/grid.py#L9 < my framework lets you return GridOut objects from your controllers directly as another example
[02:33:17] <GothAlice> Many examples available. :)
[02:34:20] <alexi5> i am using c#. I wil do some reading on this from the c# driver docs
[02:38:41] <GothAlice> cheeser: Actually, I'd appreciate your eyeballs to see if I've forgotten anything obvious on that view handler. Pretty sure I'm forwarding on all of the metadata possible. ¬_¬
[02:39:47] <cheeser> ooh. i should add that to my guy as well: something to decorate an HTTP response with right headers...
[02:40:46] <alexi5> so in my case where it is highly unlikely to store an image file larger than 16MB much less 10MB, a normal document would be best for me ?
[02:41:59] <GothAlice> alexi5: If you only need to be able retrieve the whole image at once, the document approach can work fine. GridFS gets you other features, like file-like access with seeking, etc. You can read about it here: https://docs.mongodb.com/manual/core/gridfs/
[02:43:05] <GothAlice> (That document has a "when to use GridFS" section.)
[02:44:29] <GothAlice> ^_^ Only you can really know which solution will be right for your needs.
[02:46:52] <alexi5> ok. i'll create an app doing both and compare and see how the compare
[02:53:07] <Absolome> Is there a good free admin UI for mongo that’s widely used?
[02:53:36] <Absolome> Just realized that I’ve been working on this project for too long to still be doing everying in the shell
[02:55:51] <GothAlice> I use a REPL shell with full syntax highlighting and tab completion… find it quite excellent, in fact. (Pymongo in a ptipython shell is very similar to the mongo shell.)
[02:56:36] <GothAlice> There are a number of REST interfaces available, and some interesting desktop software like Paw to utilize them: https://docs.mongodb.com/ecosystem/tools/http-interfaces/
[02:57:55] <GothAlice> Many of the "GUI" tools don't keep pace with MongoDB updates, some of the more popular ones like MongoHub almost seem abandoned. :( https://docs.mongodb.com/ecosystem/tools/administration-interfaces/
[02:58:14] <GothAlice> This last link seems to be pretty up-to-date, though. Lots of additions since the last time I looked.
[03:43:04] <Sygin> is there a way to force a balancer to run?
[03:44:01] <Sygin> because i added first shard to the system. then pushed 500 docs to it. then i added second shard. but the second shard did not get 250 docs immediately. it seems to me the balancer only balances things when docs are added into the system
[03:45:22] <GothAlice> You didn't mention if you created a sharding key for that collection.
[03:45:32] <GothAlice> If you didn't, it has no way to know you want the data distributed.
[03:59:11] <Absolome> GothAlice: thanks! I’ll have a look. The data I’m working with would be super helpful to be able to browse through and the collection names make it awkward to do with a shell
[03:59:41] <GothAlice> Absolome: Pro tip: you can assign collections to local variables to make them easier to work with. :)
[04:00:00] <GothAlice> var c = db.my_insanely_long_collection_name_of_doom;
[04:00:45] <Absolome> OH nice, do those get collected once you close the shell, or is there a way to set them premanently?
[04:01:29] <GothAlice> You can run scripts by passing the filename to the 'mongo' shell command line, to keep running after the script is run, include the --shell option.
[04:01:51] <Absolome> that’s an immense help, I knew you could set variables but forgot that everything is javascript so it’d be fairly simple to just assing collections like that
[04:02:11] <GothAlice> If you're always working with the same data, you can write a .mongorc.js file (the dot at the front means hidden) and store that in your home folder, it'll be automatically run.
[04:03:01] <Absolome> oh gosh yeah I think I actually already have one on my server I just never thought to put a js script in there
[04:03:18] <Absolome> I’m very new to how easy most of this is
[04:04:23] <Absolome> this is my first project after a long 32bit PPC Assembly project I was working on, so I keep bracing myself for things to be EXTRA obtuse and then they aren’t
[04:04:57] <GothAlice> I'm a fan of ARM, but high-level can be a breath of fresh air.
[04:05:27] <Absolome> node.js and assembly are a funny mix of languages to hop between
[04:05:52] <GothAlice> (PPC vector extensions and such seemed like… such a hack. Not as bad as SSE/2/3, but… ;)
[04:19:46] <Sygin> GothAlice, hey so i have a shard set like sh.shardCollection("test_db.test_collection", { "_id": "hashed" } ). but its still doc ratio like 750-250
[04:20:57] <GothAlice> To shards, documents are not the unit of measure. Chunks are. Ref: https://docs.mongodb.com/manual/sharding/#chunks
[04:21:56] <GothAlice> If you have > 64MB of data, but < 128MB, you'll get two chunks. That's the most "fine-grained" it can come up with, given a default chunk size of 64MB.
[04:24:12] <GothAlice> (Follow the "Data Partitioning with Chunks" link for full details.)
[04:24:31] <Sygin> GothAlice, so how does that solve the problem of having too many documents on one shard?
[04:24:42] <GothAlice> You have 500GB of data or so, right?
[04:25:13] <Sygin> yeah cause each chunk is 64MB ?
[04:26:04] <GothAlice> Yes. With a hash sharding key, half those chunks will migrate to the second shard. That it isn't a perfectly even distribution doesn't matter, document-wise, it'll be at most 63.999MB off data storage-wise between the two.
[04:26:42] <Sygin> oh so space wise it will always be equal
[04:26:54] <Sygin> even tho doc wise it looks like one of them is higher than the other
[04:26:56] <GothAlice> Within one byte less than the chunk size. ;)
[04:31:20] <Sygin> now i have 300 MB on one side and 302 MB on the other
[04:31:24] <Sygin> now how can i trigger a refresh?
[04:32:00] <GothAlice> If you create a disk image, mount it, add a file then delete a file from it, does the disk image change in size? No, it does not. You're trying to micro-manage it again. ;P
[04:34:59] <Sygin> GothAlice, oh i see. so does this mean, that if i add more docs into this system right now it will auto prefer the smaller one until that smaller one reaches an even size?
[04:36:27] <GothAlice> No. It tries to distribute records according to your sharding key.
[04:38:36] <GothAlice> That covers the things to keep in mind when choosing a shard key, based on how the shards divvy up based on ranges in that key.
[04:38:59] <Sygin> and its showing up sh.shardCollection()
[04:39:13] <Sygin> which i already said that i have already done sh.shardCollection("testdb.users", {"_id":"hashed"})
[04:39:44] <GothAlice> Then please define "implement" in your original question.
[04:44:51] <GothAlice> "how i can implement the shard key?" < this bit — I do not understand. You've implemented one. :( The documentation does cover how chunks are created (by splitting up ones that become too large as data is added to them) — the docs don't just cover the "what", but also the "how" and usually the "why".
[04:49:16] <Sygin> GothAlice, what i am not understadning is this. so i made 2 mongod. and i joined them two as shards. and then i created an index on the collections on the hashed _id, and then i added that to make that shard. so now i added like 100000 docs into the system. all of thme fell onto one shard
[04:49:22] <Sygin> there was no redistributing going on
[05:00:09] <GothAlice> Sygin: The typos… are not making things easy on you. testdb.test_collectio2n vs. testdb.test_collection, and your example of "sh.shardCollection("testdb.users", {"_id":"hashed"})" isn't using the right collection name.
[05:02:05] <GothAlice> Note there's only one chunk for that collection.
[05:05:31] <GothAlice> That looks like an empty collection, given the min and max key are both 1.
[05:06:36] <Sygin> i pumped it full of data like 100000 docs or i dont even know anymore
[05:06:41] <Sygin> i put that command down several times
[05:06:52] <GothAlice> Double check that you actually did. db.test_collection.count()
[05:07:09] <GothAlice> Given the previous typos, there's a non-zero chance that you inserted those records into a different collection.
[05:07:38] <GothAlice> ("show collections" to get a listing of all collections in the db, from the mongo shell, to see if there are any unexpected ones ;)
[05:11:25] <GothAlice> Using my cluster testing script, then following the "enable sharding on DB", "insert a large number of records into collection in that DB", "add hashed index", "mark hashed index as sharding key" steps, I get multiple chunks which immediately start balancing. :/
[05:17:09] <Sygin> im gonna redo the whole thing and delete the old shit
[05:22:03] <Sygin> ok so this time i added the two shards beforehand
[05:26:10] <GothAlice> Hmm. Try having only one shard, populating your data, enabling sharding on the collection, _then_ adding the second shard. Does the behaviour change?
[05:27:43] <GothAlice> A deviation I'm noticing from the tutorial is that you're populating the collection in a shard set already containing two members, and marking the collection for sharding after. The tutorial has the data being populated _first_, then the second shard added.
[05:28:23] <Sygin> but why should it matter when the shard is added?
[05:28:40] <Sygin> the end result should be the same no?
[05:29:18] <GothAlice> When you create a new collection and populate it with data (that isn't marked as sharded _first_) in a full sharded set, the collection is bound to the primary shard.
[05:29:18] <Sygin> because if i add shard 2 at the end. it should populate evenly. than if i add shard 2 at the start and as i fill up with data. it should populate evenly as well. isnt that the end goal?
[05:34:22] <GothAlice> Seems to make a world of difference.
[05:35:06] <Sygin> this system makes even less sense now
[05:35:28] <GothAlice> A problem with following tutorials and deviating from them. ;)
[05:37:09] <Sygin> my end goal was to have hdd1 and hdd2. and the moment i add hdd2's folder as a shard, it should populate evenly. and it shouldnt matter if hdd2's folder was added before all the users signed up for our system or not. in the end it should have been distributed evenly regardless of when added :p
[05:37:16] <Sygin> but this is not how the system works
[05:37:59] <GothAlice> Your process will need to be: configure shard 1 containing your existing data. Mark that data as sharded. Add shard 2. It'll balance.
[05:38:40] <GothAlice> The problem is that you're currently testing by: configure shard 1 and shard 2, adding data (which gets pinned to one of the shards), marking it for balancing. Because it's pinned, you'd need the additional step of forcing it to split the chunk.
[05:39:12] <GothAlice> But don't do that to test, because that's not what you'll end up needing to do with your real data.
[05:39:31] <Sygin> well i'd like to test both cases but
[05:39:43] <Sygin> ok so for the first case: '<@GothAlice> Your process will need to be: configure shard 1 containing your existing data. Mark that data as sharded. Add shard 2. It'll balance.' when i add shard 2, is it going to split at sh.addShard() command?
[05:39:59] <Sygin> and for the second case, what is the command to force a split?
[05:40:28] <GothAlice> T_T I've already linked to the documentation for that. https://docs.mongodb.com/manual/tutorial/split-chunks-in-sharded-cluster/
[05:44:46] <GothAlice> ^ this may be important in your case.
[05:45:33] <GothAlice> Hm, maybe not if you only have half a TB so far.
[05:48:11] <GothAlice> https://docs.mongodb.com/manual/core/sharded-cluster-shards/#primary-shard is the issue you're hitting—the collection created and populated with data after both shards are added but before marking the collection as sharded is represented by "Collection2" in that diagram.
[05:49:09] <GothAlice> (It's getting pinned to the "primary shard" and balancing is skipped on it.)
[05:53:10] <Sygin> how can i remove this primary shard
[05:53:15] <Sygin> since its creating lots of problems
[12:11:22] <jamieshepherd> It just seems to happen way too commonly to be stable
[12:11:38] <jamieshepherd> I'll have some stuff working on it, then it'll just die.. Sometimes I'll clear a connection, it'll just die
[12:11:54] <jamieshepherd> the server has 1gb ram, should be plenty
[12:12:07] <jamieshepherd> we do handle a lot of records (80k+) but I thought mongo would be capable of dealin with these wiht ease
[12:12:58] <alexi5> it seems mongodb has problems allocating memory
[12:13:00] <alexi5> Invariant failure: ret resulted in status UnknownError: 12: Cannot allocate memory at src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp 576
[12:14:32] <alexi5> are other memory intensive apps running on the server with mongodb ?
[12:21:16] <jamieshepherd> Nah it's just the mongo and mysql
[12:23:24] <jamieshepherd> Maybe 1gb just isn't enough though
[12:52:12] <alexi5> 1GB of ram for mysql and mongodb . most likely you need more memory
[18:18:15] <Sygin> cheeser, last night GothAlice and i were discussing this and she said if you add the shard first then populate the data, the distribution of data across shards would be different than if you distributed the data first then added the shard....
[18:18:21] <Sygin> well at least that was my understanding of it
[18:18:46] <GothAlice> cheeser: There was some confusion relating to how pre-existing data is spread amongst shards. I.e. http://s.webcore.io/3s0u2B0T012d
[18:19:17] <Sygin> GothAlice, are you a dev of mongo ?
[18:19:22] <GothAlice> cheeser: pre_test there was populated with data, second shard added, then sharding configured on the collection.
[18:19:36] <GothAlice> post_test was populated, configured for sharding, then a second shard added.
[18:19:42] <GothAlice> Sygin: I am not, just a rabid user. ;)
[23:30:19] <cheeser> it's just that if you didn't want to recompile for every platform... :D
[23:32:02] <Test12322> cheeser: your right :-) i just found it hard to compile the c driver first, on top the c++ , a flat plane deb file would do the job.. one for amd one for intel would cover 99% i guess
[23:32:33] <cheeser> i'm a little surprised we *don't* have packages for the c++ driver. but i don't know how that stuff is dist'd