PMXBOT Log file Viewer

Help | Karma | Search:

#mongodb logs for Friday the 19th of August, 2016

(Back to #mongodb overview) (Back to channel listing) (Animate logs)
[03:04:06] <ironpig> is uname one of those strings that can confuse mongodb if used in a query?
[03:51:30] <GothAlice> It's an interesting challenge, abstracting localiziable fields such that the FTI can handle them.
[03:51:45] <GothAlice> Also frustrating. XD
[03:52:16] <GothAlice> Looks like I'm going to need a post-processing pass on model definition. Le sigh, was hoping to avoid that.
[03:57:31] <GothAlice> (Basically, trying to have document.title map transparently to the appropriately translated sub-field of the document-specific list element… complicated by virtue of also wanting sub-document fields translated…)
[09:27:08] <sifat> Hi all. I want find a string like "DC - (RD)" with regex, but my query doesn't seem to be escape the brackets. What's the way to escape these brackets?
[09:28:45] <GothAlice> sifat: In Python, import re; re.escape(string)
[09:28:58] <GothAlice> Most languages will feature, somewhere, a builtin function or method to perform such escaping.
[09:29:14] <sifat> I'm building a Meteor app
[09:29:32] <sifat> But my query is a mongodb query
[09:29:41] <GothAlice> Right, JS is one that… probably won't include such a thing out of the box.
[09:29:58] <sifat> I see - should I look for an NPM module?
[09:29:59] <GothAlice> https://github.com/nylen/node-regexp-escape
[09:30:02] <GothAlice> There's one. :)
[10:18:38] <sifat> GothAlice: thanks!
[14:06:09] <durre> I have a collection with 1 million documents. I wish to read 1000 at a time and then bulkWrite those into another collection using node. not sure how to use the cursor to achieve this
[14:09:13] <StephenLynx> I wouldn't bulkwrite it at once.
[14:09:24] <StephenLynx> at the same time
[14:09:35] <StephenLynx> if you just wish to clone the collection, I think theres a command for that.
[14:11:56] <durre> StephenLynx: but it's between two different databases. does such a function still exist?
[14:12:25] <durre> this will be a one time operation. not functionality we will keep
[14:48:22] <StephenLynx> hm
[14:48:27] <StephenLynx> I also think you can clone a db.
[14:50:18] <Derick> StephenLynx: only to another server I believe
[14:50:35] <Derick> durre: I think you need to write a script
[14:50:59] <Derick> https://docs.mongodb.com/manual/reference/method/db.collection.copyTo/
[14:51:30] <Derick> it says it's deprecated though...
[14:52:06] <Derick> StephenLynx: cloneConnection is only between MongoDB instances: https://docs.mongodb.com/manual/reference/method/db.cloneCollection/
[14:52:09] <Zelest> Derick, what OS did you run again? :)
[14:52:14] <Zelest> (btw)
[14:52:58] <Derick> Debian unstable
[14:53:35] <Zelest> ah
[14:53:41] <StephenLynx> kek
[14:53:57] <Zelest> kek?
[15:13:54] <StephenLynx> kek
[15:14:13] <nedbat> My installation has 3Gb of files in mongo/mongodb/journal. Can I make mongo use less disk space for journals? This is for a developer, not production, system.
[15:15:08] <nedbat> i read a number of pages, and didn't see a clear way to just use less space.
[15:15:26] <DammitJim> the role useradminanydatabase
[15:15:35] <DammitJim> should be used as the only role for a user, correct?
[17:38:42] <Karbowiak> hi guys, i'm running head first into a problem i cannot seem to google my way out of - i'm currently doing an aggregation query with mongodb, to gather up the various counts of instances where an ID has been seen - should be fairly trivial, except it lists multiple IDs under the various counts (i guess if a count for two entities is the same, it groups them) this is the result (you can see
[17:38:43] <Karbowiak> under corporationID it has between 1 and 3 IDs) view-source:https://neweden.xyz/api/stats/top10corporations/ - and this is the code (with the aggregation) https://github.com/new-eden/Thessia/blob/master/src/Controller/API/StatsAPIController.php#L88-L115
[17:38:53] <Karbowiak> i simply cannot figure out what i am doing wrong
[17:55:27] <eindoofus> i'm learning how to use Morphia and a bit confused be the DBRef keys i'm seeing in an array within the doc. is it typically a good idea to leave those when inserting? if not, how should such a record be added?
[17:56:07] <eindoofus> this is what seems to be creating DBRef keys "elmer.getDirectReports().add(daffy);"
[18:27:41] <cjhowe> is it okay if monitor connections call isMaster to authenticate, then call isMaster again to start the topology check? should I wait minHeartbeatFrequencyMS or something?
[18:39:01] <GothAlice> cjhowe: Many drivers offer events or callbacks to notify your own code of topology changes.
[18:39:13] <cjhowe> i'm writing the driver
[18:39:25] <GothAlice> Ah. Then you might want to model it after something like: http://api.mongodb.com/python/current/api/pymongo/monitoring.html
[18:40:06] <cjhowe> i mean, i've got those events emitting just fine, the problem is that the current code uses isMaster to authenticate
[18:40:18] <cjhowe> and i don't want to remove auth for monitors if i don't have to
[18:41:06] <GothAlice> https://github.com/mongodb/mongo-python-driver/blob/master/pymongo/monitor.py < this is the underlying monitor for the monitoring.py abstractions. https://github.com/mongodb/mongo-python-driver/blob/master/pymongo/monitor.py#L147-L166 < it uses IsMaster.
[18:41:29] <cjhowe> yeah
[18:42:26] <GothAlice> Technically "used" in a rather raw way in the _check_with_scocket method, below the highlight section.
[18:43:07] <GothAlice> However, authentication is a post-connection feature. AFIK it isn't required for topology negotiation/monitoring.
[18:44:50] <GothAlice> https://github.com/mongodb/mongo-python-driver/blob/master/pymongo/topology.py mentions nothing of authentication or credentials, nor is it involved in the topology configuration: https://github.com/mongodb/mongo-python-driver/blob/master/pymongo/mongo_client.py#L387
[18:47:54] <cjhowe> yeah, it's using the socket directly
[18:48:12] <cjhowe> so i guess i should take out auth for monitors since it isn't in the python client
[18:48:29] <GothAlice> ^_^ Also apologies if Python is nowhere near applicable, it's just what I know best.
[18:48:30] <cjhowe> (i'm reusing the connection code for the monitor, that's why it's there)
[18:48:37] <GothAlice> Aaah.
[18:58:54] <nedbat> GothAlice: howdy
[18:59:01] <GothAlice> Hey-o.
[18:59:08] <nedbat> My installation has 3Gb of files in mongo/mongodb/journal. Can I make mongo use less disk space for journals? This is for a developer, not production, system.
[19:00:45] <GothAlice> mmapv1 or wiredtiger storage engine?
[19:01:03] <nedbat> @GothAlice: not sure, i'm a mongo novice, working on some installation scripts
[19:01:13] <GothAlice> MongoDB 3.2 or newer, or earlier?
[19:01:34] <GothAlice> (That's when the default changed.)
[19:02:25] <nedbat> @GothAlice: 2.6.12
[19:03:24] <GothAlice> Cool. For way smaller sizes, use the smallFiles storage option. (--smallFiles CLI arg or https://docs.mongodb.com/v2.6/reference/configuration-options/#storage.smallFiles )
[19:04:30] <GothAlice> nedbat: Also, was the screenshot I provided via Twitter sufficient in demonstrating the i18n pattern? (And yes, I'm aware of the minor space typo on the French side. ;)
[19:04:55] <nedbat> GothAlice: i only saw a few places with titlecase in the english
[19:05:26] <GothAlice> Notably in the "checkout progress" bar at the top, aye. But it affects all button labels and things, too.
[19:05:44] <GothAlice> Also the job titles, as I quoted in a second tweet.
[19:06:37] <nedbat> GothAlice: maybe the screenshot is cropped on Twitter? https://twitter.com/GothAlice/status/766593224857128960
[19:06:41] <Zelest> *yawns*
[19:08:11] <GothAlice> nedbat: http://s.webcore.io/303g3t0R1Z0R/Screen%20Shot%202016-08-19%20at%2014.56.48.png vs. http://s.webcore.io/2t262g2m0e3F/Screen%20Shot%202016-08-19%20at%2014.57.04.png (headings there)
[19:08:56] <nedbat> GothAlice: yes, i see what you mean. I never noticed that about French before, maybe haven't seen enough.
[19:09:38] <nedbat> GothAlice: my doc writers won't let me write, "Use this code to delete a file: <block>rm -rf the_file</block>". They want: "Use this code to delete a file. <block>...."
[19:11:00] <GothAlice> I try in my documentation to avoid inline "type this" code blocks, but do use trailing colons. Hmm. At-work documentation, after translation and integration into the standard document format, I never really review. I'll ask for a copy of the latest manual in French and see what they do.
[19:18:55] <GothAlice> nedbat: Apparently the choice to use colon or not is context dependent, based on the meaning of the instruction. (Yay!) In the case of text the user must literally type, our style guide says to preserve colons.
[19:19:01] <nedbat> GothAlice: my doc writers point is that we cannot use colons like that *in English* because it will be difficult for the translators, which I cannot understand.
[19:19:09] <GothAlice> Yeah, that's nonsense.
[19:19:24] <nedbat> it drives me nuts
[19:20:35] <GothAlice> nedbat: http://s.webcore.io/1s332w160g3M/Screen%20Shot%202016-08-19%20at%2015.10.42.png
[19:20:59] <nedbat> GothAlice: the accents in that exchange are adorable
[19:21:05] <GothAlice> Yeah, IME hilarity. XD
[19:22:51] <GothAlice> This might be a difference between Québecois French and France-French, though. Unsure of your target region.
[19:24:14] <GothAlice> http://www.btb.termiumplus.gc.ca/tcdnstyl-chap?lang=eng&lettr=chapsect17&info0=17 may be interesting. http://french.about.com/library/writing/bl-punctuation.htm mentions _increased_ colon usage vs. English.
[19:32:23] <nedbat> GothAlice: i wonder if the prose rules there carry over into more technical contexts.
[19:38:13] <dicho1> I found an interesting feature in Hive and I wasn't sure if it was possible in MongoDB based on the docs: In Hive you can choose the columns you include in your SELECT clause by regex. I don't see similar functionality in the MongoDB docs, but I wanted to ask here just to be sure I wasn't missing something.
[19:39:49] <GothAlice> dicho1: Keys are effectively static, and selected for explicitly during projection. There are reasons for this, for example, if you have dynamic keys you lose the ability to index them sensibly. In MongoDB the pattern is to rotate {"dyn_key_1": 27, "dyn_key_2": 42} into a list of pairs, a la [{"key": "dyn_key_1", "value": 27}, …} at which point yes, you can effectively query them in the way you describe.
[19:40:06] <GothAlice> But you still can't "project" them like that. (You can project the first matching one, but not multiple.)
[19:41:18] <dicho1> @GothAlice: OK, that's an interesting workaround that I'll have to keep in mind. Thank you!
[19:41:35] <GothAlice> dicho1: It's less of a workaround than a critical pattern to utilize, given the mentioned lack of indexing.
[19:41:41] <GothAlice> Dynamic keys are the death of many models. ;P
[19:42:45] <dicho1> This is true. My use case is more for historical data though. It would be a pretty huge undertaking to normalize all the field names and their types for the thousands of CSV's I have.
[19:43:00] <dicho1> They're largely similar, but have some slight differences.
[19:43:41] <GothAlice> That's reasonable; optional keys are a-ok, truly dynamic ones cause problems.
[19:44:48] <dicho1> GothAlice: Many thanks for your help!
[19:44:55] <GothAlice> It never hurts to help. :)
[19:50:34] <Sygin> hello there. i have a question. does sharding look at disk space?
[19:51:34] <Sygin> reason i ask is because if say i have 2 mongo instances running, and one of them is on a disk drive say 98% full and i add a new one with a new instance. i would like mongo to write to the fresh one. but since its a db i would still need the first one to read from for application logic purposes
[19:51:42] <Sygin> who would you say i should tackle this problem?
[19:51:45] <Sygin> how*
[19:52:12] <GothAlice> Sygin: AFIK a shard with insufficient free space will not accept new chunk migrations, but running low on disk is always a "bad situation". You would need to adjust your sharding key to more evenly distribute the records, then migrate chunks off the full node.
[19:52:52] <GothAlice> Ref: https://docs.mongodb.com/manual/core/sharding-data-partitioning/
[19:54:45] <Sygin> GothAlice, well i have one instance of mongo running and i'm just trying to future proof this situation. so that when i need to add more space i can do it easily
[19:57:03] <GothAlice> With reasonable sharding keys, and sensible chunk size (the default often works well), simply adding a new shard to the set would allow the balancer to automatically migrate chunks. There are a few situations where automatic chunk migration might not happen.
[19:57:13] <GothAlice> Ref: https://docs.mongodb.com/manual/tutorial/migrate-chunks-in-sharded-cluster/
[20:04:12] <GothAlice> My own question of the day: which is better, to use coalesced path prefix searches to find all descendants (not just immediate children), or parents._id indexed array? I.e. {'parents._id': {$in: parents}} vs. {path: {$in: path_regexen}}. Testing time, I guess! :D
[20:07:35] <GothAlice> (Only reason I'm not using a graph to store this tree is that it's directed and acyclic. Or I would. ¬_¬)
[20:13:03] <Sygin> hey GothAlice so say i already filled up a full hdd with some data and now i add another hdd and want to evenly distribute between the two. would be possible then to start up another mongo instance join them then evenly distribute them together ?
[20:13:27] <Sygin> evenly distribute the existing data
[20:15:13] <GothAlice> Yes, though it's a touch more difficult if you're going from non-replicaset standalone to sharded after already having data.
[20:16:15] <Sygin> oh interesting. meaning that its best to add all the hdds in the beginning then filling them up like that, rather than add one later and try to redistribute it then
[20:16:38] <GothAlice> It's not about HDDs, it's about configuring multiple mongod's in a way that they can communicate and scale later.
[20:17:02] <GothAlice> Standalone doesn't use an oplog, making the migration to a replica set or sharded set require additional disk space to create the oplog, for one.
[20:17:22] <GothAlice> (So when your disk is already basically full, it's… too late.)
[20:17:43] <Sygin> oh i see
[20:18:22] <GothAlice> Without replication, your data is vulnerable to hardware failure, i.e. it's not "high availability" or redundant. In HDD terms you can think of replication as mirroring RAID, and sharding as striping RAID. The combination of the two is RAID 10.
[20:20:04] <GothAlice> This, however, refers to the arrangement of mongod processes, not literal disks. (Though one should not run two mongod data services on the same machine/disk! That way lies madness. ;)
[20:20:46] <GothAlice> Having real RAID is also advantageous for disk IO performance and hardware reliability, separate from whole-machine failure protection of mongod-level replication.
[20:21:08] <Sygin> GothAlice, haha why is it bad to run 2 mongod instances? i was thinking of running another instance once i add anjother hdd to the system
[20:21:16] <Sygin> and connecting my node to that port
[20:21:30] <GothAlice> You… technically can, but database services like to assume they can consume all available RAM, so if you have two on the same hardware, they'll fight for resources.
[20:22:00] <GothAlice> Notably, "2 monogd instances" on the same physical hardware is… bad. No additional reliability or redundancy against machine failure, there.
[20:22:16] <GothAlice> (You might as well properly RAID the disks and run just one mongod.)
[20:22:36] <Sygin> so GothAlice if i have say 2 hdds how can i run/edit one mongod instance but utilize 2 different disks
[20:22:38] <Sygin> ?
[20:23:03] <GothAlice> Best to do that using OS-level RAID, to merge the physical drives into one logical volume.
[20:23:06] <Sygin> because one will be mapped in a mount of /dev/sda for example, another in /dev/sdb
[20:23:10] <Sygin> ooooh
[20:23:36] <GothAlice> Then mongod need not even care. :)
[20:23:42] <Sygin> but thats something that can be done after reformatting ?
[20:24:15] <Sygin> is there a way to virtualize already created partitions ? i have two partitions that are mapped by LUKS
[20:24:59] <Sygin> well this is gonna be a new headache
[20:25:25] <GothAlice> Always have backups. ;)
[20:26:01] <Sygin> well this is future proofing. i dont already have the data i'm just planning for the long term
[20:26:17] <Sygin> as in if i apply settings set A then turns out in the future when i do have data that i needed to be set B. the damage control would be more frustration
[20:26:23] <Sygin> frustrating*
[20:27:32] <GothAlice> https://docs.mongodb.com/manual/sharding/ covers the general architecture of running a sharded cluster. Additionally, in really "high available" systems with redundant storage, each "shard" can (and probably should) be its own replica set. There are tutorials for converting a replica set into a sharded one, but you really first need a replica set. Never not replica set, unless it's just a local development mongod.
[20:27:38] <GothAlice> (And even then, I do, for testing of shard keys.)
[20:27:52] <GothAlice> *I do shard locally
[20:28:28] <Sygin> GothAlice, reason why that RAID thing you said is going to be frustrating is also because the LUKS system i made/planned for works but i'm not sure if i can add RAID into the mix and have an encrypted partition
[20:29:00] <GothAlice> If the encryption is AES-XTS and new volume space is only ever appended to the set, theoretically it could work.
[20:29:11] <GothAlice> But that's deep voodoo. ;)
[20:29:17] <Sygin> yep LOL
[20:30:19] <Sygin> GothAlice, is there no way that one mongod instance could read from 2 different locations on disk?
[20:30:51] <GothAlice> You can use symlink shenanigans in the mongod data directory to point certain files or sub-paths at different locations, or use additional mount points within that data directory, sure.
[20:31:52] <GothAlice> I've done so to isolate per-user databases within their own home folders, symlinked into the mongod data directory. This causes… some interesting problems, but nothing too severe as long as stability of the links is maintained.
[20:32:05] <GothAlice> (Useful for quota management, though.)
[20:32:21] <Sygin> GothAlice, i was also thinking of looking at it from another perspective. is there a way do you know in linux where you can virtualize say /dev/mapper/db1 and /dev/mapper/db2 into /dev/mapper/dbmixed so much that it looks like its another partition when in reality its a virtualized thing?
[20:33:04] <GothAlice> I do not know. The mapper is already an abstraction potentially combining multiple physical devices; I would hesitate to add additional layers on top of that.
[20:33:48] <GothAlice> I.e. that sounds like unnecessary, and potentially fragile complexity.
[20:33:55] <Sygin> yeah same
[20:34:04] <Sygin> but imagine tho :D
[20:34:23] <GothAlice> Sygin: I have enough data that that thought will keep me up at night.
[20:34:32] <Sygin> haha
[20:35:35] <Sygin> GothAlice, ok so for the symlink method that you said how would that even work? as in how would mongod know to put certain files in one dir and certain files in a symlink ?
[20:37:12] <GothAlice> The mongod process would be generally unaware of the true location of the data, it would continue to access paths under its data directory and generally not care. Like any process, it'd simply "follow" any symlink it encounters.
[20:38:10] <Sygin> yeah i know that part
[20:38:21] <Sygin> i googled and it says something about directoryperdb
[20:38:26] <Sygin> so imma have to look into that i guess
[20:38:48] <GothAlice> Data is generally organized into a few top-level files (under, say, /var/lib/mongod) holding things like the oplog, journal, etc. Databases are then constructed as folders under there, containing stripes or further subdirectories for specific types of stripes (mmapv1 vs. WT).
[20:39:39] <GothAlice> You can create a database, then move the database directory somewhere else (in my example, into the user's home folder who is responsible for that data), and create a symlink in the mongod data directory pointing at the new location. E.g. /var/lib/mongod/illico → /home/illico/var/mongo
[20:40:33] <GothAlice> Pro tip: don't move a database like this while mongod is running. It'd probably get very, very unhappy with you.
[20:40:39] <Sygin> of course yes
[20:42:58] <GothAlice> And yeah, directoryperdb is too useful to not enable, in any situation where you do have multiple databases. Not all applications do, though.
[20:43:16] <Sygin> GothAlice, but how does your example combine two hard drives though? because now all that has happened is that the data is moved elsewhere? so its reading and writing from elsewhere (other disk) but i want both hdds to be used though
[20:43:57] <Sygin> basically, how can i trick the mongo system to use both hdds unknowingly lol
[20:43:59] <GothAlice> In my scenario, each user's home folder is an independently mounted RAID volume. (Effectively.) Think: Amazon EBS volumes. (Though I'm not using AWS. ;)
[20:47:02] <Sygin> GothAlice, http://stackoverflow.com/questions/15033169/second-hard-drive-for-mongodb-using-directoryperdb
[20:47:12] <Sygin> this encourages me :)
[20:47:39] <Sygin> but before i deploy im going to have to test this
[20:47:41] <GothAlice> Quite so; symlink use is pretty simple.
[20:47:45] <GothAlice> And yeah, always test.
[20:47:54] <GothAlice> My case is actually way more complex than simple RAID, being a distributed network filesystem, sharing the ephemeral space allocated to a large number of VMs, but you get the idea. ;)
[20:48:06] <Sygin> what do you even do? LOL
[20:48:18] <GothAlice> Vertical market application hosting, amongst other goodies.
[20:48:38] <Sygin> oh. nice. do you do this alone or you have a IT team ?/
[20:50:16] <Sygin> yeah this seems a much better alternative than sharding with 2 instances lol
[20:50:33] <Sygin> but i think when i do "vertical scaling" i will need to get into sharding
[20:51:11] <GothAlice> Sharding lets you turn a vertical scaling problem (running out of disk space) into a horizontal one (just add more servers, you have more disk space).
[20:51:44] <Sygin> well at first i doubt we'd vertically scale
[20:53:46] <GothAlice> (Very small team: myself and one other. The system is, at this point, kinda silly. Almost five years uninterrupted uptime, automatic disaster recovery, etc., etc., barring Rackspace occasionally screwing things up with network hardware updates. Moving a client from a VM on failing hardware to another dom0 takes about 20 seconds, and with a 99.995% SLA, that means we can survive 6.48 failure transitions like that per month without penalty.)
[20:53:46] <GothAlice> ;P
[20:54:23] <Sygin> 5 years uptime? how do you guys do kernel/security updates? ??
[20:54:38] <GothAlice> Also: make -j192 is just silly to watch. Linux kernel compile from depclean in ~45 seconds.
[20:55:20] <GothAlice> (distcc is awesome on a properly sized cluster)
[20:56:38] <GothAlice> Combination of live hot-swapping of kernels (a few milliseconds of process suspension to re-link things, doesn't even interrupt network connections) or migration to updated hosts. It's a homogenous cluster, so every server can serve any service or client; big kernel updates (i.e. ones which alter the kernel memory layout) just require spinning up new nodes and migrating people, then shutdown of the old nodes.
[20:56:59] <Sygin> oh cool!
[20:57:08] <GothAlice> Up to 6.25 times per month. ;)
[20:57:20] <GothAlice> Er, 6.48.
[20:58:29] <Sygin> sounds nice yo
[20:59:57] <GothAlice> It can be. Sometimes it can be a huge pain. XD
[21:00:48] <GothAlice> Back when we _were_ on AWS, a multi-zone cascading failure of EBS controllers forced me to spend 36 hours reverse-engineering the MySQL InnoDB on-disk structure… immediately prior to getting four wisdom teeth extracted. Back when we still supported MySQL. (We don't any more.)
[21:01:01] <GothAlice> Data recovery is… un-fun.
[21:01:50] <GothAlice> Doubly so under threat of general anaesthetics. XP
[21:02:24] <Sygin> yeah im guessing you didnt like mysql as much as you do mongo ?
[21:03:30] <GothAlice> Switched to MongoDB around 1.2 or so, yeah. I never liked the SQL "all your data is nothing but a spreadsheet" approach. Very limiting.
[21:03:51] <GothAlice> (And once you start getting into models like EAV—entity attribute value—you might as well be using a document store anyway!)
[21:04:35] <Sygin> GothAlice, i have a question
[21:04:40] <GothAlice> Hit me.
[21:04:43] <Sygin> sorry didnt meant to cut you off
[21:04:46] <Sygin> er
[21:05:22] <GothAlice> :P No worries. I ramble.
[21:05:41] <Sygin> so if i do the directory per db thing right. and say i separated userlist into now userlist and userlist2. in my nodejs application dont i have to then edit all reads (find) to go thru both of the db ?
[21:06:36] <GothAlice> Yes, you would. That's generally why it's a sub-optimal solution, excepting the "these actually are separate databases" thing. Sharding is the correct way to have multiple storage locations which monogd itself can select between when answering queries, presenting you a single unified database to query.
[21:07:50] <Sygin> yes but the sharding route is for vertical scaling atm. and the other method is to use RAID and i dont know yet how to do luks. so i guess my only option now is to use this symlink method with 2 reads
[21:08:07] <Sygin> how to do RAID with LUKS*
[21:10:11] <GothAlice> Sharding is horizontal. Adding more RAM and larger disks is vertical.
[21:10:40] <GothAlice> Unless my dyslexia is really doing me over, here. ;)
[21:11:13] <Sygin> oh lol
[21:11:13] <Sygin> sorry
[21:11:13] <Sygin> yeah i got it the other way around
[21:11:21] <Sygin> basically, we are not adding new computers into this. just hard drives at first
[21:11:26] <Sygin> so i have to accomodate around that
[21:12:25] <Sygin> yes so my mind is set i think i'm going to use symlinks
[21:12:30] <Sygin> btw GothAlice what do you think of Mongoose ?
[21:12:45] <Sygin> some people have said that it is bad. I have time and chance to change to another one
[21:12:53] <Sygin> another mongo nodejs lib
[21:13:48] <GothAlice> Mongoose is arguably evil. ;P
[21:14:03] <Sygin> oh i see
[21:14:19] <GothAlice> And Node/JS an anti-language. (I want… one CSS minifier, I get 412 npm packages which fail to install three out of four times.)
[21:16:05] <GothAlice> Also various bits of JS insanity, like typeof "string" !== typeof new String("also a string"), or the ever-amusing [] + [] == null, [] + {} == an object, {} + [] == 0, and {} + {} == NaN. Wat.
[21:16:31] <GothAlice> Array(16).join("wat" - 1) + " Batman!"
[21:16:56] <GothAlice> http://s.webcore.io/382k0Y3k3X3N/Screen%20Shot%202016-08-19%20at%2017.07.07.png
[21:17:53] <GothAlice> https://www.destroyallsoftware.com/talks/wat for a highly amusing run through of wats.
[21:18:40] <Sygin> oh i see.
[21:18:56] <GothAlice> The reason why I mention Mongoose as being arguably evil is that it's a large source of problems reported in this channel. For example, I've never seen anyone using anything else end up with a collection literally named "[object Object]". But it's happened at least twice that I've assisted with, for Mongoose users. Also things like Mongoose storing ObjectIds as strings, not ObjectIds, which causes all _sorts_ of problems.
[21:18:58] <Sygin> GothAlice, so i did the dir per db thing right. and it works. puts it in separate folders. that menas i can abuse this to put symlinks
[21:19:18] <GothAlice> You can, yes.
[21:19:21] <Sygin> yeee
[21:27:07] <Sygin> oh i see. GothAlice so say i have hdd 1, hdd 2 on server 1, and server 2 with another hdd. i can manipulate shard key ratios so that mongodb can write the least amount to one hdd. and most amount to another hdd. is that its true purpose ?
[21:27:43] <GothAlice> Yes; choice of sharding key allows you to control how records are distributed.
[21:28:46] <Sygin> so now that i have one mongod instance. but two databases and i used a trick to make the second db into another hdd. and i add another server into the mix. the hdd1 will be most full. so at point of adding new hdds i can just change parameter values so hdd1 gets read fine but when writing mongo just sorta prefers the other db
[21:28:56] <GothAlice> There are many possible reasons to want to do that; an imbalance in available disk space between servers should generally be avoided, but is one reason to tweak the balance. Others include grouping often-queried-together data together, to avoid needing to merge records across servers excessively. Also datacenter awareness, i.e. having some of your data in Seattle, and other data in New York, to help move the data used by certain users
[21:28:56] <GothAlice> closer to those users.
[21:29:33] <GothAlice> When it comes to adding a new shard to an existing cluster, generally, if you didn't force the sharding too much, it'll automatically attempt to move data around to achieve a new balance.
[21:29:38] <Sygin> dope as
[21:29:46] <Sygin> really?
[21:29:49] <GothAlice> (That's why that process is called "balancing", and the server process doing it the "balancer".)
[21:29:57] <Sygin> its made to evenly distribute automagically ?
[21:30:01] <GothAlice> Generally, yes.
[21:30:11] <GothAlice> Poor choice of sharding key can get in the way of that, though.
[21:30:28] <GothAlice> I.e. it's possible to come up with a sharding key that tries to put all data on one shard. That's bad news. ;)
[21:31:08] <Sygin> well if 1 hdd is about full and other 2 are nearly empty. i can choose a sharding key to choose those 2 for write
[21:31:51] <Sygin> if shard keys can be changed about at run time maybe we could even write a program that calculates the shard key ratio at each write by hdd space calculated
[21:31:53] <Sygin> idk tho lol
[21:33:21] <GothAlice> You may be trying to exert too much direct control over the process. Data will be read and written to the shard determined by the sharding key, yes, but generally don't try to micro-manage it to the level of only reading on one shard and writing to others. That's needlessly complicated. Natural rebalancing will migrate data around and find a new happy medium, given the chance. This is a good thing.
[21:34:15] <GothAlice> Just carefully think about your sharding keys up-front, and yes, you can change them later, but really large scale migrations are something one wants to generally avoid. I.e. if you change your key too much, suddenly every chunk might find itself wanting to be on a different shard…
[21:35:00] <GothAlice> Basically, if you have one node, and add a second, you want the data spread roughly evenly between them. Add a third node, and the nodes should want to to each store only a third of the available data. Etc.
[21:35:17] <GothAlice> (Excluding, again, datacenter-aware issues where you really want your data spread geographically.)
[21:36:56] <Sygin> this is fascinating stuff
[21:37:04] <GothAlice> But really: don't overcomplicate it. ;)
[22:44:01] <Sygin> well GothAlice thank you. you were a superb help :)
[22:44:11] <GothAlice> It never hurts to help. :)
[22:51:24] <rendar> GothAlice: -j192?!
[22:51:30] <GothAlice> -j192
[22:51:31] <rendar> GothAlice: what kind of system is that?
[22:51:41] <GothAlice> Around 48 4-core VMs.
[22:52:12] <GothAlice> Using distcc distributed compilation.
[22:52:30] <rendar> 48 VMs, but what is the real hardware? did you have bought your own hardware or those are 48 VMs in the cloud?
[22:52:40] <GothAlice> "In the cloud."
[22:52:46] <rendar> i see
[22:52:52] <rendar> oh, rackspace, iirc
[22:52:56] <GothAlice> Yeah. ^_^
[22:53:02] <rendar> got it
[22:53:03] <rendar> :)
[22:53:50] <rendar> need to sleep, let's talk later, good night
[22:53:55] <GothAlice> Have a great one!
[22:55:10] <rendar> thanks! :) bye
[23:06:39] <Sygin> hey GothAlice you still around?
[23:06:48] <GothAlice> Yup. What's up?
[23:07:47] <Sygin> is it possible to shard to the same instance of mongod (same port) but different db and make the two dbs ?
[23:07:51] <Sygin> like i want to connect like
[23:08:12] <GothAlice> Uh… no.
[23:08:12] <Sygin> mongodb://localhost:30000/db1,localhost:30000/db2
[23:08:18] <GothAlice> Yeah, no.
[23:08:24] <Sygin> ooo
[23:08:34] <Sygin> that would make things sooo much easier lol
[23:08:43] <GothAlice> Actually, it wouldn't.
[23:09:05] <GothAlice> I.e. splitting up the DBs in this way is a sub-optimal solution that actually makes things harder, for this use case.
[23:09:18] <Sygin> why
[23:09:40] <GothAlice> You are explicitly avoiding the true sharding functionality to self-monkeypatch a problem instantly solved by sharding.
[23:10:26] <GothAlice> True sharding requires multiple processes and a balancer (+ config server to track which nodes know of which records). Manual splitting like this… is you recreating sharding poorly. ;)
[23:11:19] <Sygin> so GothAlice how would i make like the db software work so that one query would look thru both dbs ?
[23:11:42] <GothAlice> You can't. If you have one collection split between two databases, you're going to have to query them separately. This is bad.
[23:11:53] <Sygin> right. ok
[23:11:58] <Sygin> i guess i was trying to find the easy way
[23:12:02] <Sygin> i will do them separately
[23:12:05] <GothAlice> Splitting like this is only possibly suitable if the collections contained within are distinct. I.e. you query the "foo" collection in db1, an the "bar" collection in db2.
[23:12:29] <GothAlice> Faking merging of collections yourself? Just use sharding.
[23:12:53] <Sygin> i thought i couldnt shard like that
[23:13:13] <Sygin> mongodb://localhost:30000/db1,localhost:30000/db2
[23:13:34] <GothAlice> In sharding you have one database with the data spread across multiple mongod instances. The mongos router is what your application connects to, and *it* deals with asking the right mongod servers for the data and merging the results.
[23:14:12] <GothAlice> What you just described with that pseudo URI is one server, one process, two different databases. That's not good, nor does MongoDB supply any support for such use.
[23:14:23] <Sygin> ok
[23:14:51] <Sygin> so the best way to do this method (if i keep going with the symlink method) is when i need to do a read is to program so that it reads both the dbs
[23:15:11] <Sygin> and when i write it writes to one db
[23:15:59] <GothAlice> The example I gave earlier was isolating different user's databases. No application is querying across different databases in that situation. (This is a reasonable use for the symlink approach.) I'm not sure why you're so preoccupied with this read one / write other situation. :/
[23:16:56] <GothAlice> You seem to be so focused on that one tree that you're missing the forest. Otherwise known as XY problem solving. ;)
[23:17:20] <GothAlice> Really: just use sharding.
[23:17:30] <Sygin> GothAlice, to shard i need to set up another server
[23:17:38] <Sygin> and i cant do that right now
[23:18:01] <Sygin> i'm trying to go with the add hdd approach
[23:19:01] <GothAlice> You don't technically need to, no. It's preferred. You can run two mongod's on one host. It's terribly un-redundant, and will run into performance issues due to RAM limitations, but it can work as a temporary solution. Notably, no-one runs MongoDB on a single server unless they a) have an incredibly good backup and disaster recovery plan, or b) don't care about the data, because the risk of data loss is so great.
[23:19:59] <Sygin> i see
[23:20:09] <Sygin> hmm
[23:21:06] <GothAlice> Standalone is only really viable for development purposes, i.e. on your laptop while writing your app. In production one is strongly encouraged to at least have replication across multiple servers, to avoid data loss, and sufficient replication (or replication + arbitration) that the loss of a server won't impact the application; this is called "high availability". Sharding then lets you get additional space (horizontal scaling) by
[23:21:06] <GothAlice> spreading the data and queries across multiple servers.
[23:21:47] <GothAlice> Technically you could run many mongod nodes, the mongos router, and configuration server, all on one machine. I do this for testing. But it's not a production setup.
[23:22:12] <Sygin> is this specific to mongo?
[23:22:27] <Sygin> or is this true for all database systems ?
[23:22:31] <GothAlice> No. All database systems.
[23:22:44] <Sygin> so GothAlice how would the data be lost anyway/
[23:22:46] <Sygin> ?*
[23:23:33] <GothAlice> If you have a single database server and single application server, if the database server suffers any fault whatsoever your application dies.
[23:23:59] <GothAlice> If you have a replica set of at least two data nodes and an arbiter (that just votes, but doesn't store data), then the application can survive the loss of one of the data servers.
[23:24:27] <GothAlice> In the single database server case, if your HDD fails, you've lost your data. In a replica set, losing one HDD (or having one entire machine light on fire) won't matter.
[23:25:52] <GothAlice> Additionally, you can spread out the nodes in a replica set across multiple racks (protecting against the failure of power in a rack, for example) or across data centres, i.e. if you have a replica in New York and a replica in Seattle, your data would survive the nuking of New York.
[23:26:27] <Sygin> nuking
[23:26:28] <Sygin> damn
[23:26:36] <GothAlice> It's an extreme example to illustrate the point. ;)
[23:26:54] <GothAlice> Things like flooding or other regional natural disasters can have their impact mitigated if you spread your data around.
[23:27:12] <GothAlice> This is true for any database system.
[23:28:10] <Sygin> ok well you have given me a lot to think about. i will tell my boss about all this but i doubt he will care enough to add more servers
[23:28:23] <Sygin> so i think imma hve to go with the symlink shit when the time comes
[23:28:25] <GothAlice> If you run only one database server, some day, you _will_ regret it.
[23:28:35] <Sygin> well its just for now
[23:28:35] <GothAlice> And potentially lose your data.
[23:28:46] <Sygin> what if we had like, weekly backups ?
[23:28:57] <GothAlice> If the loss of up to a week of changes is acceptable, that can work.
[23:29:07] <GothAlice> But note: "live" backups are accomplished through replication.
[23:29:39] <Sygin> well i get it ok
[23:29:43] <GothAlice> I.e. we run our servers in a datacenter outside the city, but we also have a replica in our office, constantly getting updates. If the datacenter goes dark (loss of power or network connectivity) we have a backup ready to go in the office.
[23:30:47] <Sygin> so for the symlink method (no sharding just dirperdb) i guess for each write and read i have to make sure my application reads and takes into account both the databases right ?
[23:31:04] <GothAlice> Correct.
[23:31:12] <GothAlice> It's a not insignificant amount of additional work.
[23:31:51] <GothAlice> For example, if you have a collection in both, and query both, how do you sort the results? You'd need to combine them, then sort them, and can't make use of MongoDB's own sorting.
[23:32:08] <Sygin> yes
[23:32:10] <Sygin> you are right about that
[23:33:04] <Sygin> i wish there was someone who dealt with this problem before. like someone could abstract away from mongodb's databases combine them the two so that all queries act like they came from one db when really he just added glue code to add the two. GothAlice
[23:33:27] <GothAlice> Again, using the correct solution of sharding, mongos is that existing solution.
[23:34:35] <Sygin> haha im weary about running two instances of mongod now
[23:34:40] <Sygin> i think i'll write the glue code
[23:34:58] <GothAlice> In your case, I would recommend just setting up a local shard set, running two mongod processes, mongos, and mongod configuration server on the same node. It's… awful… but it'd work. Point the first mongod at the first data drive, the second mongod at the second data drive, the configuration mongod at the server's local drive, and mongos will manage routing, result merging, etc.
[23:35:37] <Sygin> its okay i was going to rebuild this project from scratch anywa. i wrote some code in mongoose but i will scrap that and use mongodb instead
[23:36:28] <GothAlice> That'll get you familiar with the sharding process for when you do scale up to multiple servers, not require any code change in your application, even if on a single server it's sub-optimal.
[23:38:10] <Sygin> GothAlice, do you think using mongos to combine shards is better or writing glue code for 1 mongod instance is better, performance wise ?
[23:39:07] <Sygin> im sorta leaning towards writing my own glue code for the 1 mongod instance with symlinks and stuff
[23:39:13] <GothAlice> Performance in this situation will be irrelevant. It'll not be good either way, since if one of your HDDs is full you already have more data than will fit in RAM. (Another key use of sharding across multiple servers: to keep the amount of data on each server within reasonable RAM limits.)
[23:39:47] <GothAlice> Far more impactful is the cost to integrate each solution. Sharding is a supported technique, requiring no change to your application.
[23:40:15] <Sygin> yeah sharding means if i have a 2nd server i can just add that to a config file
[23:40:19] <Sygin> and let it run
[23:40:33] <Sygin> with the glue code method i would have to go through more trouble then
[23:41:26] <GothAlice> Quite so.
[23:41:52] <Sygin> alright i'll go the mongos method
[23:41:58] <Sygin> thanks for all your help GothAlice :)
[23:42:52] <GothAlice> Sygin: One quick thing, I have a little bash script I use for testing sharding.
[23:43:11] <GothAlice> It demonstrates some of the configuration options needed to run multiple mongod and the mongos all on one machine.
[23:43:55] <GothAlice> https://gist.github.com/amcgregor/c33da0d76350f7018875 might assist you.
[23:44:24] <GothAlice> (Don't just use the script though; it's not… meant for long-term use, only short-term testing, but the commands to run and what order to run them is covered.)
[23:46:29] <Sygin> GothAlice, is it basically like, you have mongod process 1 and mongod process 2. and then you use mongos to connect the two by specifying its ports and then mongos being listening on a new port?
[23:47:23] <GothAlice> Please fully read the documentation on sharding: https://docs.mongodb.com/manual/sharding/
[23:48:28] <GothAlice> There are three main components: shards containing data, a query router, and configuration server; each needs to be running for the whole to work, effectively. Do not cut corners in following the tutorials.
[23:49:06] <GothAlice> https://docs.mongodb.com/manual/tutorial/deploy-shard-cluster/ being the tutorial.
[23:55:55] <GothAlice> After reading through the tutorial, it should be much easier to follow what the script I linked does and how it applies the multi-server sharding tutorial to a single machine, which is your use case.
[23:56:01] <GothAlice> Sygin: ^
[23:59:46] <Sygin> i see GothAlice
[23:59:51] <Sygin> well i have a lot of learning to do i guess