[03:06:25] <deimos> I currently have about ~720gb of data in a single mysql db, ~128gb of indexes and I see about 30% of the ops are writes, 60% reads, and the rest system stuff, I'm trying to find a long term solution as the growth rate has really started to pick up, and I wanted to look into using mongo, I assume I'd have to use sharding due to the index sizes and the memory constraints … anyone have similar data sizes that has had a lot of success usi
[03:11:18] <mr_smith> deimos: your indexes are 18% the size of your data? that seems rather high.
[03:11:19] <deimos> resting: thanks, reading now, trying to decide on a long term solution as the data starts approaching the tb range has been rocky
[03:12:53] <deimos> mr_smith: yeah, and in reality, the indexes in mongo would probably get a once over, we're using mysql like a doc store as it is, and thats the data we want to move out and in to something long term that can scale, we can keep a reference in mysql and in mongo cut down on the number of indexes to 1-2 to help reduce those sizes
[03:13:54] <mr_smith> deimos: what are the pain points? read, write or query performance?
[03:15:12] <deimos> in the initial tests we did writes were a bottleneck for us, we setup 3 servers and dumped about half the data on them, and then hammered the hell out of it trying to push the other half, and it was having a hard time, granted i think we made some mistakes with the shard key and there were some configuration issues we missed
[03:15:59] <mr_smith> deimos: this is on the mysql side, right?
[03:16:00] <deimos> there was a lot of disk contention, i think because we had to keep the indexes in memory due to the random nature of our access patterns
[03:16:30] <deimos> this was the mongo test we had setup, dumping the data from mysql into mongo
[03:17:13] <mr_smith> deimos: well, mongodb is going to keep indexes in memory. that's why it's fast.
[03:17:49] <mr_smith> however, 128gb of indexes may not be ideal. just keeping those indexes up to date in a bulk load probably buried it.
[03:18:26] <mr_smith> even with a TB of data, mysql should be pretty fast.
[03:18:32] <deimos> yeah, correct me if im wrong, the shards themselves will only need to load the index for the data they contain into memory, it seems pretty self evident that would eb the case but i want to be sure heh
[03:19:05] <deimos> well, i can't say the data is structured very well, we're fighting a political battle on application architecture too :/
[03:19:15] <mr_smith> deimos: yeah, but it depends on your shard key. choose a bad one and you're going to have a bad time.
[03:21:15] <deimos> yeah i think we need to take some time and figure out the best key to use for sure
[03:24:21] <deimos> we grab data from a gaming company, most of it is like games/matches fought between 2 teams of folks, and then we have a web based analytics/scoring tool to surface the data/trends/patterns in fun interesting ways, etc..
[03:24:28] <deimos> so we were using the player's ids
[03:25:46] <deimos> most of the data is accessed when someone goes tha players profile, they see those games, and who they played against, etc… but then we also link to those players and there's a number of aggregations we do, and for building trends we aggregate historical data for like who uses what items the most, or whatever
[03:26:20] <deimos> sorry, typing in the dark, forgive my typos
[03:27:39] <deimos> we figured that way when new games were added, the idea would be single node queries since hopefully the games would exist where that player was inserted initially realizing though that could create hot spots
[03:31:38] <mr_smith> deimos: using IDs is questionable. index locality is good, but query isolation sucks and you're vulnerable to reliability problems. something better would be like player and time, just to spitball.
[03:34:02] <deimos> question, when you have say, 3 shards, and you need to add another one, is the idea that when you add it, they start rebalancing right away? and is that something that is throttled well based on the number of reads/writes coming in?
[03:34:26] <deimos> i think thats one thing we ran into is when we added a shard it was going to take days to rebalance due to the writes
[03:55:13] <mr_smith> deimos: i'd have to dig into that one, but you can manually split and move chunks between shards.
[04:25:11] <jwilliams_> if i've already had a sharded mongo cluster, how can i add replication set to an existing shard server?
[04:25:31] <jwilliams_> come across to read the doc https://groups.google.com/forum/#!msg/mongodb-user/F3_XA9RPHVM/I_kyN7HlVLcJ
[04:25:38] <jwilliams_> but that seems to be different from my case.
[04:26:48] <jwilliams_> and other examples e.g. http://docs.mongodb.org/manual/tutorial/deploy-replica-set/ seems to be done with fresh one.
[07:03:04] <ragsagar> I am using aggregation framework to group and count based on date. I achieved this http://dpaste.com/886426/ . But I want to group dayOfWeek and month and group based on that, not just year. How can I achieve that.?
[08:42:07] <oskie> some time after PRIMARY change, the secondaries start to lag a bit (~10 min), exactly the same lag. Yet there is not high CPU or high I/O on the slaves. What's wrong?
[08:43:39] <oskie> what's usually wrong when the two secondaries lag with exactly the same optimeDate?
[08:45:50] <chrisq> how come mongodb still uses huge amounts of memory even when i drop the only collection i had there?
[08:46:53] <chrisq> also, /var/lib/mongodb/, retains <collection>.<int> files that are GBs in size
[08:49:32] <NodeX> mongo doesn't clean data files and the memory is mmapped
[08:49:46] <NodeX> ergo the Operating system will remove it from the LRU cache
[08:49:50] <chrisq> droping the database solved it, i'm still curious why the database would use so much memory after the data in the db is actually removed
[08:50:15] <chrisq> ah, right, so how long would i normally wait for that to happen?
[08:54:46] <NodeX> if you want to force it then restart your mongo
[08:58:10] <oskie> if a single secondary is overloaded and lagging, could that cause other secondaries in the same replica set (not overloaded) to be lagging as well?
[09:15:16] <oskie> But I don't get it. The primary has very little load, little I/O. Still, the secondaries keep lagging more and more. There are not connectivity issues.
[09:21:49] <yarco> what's the opposite operation to $addToSet ?
[10:00:14] <simenbrekken> I've got a bunch of documents with different statuses (pending, approved, rejected) which I need to count each time I search the collection. I've made an index on the 'status' field but it's still pretty slow. Are there other ways of optimizing count=
[10:58:10] <chrisq> NodeX: i've seen that when used on sub-documents in the docs, but they were all dicts in dicts, not dicts in arrays. would i just ignore the fact that they are in an array?
[12:43:35] <tom0815> hello, is it possible to configure the timeouts for replicasets heartbeat?
[13:12:54] <jeremy-> Is there anyway to instance a collection without any data and ensureIndex (awaiting data). As i understand collections are created on the first data input. I'm just working out the best way to ensureIndex as part of the initial data input process on new collections that are created dynamically (daily date based collections).
[13:13:21] <Derick> ensureIndex will also create the collection
[13:25:35] <jeremy-> Eg, what if you wanted spelling mistake searching match suggestions as well
[13:25:41] <kali> if the search is a critical path in you app usage, i recommend you design with an external search engine since the beginnin
[13:26:03] <kali> if the search is just a "nice to have" feature, you can try DYI with mongodb
[13:26:05] <NodeX> you can build a spelling engine in anything
[13:26:16] <NodeX> dont forget in mongo 2.4 FTS is arriving
[13:26:31] <jeremy-> I havnt been watching roadmap, sounds interesting
[13:26:35] <NodeX> it will be somewhere in the middle of a full blown indexer and what's currently available in mongo
[13:26:59] <jeremy-> I was just thinking, if you were going to use a python or whatever natural language toolkit to get typos then search mongodb, i thought one of the out of the box search engines might do it more efficient
[13:27:07] <jeremy-> than sending like 20-50 possible searches per query
[13:29:16] <jeremy-> I think mongo (or even a sql based database) would be a pretty nice solution for what you are saying personally, but im pretty inexperienced. I just cant see a strong case for full text search engine
[13:30:28] <jeremy-> I have a mongodb with like 220 million search terms and metrics like volume/cpc and mongodb returns results in a tiny fraction of a second, havnt measured it but its like .1 or less i assume
[13:32:38] <jeremy-> I just use it to attach metadata to dynamic lists that are much smaller, but i can run a batch job to update like 100k records in minutes which is querying from my 220million records
[13:34:09] <jeremy-> while i'm talking to people who are obviously much better/more experienced than me
[13:36:15] <jeremy-> Ive been trying to benchtest some python to determine potential improvements for my application. Is there any known issue where if you hammer a mongoserver with like 100k writes in a short period that the speed at which the writes occur actually slows down over time (just slightly)
[13:36:46] <jeremy-> I've checked all sorts of things using pythons debugging framework and i cant see a memory based reason for the slowdown, I was thinking it might just be a small bottleneck as mongo gets a thrashing
[13:38:40] <jeremy-> Or i guess, i wonder if i can just log the speed at which mongo gets requests and processes them on the database end so i can measure it there, (for some reason only thought of that now..doh)
[14:11:37] <krawek> firefux: the new aggregation framework works really good for reporting
[14:12:54] <kali> krawek: the AF has limitations that could be problematic for generating big reports
[14:13:10] <kali> or even small reports on big dataset, if you're unlucky
[14:17:05] <balboah> anyone running mongodb in amazon ec2 production? Is the recommendation about raid10 ebs still valid when they also have a enhanced IOPS option?
[14:20:07] <oskie> balboah: sure do.. for us, provisioned IOPS gives less performance
[14:20:25] <oskie> balboah: unless you go >= 1200 or maybe 1500 provisioned IOPS