[16:47:33] <alsuren> Imagine that I have a collection (`content`) with 1_000_000_000 documents in it, and it has an index on `site, author_id` (we have 200_000_000 authors). Is there any way to efficiently iterate over the `site, author_id` index, and give me the list of distinct `site, author_id` pairs so that I can update another collection elsewhere?
[16:49:09] <alsuren> I've looked at using $group in the aggregation framework, but there doesn't seem to be a way to avoid it bringing the whole world into memory in one big hash
[16:50:17] <Derick> I don't think it's possible to query which unique elements there are in an index, no
[16:52:59] <alsuren> ... actually, looking at those numbers, it looks like I only have 5 pieces of content per author on average (I expect we have a massively long tail of fly-by contributions) so it probably won't be that terrible to iterate over them using pymongo, and keep track of some kind of `current_author_id` variable, then flush a write to the other collection whenever it changes