[08:54:25] <synthmeat> wha, broadcasting entire collection every time? no.
[08:56:46] <kirillkh> synthmeat: yes. I do need this.
[09:04:58] <kirillkh> I'm going to use the collection as a message queue.
[09:19:01] <synthmeat> well, nothing's preventing you from fetching the entire collection each time you get any of the changestream events
[09:20:36] <synthmeat> in few places in my code, i fetch everything, then patch it up with fullDocuments from changestream. here, i don't really care if it drops anything, though. and i still can't wait to reimplement that properly via redis streams or smth else.
[09:21:26] <synthmeat> if it's something super-small and relatively infrequent, fetching everything on changestream event could cover your bases.
[09:30:06] <kirillkh> synthmeat: good idea, thank you
[09:30:53] <synthmeat> (seriously though, redis streams are super easy and perfect for 90% of these types of jobs. for whenever you want to "do it right")
[09:33:07] <kirillkh> synthmeat: yeah, but what if the redis node goes down?
[09:33:37] <kirillkh> synthmeat: besides, I want to filter the events by key. Is it possible with redis streams?
[09:35:54] <synthmeat> i don't think it is (in a reasonably easy way), though i never needed it so i don't know for sure
[09:37:03] <synthmeat> (note that changestream hogs up one of your connections from pool fully, so don't do a lot of them)
[09:38:54] <kirillkh> synthmeat: I'm going to work with python asyncio, not sure if connections pool is relevant there
[12:57:49] <GothAlice> kirillkh: Don't use watch for this. Use the right tool for the job: capped collections. In which case, yes, you issue a find, say it's "tailing" and "await", you'll iterate all existing documents that match, then wait, woken up when new matches are added.
[12:59:03] <GothAlice> I combine it with some logic like this, in a handy little utility: https://github.com/marrow/mongo/blob/develop/marrow/mongo/util/capped.py?ts=4#L11-L46 (MIT licensed, so feel free to use under the requirements listed in LICENSE.txt)
[13:00:01] <GothAlice> Watch is to be notified of changes to an existing collection.
[13:03:15] <GothAlice> kirillkh: "I'm going to use the collection as a message queue." → no ordinary collection will ever satisfy the criteria of a queue. That's explicitly and exactly what capped collections are for; fixed-size (usually) append-only ring buffers with push notification capability. Highly efficient pointer chasing.
[13:04:04] <GothAlice> Literally how MongoDB itself implements multi-server replication. The oplog? A capped collection.
[13:05:32] <kirillkh> GothAlice: I don't need to preserve order. How else are capped collections better?
[13:07:39] <GothAlice> Fixed size, literal ring buffers, directly "tailable" using trivial "running pointer catch-up". Literally a queue. .watch(), and you're preserving an order, but just might not be recognizing it. Oplog order. By tailing the oplog, which will only ever expose _new_ operations made to the collection.
[13:07:56] <GothAlice> Being the right tool, vs. the wrong tool for the purpose of implementing a queue. ;P
[13:09:53] <GothAlice> kirillkh: I built a distributed RPC background task worker system using MongoDB. Capable of distributing and executing 1.9 million RPC tasks per second… 8 years ago. https://gist.github.com/amcgregor/4207375 see the comments at the bottom for the rest of the slides from the presentation, a working MCVE implementation, and a link to a more formal project encapsulating this.
[13:10:50] <kirillkh> GothAlice: except I don't want to cap the size of the buffer. I want it to scale horizontally.
[13:10:54] <GothAlice> You can literally run a function which consumes a generator, where the generator is running on another physical machine, consuming resources from a generator running on yet another machine, … fully distributed async generator chains. For realsies. ;P
[13:11:16] <GothAlice> It explains all, and resolves that concern.
[13:12:26] <GothAlice> For the project at work this was originally developed for (the Facebook game for The Bachelor, really! ;) we determined that to support 24h of history for 1 million simultaneous game instances would require an 8GB capped collection. You measure, test, and allocate to your needs.
[13:12:54] <synthmeat> why don't you write a book? of cookbook-ish form?
[13:13:56] <GothAlice> synthmeat: Because generally I hate people and really don't mind them writing code that sucks. That's their problem. Not mine. 'Cause I'll never use their crap code. ;P </Egotist> XP
[13:14:29] <synthmeat> [proceeds to help everyone on irc chan]
[13:15:36] <synthmeat> any advice on keeping synced copy of collection in memory across many instances?
[13:16:23] <GothAlice> synthmeat: I used to go to conferences, give presentations, talks, and sit on round-table discussions. The "community" for this language can, generally, suck it. I frequently get myself banned from #python, thus founding ##python-friendly (ref: https://gist.github.com/amcgregor/1135902 for the time I got banned shortly after creating that channel) and participating on the Web-SIG (special interest group) within CPython itself was a
[13:16:26] <GothAlice> demonstration of why design by committee nets you absolutely zero progress of any kind. (PEP444 era.)
[13:16:56] <GothAlice> "Agreeing on async processes" was literally impossible.
[13:21:16] <GothAlice> Now I follow the Zen rule: "Practicality beats purity." I don't need to define specifications others are to. Others suck. I'll just implement the ideas and specifications, let my clients profit from the benefits of those ideas and specifications, and if nobody else on the planet use the FOSS library I've created for it, that
[13:21:48] <GothAlice> s/are to/agree to/ Ugh, can't type today.
[13:22:49] <synthmeat> i'm just glad people like GothAlice and StephenLynx exist
[13:23:24] <kirillkh> GothAlice: can it be used with Motor?
[13:25:30] <GothAlice> kirillkh: I have no idea. It uses standard pymongo APIs, so if you can get a reference to a pymongo Collection from Motor, then yes. Otherwise should be relatively trivial to adapt. Write a variant of Collection and Queryable for AsyncCollection and AsyncQueryable which invokes the correct APIs.
[13:25:47] <GothAlice> All behavior like that is extremely isolated.
[13:29:41] <GothAlice> kirillkh: https://github.com/marrow/mongo/pull/11/commits/a7a02106ec4181d07d1d772a216b1a5d6d47e1ef#diff-4745b5c62d4d0d557c8ccdcde5af16e5R153 (synthmeat, you'll get a kick out of this, I think)
[13:29:58] <GothAlice> Where "S" is a "document class" in the example.
[13:32:54] <GothAlice> (My DAO says, "yes, IDs contain dates. You can freely use the ID field as if it were a date field, including use of relative specifications (vs. now), it'll just 'do the right thing' and work it out for you.")
[13:33:12] <GothAlice> Thus the Thread.id >= -timedelta(days=7) bit.
[13:33:35] <GothAlice> ^ is literally equivalent to: ObjectId.from_datetime(datetime.utcnow() - timedelta(days=7))
[13:33:53] <GothAlice> Well, actually… {'_id': ObjectId.from_datetime(datetime.utcnow() - timedelta(days=7))} to be properly specific.
[13:35:05] <GothAlice> That's a "basic queryable field", though. Just one. So, to reduce the number of separate times that value would need to be calculated when comparing multiple ID fields against a given date (see the "this is short for" from the linked diff)… QueryableQueryable it: (Thread.id | Thread.reply.id) >= -timedelta(days=7)
[13:35:15] <GothAlice> -timedelta(days=7) only calculated once. Used twice.
[13:36:29] <GothAlice> Combine fields prior to comparison, vs. combining the result of multiple comparisons.
[13:38:09] <synthmeat> i'm not sure any collection i have would have meaningful date in objectid, tbh. maybe analytics one
[13:38:23] <GothAlice> synthmeat: All ObjectId values contain a date.
[13:39:58] <synthmeat> felt need to respond with something so you don't feel like throwing pearls in front of swines :D
[13:40:12] <GothAlice> Fun fact, my Marrow Mongo DAO layer has a field type explicitly for the storage of "rounded periods": https://github.com/marrow/mongo/blob/develop/marrow/mongo/core/field/period.py?ts=4 :D
[13:40:33] <GothAlice> Since analytics and reporting is a pretty heavy use for it at work.
[13:40:50] <synthmeat> yeah, for that sure sounds useful. date field would be such a waste there
[13:41:17] <GothAlice> Technically a date field. Just… rounded to the edge of an acceptable window.
[13:42:53] <synthmeat> how does one do regular period sampling from collection, outside of dropping an increasing integer as field in there then $mod?
[13:44:19] <GothAlice> https://gist.github.com/amcgregor/1ca13e5a74b2ac318017 ← pre-aggregation pivots our reporting to constant-time O(1) operations. Literally no record can exist that does not represent an accurate period, if missing when "updating", it is created. (Upserts.) Reporting for a given week always iterates the exact same number of records: 168. Etc.
[13:46:35] <GothAlice> That "sample update" (L30-47) represents "hourly buckets", with the most terrible idea of using dynamic property names for counters. (Failure: can't index any of the actual data. Upside: infinitely flexible.)
[13:47:08] <synthmeat> yeah, this $mod stuff i do is rigid af
[13:48:33] <GothAlice> Thus I calculate the "bucket time ranges" application-side, I don't rely on MongoDB $mod operations. I store the actual date and time of the start of the period. Always. And pre-aggregate (pivoting from per-event to per-time-period) at the moment the hit comes in; record the Hit insert, record the Analytic upsert.
[13:48:45] <GothAlice> synthmeat: 1.9 million clicks report generates in 38ms.
[13:49:20] <GothAlice> Because it's only actually evaluating 8,544 aggregate Analytic records to cover the entire year. Always. No matter how much activity there is during the year.
[13:55:09] <GothAlice> a) Applicant activity is progressively increasing over time (bottom histogram), orders are generally well distributed, obv. not placed on week-ends, typically, with a notable hole for X-mas and new years (middle histogram). Job offers themselves, however, are often added on Friday under the "end-of-week crunch". (Upper histogram.) Same x-mas/new years hole.
[13:55:47] <GothAlice> Also: only 10% of job offers added to the system are ever sent anywhere. That just make me so sad.
[13:58:46] <GothAlice> Ah, lastly, if people are looking for work, they're looking for work. "Holidays" have no impact whatsoever in the overall shape of the histogram, neither do week-ends have much impact. (People offering jobs work business hours. People looking… are always looking.)
[14:12:02] <GothAlice> The two little bits of red at the start of January are people's failed "new years resolutions". First week, four days of solid effort. Friday crushes you. Monday second week of Jan? Oh crap, better get back to looking… third week: screw it.
[14:12:31] <GothAlice> Bizarre to be able to see that in the shading of a chart. XD
[14:13:28] <GothAlice> "I decided to get in shape for new-years. A pear is a shape."
[14:33:09] <synthmeat> ok. ignoring nonsense that this code is, this shouldn't hit race condition, right? https://gist.github.com/synthmeat/138f6476549916b326a229bebd5adf78
[14:37:10] <synthmeat> full mongos cfg https://gist.github.com/synthmeat/2222aa4f98b2a824b151722134850538
[14:43:36] <GothAlice> synthmeat: That pair of operations aren't racy. Even if you didn't apply concerns to it, if you read from the primary you're writing to, those operations must always complete in that order. (Before servicing the findOne, the insertOne will have been "completed" or at least "committed" to the primary, where it will be returned for any queries after that point. The entire MongoDB operation log is linear.)
[14:44:23] <GothAlice> Noting that the immediate read is absolutely unnecessary.
[14:45:05] <synthmeat> GothAlice: ok. sorry, forgot to mention connection is established with secondaryPreferred
[14:45:09] <GothAlice> userData['_id'] = resultOfInsertAwait.insertedId — and the userData will be materially identical to the record returned by findOne.
[14:45:26] <synthmeat> no, i know that second bit, sure.
[14:46:00] <GothAlice> Or just assign an ID before insertion. userData['_id'] = ObjectId(); await db.collection(…).insertOne(userData, …) — also eliminates the need for a post-insert re-read.
[14:49:08] <synthmeat> (please, ignore nonsense. i forgot to mention secondaryPreferred read preference)
[14:49:31] <synthmeat> perplexed because i can actually have this race :/
[14:50:18] <synthmeat> 4.2, no compatibility, ubuntu (eh, i know.)
[14:50:55] <synthmeat> (read/write concerns not set in uri or smth)
[14:51:14] <GothAlice> Alternatively, a post-insert read is… absolutely unnecessary.
[14:51:50] <GothAlice> And if you do want lock-step operation, you must read from the primary.
[14:52:35] <synthmeat> why doesn't 'majority' on both read and write sides doesn't work?
[14:53:02] <synthmeat> because of non-_id index query?
[14:53:23] <GothAlice> a) You're doing it wrong by doing more than you need to. b) You're doing it wrong because you are mis-targeting your query and intentionally accepting answers from nodes with stale data. Majority does not mean all.
[14:56:04] <GothAlice> Ah, right, they changed that out. Linearizable.
[14:58:02] <GothAlice> Noting the restriction to reading from the primary, with that.
[14:59:47] <GothAlice> Noting especially: https://docs.mongodb.com/manual/reference/read-concern-linearizable/#read-your-own-writes ← what you are trying to do with your sample. Changed in 3.6.
[17:08:21] <Mech0z> I have a collection Seasons with documents with a field named EndDate that I would like to delete, but when I run db.yourCollectionName.update({}, {$unset: {EndDate:1}}, false, true); I get WriteResult({ "nMatched" : 0, "nUpserted" : 0, "nModified" : 0 })