[07:26:11] <databasesarehard> hello anyone around to help a database n00b ?
[07:26:29] <databasesarehard> ive got a lot of text data and im not sure how to proceed with what i need to do
[07:27:30] <databasesarehard> ive hit the document size limit and i guess i need to be using gridfs - but im not sure if im going to be able to manipulate the data as needed
[12:36:45] <GothAlice> databasesarehard: "Text data" … "hit the document size limit" … you're telling me you have more than 200 novel chapters (5,000 words per chapter average) within a single document? That's a million words of text.
[12:38:30] <GothAlice> Alternatively, that'd only take an array containing two million nulls to hit. (The most compact representation that can hit the limit.)
[17:17:20] <reaperkre> hello can someone please take a look at this gist and help me troubleshoot a connection problem? https://gist.github.com/reaperkrew/7c826d20f389f2a5d171d51e29ebe29d
[21:38:44] <reaperkre> hi everyone, anyone have difficulty connecting to mongo+srv on ubuntu 18?
[22:05:06] <databasesarehard> GothAlice: i want to say no but - yes(?) - i know im only dealing with text data and after nesting a bunch of objects the insert is getting rejected
[22:05:18] <databasesarehard> so im working on rearchitecting my data
[22:15:40] <databasesarehard> moving some nested data out into its own db & document seems like its handling itself better
[22:23:33] <databasesarehard> GothAlice: also after re-reading your reply - yes in some cases there may be over 1m words per document - lots of data to scrape out there
[22:24:30] <reaperkre> nevermind, it was laravel homestead had a problem with dns
[22:24:41] <reaperkre> used a vanilla vagrant bionic box and it was fine
[22:45:35] <databasesarehard> rearchiteched but still some documents are too big >_> i guess i need to use gridfs
[22:49:15] <GothAlice> databasesarehard: Are you trying to use documents as if they were collections? With large arrays of many objects? Possibly even multiple arrays, or nested arrays?
[22:51:04] <GothAlice> There are strong disadvantages to trying to get too fancy, in many different ways: excessive complexity like that leads to an inability to sensibly query or update (only one array value at a time can be matched…), storing localized values with languages as keys, dynamic (i.e. arbitrary) keys, …
[22:51:18] <databasesarehard> GothAlice: just an array of objets
[22:51:23] <databasesarehard> just a lot of objects
[22:51:37] <databasesarehard> each obj has a few k:v pairs
[22:51:47] <GothAlice> Then that sounds like a collection, not a field within another document.
[22:52:06] <databasesarehard> yeah so i made it its own collection
[22:52:09] <databasesarehard> and its still too big
[22:52:27] <databasesarehard> in some instances the collection has over 1m elements
[22:52:46] <databasesarehard> consisting of 6 k:v pairs (incl. _id)
[22:53:01] <GothAlice> Are… you trying to store the array of ObjectIds of records in the other collection as an array within the first, just like storing individual documents before, just… the reference instead?
[22:53:11] <databasesarehard> pymongo.errors.DocumentTooLarge: BSON document too large (29446984 bytes) - the connected server supports BSON document sizes up to 16793598 bytes.
[22:53:19] <databasesarehard> just bits of meta data about them
[22:53:40] <databasesarehard> sometimes there are just over 1m records
[22:56:09] <GothAlice> I think you need to stop and completely rethink your data model. Back up, try to find an existing white paper, how-to, tutorial, blog post, or article about what you are trying to do.
[22:56:20] <databasesarehard> im scraping tons of data
[22:58:02] <GothAlice> Are you storing files? GridFS, period. Storing large numbers of completely arbitrary key-value chunks of data? Entity-Attribute-Value separation (EAV). Small numbers, you can get away with documents like {props: [{name: "x1", value: 27}, {name: "y1", value: 42}, …]} — just don't embed these within other documents, and don't mix with other arrays.
[22:59:10] <databasesarehard> some collections have >1m elements
[22:59:19] <databasesarehard> well - they would if they could store
[22:59:21] <GothAlice> No, collections have documents.
[22:59:55] <GothAlice> And while there is an upper bound on the number of records a collection can store, one will not hit it. You'll run out of disk space first.
[23:00:10] <databasesarehard> then i mean, what am i doing wrong here?
[23:00:49] <databasesarehard> i have a list/array of items - each with the structure above - in some cases the list contains over X elements which causes it to be oversized
[23:00:57] <GothAlice> "BSON document too large (29446984 bytes)" — show me the document you were trying to insert when that happened. Showing me the success cases (data already present) does little too diagnose.
[23:01:25] <databasesarehard> the failure cases have X elements and it doesnt insert
[23:01:29] <databasesarehard> im not sure what you want me to provide you
[23:02:24] <databasesarehard> im writing a scraper and putting lots of data in to a database, im not sure what you expect me to have on hand
[23:02:36] <databasesarehard> if the data would store properly i would happily show it to you :)
[23:03:08] <GothAlice> The record you were trying to insert when the error occurred. If one has anything approaching sensible logging, it would be included, but, can't all have good logging practices. If you do not have one presently handy, I'm afraid I can't help you.
[23:03:51] <databasesarehard> oh please - i just told you im writing code from scratch, there isnt much logging as of yet
[23:04:04] <databasesarehard> give me a moment and I can get you the data of one of the failed records
[23:04:09] <GothAlice> Other than to suggest what I already have: that array belongs in its own collection, separate from the document you are trying to insert that contains it. Insert the document without those, note the ID, insert the array elements into a separate collection referencing that inserted document ID.
[23:04:20] <databasesarehard> i have moved it to its own collection
[23:04:25] <databasesarehard> and there are too many elements
[23:07:39] <GothAlice> That tells me instead of storing M = [{a: [{…}, {…}, {…}] (monolithic, array embeds documents), you're storing P = [{ObjectId(1), …}, {ObjectId(2), …}, {ObjectId(3), …}]; M = [{a: [ObjectId(1), ObjectId(2), ObjectId(3)]}] — i.e., storing the IDs of the documents in that "foreign" collection within an array of the original document. "Filling up" a collection would not get you a BSON document size error.
[23:08:22] <GothAlice> Inserting a single over-large record, like the original one embedding all the documents, would. Or inserting one that's still over-large despite having moved the actual "data" out, if containing an array of references like that.
[23:11:08] <databasesarehard> kk getting the data for you -i verified one of the places it fails
[23:11:13] <databasesarehard> ill get you the output momentarily
[23:13:05] <databasesarehard> okay, so my text object is
[23:13:05] <databasesarehard> root@ubuntu-c-16-sfo2-01:~# ls -lh gothalice_dump
[23:15:10] <GothAlice> That's a "you must be joking" situation. Refactor your data model, man. Reconsider life choices. I have more than 40 TiB of data in MongoDB, been using it for 10 years, never encountered the document size limit. Even when it was 4MiB.
[23:16:35] <GothAlice> And data ingress (scraping, feed consumption, XML-RPC, you name it) is a major component of my work. Built quite the MongoDB-based data-derived CSS selector (or mapping key path) content extraction engine.
[23:20:01] <databasesarehard> my data model is a list not much to refactor there
[23:20:15] <databasesarehard> quite literally a file list
[23:23:49] <databasesarehard> so if i have a huge AF list of data what should i do? again - i guess move to gridfs
[23:24:23] <databasesarehard> the output of that file was pure text - i should have done a character count before destroying the instance it was on