pmxbot IRC Log Viewer

[00:52:28] <josaliba> Hey guys I'm new to MongoDb

[00:53:06] <josaliba> I Installed MongoDB PHP Driver but whenever I do an insert it adds the document twice

[00:53:09] <josaliba> any idea why?

[00:53:25] <joannac> how are you inserting?

[00:53:41] <josaliba> $conn = new MongoClient();

[00:53:41] <josaliba> $conn->selectDB("Test")->selectCollection("Names")->insert(["name" => "josaliba"]);

[00:53:47] <josaliba> that's all I do

[00:54:18] <joannac> and through the shell, you can see 2 documents with {name: "josaliba"} ?

[00:54:26] <josaliba> yes

[00:54:35] <josaliba> If I run the script once, it creates two documents

[00:55:47] <Boomtime> that will insert a new document each time it runs, so if you've checked this twice, even accidentally, you'll have two (or more) documents

[00:56:01] <josaliba> I know

[00:56:07] <josaliba> But If I drop the collection

[00:56:09] <Boomtime> try dropping the database between runs

[00:56:17] <josaliba> and run the script, then 2 documents are created

[00:56:36] <Boomtime> can you provide the entire script in a pastebin?

[00:56:45] <josaliba> this is the entire script

[00:57:23] <Boomtime> you are running this in an interactive PHP shell?

[00:57:37] <josaliba> no by the browser

[00:58:43] <Boomtime> in the shell, can you paste here the objectIDs of the two documents

[00:58:58] <josaliba> { "_id" : ObjectId("546e8e03fa4634eb3c8b4594"), "name" : "josaliba" }

[00:58:58] <josaliba> { "_id" : ObjectId("546e8e03fa4634ec3c8b4595"), "name" : "josaliba" }

[01:00:37] <Boomtime> interesting, you have more connections than you think

[01:01:13] <josaliba> that's what i presumed

[01:01:17] <josaliba> how can I check that?

[01:01:38] <Boomtime> the process ID changed

[01:01:39] <josaliba> or fix it ...

[01:01:57] <Boomtime> process ID that inserted the first one: eb3c

[01:02:10] <Boomtime> process ID that inserted the second one: ec3c

[01:02:34] <Boomtime> however.. their counter is the same, or sourced too closely to likely be coincidence

[01:03:00] <josaliba> I see

[01:03:00] <Boomtime> i.e these documents were probably inserted by the same web-server, but you script was spawned twice

[01:03:16] <Boomtime> is your browser doing a double connection?

[01:03:25] <josaliba> no

[01:03:29] <josaliba> but i just run htop

[01:03:41] <josaliba> and I see I have 2 mongodb processes running

[01:04:10] <Boomtime> uh..

[01:06:29] <josaliba> weird

[01:06:45] <josaliba> whenever I run service mongos start, 2 processes are created

[01:06:52] <josaliba> but when I run stop both are stopped

[01:08:22] <Boomtime> "mongos"?

[01:08:51] <Boomtime> emphasis on S, this is a real process but means a very different thing from mongoD

[01:10:43] <josaliba> i'm reinstall everything to see

[01:14:25] <josaliba> didn't help, still have the same problem :(

[01:15:16] <Boomtime> you seem to have the same problem with everything - two mongodb instances will not cause two documents to be inserted

[01:16:00] <Boomtime> meanwhile, two instances of the app would not cause two documents to be inserted either - somehow the same script is being called twice

[01:16:20] <josaliba> if I echo something

[01:16:27] <josaliba> it appears only once

[01:17:02] <Boomtime> php echo goes back to the client that is rendering that output, it means nothing if the client only renders one stream - where would you look to see the other stream?

[01:17:36] <Boomtime> mongodb is seeing two different processes connect and each one inserts a document

[01:17:40] <josaliba> I don't know

[01:17:42] <Boomtime> that is what is happening

[01:18:29] <Boomtime> given that the process IDs appear to be sharing the same internal counter, i would suggest it is lilkely these are threads of the web-server

[01:18:47] <Boomtime> the timestamps are identical

[01:20:05] <Boomtime> all the evidence points at two threads on the web-server, either the script is being onvoked twice by a client, or it is effecitvely running twice for some other reason

[01:20:55] <Boomtime> you can enable deeper logging on the mongodb to show if it confirms there are two connetions

[01:21:25] <josaliba> I see

[03:44:19] <zzing> I am wondering if mongo would be a good fit for data that contains tables of tabular data — but each table has metadata such as a citation, caption, name — also where it might be desired to be able to search both the metadata and the tabular data.

[03:45:13] <zzing> (and the tabular data is almost certainly unique (down to the column names and types) for each table)

[03:54:44] <Boomtime> sounds perfect

[03:56:34] <zzing> I have been considering some models, from a relational to various 'nosql' types. I know exactly what to expect for the relational, but this data is not very relational.

[04:20:20] <zzing> Is there any way of recording/tracking changes to documents? I am thinking it is a useful idea, but not essential.

[04:22:57] <Boomtime> the oplog tracks changes to all documents by necessity though it is of a fixed size (circular buffer) and I suspect you mean more in a "historical versions" type of way

[04:23:34] <Boomtime> it would be up to you to build a versioning system for your data if that is what you want

[04:29:52] <zzing> Boomtime, sounds like an interesting challenge :-) I will probably figure something.

[04:33:51] <GothAlice> zzing: Versioning is indeed fun; at work we parse the oplog stream for updates of interest and record them permanently in a separate audit log collection.

[04:34:22] <GothAlice> So, sorta-versioning. ;)

[04:34:50] <zzing> Realistically, most of the updates will be small things, so it wouldn't be too bad to do a kind of diff to roll back changes

[04:35:05] <zzing> But luckily most of the data is enter once and query only

[04:35:44] <Boomtime> the oplog is a record of the operations that you perform, which might be close to a diff if you structure your updates properly

[04:36:12] <Boomtime> but the oplog collection itself will rollover after a while, so you'd need to record it permanently the way GothAlice describes

[04:36:30] <Boomtime> or you can roll your own some other way

[04:37:26] <zzing> I could do it more easily be recording changes when I make them through the rest interface

[04:37:42] <GothAlice> It lets me "seek back" through changes to individual fields and apply them in reverse (pushes/pops) or simply lets me jump to explicitly $set values from any point in time.

[04:38:15] <GothAlice> zzing: That was my initial approach, but I found parsing the oplog to be easier than updating waaaay too many queries. (I wanted auditing of *everything*, not just certain fields or certain collections.)

[04:39:03] <Boomtime> GothAlice: how do you undo a $set ?

[04:39:14] <GothAlice> By jumping to the $set prior to that one.

[04:39:28] <Boomtime> assuming you have everything, fair enough

[04:39:36] <GothAlice> (or to the original insert if we've run out of updates to scan ;)

[04:41:33] <zzing> GothAlice, shall be interesting when I get there. I am at the paper stage planning everything that I will be doing first :P

[04:42:22] <GothAlice> There's also a nifty hack involving use of $comment to pass data (like, say, the "effective user" ObjectId of the browser session initiating the query) around. :3

[04:42:35] <GothAlice> $comment is awesome.

[04:43:17] <GothAlice> I resisted the ironic temptation to encode JSON data into it. ;)

[08:21:16] <foofoobar> Hi. I’m trying to filter in a collection for every row where „archived“ == false or „archived“ is not set, what is the idiomatic way to do this?

[08:22:12] <foofoobar> (archived should be a boolean)

[08:23:50] <Boomtime> { archived: { $not: true } }

[08:24:43] <Boomtime> note that this is a collection scan, so if the collection is large performance will probably suck

[11:01:55] <kexmex> i upgraded mongo, now i am getting this error:

[11:02:04] <kexmex> QueryFailure flag was getMore runner error: Overflow sort stage buffered data usage of 33556304 bytes exceeds internal limit of 33554432 bytes (response was { "$err" : "getMore runner error: Overflow sort stage buffered data usage of 33556304 bytes exceeds internal limit of 33554432 bytes", "code" : 17406 }).

[14:36:14] <kexmex> upgraded to MongoDB 2.6.5 :

[14:36:16] <kexmex> QueryFailure flag was getMore runner error: Overflow sort stage buffered data usage of 33556304 bytes exceeds internal limit of 33554432 bytes (response was { "$err" : "getMore runner error: Overflow sort stage buffered data usage of 33556304 bytes exceeds internal limit of 33554432 bytes", "code" : 17406 }).

[15:25:14] <winem_> hi, just a quick Q. the TTL is part of the definition of the index and expiresAt can be set individually per document, right?

[15:26:39] <cheeser> come again?

[15:37:07] <SahanH> what's the ideal document, http://d.pr/n/1b78j+ or http://d.pr/n/ydpC+

[15:42:48] <Mmike> So, I'm trying to do rs.isMaster() in python, and then parse the output, but I get this ISODate(...)... is there a way to get rid of it?

[15:43:20] <brianseeders> anyone have experience with https://github.com/dropbox/hydra and MongoDB 2.6?

[15:43:43] <brianseeders> Trying to figure out a way to migrate a large dataset with minimal downtime

[15:44:20] <brianseeders> That doesn't use replication

[15:48:38] <Mmike> brianseeders: why not? Use replicasets

[15:52:57] <brianseeders> because our existing dataset doesn't work with replicasets

[15:53:41] <brianseeders> I've tried it every major version release, and it's never able to do the initial data sync

[15:55:43] <brianseeders> For example, when I tried it on 2.4.x, the collections would get duplicate key errors on the initial sync and just stop

[15:56:00] <brianseeders> But if you query the collection for the stated key, there's only one document

[16:44:31] <theRoUS> Derick: i have a db with a lot of records concerning nagios alerts. i'd like to do some data reduction/grouping by type, but the only field with the relevant info has '<hostname>/<check> <status>'. i'd like to select and group on just the '<check> <status>' portion.

[16:44:57] <theRoUS> is there a way to do that? that i can experiment with the in the mongo tool before going to code?

[16:52:04] <Derick> theRoUS: is it stored as a string or as multiple fields?

[17:21:15] <theRoUS> Derick: string

[17:21:38] <Derick> theRoUS: I'm afraid you'll have to do it in code then

[17:21:52] <theRoUS> Derick: meh. thanks!

[17:24:22] <drorh> Hi. I come from relational backgroud and have a strong urge to fall back to mysql, although the project im doing dictates mongodb... I was wondering im there is a "when not to use mongo" sort of article to read...?

[17:26:38] <kali> drorh: you don't you tell us what you're struggling with instead ?

[17:30:24] <drorh> kali, In the project, which deals with flight trips, I consume 3 Apis, each of which has their own structure and identifiers to things which are potentially the same, other than pricing, potentially.

[17:31:54] <drorh> kali, the front end is to consume this mongo dB regardless of which api it came from

[17:33:02] <kali> so far so good, i guess :)

[17:33:34] <sub_pop> that sounds even more like a valid case for mongodb over a schema-strict database

[17:33:58] <kali> +1

[17:34:11] <drorh> Btw sorry for słow responses... I'm on an iPad currently......

[17:35:52] <drorh> sub_pop: that's probably why the spec says mongodb... :)

[17:37:20] <drorh> Although it keeps getting pseudo relational whenever I build it in my head!!!!

[17:41:23] <GothAlice> drorh: http://irclogger.com/.mongodb/2014-11-19#1416454077-1416452929 :)

[17:41:50] <GothAlice> (Note the Java article linked there; it does a good job summarizing how to think about MongoDB structuring from a relational background.)

[17:42:07] <drorh> Im thinking: each document represents a unique trip, and embeds 3 arrays of documents, each of which is a list of the potential configurations per api. Am I on the right track?

[17:42:33] <GothAlice> There are some constraints when you have multiple lists in a document; you can effectively only deeply query one at a time, if that's an issue.

[17:42:43] <GothAlice> ($elemMatch)

[17:43:52] <drorh> Hey GothAlice :)

[17:44:34] <GothAlice> Hey-o. :)

[17:46:14] <drorh> Im not sure I'm going to need to deep query.

[17:48:05] <GothAlice> Perfect, then such an arrangement can certainly work out, though I'd need to see schema ("arrays of documents" sounds scary ;) to see if this is the best fit for your use case. (I.e., what questions do you need to ask of the data?)

[17:49:54] <GothAlice> Oh, your old pastebin links still work.

[17:49:55] <GothAlice> Durr.

[17:50:06] <GothAlice> (You forgot to set them to expire. ;)

[17:50:50] <GothAlice> drorh: http://pastebin.com/HL9Ker4A — still a representative sample?

[17:52:29] <drorh> Cliet froze on me...

[17:52:35] <GothAlice> drorh: http://pastebin.com/HL9Ker4A — still a representative sample?

[17:53:06] <drorh> Im cripled on this iPad !!!

[17:53:25] <drorh> Sec Alice

[17:54:04] <GothAlice> I used an iPad for three months as my primary development machine… just needs the right apps. ;) (TextExpander, Colloquy, Textastic programmers' editor, Cathode SSH client, Transmit for general purpose connectivity, etc.)

[17:54:07] <GothAlice> ^_^

[18:01:25] <ni291187> sigh

[18:01:43] <ni291187> lol

[18:07:31] <drorh> that's a bit better

[18:08:04] <GothAlice> drorh: One really important thing when using devices with intermittent connectivity: run an IRC bouncer somewhere. Then you won't miss messages if you lose connection. :)

[18:08:14] <GothAlice> (It's why I basically never go offline.)

[18:08:54] <drorh> GothAlice i see. not now :)

[18:10:00] <drorh> GothAlice the pastebin u linked earlier is not so relevant because it has the structure of only 1 of the Apis

[18:14:12] <drorh> GothAlice can I have that article u linked earlier again? :)

[18:14:30] <GothAlice> drorh: http://www.javaworld.com/article/2088406/enterprise-java/how-to-screw-up-your-mongodb-schema-design.html

[18:14:32] <GothAlice> :)

[18:14:41] <drorh> lol

[18:19:35] <brianseeders> anybody seen this on initial data sync for a new replica?

[18:19:43] <brianseeders> 2014-11-21T18:06:03.929+0000 [rsSync] build index on: forms.forms properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "forms.forms" }

[18:19:43] <brianseeders> 2014-11-21T18:06:11.514+0000 [rsSync] build index done. scanned 2187301 total records. 7.584 secs

[18:19:43] <brianseeders> 2014-11-21T18:06:11.642+0000 [rsSync] replSet initial sync: error while cloning forms. failed to create collection "forms.forms": collection already exists. sleeping 5 minutes

[18:19:51] <drorh> GothAlice the sinmplest question I'd ask the data is give me the cheapest trip information for a specified destination (origin is fixed) and date interval

[18:24:27] <drorh> I have some ideas

[18:28:21] <GothAlice> brianseeders : Did you start your RS from a clean state when first setting it up, or did you migrate from standalone to RS?

[18:28:36] <brianseeders> migrate from standalone

[18:29:50] <GothAlice> brianseeders: What steps, exactly, did you follow to migrate that?

[18:31:42] <brianseeders> Well, I went by this: http://docs.mongodb.org/manual/tutorial/convert-standalone-to-replica-set/

[18:32:37] <GothAlice> brianseeders: The error messages you had were on a fresh secondary you were spinning up (as part of the "Expand the Replica Set" step?)

[18:33:12] <brianseeders> I stopped mongod on the standalone, edited my conf file to include replset=rs0, started it back up, and did rs.initiate()

[18:33:56] <brianseeders> then i created a new instance, edited /etc/hosts to ensure that the host for the first replica would resolve, added replset=rs0 to the config, and started it up

[18:34:06] <GothAlice> So far so good.

[18:34:19] <GothAlice> (Good catch on the name resolving; it's a common source of problems.)

[18:34:30] <brianseeders> then I did rs.add("ip_of_second_instance:port") from the first instance

[18:34:40] <GothAlice> Still good.

[18:35:14] <brianseeders> the new replica fetched several databases/collections fine, but when it got to that one, that error appeared

[18:35:20] <GothAlice> (Though shouldn't you be using a hostname instead of IP there? ;)

[18:35:26] <brianseeders> and then it started the sync again from scratch

[18:35:58] <drorh> GothAlice ur silence is perfectly understandable:). rest assured I'll be back more co concerte

[18:36:44] <drorh> wow I managed to stutter in a text message!

[18:36:45] <GothAlice> drorh: The "final destination" ordered by price is a relatively simple query. What types of queries would you need to make against the "api" data?

[18:36:54] <GothAlice> ^_^

[18:37:37] <drorh> GothAlice this is where the details step in

[18:39:04] <brianseeders> interesting:

[18:39:16] <brianseeders> rs0:PRIMARY> db.getCollectionNames()

[18:39:17] <brianseeders> [ "forms", "forms", "system.indexes" ]

[18:39:22] <GothAlice> Wau.

[18:39:27] <GothAlice> )o(

[18:40:05] <GothAlice> Correcting that is beyond my ability to assist; like having the same key twice in a document (which is apparently possible) it's something I've never seen before. :/

[18:40:47] <brianseeders> yeah, it's weird

[18:41:00] <drorh> GothAlice the admin as an expert in the field, will have a panel through which he configures the frequency (or complete lack of) of destinstion+api+timeframe calls to the remote service

[18:41:03] <GothAlice> Darn, Boomtime isn't around.

[18:41:15] <drorh> GothAlice was this clear...?

[18:41:32] <GothAlice> Roughly; it sounds like we do something similar at work to schedule our web scraping jobs.

[18:41:53] <drorh> ok good

[18:42:17] <drorh> GothAlice this is the easy part the way I see it

[18:43:38] <GothAlice> The key question for me is: when accessing a stored route, do you need access to all of the bound APIs? Conversely, when accessing the information for a *single* API, do you need access to the rest? The route itself? (Likely "no, no, yes")

[18:44:32] <brianseeders> So I did a repair on that database, and getCollectionNames() now only returns one instance of "forms", but db.stats() still reports too many collections

[18:44:42] <brianseeders> Wonder what will happen now

[18:45:09] <GothAlice> brianseeders: If I encountered this, the first thing I'd do is a very thorough mongodump, then a filesystem snapshot with mongod offline. Just in case.

[18:45:30] <drorh> GothAlice I'm not sure yet if the arch spec says that the agent (the ongoing script that fills mongo) is the data source ta the front, or there is an rdb in between

[18:45:30] <GothAlice> (There's no such thing as too many backups.)

[18:45:47] <brianseeders> I'm working off of a cloned snapshot :)

[18:45:56] <GothAlice> brianseeders: :) That makes me extremely happy.

[18:47:34] <drorh> GothAlice the front is api agnostic, up until billing comes in

[18:48:03] <GothAlice> drorh: I'm not entirely clear on the impact of your last two statements.

[18:48:03] <drorh> makes sense?

[18:48:42] <GothAlice> Using one database engine to populate another is generally excessive.

[18:48:51] <drorh> i Agrze

[18:48:55] <drorh> damn

[18:48:59] <drorh> I agree

[18:49:04] <GothAlice> (Excluding caches, but caches can lead to madness if not implemented carefully.)

[18:49:51] <GothAlice> Is the billing note due to accounting processes generally needing transactional safety?

[18:50:32] <drorh> GothAlice lets omit that statement

[18:50:44] <GothAlice> Ok. :)

[18:51:23] <drorh> GothAlice the 2nd statement is still problematic?

[18:52:47] <GothAlice> From what I can tell you have several components: A front-end doing costing analysis for trip planning (neat, btw!), a back-end runner that continually refreshes the data from external APIs, and a management interface to control the behaviour of the runner, correct?

[18:52:49] <drorh> GothAlice the billing note should have been named booking note

[18:52:56] <GothAlice> Ah. :)

[18:54:28] <GothAlice> As an interesting note, if you *do* need to integrate a relational database for (one would hope would be an excellent reason like transactional safety ;) I would recommend Postgres; if you use that you can connect Postgres to MongoDB and include MongoDB data in your Postgres queries without having to duplicate the data.

[18:54:32] <drorh> GothAlice correct.

[18:54:57] <drorh> sorry for the lag I was ask for a min

[18:55:00] <GothAlice> (They work quite well together; though you *will* have to map your nice rich documents to flat tables.)

[18:55:10] <drorh> ask

[18:55:14] <drorh> afk

[18:55:19] <GothAlice> k

[18:55:51] <drorh> GothAlice awesome note

[18:59:35] <drorh> GothAlice the challenge/question remains: front would be api agnostic? or would it be api adaptive?

[19:00:04] <drorh> question is clear?

[19:00:14] <GothAlice> I'd have to ask: which "side" of the API does the question relate to? Back-end database interface? Front-end public API published by your app?

[19:01:57] <GothAlice> One generally writes code against a database layer (ORM, ODM, etc.) and some of these layers are truly agnostic. (For an example from Python, SQLAlchemy supports basically all forms of relational back-end, so you can freely migrate if need be without modifying your code.)

[19:02:13] <GothAlice> Unfortunately because of the fundamental differences in approach between document and relational databases, being able to be agnostic to *that* is much harder.

[19:03:00] <GothAlice> (You could bind MongoDB collections entirely into Postgres and use them completely through Postgres… but you're completely defeating the point of having MongoDB at all in that setup, you might as well drop it and just use Postgres.)

[19:03:28] <drorh> GothAlice by api/s I only mean the remote services consumed by the agent. by front-end I mean queries by an end user, roughly and succinctly

[19:04:31] <GothAlice> Right. The front-end is completely separate from the agent pulling in the data; the agent's job would be to transform foreign data structures returned by API calls into the format your front-end would expect.

[19:05:03] <GothAlice> (You likely would need specialized conversion routines, you write, for each external API you want your agent to consume.)

[19:05:24] <GothAlice> https://gist.github.com/amcgregor/7be2ec27adc80c9fafa1#file-sync-cvm-py-L21-L54 is an example of one of our API adapters. :)

[19:05:45] <GothAlice> (We're pulling in job data instead of travel data, but the idea is basically the same.)

[19:07:08] <ianp> Spring-data is like that for java based platforms. So grails (groovy-based rails thing on java) uses spring data and you can use it with mongo or SQL db's

[19:07:24] <ianp> It's quite nice considering the complexity of the problem

[19:08:01] <ianp> anyway, that's why the 'DAO' pattern exists

[19:08:08] <drorh> ianp thnx!

[19:08:13] <ianp> you put all your db specific stuff in one layer

[19:08:39] <ianp> not sure what you're thanking me for .. but yw :>

[19:08:49] <GothAlice> ianp: Heh, the last project we used DAO on had most of our unit tests testing the abstraction layer instead of our own code. ¬_¬

[19:08:54] <ianp> relevant : http://en.wikipedia.org/wiki/Law_of_Demeter

[19:09:17] <drorh> ianp the info...

[19:09:19] <ianp> hehe, "we have high unit test coverage, jsut look at these reports!" *proud face*

[19:10:13] <ianp> actually it'd still be low in that case

[19:10:17] <GothAlice> ianp: Yup.

[19:10:25] <ianp> but still an example of people not thinking about what they're actually doing

[19:11:15] <GothAlice> "Look, I just wrote this neat test that makes sure we can store some of our more complicated structures and read them back out!" "Uhm… if we can't insert and fetch we've already lost. Good job."

[19:13:46] <drorh> I'm not sure this was referenced to me, but if it was then ouch

[19:14:18] <GothAlice> Heh, no, general commentary on using abstraction and spending too much time fretting that the abstraction even works. (The abstraction layer should have its own tests…)

[19:15:03] <drorh> oh i see

[19:25:50] <drorh> GothAlice I had no doubt about having conversion routines, but when u said so explicitly I realized I'm going to implement 2 collections. 1 maintains isomorphism with the Apis, and the other fills up (if not existent for the query from the front) with converted data on demad and sends it to the front

[19:26:20] <drorh> demand*

[19:27:48] <drorh> GothAlice sounds good?

[19:28:05] <GothAlice> Why not simply process all data as it arrives and include whatever tracking is needed (update timestamp, tagged identifier, etc.) needed for the agent to track state?

[19:28:40] <GothAlice> (For example, our job data pulls all jobs in, but only inserts "new" ones, checking modification times to "update" existing ones if needed based on a unique combination of (company, job_reference).)

[19:30:10] <GothAlice> Happens https://gist.github.com/amcgregor/7be2ec27adc80c9fafa1#file-source-syndicated-py-L65-L69 in our codebase (for the duplicate check; this code is a snapshot before we added updating.)

[19:33:00] <drorh> i think that would require the Apis to be mutually isomorphic, else a non essential abstraction is needed

[19:34:03] <drorh> the conventions would get rid of data and structure not essential for the front

[19:34:47] <GothAlice> Hmm. Surely there is a unique key buried somewhere in that data. ;) As long as the number of automatic translations needed on-demand is bounded (i.e. I can't accidentally trigger 1,000 on-demand translations with a single request) such an approach should be fine.

[19:35:16] <GothAlice> However this does add data processing to the front-end.

[19:36:36] <drorh> GothAlice too many unique keys at different places for different Apis. that's exactly the issue :)

[19:38:14] <drorh> GothAlice what new processings does this introduce to the front?

[19:38:47] <GothAlice> But MongoDB doesn't have to care. :) {foreign_key: {foo: 27, bar: "green"}, …} — unique index on foreign_key and you can have a per-API subdocument schema there.

[19:40:38] <drorh> GothAlice i see

[19:40:42] <GothAlice> The "on-demand" aspect of conversion would mean conversions are triggered by the front-end.

[19:43:32] <drorh> the front would just ask for converted data. most of the time some other instance of front already implicitly made sure the converted data already exists

[19:44:22] <laurensvanpoucke> hello

[19:44:24] <laurensvanpoucke> I have a MEAN stack app. So a mongodb database. What is a good / best option to host images?

[19:44:38] <drorh> GothAlice it's a dilemma. but having a dilemma in existence is progress right now lol

[19:49:05] <GothAlice> laurensvanpoucke: We store images and PDFs in GridFS.

[19:49:11] <GothAlice> (A component of MongoDB.)

[19:49:36] <GothAlice> laurensvanpoucke: See: http://docs.mongodb.org/manual/core/gridfs/

[19:50:23] <laurensvanpoucke> @GothAlice so you can store a whole image in mongodb without problem ?

[19:50:30] <drorh> GothAlice thanks a million for all the help. I'd probably bbl soon

[19:50:37] <GothAlice> laurensvanpoucke: I personally store 25TiB of data in MongoDB. :)

[19:50:38] <laurensvanpoucke> or is this gist also a good idea ? https://gist.github.com/aheckmann/2408370

[19:50:57] <GothAlice> (That 25TiB includes books, images, movies, music, a copy of Wikipedia, …)

[19:51:04] <drorh> lol

[19:51:16] <GothAlice> drorh: And I'm not joking. ;)

[19:51:18] <GothAlice> lol

[19:51:27] <drorh> I know .

[19:53:06] <GothAlice> laurensvanpoucke: You certainly can use a BLOB-like field stored with your main document. The limit used to be 4MB per document, so separating it was more of a concern in the past. (Now it's 16MB per document.) GridFS handles splitting large files up for you. It's also often a good idea to separate it out to speed up queries (i.e. if MongoDB needs to scan the table to answer a query, having the BLOBs in there will slow it down.)

[19:53:15] <GothAlice> Generally one stores metadata separate from the data. :)

[19:53:41] <GothAlice> s/table/collection/

[19:54:20] <laurensvanpoucke> so not store the acutal image for example on amazon s3 and just a link to that file in mongodb?

[19:54:56] <GothAlice> That's another approach, yes. Has the benefit of providing URL-based access. (I.e. no need to stream data out of MongoDB.)

[19:55:35] <laurensvanpoucke> I see

[19:55:36] <GothAlice> My dataset is too large to do that with; the cost would be insane.

[19:55:43] <laurensvanpoucke> cost?

[19:55:49] <GothAlice> S3 costs per gigabyte and for transfer.

[19:55:53] <laurensvanpoucke> ow ok

[19:56:21] <laurensvanpoucke> and the gist I showed you, what do you think about that ?

[19:56:25] <GothAlice> (I priced it out: by not storing my data there I can afford to use Drobo disk arrays and replace every drive in each of the three arrays every month!)

[19:57:02] <laurensvanpoucke> wow

[19:57:19] <laurensvanpoucke> Now I'm storing data on compose.io

[19:57:30] <GothAlice> With the initial cost of buying the arrays paid off in four months.

[19:58:12] <GothAlice> $18/gb/mo. That'd cost me just under half a million dollars per month. XD

[19:58:28] <laurensvanpoucke> how much gb on storage you have ? :p

[19:58:43] <GothAlice> 24 TiB = 25,600 GiB

[19:58:55] <laurensvanpoucke> wow :p insane... you work for big company ? :p

[19:59:11] <GothAlice> … that's my personal dataset running in a rack at my apartment.

[19:59:45] <laurensvanpoucke> oh not even a web project :p

[20:01:19] <GothAlice> http://cl.ly/image/0t1x2Q2L1u0E was a somewhat older comparison chart.

[20:02:29] <laurensvanpoucke> big amounts

[20:03:37] <brianseeders> 2014-11-21T20:01:16.828+0000 [rsSync] replSet initial sync done

[20:03:37] <brianseeders> 2014-11-21T20:01:17.831+0000 [rsSync] replSet SECONDARY

[20:03:44] <brianseeders> well, that's cool

[20:04:06] <GothAlice> After a certain point, using SaaS/PaaS/*aaS makes no financial sense.

[20:04:19] <GothAlice> brianseeders: I'm glad you got that fixed. T'was a crazy situation to be in.

[20:07:14] <brianseeders> Yes, the repair fixed it and the data seems to be okay

[20:09:14] <brianseeders> I'm glad to see that the erroneous duplicate key errors that I encountered when I originally tried to make this a replica set (2 years ago?) are gone

[20:10:00] <brianseeders> I tried to migrate away from a standalone instance back then with the same dataset and just gave up

[20:10:33] <brianseeders> That was when 2.4 first came out

[20:10:48] <GothAlice> Things have come a long way since 2.4.0. :)

[20:16:25] <brianseeders> Now I just need to figure out what the new setup should be and an actual migration plan

[20:17:01] <GothAlice> brianseeders: https://gist.github.com/amcgregor/c33da0d76350f7018875 :)

[20:17:29] <GothAlice> A handy script that sets up (a local) sharded replica set with 3x3 members and authentication enabled. Spread that across multiple hosts and you have an instant setup. :)

[20:18:15] <GothAlice> Also, MMS is great for setting up new clusters. (But you need to set them up initially with MMS, I believe.)

[20:18:54] <cheeser> depends on what you want to do with them

[20:19:34] <cheeser> monitoring/backup can be done on preexisting clusters. provisioning not so much.

[20:19:49] <GothAlice> That's the ticket. I keep forgetting the word "provision". XD

[20:21:15] <brianseeders> What I ultimately will want to do is set up a new cluster, replicate to it from the standalone instance (which is the only instance receiving reads/writes), then cut over to the new cluster

[20:21:22] <drorh> GothAlice the arch spec pretty much contradicts what the client said 2 days ago. now I have a tactical opportunity to be rid of the rdb in the middle :)

[20:21:37] <GothAlice> drorh: \o/

[20:21:57] <GothAlice> drorh: One of the reasons I love design documents so much. Clients "say" many things… ;)

[20:22:10] <brianseeders> I'm guessing Priority 0 members with some re-configuration at cutover time

[20:22:55] <GothAlice> brianseeders: You could do it pretty much as you describe. Turn the current solo server into a RS primary, add the other hosts in the new cluster, then just offline the original primary during a period of idle insert/update activity.

[20:23:03] <drorh> hehe it's great

[20:23:12] <drorh> imh

[20:23:14] <drorh> o

[20:23:16] <GothAlice> brianseeders: The new "cluster" will take over.

[20:24:02] <brianseeders> I would just need to re-configure them to be priority: 1, I assume

[20:25:07] <GothAlice> brianseeders: I don't think you'd even need to worry about having priorities in the first place.

[20:25:13] <drorh> GothAlice I wanna show u the arch sketch.. it appears like nonsense and I want to confirm this. k?

[20:25:27] <GothAlice> brianseeders: The solo will become primary, and will remain so until it's brought offline.

[20:25:36] <GothAlice> drorh: Sure! I once joined a project part-way through the specification process; they had opted to use zeromq, rabbitmq, redis, membase, and postgres. After I cleaned the coffee off my keyboard I offered to replace *all* of that with MongoDB. ;)

[20:25:48] <drorh> hahah

[20:26:01] <drorh> sec

[20:26:46] <GothAlice> (Why worry about scaling five different services?! Blew my mind: and they even had a reason for the apparent duplication of queues: one had persistence, the other was faster. ¬_¬)

[20:26:49] <brianseeders> Well, I'm trying to avoid a situation where the solo becomes inaccessible because of problems on the new instances or something

[20:27:06] <brianseeders> During testing, I brought down that second instance and the original one demoted itself

[20:27:15] <GothAlice> Ah, that's because you didn't have an arbiter.

[20:27:44] <GothAlice> Remember: a primary can only exist if some node, somewhere can reach a "majority" of the cluster. With two hosts, if one goes down there is no way to get 50.1% of the vote.

[20:28:13] <GothAlice> An arbiter lets the hosts figure out if they lost *their* connection, or if the other host did.

[20:28:55] <brianseeders> Right, that's why I wanted to make the new ones priorty 0

[20:29:19] <brianseeders> Just to try to not introduce more places for problems in the old environment before the cutover

[20:29:42] <GothAlice> The correct solution is to have an arbiter; you can even co-host it on the same machine as the primary if you wish, for the purposes of avoiding read-only mode if the secondaries go away.

[20:30:04] <GothAlice> The end goal is to migrate to a clustered setup, though, isn't it?

[20:30:29] <GothAlice> (Don't co-host if you want that original primary to stick around for extended periods, though.)

[20:30:42] <brianseeders> Yes

[20:30:52] <brianseeders> As part of a migration of our entire infrastructure somewhere else

[20:32:37] <GothAlice> I'd perform the following process for the migration: 1. Promote solo to RS primary. 2. Add one secondary in the new site and an arbiter on the same host as the primary. 3. Let initial replication complete. If the secondary goes away at this point, the primary will keep chugging along as if nothing really matters. ;)

[20:33:43] <GothAlice> 4. Shut down the secondary in the new location. Perform a clone of the /var/lib/mongodb to two new hosts. (This will eliminate initial seeding of the new secondaries and save some time. ;) 5. Bring all three new hosts back online. 6. When satisfied, remove the old primary and arbiter, new cluster will take over.

[20:37:27] <GothAlice> brianseeders: Was that… clear or informative?

[20:41:52] <brianseeders> Yeah, sorry, multi-tasking

[20:41:58] <brianseeders> That is clear and probably what I will do

[20:45:41] <GothAlice> The trick with that process is if the secondaries become inaccessible during the migration, the primary will simply keep functioning. And, after step 2, before step 5, if the primary goes away, that first secondary will go into read-only mode (it won't become primary).

[20:46:34] <GothAlice> On step 5, though, the new cluster is ready and waiting to take over.

[20:50:22] <brianseeders> what prevents the secondary from being promoted?

[20:51:28] <GothAlice> That if the box hosting the primary and arbiter can't be contacted, the secondary can't reach > 50% of the hosts and isn't confident enough to take over.

[20:51:42] <GothAlice> (I.e. it can't verify if *it* lost its connection, or if everyone else did.)

[20:52:01] <brianseeders> Oh I see, I was thinking if the main instance went offline, but not the whole machine

[20:52:13] <brianseeders> in which case the secondary would promote

[20:52:27] <GothAlice> True; process-level failures would throw a wrench into this.

[20:53:00] <GothAlice> However the window for failure is only during the initial sync of the first secondary—how long does it take to sync your data the first time?

[20:53:27] <brianseeders> It will probably take a few hours

[20:53:47] <GothAlice> Have you ever had the primary mongod process just "go away" (without the host machine itself having connectivity issues)?

[20:54:10] <brianseeders> A few times a week

[20:54:13] <GothAlice> wat

[20:54:21] <brianseeders> Goes away with no error messages

[20:54:26] <brianseeders> We just have to deal with it

[20:55:03] <GothAlice> brianseeders: I have hosts with 900+ days of uptime and service-level uptime reaching six nines. Things spontaneously disappearing is a Bad Situation™.

[20:58:20] <brianseeders> Yeah... It's not getting OOM-killed or anything

[20:58:35] <brianseeders> and the log just shows normal operation, followed by start-up messages

[21:00:01] <GothAlice> You'll still need the arbiter, but adding that first secondary as priority zero (then changing back after) during the initial replication will prevent it from ever becoming secondary in the event the primary mongod process dies but the arbiter doesn't.

[21:00:11] <GothAlice> I'm glad you're replacing that box, though!

[21:00:23] <brianseeders> I'm sure it doesn't help that the data is ancient

[21:00:51] <brianseeders> Migrations/Upgrades without re-importing or replicating the data

[21:01:50] <brianseeders> The data probably started on 1.8 or maybe even 1.6

[21:12:58] <GothAlice> ^_^ My crazy home dataset started on 1.6. (The data's been gathering since 2000/2001, originally in a relational database.)

[21:19:07] <brianseeders> I'm out for today, good talking and thanks for the help

[21:20:59] <rgenito> is there a way to have a mongo client without the mongo server?

[21:21:09] <rgenito> just checking ... i'm on ubuntu and i'm thinking to apt-get install mongo-clients

[21:21:27] <rgenito> ...just don't wanna have to do extra work @.@

[21:22:11] <GothAlice> rgenito: If you want to be particularly "conservative of time" you can just download the precompiled binaries from mongodb.org, extract them somewhere, and pick out the tools you want.

[21:22:40] <GothAlice> rgenito: Generally it's advisable to use your systems package manager, of course, which would include the daemon program… which you just don't have to run if you don't want it.

[21:22:49] <rgenito> ah ty

[21:23:12] <Vile> rgenito: http://docs.mongodb.org/manual/tutorial/install-mongodb-on-ubuntu/

[21:23:33] <Vile> You'll need shell & tools

[21:24:07] <GothAlice> ^_^ I always forget about metapackages.

[21:28:25] <rgenito> ty!

[21:28:28] <rgenito> ty Vile !

[21:28:36] <rgenito> damn you mongo people have morbid names :)

[21:28:48] <GothAlice> XP

[21:29:38] <drorh> should I "int codify" all my indexes?

[21:30:41] <GothAlice> drorh: … what exactly do you mean by "int codify"?

[21:30:48] <Vile> Why do you want to do so? Is there any benefit to that in your case? Are there drawbacks?

[21:31:39] <drorh> well computers prefer ints over strings of ints to search by

[21:32:22] <Vile> Computers have no preferences. After all, strings are just long integers

[21:32:43] <Vile> But architects and developers and users do

[21:33:41] <drorh> for example if I have a destination city index, I can use the 3 letter code i.e LON, but instead I can define application wide that 1 means LON etc

[21:34:26] <GothAlice> drorh: If you use hashed indexes, strings of any length are turned into a number internally. ;)

[21:34:31] <Vile> What are you trying to achieve?

[21:34:41] <drorh> peorfmance

[21:34:45] <GothAlice> I.e. you're only making it harder for you to interpret your own data by obscuring those codes with custom integers.

[21:35:17] <drorh> GothAlice that is fantastic Ty!!!!

[21:35:45] <Vile> Then start with as is and afterwards change the parts that are called/accessed most often

[21:35:58] <GothAlice> Optimization without measurement is by definition premature.

[21:36:21] <Vile> It might end up being something completely different from what you expect

[21:36:30] <GothAlice> I.e. write your app in the most readable way possible, benchmark the crap out of it, find the slow bits and optimize *those*.

[21:37:08] <drorh> go ha

[21:37:13] <drorh> goch

[21:37:17] <drorh> a

[21:37:18] <GothAlice> Words.

[21:37:20] <drorh> era

[21:37:20] <GothAlice> XP

[21:37:23] <drorh> erm

[21:37:27] <GothAlice> Stop that. :P

[21:37:50] <drorh> yeah.

[21:39:53] <drorh> is it possible to hash a compound index into a single int internally?

[21:40:48] <GothAlice> drorh: No. http://docs.mongodb.org/manual/core/index-compound/#index-type-compound — see the yellow warning box.

[21:41:02] <drorh> k

[21:42:57] <GothAlice> Generally, don't worry about the internal mechanics of MongoDB… use it in the most obvious way first and go from there.

[21:49:32] <drorh> GothAlice ever had a peak in the source code?

[21:52:24] <drorh> I see

[21:56:33] <drorh> I remember source diving MySQL once. never did that again lol. I'm guessing mongodb would not scare me away so bad. we'll see :)

[21:57:48] <GothAlice> drorh: I actually *had* to source dive MySQL in order to reverse engineer the on-disk InnoDB format after a multi-zone AWS failure corrupted things. Took me 36 hours—straight—to recover that data.

[21:58:03] <GothAlice> (And that was immediately prior to dental surgery to extract my wisdom teeth.)

[21:58:04] <GothAlice> XD

[21:58:19] <drorh> lol

[22:08:11] <drorh> "You may not specify a unique constraint on a hashed index."

[22:08:24] <rgenito> is there a way to load a mongo db from the mongo shell?

[22:08:36] <rgenito> i was just looking at "load()" and tried loading this file... "transactions.bson"

[22:12:34] <rgenito> =[

[22:18:00] <drorh> GothAlice do u consider the online official docs and reference sufficient for my needs as you understand them, or is there maybe a really good book I can read?

[22:19:17] <GothAlice> The official documentation is quite good, though it's important to sometimes hunt around as the best description for a given task may be in a tutorial instead of reference. I also use search engines quite heavily, seeking out "real-world" use cases on blogs and things that fit my needs.

[22:20:22] <drorh> I see

[22:22:28] <GothAlice> "If it's a good idea, someone else has probably already written it." ;)

[22:23:11] <drorh> yep

[23:17:01] <mlg9000> Hi all, anyone have any experience troubleshoot performance in a globally distributed replica set?

[23:17:56] <mlg9000> looking for tips that might improve performance

[23:18:26] <GothAlice> mlg9000: http://docs.mongodb.org/manual/data-center-awareness/

[23:18:32] <GothAlice> :)

[23:19:46] <mlg9000> I've seen most of that

[23:20:17] <mlg9000> I don't shard, my dataset is tiny

[23:20:54] <GothAlice> http://docs.mongodb.org/manual/tutorial/configure-replica-set-tag-sets/#replica-set-configuration-tag-sets specifically

[23:21:00] <mlg9000> and I have priorities set for specific datacenters

[23:21:11] <GothAlice> Priorities effect elections, not queries.

[23:21:17] <GothAlice> Tag sets effect queries.

[23:21:57] <mlg9000> how does write concern come into play?

[23:22:51] <mlg9000> I was having sudden performance issues with my app, I disabled my furthest replica and boom.. right back up

[23:24:26] <GothAlice> Writes obviously only happen to the primary, but when requesting confirmation of a minimum level of replication you can use tags to identify which groups of replicas you care about replication to. I.e. if you have a very, very slow offsite replica for backup purposes only, don't include it in the tag set to ensure confirmed writes never wait for *it* to respond.

[23:25:45] <GothAlice> s/happen/get sent to/

[23:26:11] <mlg9000> ok, default is Acknowledged but I'm not clear from ready that documentation what exactly that means

[23:26:58] <mlg9000> I really don't care if replica writes take awhile and I don't want them bogging performance down

[23:27:35] <mlg9000> it's an inventory systems, replica's are there in case we need to access data locally in the event the primary site is offline

[23:28:25] <GothAlice> To ensure that, make sure your primary is as close to your application servers as possible (in terms of latency). Acknowledged means when you issue a write the primary will recieve the data, check that it looks sane, and tell you all is well. (You know it got it, but you have no guarantee over if that change has propagated to the rest of the cluster.)

[23:30:32] <mlg9000> hum... my application server which does all the writes is on the same host as my mongodb primary. Not sure how a remote replica would slow things down then but it definitely is

Log file Viewer

Help | Karma | Search:

#mongodb logs for Friday the 21st of November, 2014