[00:07:40] <javeln> hey all. i'm working on getting mongo into our datapipeline with pig, and it's like 99% perfect for us. but i'm hitting a snag with using updatestorage using the _id as the upsert key, any suggestions,thoughts,etc?
[00:20:39] <federated_life> javeln: whats the snag?
[00:20:49] <federated_life> and why use mongo if your hadoop anyways?
[00:21:39] <javeln> a lot of it's just legacy, and i'm trying to make incremental improvements without rebuilding everything from the ground up
[00:22:13] <javeln> we don't have the luxury of a persistent hadoop cluster, so mongo's a nice queryable place to keep records inbetween data runs
[00:23:12] <javeln> basically there's a rather large unique that i could use to upsert records in, but it seemed like it'd be a lot nicer just to use the _id
[00:23:44] <federated_life> javeln: perhaps your not sharded on _id ?
[00:24:36] <javeln> in a word, this: could not instantiate 'com.mongodb.hadoop.pig.MongoUpdateStorage' with arguments '[{_id:ObjectId("test")}, {$set:{cited:"$cited"}}, cited:long,mongo]'
[00:25:04] <javeln> i can get the string version of the _id in the load function, i want to update in the store function using that
[00:25:23] <javeln> where "test" would be substituted with the string _id from the load func
[00:27:12] <javeln> it works fine if i do '{_id:"ObjectId(\$mongoid)"}' . . . but then that literal string appears as the value instead of the actual ObjectId of $mongoid
[00:30:48] <federated_life> your quoting might be messed up
[00:31:37] <federated_life> and some docs , https://github.com/mongodb/mongo-hadoop/blob/master/pig/README.md#updating-a-mongodb-collection
[00:32:59] <javeln> yeah, i've looked at that, it would work fine if i did like '{_id:"\$mongoid"}' but then it doesn't perform the ObjectId function
[00:33:23] <federated_life> nah, use { _id : ObjectId("") }
[00:33:33] <federated_life> from the example, your quotes are out of scope
[00:34:49] <javeln> could not instantiate 'com.mongodb.hadoop.pig.MongoUpdateStorage' with arguments '[{_id:ObjectId("")} . . . :(
[00:35:14] <javeln> is there a way to make the _id field just a plain string when it's autogenerated?
[00:36:11] <javeln> just some easy uuid that i can use to upsert docs without loading all the extra unique key fields?
[00:38:31] <federated_life> javeln: its generated as an objectId , an objectId is a monotomically increasing field
[00:38:41] <federated_life> so, you will have to specify the UUID you want when docs are generated
[00:38:55] <federated_life> just keep in mind that there are different UUIDs and depending on your concurrency, you might have dupes
[00:39:08] <federated_life> I forget the different kinds right now, but some are more unique than others
[00:41:13] <javeln> i see, yeah that might work, we've got the luxury of not really having "BIG" data by modern standards, and it doesn't have to be highly available or anything like that. just need to be able to selectively update docs as painlessly as possible
[03:11:20] <dman777_alter> how does mangodb do text based searches? do it take words in stored documents and make vectors of them and store the vectors somewhere?
[05:02:40] <dman777_alter> ugh...can't tell what happens with next() in mongoose. don't see it in the api's.
[13:09:09] <daslicht> hello, lets say i save some data like this: {"foo":"bar"} , is it possible to add an image as binary data eitehr ? using gridstore?
[13:09:40] <daslicht> is it possible in the same document or do i have to link it using a objectid ?
[13:10:20] <daslicht> eg: {"foo":"bar", "image":????} or {"foo":"bar", "image_id":4hj4hj2klfdjhh}} ?
[14:13:42] <kali> daslicht: you can store a binary in the document. there is a binary type in bson, look for it in your driver documentation
[14:13:52] <kali> daslicht: be aware of the 16MB limit
[14:14:03] <kali> daslicht: gridfs is an alternative
[14:14:32] <daslicht> yeah i am aware of that binary type and its limitatiion, thats why I ask for gridfs
[14:15:18] <daslicht> the main question is if i need to store 2 documents one with the noraml data and another with the image and like them via id or if i can store anything in one doc
[14:16:09] <redsand_> try 1 doc and use compression on the binary
[14:16:25] <kali> daslicht: well, in that case, you store one "file" in gridfs (it will be mapped to one file document and one or more chunks documents) and and store the filename in your image document