PMXBOT Log file Viewer

Help | Karma | Search:

#pypa-dev logs for Thursday the 23rd of June, 2016

(Back to #pypa-dev overview) (Back to channel listing) (Animate logs)
[13:19:10] <jaraco> I tried adding /pypa/setuptools to Gitter, but Gitter seems to think /pypa is a private repository or that I don’t have access to it. I requested access, but the request is still pending for days.
[13:19:20] <jaraco> Did someone see that request come through?
[14:04:38] <dstufft> jaraco: hmm
[14:04:49] <dstufft> I think we require confirmation for new third party things
[14:04:57] <dstufft> I might have missed an email
[14:05:02] <dstufft> give me a few and I'll take a look
[14:15:36] <jaraco> Thanks dstufft. Unfortunately, I don’t even know how to reach that page anymore. It came up when I requested to add private repositories (which apparently just means repositories that aren’t visible unless I’m logged in).
[14:15:53] <jaraco> Most of my organizations showed up as already authorized, but pypa did not and had a button to request access.
[14:16:06] <jaraco> This was maybe late last week.
[14:16:21] <jaraco> But I’ve since granted access to my github account and so I no longer get that page.
[14:16:45] <jaraco> *granted Gitter access to my Github account.
[14:45:11] <dstufft> jaraco: I granted access to the PyPA org to Gitter
[14:46:01] <jaraco> Works.
[14:46:04] <jaraco> !em dstufft
[14:46:05] <pmxbot> Vi faras bonan laboron, dstufft!
[14:46:25] <dstufft> In unrelated news, I suck at writing a bio for myself :[
[14:57:51] <willingc_away> dstufft Happy to review bio if desired. Short version: You are awesome-sauce.
[14:59:30] <dstufft> willingc_away: thanks :) I just sent it a bit ago. I did an interview for a podcast and they wanted one. Mostly I don't like talking about myself though. I'm much more comfortable with the techincal side of things D:
[15:00:25] <willingc_away> No problem. I think most of us have the same issue when making a bio. Enjoy your day :-)
[15:15:30] <mbacchi> dstufft: which podcast? I enjoyed the podcast.__init__ one from a couple months ago. Its the catalyst for why I'm trying to get involved and help out with warehouse (only 1 PR so far but learning...)
[15:15:44] <dstufft> mbacchi: Talk Python to Me
[15:15:51] <dstufft> mbacchi: and I'm glad to hear that :D
[15:16:19] <dstufft> I haven't actually listened to the Podcast.__init__ one, It's painful to hear my own voice :P
[15:19:04] <mbacchi> cool I'll look out for that. the init one was more dense on technical content than most, typically we get lots of marketspeak thrown in
[15:22:25] <mbacchi> dstufft: so I'm getting myself in a place where I can work on issue #787, no pyramid experience so bootstrapping some of that first. My analysis of the issue though is there is currently nothing that puts stats into redis currently correct? how does the legacy pypi 'count' a download currently, where are the stats incremented?
[15:26:03] <dstufft> mbacchi: Soooo, this one is a bit fun! Legacy PyPI has rsyslog running and it pipes all of it's data into a python script that parses the log file and then increments counters in redis bucketed by certain values with expiration on them, then it also has a script that runs once an hour that tallies up those buckets and adjusts the value in the database for the absolute counts
[15:26:14] <dstufft> This is sort of completely broken at the moment
[15:26:16] <dstufft> However!
[15:26:39] <dstufft> We currently have a project called linehaul, which is a syslog daemon itself and it accepts log lines from Fastly, parses them and sticks them in a big BigQuery database
[15:27:12] <mbacchi> oh...I'll look that up...
[15:27:19] <dstufft> so this information is publically queriably now, but queries can take tens of seconds, sometimes hundreds of seconds to complete depending on how much data is being queried-- suffice it to say that it's too long to happen inside of your normal request/response flow
[15:27:55] <dstufft> so the open question becomes how do we handle the stats that we want in the normal flow
[15:27:56] <mbacchi> yep, so as of now still going to require an async process
[15:29:20] <dstufft> Do we A) Have a second database of storage that linehaul sends data too that does the rolling counts B) Have a daily process that updates the counts for everything C) When someone requests the page, have it kick off a bit of JS that polls an endpoint and waits for stats tos how up, and have that endpoint kick off an async job that fetches and queries the stats (and probably caches them) D) Soemthing else E) Some combination
[15:29:54] <dstufft> I started on a branch that implements C, but It's not super well written I think and I don't know if it's actually the best way to do this or not
[15:31:06] <dstufft> (to be clear, I haven't worked in it recently I just started it then quit to work on other things)
[15:31:20] <dstufft> https://github.com/pypa/warehouse/compare/master...dstufft:stats I just wished what I had done so far, incase it's useful
[15:31:53] <mbacchi> how would I integrate linehaul into my own test warehouse environment? or would I want to start with the public url?
[15:33:54] <mbacchi> ah, that compare link helps...thanks I'll look more at that
[15:34:09] <dstufft> mbacchi: so, If your work is going to just query the BigQuery database, you can just use the publically available data at https://mail.python.org/pipermail/distutils-sig/2016-May/028986.html
[15:34:28] <dstufft> (that PR predates the public data, so uses the old name)
[15:34:52] <mbacchi> k
[15:35:42] <mbacchi> thanks for the pointers...I'm sure I'll be back to ask questions
[15:36:37] <dstufft> If you want to fork the data into a seperate database that is more suited to real time querying (BigQuery is nice, especially for archival data since you can efficiently query massive mounts of data but that means it's not entirely suitable for real time data unless you add an async process) then you'd want to work on linehaul directly which is at https://github.com/pypa/linehaul but which is not nearly as nice to work on as Warehouse
[15:36:38] <dstufft> itself :( (but is much smaller in scope so that makes it a bit better)
[15:37:36] <dstufft> Personally I have a preference for just using BigQuery as the source of data if it can be done in a reasoanble way, less things to maintain and less sources of truth
[15:37:45] <dstufft> but I'm not opposed to other ideas :)
[15:37:54] <dstufft> mbacchi: and no problem! feel free to ask whatever you need
[15:39:09] <mbacchi> ya, I am not familiar with BigQuery to be honest, wonder if it would make sense to shove the stats that we expect to use for warehouse the most in a separate table to enhance quick response time
[15:40:02] <dstufft> mbacchi: that is something that can be done I think yes. I believe BigQuery even natively supports something like that. To be quite honest I'm fairly new to BigQuery myself so I'm still figuring out patterns for making it work
[15:40:55] <dstufft> which is partially why I abandoned my in progress work for the time being, becuase there were lower hanging, more important things for me to focus on :) But I'm totally on board with someone else picking that up if that's what interests them!
[15:42:22] <mbacchi> ya I just browsed the list of issues trying to find something that didn't require an overabundance of pyramid bakcground, so that's what led me to that item...but I think with this new info it actually looks more manageable than I had originally though...thanks