pmxbot IRC Log Viewer

[00:07:26] <travis-ci> pypa/pip#17963 (master - a13f201 : Pradyun Gedam): The build passed.

[00:07:26] <travis-ci> Change view : https://github.com/pypa/pip/compare/7fdf1634e083befe6334f10ddbe4b9ce3e7826bb...a13f2014f99ed3ca8fb2b3fa25d3631a50a0cace

[00:07:27] <travis-ci> Build details : https://travis-ci.com/pypa/pip/builds/184299591

[18:54:15] <daniel137> hi i need help to isntall one package on python

[21:17:13] <ianw> dstufft: fyi narrowing in on the pypi/fastly cdn issues which appear to be related to stale index files being served; currently working on details for bug reports @ https://etherpad.opendev.org/p/debugging-pypi-index-problems

[21:17:49] <ianw> somewhat in reference to https://discuss.python.org/t/any-chance-to-an-issue-tracker-for-pypi-org-operational-problems/5219

[21:18:06] <ianw> at least, the problems that inspired that thread :)

[21:18:40] <ianw> one question for anyone; we are seeing old index files served up, without a serial number

[21:19:21] <ianw> (the  at the bottom of the page)

[21:19:52] <ianw> that may be a clue? does anyone know how recent an addition that was to the index.html files? was it a flag day when they were all regenerated? something else?

[21:34:44] <dstufft> ianw: there was no flag day when we regenerated, because our infra doesn't "generate" like that, our oriign servers render the templates as part of the request / response cycle, so we deployed that and just as stuff fell out of the cache it hit the origin servers again and got an updated value

[21:35:19] <dstufft> ianw: FWIW not having that tag might not mean that it was a stale CDN cache, we also have our own bandersnatch mirror, and if fetching from PyPI fails, it'll fall back to fetching from that mirror

[21:35:41] <dstufft> and I don't think that mirror has the serial comment

[21:37:05] <dstufft> ianw: are you all using nginx? or is that server header from us

[21:37:22] <cooperlees> fwiw - I've wanted to update pypi's bandersnatch, but wanted a test box first. It's a large jump in versions

[21:37:53] <cooperlees> And I was going to ansible it all

[21:38:03] <cooperlees> + possibly use docker

[21:38:43] <dstufft> I could very much believe that our old and largely ignored bandersnach install is doing something funky :)

[21:39:00] <cooperlees> and the ultimate goal of moving the mirror to s3 from *insert aws posix network mount thingy*

[21:39:29] <cooperlees> techalchemy's been promising me s3 support to bandersnatch for a long time now :P

[21:39:40] <ianw> dstufft: ... interesting ok. we're putting together a bug report as we speak at that etherpad

[21:40:44] <dstufft> there's an argument that soon we're going to hit diminshing returns with the mirror TBH

[21:41:10] <dstufft> As part of the TUF work, I think we're moving to pregenerating the /simple/ pages and storing them in an object store

[21:41:16] <cooperlees> If that's the case then I'll do nothing :)

[21:41:40] <dstufft> which means both the /simple/ pages AND the files can be served directly from the object store

[21:41:49] <dstufft> instead of going through our servers

[21:42:43] <dstufft> which I guess raises the question if it's worthwhile to have a mirror at all? I dunno it didn't really occur to me until right now

[21:42:58] <dstufft> I should probably write an issue so we think about it :p

[21:43:49] <dstufft> (this also means if we do have mirror AND that mirror is going to be backed by an object store, we probaby want it to be a *different* object store than we're currently using for actual pypi)

[21:45:14] <clarkb> dstufft: our cache is apache so you're generating those headers (pretty sure anyway)

[21:45:28] <clarkb> what we've got pasted there is the contents of our apache cache

[21:45:34] <dstufft> Ok

[21:46:14] <clarkb> I'm guessing most people don't notice because they either don't lock things or when they lock things the list ends up stale so aren't trying to use latest things frequently

[21:48:13] <dstufft> Does the problem resolve itself? If so after how long?

[21:48:55] <clarkb> the vast majority of requests seem to work. Some of these failures I've managed to rerequest packages for less than 5 minutes later and the indexes are fine

[21:49:09] <clarkb> in the past when we've seen the behavior it will be a day or two or three of pain then it goes back to normal

[21:49:50] <dstufft> a day or two of pain sounds like the original cached item fell out of the fastly cache

[21:50:16] <dstufft> I think our fastly cache will live for 24h

[21:50:24] <clarkb> well its a day or two of flip flopping

[21:50:39] <clarkb> some requests work some done. Then after a bit the flip flopping stops and it just works reliably

[21:51:43] <dstufft> Yea, my main thing was if it was like.. 10 minutes of pain on a new release, that might just be distributed systems, but our purging should ensure that things are available within a few minutes

[21:52:04] <clarkb> dstufft: oh ya these packages have neen releases weeks ago in some cases

[21:52:10] <clarkb> *been released

[21:52:16] <dstufft> gotcha

[21:53:09] <dstufft> it's possible there's a purging bug in either warehouse or fastly, weeks ago is weird though I wouldn't expect tht from a purge back

[21:53:10] <clarkb> in one of our examples on the etherpad the release for the version we want happened August 3, 2020 and the index we get served is marked Last-Modified as April 9, 2020

[21:53:12] <dstufft> bug*

[21:54:01] <clarkb> and the requests for it were all made today. Some with correct up to date indexes and some without.

[21:54:02] <dstufft> hmm

[21:54:05] <dstufft> last-modified

[21:55:13] <dstufft> I think that means it came from our internal bandersantch mirror

[21:55:28] <dstufft> we should probably make sure we don't cache those if we aren't already

[21:56:47] <dstufft> unless I'm blind, https://bpa.st/QUDA I don't see a last-modified atm

[21:57:55] <dstufft> I don't see any CSP headers in your respones either

[21:57:59] <clarkb> ya double checking in my local browser there is no last-modified header when we get what we expect

[21:58:43] <dstufft> which CSP headers are added by Warehouse itself, not Fastly

[21:58:44] <dstufft> so no CSP headers strongly suggest it's not hitting Warehouse

[21:59:50] <clarkb> is the bandersnatch mirror stale?

[22:00:00] <dstufft> I think I technically have access to the internal mirror... but I don't even remember where it lives

[22:00:12] <clarkb> (that may be another pointer to it if it is)

[22:00:47] <cooperlees> dstufft: Yeah, please open an issue on that. I'd love to not have to generate the html

[22:01:01] <cooperlees> Should men the code super dumb

[22:01:44] <clarkb> fwiw I'm working on an issue for the stale indexes now too

[22:03:46] <dstufft> cooperlees: do you know what host serves our mirror yet?

[22:04:12] <dstufft> oh I guess I can reverse engineer it from fastly

[22:04:57] <dstufft> womp womp

[22:05:03] <dstufft> permission denied

[22:05:05] <cooperlees> Yeah I do - let me open my doc

[22:05:25] <dstufft> cooperlees: do you have na ccount on the box? can you see if there's a dstufft user lol

[22:05:25] <cooperlees> https://mirror.dub1.pypi.io

[22:05:53] <dstufft> In any event: This serves a full mirror, current as of:

[22:05:54] <dstufft> 2020-08-02 08:38:01 UTC.

[22:05:59] <dstufft> seems bad

[22:11:09] <fungi> it's taken a couple of days for us to track this down, but yeah if fastly is occasionally serving indices from a stale second server then that would absolutely explain everything we've seen (and also why we had a devil of a time coming up with any active reproductions)

[22:11:42] <fungi> dstufft: also, yes, all the releases we're having occasional trouble finding were after that date

[22:11:49] <fungi> so extremely likely

[22:13:07] <dstufft> sudo -u pypi-mirror /usr/local/bin/pypi-mirror

[22:13:07] <dstufft> Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>

[22:13:07] <dstufft> OSError: [Errno 28] No space left on device

[22:13:07] <dstufft> cat: /data/pypi-mirror/status: No such file or directory

[22:13:13] <dstufft> womp womp

[22:13:27] <fungi> that'll do it!

[22:13:29] <dstufft> 12TB disk full

[22:13:45] <cooperlees> I thought EWDurbin had monitoring on that

[22:13:53] <fungi> now's probably a terrible time to say that's why we stopped maintaining our own mirror of pypi :/

[22:13:56] <cooperlees> That and it running was the 2 alarms I thought

[22:14:38] <cooperlees> fungi: THat's why we made bandersnatch blacklists

[22:14:50] <cooperlees> You can 1/3 it blocking the top 100 packages last I checked

[22:15:03] <fungi> yep, we ended up dropping the idea of mirroring pypi before bandersnatch merged the blacklisting feature

[22:15:13] <cooperlees> That's quite awhile ago

[22:15:13] <dstufft> hmm

[22:15:22] <dstufft> this bandersnatch is too old to have verify

[22:15:29] <dstufft> so I guess it just isn't deleting anything

[22:15:36] <cooperlees> Yup - it's very old

[22:15:41] <fungi> and also would have needed to completely rebuild our mirror to retroactively apply blacklists to get it to a reasonable size again

[22:15:43] <cooperlees> Yeah, delete would be broken

[22:16:15] <dstufft> does anyone remember if you can live resize an EBS disk in AWS

[22:16:16] <cooperlees> fungi: Yeah, it would be nice to add filtering support to verify

[22:16:19] <fungi> (we dumped it around the time pypi crossed the 1tb threshold, if memory serves)

[22:17:37] <ianw> fungi: i think it was 2tb, because it was getting to openafs limits

[22:17:44] <fungi> oh, right you are

[22:18:19] <ianw> a tb here, a tb there :)

[22:19:11] <clarkb> ya the daily package uploads for cuda and AI stuff caused disk use to explode

[22:19:20] <dstufft> yea

[22:19:24] <dstufft> we wanna do ~something~

[22:19:26] <clarkb> and we could list things to exclude but that list seemed to be growing too

[22:19:28] <dstufft> not sure what that is yet

[22:19:32] <clarkb> so we switched to a caching proxy instead

[22:19:34] <dstufft> anyways

[22:19:40] <dstufft> this is just broken currently

[22:19:44] <clarkb> I almost have all the headers and things ready to attach

[22:19:55] <clarkb> then I'll file a bug and we can use taht track the fixing. Thank you for the help!

[22:20:34] <fungi> yes, this has been super helpful. at least we've got confirmation it's not something we're doing wrong ;)

[22:20:44] <dstufft> cooperlees: this version of bandersantch will recover if I randomly delete mirrored files right?

[22:21:05] <dstufft> really tempted to just delete all the files and let it resync

[22:21:41] <dstufft> since almost 50% of these files shouldn't exist anymore

[22:25:08] <clarkb> https://github.com/pypa/warehouse/issues/8568 is the issue

[22:28:56] <cooperlees> dstufft: Yes it will - can you just mark fastly not to use it while you do it?

[22:29:49] <dstufft> yea I removed it from fastly

[22:29:55] <clarkb> pip will retry too if it just gets an error iirc

[22:30:01] <clarkb> so that may improve behavior for our situation at least

[22:30:16] <cooperlees> dstufft: While you're around - Can we clean up https://github.com/pypa/bandersnatch/issues/56 + https://github.com/pypa/warehouse/issues/4892

[22:30:37] <cooperlees> tl;dr where bandersnatch manually tried to clear fastly cache

[22:30:58] <cooperlees> self.session.request("PURGE", path, timeout=self.timeout)

[22:35:15] <dstufft> cooperlees: uhh, I don't think I have anything new to add right now, I still think if we're getting stale stuff cached in Fastly that I would prefer it if we didn't have a bunch of bandersnatch issuing purge requests. I think probably the correct answer is to stop trusting the CDN to purge 100% of the time, and make our purging smarter so it will verify the new content

[22:35:30] <dstufft> or reduce the amount of time we cache things for

[22:35:41] <dstufft> so that a stale cache doesn't stick around as long

[22:36:22] <cooperlees> Do we have metrics on how often this is happening? My goal is to just simplify the bandersnatch code ultimately

[22:36:26] <cooperlees> SUPER low pri

[22:36:43] <dstufft> no metrics currently

[22:37:30] <dstufft> this reminds me that we need to figure out a solution to bandersnatch deleting files in the general case too :/

[22:37:37] <dstufft> SO MUCH STUFF TO DO

[22:37:54] <cooperlees> Always ...

[22:38:05] <cooperlees> To be fair, delete has not worked in years ...

[22:38:21] <cooperlees> Since I started maintaining it didn't work then and was thought it did

[22:38:27] <cooperlees> I noticed it starting FB's mirrors

[22:38:44] <dstufft> pretty much once we moved PyPI's file URLs it broke I think

[22:40:08] <cooperlees> Yeah, I recall you saying when you moved to AWS from Gluster

[22:42:20] <cooperlees> dstufft: If you're killing the mirror - we might as well docker pull papa/bandersnatch ?

[22:42:30] <cooperlees> *pypa/bandersnatch

[22:42:50] <cooperlees> O happy to work on it another day

[22:43:16] <dstufft> I'm just deleting the synced files and then letting it sync again, I'd prefer not to drastically alter the deployment in an adhoc fashion

[22:43:19] <clarkb> I'm going to pop out now. Thanks again. I'll continue to lurk here and feel free to ping if there is more we can do to help debug it

[22:45:29] <cooperlees> dstufft: Yeah agree

[22:46:08] <cooperlees> Can you create instances? Can you make me one with a small (100gb) mount point to test a docker etc. install of latest bandersnatch?

[22:46:22] <cooperlees> And I'll ansible it all

[22:49:47] <ianw> dstufft: would you expect that as of about 5 minutes ago, we would likely not be be seeing the stale indexes?

[22:50:09] <dstufft> ianw: if they got cached in Fastly they might stick around longer

[22:50:43] <dstufft> curl -XPURGE the URL if you find one, and if it happens again it's still broken

[22:50:44] <dstufft> but otherwise, yea

[22:51:03] <ianw> oh, cool, that's a good one to know

[22:51:11] <dstufft> cooperlees: uh, I think so, but hold on

[22:51:26] <cooperlees> dstufft: No hurry

[22:51:40] <dstufft> ianw: FWIW we generally ask people not to use that in any automated fashion, and if they find themselves needing to use it with any sort of regularity, they should file an issue

[22:52:04] <dstufft> like the ideal state is you never have to use that, because PyPI just does the right thing

[22:52:12] <dstufft> but computers etc

[22:52:18] <ianw> dstufft: well, yeah, i mean it's all just wrt to this current issue :)

[23:26:01] <ianw> dstufft: i feel like i issued a purge for https://pypi.org/simple/octavia-lib/ but we had a report it still got a mismatch

[23:26:58] <ianw> 2020-09-15 23:09:01.886227 | ubuntu-bionic | ERROR: No matching distribution found for octavia-lib===2.2.0

[23:32:00] <dstufft> ianw: is that timestamp UTC

[23:32:54] <clarkb> yes

[23:35:36] <dstufft> hm

[23:36:05] <dstufft> last entry in the mirror's access log is 22:30 UTC, so it couldn't have hit the mirror

[23:36:14] <dstufft> oh uh

[23:36:44] <dstufft> we'll serve stale content if we get an error too

[23:36:45] <dstufft> and I think purge doesn't actually remove it, it just marks it stale

[23:37:18] <dstufft> we can serve stale content for up to a day

[23:37:42] <dstufft> though that's concerning if it's getting errors that often tbh

[23:38:21] <ianw> yeah it seems pretty regular for CI jobs

[23:38:46] <ianw> but it's a heisenbug, every time i try "by hand" i get the right thing

[23:39:15] <dstufft> OH

[23:39:23] <dstufft> we will also serve stale when we revalidate

[23:39:28] <dstufft> for 5 minutes

[23:39:48] <dstufft> so if you purged then immediately ran the job, it would have served the stale copy while it updated the cache

[23:40:26] <ianw> hrm, i couldn't say if it was within 5 minutes

[23:40:46] <ianw> we probably just need to monitor at this point

[23:40:50] <dstufft> I'd guess say keep monitoring it and see

[23:40:51] <dstufft> yea

[23:41:15] <dstufft> there are enough edge cases where we will serve stale that it's tricky

[23:41:37] <dstufft> all that work to make it harder for pypi to actually go down, coming back to haunt us :p

[23:42:03] <dstufft> they sort of worked!

[23:42:06] <dstufft> it didn't go down!

[23:42:15] <dstufft> for some definition of not going down

[23:43:56] <ianw> haha yes, zombie death squads never get credit for their tenacious can-do attitude :)

Log file Viewer

Help | Karma | Search:

#pypa-dev logs for Tuesday the 15th of September, 2020