[18:54:15] <daniel137> hi i need help to isntall one package on python
[21:17:13] <ianw> dstufft: fyi narrowing in on the pypi/fastly cdn issues which appear to be related to stale index files being served; currently working on details for bug reports @ https://etherpad.opendev.org/p/debugging-pypi-index-problems
[21:17:49] <ianw> somewhat in reference to https://discuss.python.org/t/any-chance-to-an-issue-tracker-for-pypi-org-operational-problems/5219
[21:18:06] <ianw> at least, the problems that inspired that thread :)
[21:18:40] <ianw> one question for anyone; we are seeing old index files served up, without a serial number
[21:19:21] <ianw> (the <!--SERIAL 8050067--> at the bottom of the page)
[21:19:52] <ianw> that may be a clue? does anyone know how recent an addition that was to the index.html files? was it a flag day when they were all regenerated? something else?
[21:34:44] <dstufft> ianw: there was no flag day when we regenerated, because our infra doesn't "generate" like that, our oriign servers render the templates as part of the request / response cycle, so we deployed that and just as stuff fell out of the cache it hit the origin servers again and got an updated value
[21:35:19] <dstufft> ianw: FWIW not having that tag might not mean that it was a stale CDN cache, we also have our own bandersnatch mirror, and if fetching from PyPI fails, it'll fall back to fetching from that mirror
[21:35:41] <dstufft> and I don't think that mirror has the serial comment
[21:37:05] <dstufft> ianw: are you all using nginx? or is that server header from us
[21:37:22] <cooperlees> fwiw - I've wanted to update pypi's bandersnatch, but wanted a test box first. It's a large jump in versions
[21:37:53] <cooperlees> And I was going to ansible it all
[21:38:43] <dstufft> I could very much believe that our old and largely ignored bandersnach install is doing something funky :)
[21:39:00] <cooperlees> and the ultimate goal of moving the mirror to s3 from *insert aws posix network mount thingy*
[21:39:29] <cooperlees> techalchemy's been promising me s3 support to bandersnatch for a long time now :P
[21:39:40] <ianw> dstufft: ... interesting ok. we're putting together a bug report as we speak at that etherpad
[21:40:44] <dstufft> there's an argument that soon we're going to hit diminshing returns with the mirror TBH
[21:41:10] <dstufft> As part of the TUF work, I think we're moving to pregenerating the /simple/ pages and storing them in an object store
[21:41:16] <cooperlees> If that's the case then I'll do nothing :)
[21:41:40] <dstufft> which means both the /simple/ pages AND the files can be served directly from the object store
[21:41:49] <dstufft> instead of going through our servers
[21:42:43] <dstufft> which I guess raises the question if it's worthwhile to have a mirror at all? I dunno it didn't really occur to me until right now
[21:42:58] <dstufft> I should probably write an issue so we think about it :p
[21:43:49] <dstufft> (this also means if we do have mirror AND that mirror is going to be backed by an object store, we probaby want it to be a *different* object store than we're currently using for actual pypi)
[21:45:14] <clarkb> dstufft: our cache is apache so you're generating those headers (pretty sure anyway)
[21:45:28] <clarkb> what we've got pasted there is the contents of our apache cache
[21:46:14] <clarkb> I'm guessing most people don't notice because they either don't lock things or when they lock things the list ends up stale so aren't trying to use latest things frequently
[21:48:13] <dstufft> Does the problem resolve itself? If so after how long?
[21:48:55] <clarkb> the vast majority of requests seem to work. Some of these failures I've managed to rerequest packages for less than 5 minutes later and the indexes are fine
[21:49:09] <clarkb> in the past when we've seen the behavior it will be a day or two or three of pain then it goes back to normal
[21:49:50] <dstufft> a day or two of pain sounds like the original cached item fell out of the fastly cache
[21:50:16] <dstufft> I think our fastly cache will live for 24h
[21:50:24] <clarkb> well its a day or two of flip flopping
[21:50:39] <clarkb> some requests work some done. Then after a bit the flip flopping stops and it just works reliably
[21:51:43] <dstufft> Yea, my main thing was if it was like.. 10 minutes of pain on a new release, that might just be distributed systems, but our purging should ensure that things are available within a few minutes
[21:52:04] <clarkb> dstufft: oh ya these packages have neen releases weeks ago in some cases
[21:53:09] <dstufft> it's possible there's a purging bug in either warehouse or fastly, weeks ago is weird though I wouldn't expect tht from a purge back
[21:53:10] <clarkb> in one of our examples on the etherpad the release for the version we want happened August 3, 2020 and the index we get served is marked Last-Modified as April 9, 2020
[22:11:09] <fungi> it's taken a couple of days for us to track this down, but yeah if fastly is occasionally serving indices from a stale second server then that would absolutely explain everything we've seen (and also why we had a devil of a time coming up with any active reproductions)
[22:11:42] <fungi> dstufft: also, yes, all the releases we're having occasional trouble finding were after that date
[22:29:55] <clarkb> pip will retry too if it just gets an error iirc
[22:30:01] <clarkb> so that may improve behavior for our situation at least
[22:30:16] <cooperlees> dstufft: While you're around - Can we clean up https://github.com/pypa/bandersnatch/issues/56 + https://github.com/pypa/warehouse/issues/4892
[22:30:37] <cooperlees> tl;dr where bandersnatch manually tried to clear fastly cache
[22:35:15] <dstufft> cooperlees: uhh, I don't think I have anything new to add right now, I still think if we're getting stale stuff cached in Fastly that I would prefer it if we didn't have a bunch of bandersnatch issuing purge requests. I think probably the correct answer is to stop trusting the CDN to purge 100% of the time, and make our purging smarter so it will verify the new content
[22:35:30] <dstufft> or reduce the amount of time we cache things for
[22:35:41] <dstufft> so that a stale cache doesn't stick around as long
[22:36:22] <cooperlees> Do we have metrics on how often this is happening? My goal is to just simplify the bandersnatch code ultimately
[22:42:50] <cooperlees> O happy to work on it another day
[22:43:16] <dstufft> I'm just deleting the synced files and then letting it sync again, I'd prefer not to drastically alter the deployment in an adhoc fashion
[22:43:19] <clarkb> I'm going to pop out now. Thanks again. I'll continue to lurk here and feel free to ping if there is more we can do to help debug it
[22:46:08] <cooperlees> Can you create instances? Can you make me one with a small (100gb) mount point to test a docker etc. install of latest bandersnatch?
[22:51:40] <dstufft> ianw: FWIW we generally ask people not to use that in any automated fashion, and if they find themselves needing to use it with any sort of regularity, they should file an issue
[22:52:04] <dstufft> like the ideal state is you never have to use that, because PyPI just does the right thing