PMXBOT Log file Viewer

Help | Karma | Search:

#pypa-dev logs for Friday the 15th of May, 2015

(Back to #pypa-dev overview) (Back to channel listing) (Animate logs)
[18:31:20] <lifeless> dstufft: 2786 should be a no-brainer
[18:32:41] <dstufft> lifeless: merged
[18:48:02] <lifeless> dstufft: top of head
[18:48:14] <lifeless> dstufft: how many distributions are on pypi, vs packages
[18:48:24] <lifeless> dstufft: e.g. for youtube-dl its 500 for 1
[18:48:43] <sigmavirus24> lifeless: distributions == unique versions of a package?
[18:48:46] <dstufft> lifeless: 391834 total files
[18:48:50] <dstufft> if that's what you mean
[18:49:04] <dstufft> or do you want an average number of files per package
[18:49:10] <dstufft> or version numbers?
[18:49:12] <lifeless> I want avg versions per package
[18:49:17] <dstufft> ok
[18:49:19] <dstufft> sec
[18:49:21] <lifeless> if its handy
[18:51:59] <dstufft> I have the database open already
[18:53:10] <dstufft> lifeless:
[18:53:11] <dstufft> pypi=> SELECT avg(cnt) FROM (SELECT COUNT(*) as cnt FROM releases GROUP BY name) as j;
[18:53:12] <dstufft> avg
[18:53:13] <dstufft> --------------------
[18:53:14] <dstufft> 5.7268683725413925
[18:53:15] <dstufft> (1 row)
[18:53:36] <dstufft> releases == unique versions on PyPI
[18:53:41] <dstufft> one row per release
[18:54:19] <dstufft> lifeless: the max is 529
[18:54:32] <dstufft> min is 1 of course
[18:54:43] <lifeless> thanks
[18:57:47] <dstufft> lifeless: the mode is 1 and the median is 3
[18:57:54] <dstufft> just incase you were wondering
[18:58:09] <dstufft> the long tail is going to heavily influence that though
[18:58:23] <dstufft> people who only ever put up one or two releases and then enver did anything else and nobody uses them
[18:58:53] <lifeless> dstufft: yeah, see my mail just now to distutils-sig
[18:59:55] <lifeless> we're going to have to work hard on some heuristics for the resolver branch, or perhaps include a flag to choose between full / first constraint / just error on conflicting constraints
[18:59:59] <lifeless> or both
[19:00:39] <lifeless> None of my work is invalidated, its just that we can't solve the whole problem in one swoop; we need the refactoring in place to support the core loop
[19:00:52] <lifeless> and then the core loop needs to have people beat up on it
[19:00:57] <dstufft> lifeless: ok
[19:01:03] <dstufft> distutils-sig email didn't show up yet
[19:01:09] <dstufft> will watch for it
[19:01:22] <lifeless> oh, so openstack's deps have ~2.4T unique combinations
[19:01:30] <dstufft> lol
[19:01:32] <lifeless> 300^5.7
[19:01:55] <lifeless> its probably worse for openstack cause there are lots of high-version count packages in that set
[19:01:57] <dstufft> most probably won't be that high, but millions is likely
[19:03:26] <dstufft> in more positive news, PyPI's S3 migration seems to be done other than picking up the odd piece here and there
[19:03:56] <lifeless> does pip have a units heuristic somewhere?
[19:04:03] <lifeless> e.g. number -> fooK / fooM etc
[19:04:28] <dstufft> um
[19:04:32] <dstufft> it has one for file size
[19:04:37] <dstufft> I don't think one for raw units
[19:04:42] <lifeless> file size will do
[19:05:50] <dstufft> from pip.utils import format_size
[19:06:39] <lifeless> hah, tops out at MB
[19:06:46] <lifeless> I'll just let it go scientific
[19:08:42] <lifeless> please wait while my machine does 1M combinations and uses my CPU all up :)
[19:10:50] <lifeless> here we goo
[19:10:55] <lifeless> Hit step limit during resolving, 22493640689038530013767184665222125808455708963348534886974974630893524036813561125576881299950281714638872640331745747555743820280235291929928862660035516365300612827387994788286647556890876840654454905860390366740480.000000
[19:11:00] <lifeless> ^ openstacks dep count
[19:11:09] <dstufft> that's a big number
[19:11:12] <lifeless> yes
[19:11:20] <lifeless> def guess_scale(self):
[19:11:20] <lifeless> """Guess at how big a problem we're facing."""
[19:11:21] <lifeless> scale = 1.0
[19:11:21] <lifeless> for versions, _ in self._versions.values():
[19:11:21] <lifeless> scale *= len(versions)
[19:11:23] <lifeless> return scale
[19:12:31] <dstufft> this is why folks use SAT solvers I think yea? because they can be smart(er) about solving the problem, but fundamentally I think it's just a hard problem
[19:13:10] <lifeless> its NPC yes
[19:13:28] <lifeless> sat solver mapping is a good strategy because they have lots of heuristics built in
[19:14:32] <dstufft> I think the biggest problem we'll have with a SAT solver is we can't build the full boolean problem up front
[19:14:49] <dstufft> because we have to iterively discover the constraints
[19:14:52] <dstufft> we can't just query for them all
[19:16:44] <lifeless> well
[19:16:52] <lifeless> thats the point of the dict-like interface
[19:17:14] <lifeless> there are at least one sat solver an incremental api anyhow
[19:17:28] <lifeless> but you can always just throw, rebuild and retry when something new is sseen
[19:17:31] <dstufft> lifeless: yea
[19:17:48] <dstufft> I haven't had time to play with it yet, all of the SAT solvers I could fid were C++ or some other language I didn't know
[19:17:53] <lifeless> whats another *versions in there when you're already NP
[19:19:49] <lifeless> one of the nasty things is that any heuristic you can think of (like try most-recent first) will step up to heat death of the universe as soon as someone says 'oh yeah, I'm working with old-release-of-X because'
[19:22:56] <dstufft> lifeless: oh one other thing - while pip prefers (and has a requirement that any dependency is pure python), we're also totally fine with optional C speed up things, so if this is something where we can get it fast enough for the common scenario (and I don't think ~300 dependencies is common nor having many if any thing that caps the top end) in pure python, but which is slow for uncommon but we can use C or something as an optional dependency
[19:22:56] <dstufft> to make it faster, that's OK
[19:23:29] <dstufft> that's what we're going to do for package signing, we'll have a pure python implementation with "good enough" performance, and the option to install a C library for "yea I'm going ot need to verify a ton of packages" performance
[19:25:03] <dstufft> (I know that doesn't make the fact that 2.4T options is just a ton of options, but if it's the difference between unbearably slow and fast enough in uncommon cases, it's an option)
[19:30:31] <lifeless> dstufft: no, its bad I
[19:30:32] <lifeless> bad O
[19:30:38] <lifeless> like we could be 1000000 times faster
[19:30:47] <lifeless> and its still like 100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 not enough
[19:34:12] <dstufft> lifeless: well yea, but only in really bad cases. aiui the problem here is that, in theory, we could have a set of constraints which cause us to have to cycle through all 2.4 trillion combinations, but we don't actually need to figure out all 2.4 trillion combinations, we just need to figure out one.
[19:34:31] <dstufft> the question is whether that one will be the first one we try or the 2.4 trillion-th
[19:35:17] <dstufft> I'm pretty sure that in the common case it's going to be one of the first ones- if it wern't the very simple resolver pip has now would break in a lot more scenarios
[19:39:57] <tomprince> Also, it seems like it should be perfectly reasonable to say "that's too complicated, help me out"
[19:41:06] <dstufft> yea, if we spin our wheels for too long we can just bomb out and ask people to give us another top level specifier to work with
[19:41:16] <dstufft> Python does something similar with the recursion limit
[19:42:05] <lifeless> dstufft: so the common case resolver we have today never breaks
[19:42:10] <lifeless> dstufft: thats the point of it :)
[19:42:26] <lifeless> dstufft: it just assumes everything will work together, and folk get surprised when it doesn't
[19:42:28] <dstufft> lifeless: well it breaks, it just does it at runtime instead of install time and in confusing ways
[19:42:43] <lifeless> my point is that you can't reason from one to the other
[19:43:41] <dstufft> lifeless: some peolpe do, there's a small number of people who raise issues or tell me about it, typically it's couched in the terms that pip is ignoring one of their specifiers
[19:44:38] <dstufft> in my experience people don't generally constrain the top end of their requirements
[19:44:39] <lifeless> dstufft: yeah - about 1/week on my count over the last couple months
[19:45:09] <dstufft> like, maybe they'll have one that have a constrained top end in their entire dependency set
[19:47:28] <dstufft> lifeless: I mean, openstack has some of the most complex set of requirements specifiers I've seen in the wild
[19:47:41] <dstufft> lifeless: and your PR is managing to resolve that in a reasonable time right?
[19:51:20] <lifeless> not atm
[19:51:32] <lifeless> something got released or bumped and its hitting the bad O
[19:51:50] <lifeless> its what prompted me to write the mail I did
[20:46:32] <dstufft> lifeless: fwiw bundler does a similar thing
[20:46:50] <dstufft> buit has a resolver, but if it detects it's hitting bad O it just dies and says there might be a solution but it would take too long
[20:47:01] <dstufft> also http://www.gecode.org/ is a thing I just found, C++ again :(
[20:49:32] <dstufft> https://github.com/CocoaPods/Molinillo is used by bundler and cocoapods
[21:07:47] <lifeless> yeah
[21:07:53] <lifeless> so I have a bunch of optimisations I can do
[21:08:03] <lifeless> its just I know they won't be sufficient
[21:08:08] <lifeless> and a chunk of the problem is foot-gun
[21:08:18] <lifeless> so I'd like us to think about how we can undo that
[23:02:59] <qwcode> lifeless, why not start with pip at least being a "simple" fail-on-conflict resolver... which only "backtracks" for the sake of re-walking when new constraints are found. I know you're motivated to solve Openstack build issues, but again, most of the issues I've seen would be solved with what I mention I think