[15:05:15] <McSinyx[m]> pradyunsg: after some (not as much as I wish) reading, I figured that handling dependency resolve is probably too much for me at the moment
[15:05:45] <McSinyx[m]> however, I still want to work on parallel downloading and possibly build and installation
[15:08:14] <McSinyx[m]> my plan is to modularize the current collect/download codebase, then plug in aiohttp for python3, while keeping it the same for python2
[15:09:41] <McSinyx[m]> that'll only handle downloading (and maybe installation), for building probably we'll need to use thread
[15:10:04] <McSinyx[m]> I'll try to draft a proposal to make the plan clearer; my main concern is still with if i can separate the networking part from the dependency resolve to make it (networking) async
[15:11:29] <McSinyx[m]> moreover, it will depends on the dep. resolve algorithm to make the parallel downloading an atual enhancement
[15:12:13] <McSinyx[m]> nevertheless, it's still something I'd consider really interesting and wish to (carefully) play with in the next few months
[15:15:11] <pradyunsg> McSinyx[m]: hi there! So, I spent some time thinking about this yesterday. And...
[15:17:16] <pradyunsg> I think it would make sense to have a different, smaller-in-scope goal for the GSoC project.
[15:18:37] <pradyunsg> Like, we want to parallelize downloads in pip install, but that's the eventual goal. For now, we're going to do this other thing "X", and we'll make progress toward eventual goal with it.
[15:19:21] <pradyunsg> In other words, don't make "pip install will do things in parallel where possible a goal of the project proposal.
[15:20:03] <McSinyx[m]> how can I know what X is, is it by analysing the final goal for the requirements?
[15:20:47] <pradyunsg> That's the part I'm not sure about.
[15:21:20] <McSinyx[m]> I get what you're saying, but I don't get what I should do
[15:22:16] <McSinyx[m]> *pip install will do things in parallel where possible* is probably too much, but how about *pip download things in parallel where possible*?
[15:23:20] <McSinyx[m]> in your opinion, how should (approx) the scope be?
[15:25:07] <pradyunsg> I've not looked at pip's download code in a while... But extrapolating from my GSoC experience and picking from what you've said above, making the code modular in itself might end up being enough as a GSoC project? And, have stretch goals be the next tasks, that could be done if, like, the refactoring does in fact end up being easy.
[15:26:11] <pradyunsg> Basically, have a "core" project, that's ~8 weeks of work in your opinion, and then have stretch goals that'd probably also take ~8 weeks.
[15:26:48] <McSinyx[m]> to be honest, I am unsure about the time I may spent working on a codebase as large as pip, so I'm not sure if I can estimate that well
[15:26:54] <pradyunsg> Btw, installation might end up being much easier to parallelize than downloads.
[15:27:07] <McSinyx[m]> but refactoring the IO code seems like a good start
[15:27:31] <pradyunsg> Use your best guess for starters, we can iterate on the estimates before you submit the proposal. :)
[15:28:56] <McSinyx[m]> so I think I'll begin with a proposal on it; I'm not fimiliar with pip enough to be certain about thread-safetiness, so installation might need to go with the same async approach
[15:29:08] <McSinyx[m]> > we can iterate on the estimates
[15:30:57] <pradyunsg> IMO 5he like point of having a core+stretch goals, is that if the core takes longer than expected, we'd still have the time. And if its easier than expected, there's stretch goals and we don't have to scramble to figure out what to do with the rest of the time.
[15:31:11] <pradyunsg> +1 to all that techalchemy said.
[15:31:24] <pradyunsg> BRB - need to make+eat dinner.
[15:31:43] <McSinyx[m]> either we use thread/multiprocess and pray that the gain by parallel will be more than the lost via flow creation, or we can use the callback++ aka async and be more certain
[15:31:57] <McSinyx[m]> techalchemy: but I'll try to handle the refactor first
[15:35:18] <techalchemy> but i'm guessing he won't want to maintain two separate implementations as it could be a source of bugs
[15:36:19] <McSinyx[m]> that'll be more maintenance you're right
[15:36:46] <techalchemy> as far as I understand it, pradyun will have to decide if he would want to have python3-only enhancements in the code, basically it'd be conditional logic that allows certain optimized implementaitons to only run in python 3. I'm guessing if there's a strong case for something like that it's more likely he and the rest of the maintainers would want to just drop python 2 support
[15:37:35] <McSinyx[m]> I'll keep my near-future goal as modularization of parts of the installation process
[15:38:04] <techalchemy> It's worth looking at parallelizing it, downloading is slow for sure especially now that there is a resolver
[15:38:58] <techalchemy> pradyunsg, what's your plan for hashing? are you going to locally cache hashes of cached artifacts?
[15:39:04] <McSinyx[m]> since the refactor is still need to be done, I believe we can have more discussion in the future on it (py3-only enhancements)
[15:39:21] <techalchemy> again parallelization doesn't need to be python 3 only
[15:40:10] <McSinyx[m]> techalchemy: as I take a peak at the (a few year-old) discussion, there's problem with thread safety for download
[15:40:44] <techalchemy> is there? that is not something i was aware of
[15:41:11] <McSinyx[m]> techalchemy: also I can recall a conversation about having packages metadata like *nix distros, why was that (it it was) turned down?
[15:41:50] <techalchemy> McSinyx[m], erm you'll have to be a lot more specific, there have been hundreds of similar discussions resolved that way for many reasons
[15:42:32] <McSinyx[m]> it mean having the dependency chain separately from the warehouse
[15:43:02] <McSinyx[m]> I'm trying to find the download-threading-issue
[15:46:16] <techalchemy> installing and downloading are tightly coupled now
[15:47:26] <techalchemy> pradyunsg can speak a bit more about whether this is true, but I think decoupling them will be needed anyway to implement the resolver -- and maybe he already has thoughts on that topic or progress has already been made
[15:47:31] <McSinyx[m]> what do you mean by tightly coupled?
[15:48:09] <McSinyx[m]> I thought we only install after resolving?
[15:48:19] <techalchemy> well there is no resolver in pip right now
[15:48:43] <McSinyx[m]> wait there is no resolver in pip *right now*?
[15:49:00] <techalchemy> so historically when you installed a package you also downloaded and installed all of its dependencies
[15:49:08] <McSinyx[m]> I was under really wrong assumption
[15:49:40] <techalchemy> there is something called 'Resolver' if you read the code, but it doesn't perform any backtracking (maybe it does now on the master branch, not sure how far along pradyun and TP are on that)
[15:49:41] <McSinyx[m]> (so what do you suggest for me to do to support such resolver)
[15:50:06] <techalchemy> well they are already implementing the resolver, is what i'm saying
[15:50:32] <techalchemy> and as a part of that process I believe they will need to decouple download/build/install steps
[15:50:42] <McSinyx[m]> I don't get it, does that mean, effectively, the resolver is not used?
[15:50:43] <techalchemy> (but I could be wrong, so we will need pradyun to clarify)
[15:51:03] <techalchemy> McSinyx[m], it means that it wasn't written yet the last time a version of pip was released
[15:51:30] <McSinyx[m]> or, in other words, refactoring other process is a must for that goal
[15:51:54] <McSinyx[m]> I'll come back tomorrow, but do i have your recommendation on the refactor part?
[15:52:39] <McSinyx[m]> i.e. modularizing (not literally) download, build, install
[15:53:44] <McSinyx[m]> I still have a lot to read to understand the current situation have would really help; and a direction can guide me to the more urgent part that need to be touched
[15:56:41] <techalchemy> for that pradyun can provide the most information :p he has the best idea of what is needed
[15:57:44] <McSinyx[m]> i'll try to use the input from this discussion then
[21:12:16] <pradyunsg> I'm not sure the installation and download are coupled. I mean, yea, they're both using InstallRequirement but that's just the object storing state. The installation step is basically completely separate from the dependency resolution step.
[21:13:13] <pradyunsg> Also, I wouldn't say pip doesn't have a resolver rn, it just has a *very* optimistic one, that just assumes that the first thing it picks is correct. 🙃
[21:32:49] <cooperlees> I should pay more attention here
[21:33:00] <cooperlees> Hello all - How are we? Bummed I don't get to see a lot of you this year!
[21:48:52] <cooperlees> Maybe that's PEP worthy - Implementing custom retention policies for PyPI ... e.g. a user could choose to have X months of packages - e.g. cooper-nightly only has 6 months and then cooper has the real versions we keep in perpetuity
[21:54:16] <techalchemy> well it would be kind of against the purpose of a package index but how much of what they put in their package is actually required to run their code
[22:00:37] <cooperlees> I find using PyPI for your nightlies a bit of an abuse ...
[22:39:14] <toad_polo> On the other hand, I've got more than 6TB of hard drive space in my living room.
[22:39:25] <toad_polo> That's maybe $600 worth of hard drives if you want redundancy and whatnot.
[22:42:28] <toad_polo> I imagine the bandwidth costs would be more significant (if it weren't donated), assuming these are popular packages, which I guess isn't that affected by nightlies.
[22:46:58] <pradyunsg> Yea, I don't think the concern is the storage really. It's the bandwidth.
[22:47:53] <pradyunsg> We really need to chart out a plan to that magical metadata 2.0. ;)
[22:50:54] <techalchemy> that'd be nice... just so much is involved in anything substantial
[22:55:02] <toad_polo> I don't think expiring nightlies would help with bandwidth.
[22:56:27] <toad_polo> I think a lot of their bandwidth is that they stuff the wheels with a bunch of gpu/tpu specific builds, and a more granular metadata spec for wheels might help make each package smaller.
[22:57:12] <toad_polo> Though maybe they are shipping huge models, too, and the only solution is to tell them to break it up or distribute the models separately.
[22:57:51] <pradyunsg> Do what nltk does? Have your data outside the package. 🤷🏻♂️
[22:58:41] <techalchemy> from above > "but how much of what they put in their package is actually required to run their code"
[22:59:42] <pradyunsg> Hardware specific specs, would be useful, yea.
[23:00:41] <pradyunsg> Like, hey, get me the stuff that's optimized for my CPU arch without needing me to type anything other than "pip install numpy" would be nice.
[23:00:58] <pradyunsg> It's just, also a lot to standardize. :(
[23:01:35] <pradyunsg> And I personally don't know enough to even make a judgment of whether that is something we can standardise.
[23:02:14] <techalchemy> it also pushes a lot of build complexity on the repository or the distributor which isn't a great developer experience either
[23:02:45] <techalchemy> potentially it's not even really possible
[23:03:07] <techalchemy> but like you i dont really know