Thanks to visit codestin.com
Credit goes to lwn.net

|
|
Log in / Subscribe / Register

PyPI, pip, and external repositories

By Jake Edge
May 29, 2014

A debate about Python modules—and where and how they are hosted—raged in early May on two separate Python mailing lists. There are a number of interrelated issues that make up the debate, but the core question seems to be: should the now-default pip package manager treat the "official" module repository differently than other repositories? Some see "external modules"—those not hosted at the Python Package Index (PyPI)—as a potential reliability problem, while others don't see much difference between external and PyPI-hosted modules.

Stefan Krah fired the opening salvo with a rather cryptic message to the python-dev mailing list that referred to a new pip error message that came from trying to install his cdecimal package. He explained further in another message, complaining that externally hosted packages "are being singled out unfairly". There is no curation of the PyPI packages, so malicious packages could certainly end up there. It is just as risky, from a security standpoint, to install packages from PyPI as it is from an external location, he said, so why warn only for the latter?

But Donald Stufft, who works on pip, said that the warning is not about security, but is, instead, about reliability. Each external host required for packages that are needed to deploy a Python application, for example, adds a single point of installation failure. So, when a user tries to install an external package, they get a warning, rather than getting the package installed. The warning is meant to tell them that the server could be down. However, the message ("cdecimal an externally hosted file and may be unreliable") may be worded poorly, Stufft admitted. In fact, he changed the message to say: "%s is an externally hosted file and access to it may be unreliable".

That didn't end the discussion, though. For one thing, pip doesn't even check to see if the file is available before issuing the warning and exiting. That may leave users a bit unhappy, as R. David Murray noted:

And once you're at that point, as a user I'm going to grumble, "Well, why the heck didn't you just try?", as I figure out how to re-execute the command so that it does try.

Using either --allow-external pkgname or --allow-all-external will install the package even if it is hosted externally (assuming the server is up), but only if there is some way to verify the package contents (e.g. an MD5 or SHA-2 family hash associated with the file). Overriding that check is possible too, of course, using the --allow-unverified option. The hash values can be directly associated with the download URL in the PyPI entry for the package, which allows pip to verify the contents before installing the package. But that is just what Krah had set up for cdecimal when he got the error. So pip could reasonably download, verify, and install the package, and instead it warns the user and exits.

Some wondered why developers choose not to host their packages at PyPI. Originally, PyPI had no hosting, and was, as its name indicates, simply an index of packages. There was also a time where PyPI itself suffered from reliability problems. These days, though, 92% of projects that are listed at PyPI are also hosted there. Stufft, who gathered those statistics on PyPI packages, would like to see that rise to 100%, though that seems unlikely. His secondary goal of 100% verifiable packages (i.e. all having associated hash values) should be more plausible.

There are a number of reasons that a project might not want to host its package on PyPI. PEP 438, which outlines the plan to make pip installs more robust by using hashes (among other things), lists some of those reasons, as does a post from Marc-Andre Lemburg. They range from companies not allowing uploads to foreign servers to concern about export-controlled packages (e.g. cryptographic packages from PyPI's US-based servers) to wanting more detailed download statistics. There have also been concerns about the terms for "Third-Party Content" that govern packages uploaded to PyPI.

Lemburg is concerned that there is a push to have all optional Python packages hosted at PyPI. There are good reasons for some not to do so: "Accordingly, we should respect those reasons [and] make it possible for Python packages to live elsewhere, without having our tools put those packages into a bad light or making it harder for Python users to install such packages than needed." But Stufft is unconvinced that any of the reasons listed are truly sensible. There is a small minority of people who want to host externally, primarily Krah and Lemburg, Stufft said, while the majority are "glad that pip and PyPI are moving towards a more reliable model".

The conversation moved, at least to some extent, to the distutils-sig mailing list after several requests from Nick Coghlan. That mailing list reaches more of the developers who would be making decisions about pip and PyPI. So Paul Moore summarized the debate in a post there. In a reply to Stufft later in that thread, he explained his worry about pip and external packages:

I'm genuinely concerned here that I'm missing a glaringly obvious reason why off-PyPI safe files are such a bad thing. You (and Nick, and the authors of PEP 438) seem to be willing to accept a lot of negative feeling and user unhappiness to defend making pip a PyPI-only-by-default tool. I'd much rather that PyPI stand on its own merits (which are many and compelling) rather than need a "use us or pip will make your life inconvenient" crutch, which is what the current behaviour feels like.

In a lengthy reply, Stufft listed a half-dozen reasons why "safe" externally hosted files are problematic. In addition, he showed that the number of packages that are safely hosted externally is tiny, far less than 1% of packages listed on PyPI. His arguments were compelling, at least to Moore. But there is a non-technical problem too, as noted by Coghlan:

We currently have two core developers (Stefan Krah & Marc-Andre Lemburg) that are *very* unhappy with the way pip is evolving, because they favour the use of external hosting over uploading their packages to PyPI. While that is a minority opinion in the Python community at large, it still represents a significant proportion of the core developers that actually pay much attention to packaging issues.

Stufft's efforts are making things worse for Krah and Lemburg (and any others who are trying to externally host their packages), Coghlan continued, which may make it more difficult for other Python Packaging Authority (PyPA) initiatives down the road. PyPA has been delegated the authority to handle packaging issues for the language; Coghlan, Stufft, and Moore are all PyPA participants.

That led to a discussion about cleaning up how pip and PyPI handle external files. It also led Stufft to propose PEP 470, which explicitly handles indexes that can be added to PyPI and read by pip to do verifiable installations from other hosts. In Coghlan's words, the proposed mechanism "is more general, more explicit, easier to implement, much easier to explain and has much cleaner failure modes" than the existing scheme that involves "link spidering" from the URLs provided in the PyPI entry. It also appears to largely meet with Lemburg's approval.

PEP 470 makes PyPI look more like the Linux distribution repository model. Additional indexes are added like repositories are added to Yum or Apt and pip will consult those indexes when it is looking for Python packages to install. It also makes the use of external (i.e. non-PyPI) resources more explicit. When a package is found in PyPI, but its files are not hosted there, the command to add the index can be presented to the user. Administrators can also set up pip configurations that are pre-loaded with all of the required indexes so that externally hosted packages are simply picked up as if they were available at PyPI.

Seemingly, in the end, it is a nice compromise solution. There are likely to always be reasons that some packages need to be hosted outside of PyPI (the US export-control laws if no other), so finding a way to include them that doesn't turn those packages into second-class Python citizens is important. PEP 470 also cleanly puts the handling of these external repositories in the hands of users, with an easy-to-explain model that may already be familiar from using Linux. It took a bit to get there (several hundred messages), but the outcome would seem to be worth it.



to post comments

PyPI, pip, and external repositories

Posted May 29, 2014 22:42 UTC (Thu) by sjj (guest, #2020) [Link] (9 responses)

Sounds like a good plan. The Linux distribution model has been refined pretty well (although I wish there was a way to force limit external repos to some part of the directory tree, ie Chrome repository could only write to /opt/chrome etc).

Having started taking baby steps in the Ruby world recently, I wince every time something tells me to install simply with curl http://a.random.web.site/script.sh | sudo bash. It just feels wrong. And "just install this gem from this site here".

And after declaring .emacs bankruptcy and trying to get up to speed with the latest hotness, I find a tool like el-get offers to download and install packages from emacswiki, which anyone can edit. Err, really?

Even those are somewhat saner than wanting to try some alternate Android builds. Download from a link on xda-developers or some guys flaky little webserver? Yeah, fills me with confidence.

So many wheels, so many reinventions.

PyPI, pip, and external repositories

Posted May 30, 2014 4:36 UTC (Fri) by raven667 (guest, #5198) [Link] (7 responses)

Those are good points but I will also add that just because you got something signed from the official distro doesn't mean that the code has been audited or is any less buggy than some random thing you found on a forum. Certainly it is harder to sneak malware through the distro process but we should have a realistic idea of how much this process is really buying us and not overstate it.

PyPI, pip, and external repositories

Posted May 30, 2014 8:42 UTC (Fri) by tzafrir (subscriber, #11501) [Link] (6 responses)

No. But it means it is the the exact code that came from the distribution and wasn't tampered with along the way.

Different repositories may have different policies as to what they include. A repository may be Joe Random Developer's little repo. You can trust it if you trust Joe.

PyPI, pip, and external repositories

Posted May 30, 2014 13:48 UTC (Fri) by cabrilo (guest, #72372) [Link] (5 responses)

To me, the main difference between distribution and third party repositories is that:

1) I can count on distribution repositories being well tested and working well together.

2) I can rely that the package will be supported for some time. Third party packager may just give up at any time.

Honestly, security and potential of malware is not that high on the list - I always do some basic sanity checks on any new repo I add to my apt sources list - I don't just install the first thing I find on Google.

The post mentions same concerns: that it's not only about security, it's about reliability as well.

PyPI, pip, and external repositories

Posted May 30, 2014 16:17 UTC (Fri) by southey (guest, #9466) [Link] (4 responses)

If you mean here are the Linux distribution repositories then that is actually an inaccurate. Since most users get their Python from Linux distributions or from the Python site there is no guarantee that PyPI packages will work with it. It may actually mess up the standard install as only the Linux distribution or PyPI can manage the many packages provided by the Linux installation. Python versions do break packages for various reasons such using new functions and incompatibility reasons due to how your Python was compiled (like incompatible libraries). Third party repositories (i.e., the developer sites) usually react faster than to those breakages as well as bug fixes. This is real reliability problem that this PyPI approach has yet to address - in fact it seems that it is being totally ignored

To me this security and reliability is just pure illusion. The only thing that PyPI can say is that you downloaded that particular package from PyPI and you have zero guarantee that this package is secure and reliable!

PyPI, pip, and external repositories

Posted May 30, 2014 19:41 UTC (Fri) by dstufft (guest, #93456) [Link] (3 responses)

I'm not entirely sure what you're trying to state here but I suspect he problem is that "secure" and "reliable" are not binary properties.

Security wise PyPI and pip both (currently) operate under a trust model where the author of a project you're attempting to install is assumed to be trusted. This is fairly reasonable as in all likelihood the thing you're about to download you're planning on executing so you need to at least trust them enough to execute arbitrary Python on your machines. Thus when you see myself, or anyone who actually groks the threat model that applies, "secure" refers to protecting against things like Man in the Middle attacks, compromised PyPI accounts, or even PyPI itself being compromised. At times changes are added which also help protect against a malicious author, but those are generally not a primary focus at this point in time, and likely never will be given that PyPI is not, and will never be, a curated index.

As far as reliability goes, When I speak to that it's strictly about "when I attempt to download this file, how likely is it that the file will be available for me to download". It's basic math/logic that given that files not hosted on PyPI still rely on PyPI to tell the installer where to locate them, that no matter what PyPI is going to be your SPOF. So attempting to host files off of PyPI can, at best, have no impact on availability and will generally have at least some negative impact since it's unlikely that that server other than PyPI will have 100% uptime. Additionally PyPI is slowly being architectured to remove as many SPOF's in it's own system so that it's own downtime is minimized.

PyPI, pip, and external repositories

Posted May 31, 2014 15:15 UTC (Sat) by drag (guest, #31333) [Link] (2 responses)

> Security wise PyPI and pip both (currently) operate under a trust model where the author of a project you're attempting to install is assumed to be trusted.

Yes. And it largely works the same way with Linux distributions.

Sure the packages are signed (for many distribution packaging schemes), but that's only useful at making sure that from the time the package was uploaded to the online repository and then downloaded to your computer that it has not been modified.

That is just one of thousands of different problems that may occur, unfortunately. The package maintainers own desktop could be compromised. His keys could be compromised. Some other random key your system is setup to trust implicitly may be compromised. When they downloaded the tarball from the upstream developer they could of been subjected to a 'MIM' attack. The git repo they pull from may have been compromised. The upstream developer may be untrustworthy. The upstream developer's system may have been compromised. etc etc etc.

Package signing really only provides the slightly equivalent 'security assurance' as going to the upstream developer's website and making sure that the HTTPS certificate is good before downloading and compiling code from a tarball yourself. If the upstream developer provides PGP signatures of checksums for tarballs uploaded to his website (or something equivalent)then it can be argued that that approach is significantly more secure then depending on distribution repositories. It just has less chance of something going wrong.

PyPI, pip, and external repositories

Posted May 31, 2014 15:24 UTC (Sat) by dstufft (guest, #93456) [Link] (1 responses)

> Yes. And it largely works the same way with Linux distributions.

Sort of, the key difference is that the Debian project (for instance) vets the people who are able to sign packages for you. So you can transfer the responsibility of deciding "do I trust this person" to someone else vs PyPI where you have to make that distinction for yourself. I mostly bring that up because a lot of people will see a change to "secure" something related to PyPI/pip and then will decry that it's impossible to secure it because anyone can upload to PyPI.

> Package signing really only provides the slightly equivalent 'security assurance' as going to the upstream developer's website and making sure that the HTTPS certificate is good before downloading and compiling code from a tarball yourself.

Not exactly. The difference with package signing itself is that it's not transport level. That means that I can take a package that I have gotten from anywhere, PyPI, a mirror, from some random person at a conference, and verify it's integrity. TLS relies on you trusting the source of the file, whereas package signing requires you trust the *author* of the package. The other thing with TLS is that it doesn't protect you against malicious actors with bogus certificates. IOW it also forces you to also trust other random governments, nation states, and corporations that you may or may not actually want to trust (the fact that it's unavoidable to do so and interact with the web is unfortunate).

PyPI, pip, and external repositories

Posted May 31, 2014 17:32 UTC (Sat) by raven667 (guest, #5198) [Link]

The kind of security this approach brings is accountability, where you can track where each bit came from and who is responsible for it. Accountability is often the strongest kind of security, but it is personal and political and not technical, so it doesn't guarantee the lack of security exploits but it does, over time, allow for selective pressure.

So crypto and signing can help with this, especially if the code changes hands or for high-value code respositories, but fundamentally even if I download some random code in the clear from Jane's Website then it is Jane Programmer who is still held accountable for the quality of that code and is responsible for their website not getting taken over, for example.

So reputation is the strong security protection, crypto is just a technology to help back that up, but crypto alone is not the source of security.

PyPI, pip, and external repositories

Posted May 31, 2014 15:00 UTC (Sat) by drag (guest, #31333) [Link]

> The Linux distribution model has been refined pretty well

It also has become somewhat useless for a huge number of use cases. It's fantastic for 'core' OS stuff and managing updates and fixes for issues that affect it, but nowadays I rely mostly on custom (which means pain) and third party solutions for much of the software I use (like pip).

The most important and biggest benefit to things like 'pip' is that they are distribution agnostic.

> Having started taking baby steps in the Ruby world recently, I wince every time something tells me to install simply with curl http://a.random.web.site/script.sh | sudo bash. It just feels wrong.

Yes, this is absolutely vile.

> And "just install this gem from this site here".

This isn't really so bad as long as you can easily uninstall it. Although I am not a ruby developer so I may be missing something.

> And after declaring .emacs bankruptcy and trying to get up to speed with the latest hotness, I find a tool like el-get offers to download and install packages from emacswiki, which anyone can edit. Err, really?

The newer built-in emacs package management stuff is actually pretty good. The 'ELPA, MELPA, Marmalade' type repositories work out well even though you can't really automate upgrades. Emacs requires a lot of hand-holding as weirdness and incompatibilities pop up time to time, but that's the nature of Emacs.

It's a very significant improvement over trying to do a combination of distro-provided emacs packages and then tarballs or whatever that I tried using in the past.

Now I still depend on Fedora to provide me the base Emacs package, but I go out of my way to make sure that no other emacs packages get installed. (same thing with perl/cpanm and python/pip). What works out well for the distro packagers has only a random chance of working out well for what I want.

This approach has made life so much saner and easier for me on Linux/perl/python/emacs

Nowadays along with the git repository I use to manage my 'dot' home configurations on multiple systems I have a basic set of shell scripts that bootstrap the 'virtenv' type setups I use for managing the software I use in my home directory.

> So many wheels, so many reinventions.

That's because the wheels we have from distributions are triangles.

PyPI, pip, and external repositories

Posted Jun 1, 2014 12:08 UTC (Sun) by kleptog (subscriber, #1183) [Link] (1 responses)

The only reason why I host any Python packages myself is because authors have a habit of deleting old versions from PyPI, which is terrible if you're trying to make a reproducible build process. So you run a local PyPI cache essentially so once you've downloaded it once you don't need to worry about it vanishing. I'll upgrade when I want, not when the author decides.

Unfortunately I haven't found a really easy to use tool here (like apt-proxy for example) but it certainly keeps my buildfarm green.

PyPI, pip, and external repositories

Posted Jun 2, 2014 2:19 UTC (Mon) by ras (subscriber, #33059) [Link]

> In a lengthy reply, Stufft listed a half-dozen reasons why "safe" externally hosted files are problematic.

None of those looked like good reasons to me. People can make their own evaluation of whether a repository is fast or reliable enough. They don't need someone else to spam them with their opinion every time they use pip.

Debian doesn't see any need to warn you are using a unknown repository. It *does* warn you are installing a package from an author or distribution you haven't explicitly told it you trust - which is IMO far more important. And something the current python infrastructure doesn't do. SSL is not an an effective substitute. It moves trust from keys I explicitly install to the X509 PKI infrastructure. The X509 PKI infrastructure is broken - probably at least once per second, by organisations doing MITM attacks.

Debian gets other benefits from using distribution system defined by a protocol, as opposed to the social authority ("the official distribution shall be pypi, there shall be no others") being championed here. It is easily forked. Maybe making its infrastructure easy to replace both technically and practically does weaken Debian in the short term - it certainly felt threatened by Ubuntu for a while there. But in the long term having Debian derived distributions has made Debian stronger. Invariably what has happen is the derived distributions have done high risk experiments whose outcome is hard to predict, and if they had gone wrong could have damaged Debian. But once seen to work they can be safely incorporated, which means being easy to replicate allows Debian to evolve faster over the long term.

Open source's encouragement of forking and experimentation is a core advantage it has over development models. Python has certainly benefited from it, with offshoots like Pypy, ipython-notebook making it stronger. It seems odd that Python is considering abandoning such an advantage for its distribution architecture.


Copyright © 2014, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds