PyPI, pip, and external repositories
A debate about Python modules—and where and how they are hosted—raged in early May on two separate Python mailing lists. There are a number of interrelated issues that make up the debate, but the core question seems to be: should the now-default pip package manager treat the "official" module repository differently than other repositories? Some see "external modules"—those not hosted at the Python Package Index (PyPI)—as a potential reliability problem, while others don't see much difference between external and PyPI-hosted modules.
Stefan Krah fired the opening salvo with a
rather cryptic message to the python-dev mailing list that referred to a
new pip error message that came from
trying to install his cdecimal package. He explained further in another message,
complaining that externally hosted packages "are being singled out
unfairly
". There is no curation of the PyPI packages, so malicious
packages could certainly end up there. It is just as risky, from a
security standpoint, to install packages from PyPI as it is from an
external location, he said, so why warn only for the latter?
But Donald Stufft, who works on pip, said
that the warning is not about security, but is, instead, about
reliability. Each external host required for packages that
are needed to deploy a Python application, for example, adds a single point
of installation failure. So, when a user tries to install an external
package, they get a warning, rather than getting the package installed.
The warning is meant to tell them that the server could be down. However,
the message ("cdecimal an externally hosted
file and
may be unreliable
") may be worded poorly, Stufft admitted. In fact, he changed the message
to say: "%s is an externally hosted file and access to it may be
unreliable
".
That didn't end the discussion, though. For one thing, pip doesn't even check to see if the file is available before issuing the warning and exiting. That may leave users a bit unhappy, as R. David Murray noted:
Using either --allow-external pkgname or --allow-all-external will install the package even if it is hosted externally (assuming the server is up), but only if there is some way to verify the package contents (e.g. an MD5 or SHA-2 family hash associated with the file). Overriding that check is possible too, of course, using the --allow-unverified option. The hash values can be directly associated with the download URL in the PyPI entry for the package, which allows pip to verify the contents before installing the package. But that is just what Krah had set up for cdecimal when he got the error. So pip could reasonably download, verify, and install the package, and instead it warns the user and exits.
Some wondered why developers choose not to host their packages at PyPI. Originally, PyPI had no hosting, and was, as its name indicates, simply an index of packages. There was also a time where PyPI itself suffered from reliability problems. These days, though, 92% of projects that are listed at PyPI are also hosted there. Stufft, who gathered those statistics on PyPI packages, would like to see that rise to 100%, though that seems unlikely. His secondary goal of 100% verifiable packages (i.e. all having associated hash values) should be more plausible.
There are a number of reasons that a project might not want to host its package on PyPI. PEP 438, which outlines the plan to make pip installs more robust by using hashes (among other things), lists some of those reasons, as does a post from Marc-Andre Lemburg. They range from companies not allowing uploads to foreign servers to concern about export-controlled packages (e.g. cryptographic packages from PyPI's US-based servers) to wanting more detailed download statistics. There have also been concerns about the terms for "Third-Party Content" that govern packages uploaded to PyPI.
Lemburg is concerned that there is a push
to have all optional Python packages hosted at PyPI. There are good
reasons for some not to do so: "Accordingly, we should respect those
reasons [and] make it possible for
Python packages to live elsewhere, without having our tools put
those packages into a bad light or making it harder for Python
users to install such packages than needed.
" But Stufft is unconvinced that any of the reasons listed are
truly sensible. There is a small minority of people who want to host externally,
primarily Krah and Lemburg, Stufft said, while the majority are "glad
that pip and PyPI are moving towards a more
reliable model
".
The conversation moved, at least to some extent, to the distutils-sig mailing list after several requests from Nick Coghlan. That mailing list reaches more of the developers who would be making decisions about pip and PyPI. So Paul Moore summarized the debate in a post there. In a reply to Stufft later in that thread, he explained his worry about pip and external packages:
In a lengthy reply, Stufft listed a half-dozen reasons why "safe" externally hosted files are problematic. In addition, he showed that the number of packages that are safely hosted externally is tiny, far less than 1% of packages listed on PyPI. His arguments were compelling, at least to Moore. But there is a non-technical problem too, as noted by Coghlan:
Stufft's efforts are making things worse for Krah and Lemburg (and any others who are trying to externally host their packages), Coghlan continued, which may make it more difficult for other Python Packaging Authority (PyPA) initiatives down the road. PyPA has been delegated the authority to handle packaging issues for the language; Coghlan, Stufft, and Moore are all PyPA participants.
That led to a discussion about cleaning up how pip and PyPI handle external
files. It also led Stufft to propose PEP 470, which
explicitly handles indexes that can be added to PyPI and read by pip to do
verifiable installations from other hosts. In Coghlan's words, the proposed mechanism "is more general, more explicit, easier to implement, much
easier to explain and has much cleaner failure modes
" than the
existing scheme that involves "link spidering" from the URLs provided in
the PyPI entry. It also appears to largely meet with Lemburg's approval.
PEP 470 makes PyPI look more like the Linux distribution repository model. Additional indexes are added like repositories are added to Yum or Apt and pip will consult those indexes when it is looking for Python packages to install. It also makes the use of external (i.e. non-PyPI) resources more explicit. When a package is found in PyPI, but its files are not hosted there, the command to add the index can be presented to the user. Administrators can also set up pip configurations that are pre-loaded with all of the required indexes so that externally hosted packages are simply picked up as if they were available at PyPI.
Seemingly, in the end, it is a nice compromise solution. There are likely to always be reasons that some packages need to be hosted outside of PyPI (the US export-control laws if no other), so finding a way to include them that doesn't turn those packages into second-class Python citizens is important. PEP 470 also cleanly puts the handling of these external repositories in the hands of users, with an easy-to-explain model that may already be familiar from using Linux. It took a bit to get there (several hundred messages), but the outcome would seem to be worth it.
