Thanks to visit codestin.com
Credit goes to github.com

Skip to content

NumPy Security roadmap proposal #29178

Open
@rgommers

Description

@rgommers

We haven't made too many major changes to how we deal with security for the NumPy project in a few years. The last significant changes were (a) requiring 2FA for everyone with repo admin privileges or PyPI access, (b) using hashes on all GitHub actions, and (c) adding OpenSSF scorecards and addressing some issues that that turned up. The security section of our roadmap is short, and currently says:

NumPy is quite secure - we get only a limited number of reports about potential vulnerabilities, and most of those are incorrect. We have made strides with a documented security policy, a private disclosure method, and maintaining an OpenSSF scorecard (with a high score). However, we have not changed much in how we approach supply chain security in quite a while. We aim to make improvements here, for example achieving fully reproducible builds for all the build artifacts we publish - and providing full provenance information for them.

The supply chain part is critical there, and the world has changed in a few significant ways:

  • The threat levels for supply chain security have gone up continuously. Examples:
    • This blog post (Jan 2024) documents an extensive supply chain attack on PyTorch.
    • This post (Mar 2025) documents how tj-actions/changed-files, used in >23,000 repositories, was compromised.
    • Tooling to automatically exploit configuration issues for GitHub Actions is commonly available, e.g. gato-x.
  • Government regulation is starting to appear. Examples:
    • A 2021 US Executive Order is mandating the use of SBOMs and documenting of processes in certain contexts.
    • The EU's Cyber Resilience Act, adopted in Oct 2024 and coming into effect in 2027, has a lot of requirements. Open source is in large part, but not fully, exempt - however numpy is distributed in lots of commercial products so a part of our user base is affected.
  • Best practices as well as Python packaging standards are evolving. Examples:

Some of the issues discussed in that blog post on PyTorch are also relevant for NumPy. For example, our GITHUB_TOKEN is still set to the default write permissions on everything, and the tokens for uploading to our nightly bucket as well as the staging bucket for releases to PyPI is accessible by anyone with write permissions on this repository. Note that it's effectively impossible to store repository secrets safely for a subset of people with commit rights, so currently our more stringent requirements for direct PyPI access are only partially effective.

Here is a useful way to think about types of security threats from Supply-chain Levels for Software Artifacts:

Image

For NumPy, we're in decent shape for "source threats" probably - commits on main and maintenance/* are quite visible so the risk of new commits being slipped in unnoticed is low, and we've never had a CVE that was actually concerning (less concerning CVEs are a pain and need mitigating, but they won't cause large-scale damage). On "build threats" I think we are not in great shape. And as one of the highest-profile Python packages and the second-most-downloaded package with compiled extensions on PyPI, we are a pretty attractive target.

The aim of this issue is to serve as a tracking issue and discuss at a high level what we want to do. Sub-issues can be created for individual actionable steps.

Proposed improvements

Related to source-level access and repository permissions:

  1. Further tighten 2FA requirements for everyone with any permissions beyond read-only on the repository.
  2. Move GITHUB_TOKEN to read-only default permissions. Any actions that need it will have to explicitly, and granularly, enable write permissions for labels, pull-requests, etc.

Related to building and distributing of release artifacts:

  1. Move building release artifacts that get uploaded to PyPI and anaconda.org to a new repository. That new repository should have a more strict approach to security:
  • Only write access for org admins and the release team (O(5) rather than O(30) people with access)
  • Enable trusted publishing to PyPI, with a staging environment with manual approval by the release manager before uploading starts.
    • So no more anaconda.org token for releases.
  • Require linear history, so commit history is easy to inspect.
  • Also branch protection etc. - apply best practices there.
  • Only wheels.yml runs, and maybe a security-related linter action
  • All release artifacts are required to be built inside this repository on GitHub Actions runners
    • No self-hosted runners allowed; cross-compiling is allowed (whether we actually do it depends on maintenance cost as always) - verification of the test suite passing after cross compilation is done either on the main repo in a self-hosted runner or under QEMU.

Related to helping downstream distributors and end users with how they approach supply chain security:

  1. Start shipping SBOMs, in the way outlined by PEP 770
  2. Start verifying that our builds are fully reproducible (this will take quite a while to implement and requires new machinery)

Envisioned benefits

The primary benefit is significantly enhanced supply chain security, which is beneficial for both end users and for NumPy as a project (an event with malicious content injected would be pretty stressful for maintainers, and bad for the project's reputation).

Also importantly, it allows the main numpy/numpy repository to continue its relatively loose approach to development, e.g.:

  • lots of people with commit rights,
  • using a host of different actions, including from sometimes unknown individual devs (this is otherwise questionable even with pinning, since no one actually rechecks the action's source code when the pin gets bumped), and from vendors with a questionable reputation like Codecov (we were lucky to not be affected by a previous Codecov breach),
  • Continue using multiple CI services (Circle CI, Cirrus CI, Azure DevOps, maybe others in the future)
  • Allow using self-hosted runners, as long as they are set up securely (i.e., ephemeral - no cross-talk between jobs or actions running on the same self-hosted runner). SciPy already uses Cirrus Runners, and on CI: Enable GitHub Actions CI for ppc64le (Power architecture) support #29125 there is a proposal for IBM-hosted runners,
  • Allowing fairly extensive usage of caching without having to worry too much about cache poisoning, because we can remove all secrets from the main repo.

The release process will also be easier to manage once trusted publishing is set up (no more anaconda-client and manually downloading/uploading wheels by the release manager).

As yet another benefit: other projects in the scientific Python ecosystem tend to follow what we do, so the effort that the proposed changes will require will be well spent - it has a multiplicative effect when maintainers of other projects can learn from what we do and copy aspects of it.


I'll also note that what this doesn't do is making use of PyPI very secure. It'll be a very significant improvement to how we safeguard our release artifacts (I'd estimate >20x fewer people having access to our release secrets, plus more ability to verify the binaries). However, while PyPI is great for development, if one really cares about security (e.g., use of Python/NumPy inside large corporations or government entities), one should have a coherent strategy like building wheels from source and hosting them on private index servers, or using a commercial vendor of Python packages (there are many options, from commercial Linux distros to Anaconda, ActiveState, Chainguard - and I know more offerings are in the making). We just aim do the best we can here with limited means.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions