- 
                Notifications
    You must be signed in to change notification settings 
- Fork 215
Download pipelines with authenticated GH API calls #3607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
      
        
              This comment was marked as outdated.
        
        
      
    
  This comment was marked as outdated.
| Oops wrong base branch. Fixed. | 
| Done. Test failures (most likely) unrelated or at least I have no idea what they mean. | 
| 
 There is a problem mapping your CLI arguments to the parameters of the  Apart from that, please mind that there is a major refactor of Downloads ongoing (#3634). Preferably, all new contributions would already be based on and point to this new structure. | 
| How does this fit with the pipeline downloads refactoring @MatthiasZepper @jpfeuffer? | 
| Good question. Who knows more about the refactor? | 
| Yes, I checked. And the mechanism for a download of the repo data/files is still the same. | 
| And well there needs to be a little fix for the CLI apparently. Note that I just added the CLI option because I still wanted to allow downloads from the Zip URLs because they will never be rate-limited. While non-authenticated downloads from the API can be. | 
c8d134c    to
    592d621      
    Compare
  
    | Codecov Report❌ Patch coverage is  
 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
 | 
| I quickly let copilot rebase the changes. (and fixed the cli bug) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Julian has kindly taken over the maintainer role for Downloads, so I leave the ultimate decision to him, but I am leaning towards a few changes still.
| default=4, | ||
| help="Number of allowed parallel tasks", | ||
| ) | ||
| @click.option( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some reservations regarding the--api-downloadoption from a user experience perspective. The current solution exposes implementation details that users shouldn't need to think about. Instead, we could focus on what the user actually wants to achieve - authenticated vs. anonymous downloads.
Consider something like --authenticated  and a help text like Enable authenticated download (with better rate limits, access to private repos, etc.) instead.
Potentially,  even --auth-method <method> to future-proof it,  if we want to support multiple authentication methods in the future. For example, the option to download pipelines not hosted on GitHub has also been a long-standing request and similar features could then be successively added without renaming / replacing too many CLI arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MatthiasZepper while I agree with the switch to authenticated, I am not sure if auth_method is helpful here, even in the future.
I think there is usually only one method for each SCM/git provider and even if there are more, you would require extensive logic to make sure that people do not use "github authentication" when actually wanting to download from gitlab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we just use gh_api.get instead of requests.get in nf_core/pipelines/download/download.py, no new flag should be needed at all. Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vaguely recall that the problem was the need to authenticate at the API even for public repositories. For us developers, who anyway have some key-based authentication or token set up for GitHub, this is essentially unnoticeable.
But for ordinary users, who just get started with nf-core and would like to download their first pipeline, this represents a significant obstacle since they do barely understand the error message or might not even have a GitHub account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes kind of. I think it is possible to use the API unauthenticated, too but you would be rate limited more easily (which is not the case for the public zip download url, afaik). Since I don't know how many requests the CI or some power user does in a short timeframe I decided to put it behind a flag.
| url = requests.get(download_url) | ||
| with ZipFile(io.BytesIO(url.content)) as zipfile: | ||
| zipfile.extractall(self.outdir) | ||
| if not self.api_download: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the flow in this function could have a nicer flow for better maintainability of the code.
For example, the download_url for the anonymous download is assembled in line 467ff:
if not self.platform:
  for revision, wf_sha in self.wf_sha.items():
      # Set the download URL and return - only applicable for classic downloads
      self.wf_download_url = {
          **self.wf_download_url,
          revision: f"https://github.com/{self.pipeline}/archive/{wf_sha}.zip",
      }
That is logically where also the api_url should be created.
If you refactor the logic of the function, you can also reduce a bit code duplication in the ZipFile part and clearly group the topdir and request.get.content / gh_api.get.content closer together for a clearer logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I'll see what I can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be no condition here at all. From what it looks like, gh_api can just be used in both conditions either way.
As Matthias points out, instead of creating a url, we should use download_url.
- Rename --api-download to --authenticated for better UX - Replace os.rename with Pathlib operations - Refactor download_wf_files method to reduce code duplication - Rename compress to compress_type consistently across codebase - Update all references and tests accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for taking my time with this review. The repo download could probably be simplified further and I think we should get away without adding a new flag, if I am not missing something.
Is it also a target of this PR to add downloads from different sources (i.e. private repos)?
| url = requests.get(download_url) | ||
| with ZipFile(io.BytesIO(url.content)) as zipfile: | ||
| zipfile.extractall(self.outdir) | ||
| if not self.api_download: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be no condition here at all. From what it looks like, gh_api can just be used in both conditions either way.
As Matthias points out, instead of creating a url, we should use download_url.
| default=4, | ||
| help="Number of allowed parallel tasks", | ||
| ) | ||
| @click.option( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we just use gh_api.get instead of requests.get in nf_core/pipelines/download/download.py, no new flag should be needed at all. Or am I missing something?
| Yep, private/internal (GitHub) repos should be supported by this. This is basically my use case. The download_url unification should be addressed by my latest changes. | 
| # Fetch content and determine top-level directory based on authentication method | ||
| if self.authenticated: | ||
| # GitHub API download: fetch via API and get topdir from zip contents | ||
| content = gh_api.get(download_url).content | ||
| with ZipFile(io.BytesIO(content)) as zipfile: | ||
| topdir = zipfile.namelist()[0] # API zipballs have a generated directory name | ||
| zipfile.extractall(self.outdir) | ||
| else: | ||
| # Direct URL download: fetch and construct expected topdir name | ||
| content = requests.get(download_url).content | ||
| topdir = f"{self.pipeline}-{wf_sha if bool(wf_sha) else ''}".split("/")[-1] | ||
| with ZipFile(io.BytesIO(content)) as zipfile: | ||
| zipfile.extractall(self.outdir) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Fetch content and determine top-level directory based on authentication method | |
| if self.authenticated: | |
| # GitHub API download: fetch via API and get topdir from zip contents | |
| content = gh_api.get(download_url).content | |
| with ZipFile(io.BytesIO(content)) as zipfile: | |
| topdir = zipfile.namelist()[0] # API zipballs have a generated directory name | |
| zipfile.extractall(self.outdir) | |
| else: | |
| # Direct URL download: fetch and construct expected topdir name | |
| content = requests.get(download_url).content | |
| topdir = f"{self.pipeline}-{wf_sha if bool(wf_sha) else ''}".split("/")[-1] | |
| with ZipFile(io.BytesIO(content)) as zipfile: | |
| zipfile.extractall(self.outdir) | |
| # GitHub API download: fetch via API and get topdir from zip contents | |
| content = gh_api.get(download_url).content | |
| with ZipFile(io.BytesIO(content)) as zipfile: | |
| topdir = zipfile.namelist()[0] # API zipballs have a generated directory name | |
| zipfile.extractall(self.outdir) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am talking about replacing the old way of downloading with your Github API urls. This works also with unauthenticated requests (within a quota) and that way we can reduce complexity and remove the new parameter.
What are your thoughts @MatthiasZepper ?
| # allows to specify a container library / registry or a respective mirror to download images from | ||
| self.parallel = parallel | ||
| self.hide_progress = hide_progress | ||
| self.authenticated = authenticated | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should still be able to read the authenticated status from gh_api.auth instead of having the new flag.
| self.authenticated = authenticated | |
| if not gh_api.has_init: | |
| gh_api.lazy_init() | |
| self.authenticated = gh_api.auth is not None | 
| Ok, my revised suggestion: 
 What do you think? Thanks for this feature @jpfeuffer and sorry for being a bit picky on these details, I am just trying to keep the complexity as low as can be. | 
For repos that follow the nf-core template but need authentication.
PR checklist
CHANGELOG.mdis updateddocsis updated