Thanks to visit codestin.com
Credit goes to github.com

Skip to content

extremely slow scan for "copyright" files #278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mralusw opened this issue Apr 5, 2024 · 4 comments
Open

extremely slow scan for "copyright" files #278

mralusw opened this issue Apr 5, 2024 · 4 comments

Comments

@mralusw
Copy link

mralusw commented Apr 5, 2024

First of all, thank you for DISABLE_COPYRIGHT_FILES_DEPLOYMENT, it really saves the day!

Right now each library / executable is scanned running a separate dpkg -S. Then linuxdeploy checks for /usr/share/doc/PKG/copyright. This is extremely slow. dpkg -S actually supports multiple query paths in a single call, in case you want to batch the calls.

Another solution would be to add an --output-deployed-files FNAME to output the list of files for which the copyright files need to be found. After all, linuxdeploy has valuable logic to include / exclude deployed files from lookup.

I currently extract the AppImage created by linuxdeploy, list all .so libraries, look them up all at once in a single dpkg -S operation, and copy the results as appropriate. But mine is a simple case.

I've noticed that there already is a framework for deferred operations, so maybe that can be co-opted.

@mralusw mralusw changed the title extremely slow scan for "copyright" file extremely slow scan for "copyright" files Apr 5, 2024
@TheAssassin
Copy link
Member

This can easily happen on Debian when the dpkg cache is on a slow drive and is very large. For instance, in (temporary) CI environments, Docker containers etc., the scan is really fast. Ever since I replaced my SSHD (HDD + SSD cache) with a real SSD, I haven't had any issues with this any more; before, linuxdeploy took 2-3 seconds per item.

@mralusw
Copy link
Author

mralusw commented Apr 11, 2024

I've had the cache in /zram for... ages. I've tried with /tmp and it makes no difference.

time dpkg-query -S '*/xx1' '*/xx2' '*/xx3' 2>/dev/null
## Executed in  958.89 millis

Doesn't matter if the patterns are found or not found. Debian 12, Ubuntu 22.04, Ubuntu 20.04 — some slower than others.

I have no idea what it's doing, since strace dpkg -S '*/zzz' '*/xxx' '*/yyy' '*/uuu' shows delays just before it prints each message, not in-between syscalls. Probably some very inefficient grepping.

Anyway, the first improvement was to use xargs to run multiple dpkg-query's in parallel, and batch 10 of queries for each:

find "$dst"/usr/lib -type f -name '*.so*'  | sed ... |
  xargs -d'\n' -t -n10 -P"$(getconf _NPROCESSORS_ONLN)" dpkg-query -S | sed ...

But the real improvement (blazing fast) came by ditching dpkg and searching directly in /var/lib/dpkg/info/*.list:

find "$dst"/usr/lib -type f | sed ... |
grep -f - /var/lib/dpkg/info/*.list | sed ...

(grep -f - reads regex's from stdin).

I might turn this into a plugin. Problem is, I don't really know what LD deploys. I'm just searching for identical library names, and the setup is specific to my app. I'd still like to obtain a map of which files where automagically deployed to which location. Maybe I (shudder) could parse the ldlog.

@TheAssassin
Copy link
Member

I'd still like to obtain a map of which files where automagically deployed to which location.

Nobody's ever suggested machine readable output from this application. I'm open to such features, but please open another issue.

searching directly in /var/lib/dpkg/info/*.list

I'm not sure I want to implement my own parser for those files.

batch 10 of queries

linuxdeploy runs strictly sequentially. It primarily tries to speed things up by trying to avoid running the same operation twice on a file. Adding any kind of parallel operations would be a lot of work given the lack of proper C++ libraries for such a purpose (Python would make it quite easy, though).

I don't really know what LD deploys

Well, the code is free/open-source, so you can have a look for yourself. While it may be C++, you should be able to get a basic idea of what it does from there. You can grep for dpkg to find the exact commands linuxdeploy uses. Maybe there's a better way?

@mralusw
Copy link
Author

mralusw commented May 16, 2024

Nobody's ever suggested machine readable output from this application. I'm open to such features, but please open another issue.

Cool, I think that makes sense given linuxdeploy's strenghts.

batch 10 of queries

linuxdeploy runs strictly sequentially. It primarily tries to speed things up by trying to avoid running the same operation twice on a file. Adding any kind of parallel operations would be a lot of work

Yeah, not a good idea and I'm not proposing that. I'm thinking more along the lines of making more operations pluggable by external scripts.

I'm not sure I want to implement my own parser for those files.

Yeah, me neither, and it's ultimately a dpkg issue (unlikely to be solved). With pluggable scripts though it can be addressed.

In the case at hand, FWIW, I think dpkg might be searching for files created by package install scripts (i.e. not in the list file), and probably also by dpkg-divert. These make sense for admin usage of dpkg on a given machine, but not for package specification.

I don't really know what LD deploys

Well, the code is free/open-source, so you can have a look for yourself. While it may be C++, you should be able to get a basic idea of what it does from there. You can grep for dpkg to find the exact commands linuxdeploy uses. Maybe there's a better way?

I did, obviously :) (and I've had a long stint with C++, though years pass and C++ mutates). "I don't know" meant there is no machine-readable output, which goes back to point 0 — thanks for clarifying that there is no current design

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants