Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@rogu-beta
Copy link
Contributor

In order to get an accurate mapping for a package in DejaCode to PurlDB entries the patched query prioritizes the hashes. This is needed in cases where the same PURL (without query parameters) can have multiple different download URLs as is the case with Python packages and various binaries for different hardware architectures or interpreter versions. Additionally, lookups for SHA-256 and MD5 are added as SHA-1 may not be populated under all circumstances. Hashes from SBOM imports, generated by tools such as cdxgen, commonly do not use SHA-1 anymore, since it is a mostly deprecated hashing algorithm due to the risk of hash collisions. SHA-512 could not yet be added as PurlDB does not support a lookup for it. The reason for the order of prioritization is that hashes give the most accurate for the content of the package, download URL at least points to the download location which would still allow to differentiate between the different target architectures, and lastly the PURL itself in case no fully accurate matches could be found otherwise. The results are then filtered by checking that PURLs match. Here a modification is made to also strip the query parameters from the PurlDB PURL as they may also contain them and previously caused matches to not be found. For reference see the following issues:

Checks for existing packages are also extended to compare against the other hash types.

In order to get an accurate mapping for a package in DejaCode to PurlDB entries the patched query prioritizes the hashes. This is needed in cases where the same PURL (without query parameters) can have multiple different download URLs as is the case with Python packages and various binaries for different hardware architectures or interpreter versions. Additionally, lookups for SHA-256 and MD5 are added as SHA-1 may not be populated under all circumstances. Hashes from SBOM imports, generated by tools such as cdxgen, commonly do not use SHA-1 anymore, since it is a mostly deprecated hashing algorithm due to the risk of hash collisions. SHA-512 could not yet be added as PurlDB does not support a lookup for it. The reason for the order of prioritization is that hashes give the most accurate for the content of the package, download URL at least points to the download location which would still allow to differentiate between the different target architectures, and lastly the PURL itself in case no fully accurate matches could be found otherwise. The results are then filtered by checking that PURLs match. Here a modification is made to also strip the query parameters from the PurlDB PURL as they may also contain them and previously caused matches to not be found. For reference see the following issues: aboutcode-org#307 aboutcode-org#383

Signed-off-by: Robert Guetzkow <[email protected]>
@tdruez
Copy link
Contributor

tdruez commented Nov 21, 2025

@rogu-beta Thanks for stating on this, the new logic make sense, see my suggestion for refining the "PackageAlreadyExistsWarning" check.

@rogu-beta
Copy link
Contributor Author

@tdruez I'll try to work in the suggestions next week. Unfortunately, I don't have time to do this today.

Remove code duplication and reduce database queries to a single one

Signed-off-by: tdruez <[email protected]>
@rogu-beta
Copy link
Contributor Author

@tdruez Thank you for including the suggested changes!

@tdruez tdruez merged commit fd1b980 into aboutcode-org:main Nov 24, 2025
4 checks passed
@tdruez
Copy link
Contributor

tdruez commented Nov 24, 2025

@rogu-beta No worries, thanks for kickstarting this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants