Update package identifier to be constant across multiple syft run #595

spiffcs · 2021-10-27T14:40:13Z

Stabilize Syft Package ID

To explore the changes this branch introduces you can run the following commands.

syft version == main

syft -o json alpine:latest > /tmp/json1.json 
syft -o json alpine:latest > /tmp/json2.json  

diff /tmp/json1.json /tmp/json2.json

In the above, you should see a bunch of ID printed out with different uuid.

syft version == 363-package-identifier

syft -o json alpine:latest > /tmp/json1.json 
syft -o json alpine:latest > /tmp/json2.json  

diff /tmp/json1.json /tmp/json2.json

There should be no diff in the above.

syft version == 363-package-identifier

syft -o json alpine:latest > /tmp/json1.json 
syft -o json alpine:3.12 > /tmp/json2.json  

diff /tmp/json1.json /tmp/json2.json

The above diff should be easier to explore with the missing ID field.

TODO

I left a couple of comments on the PR for discussion
Are there fields we're not ignoring that we should ignore to generate the hash?
How do we feel about the error case where a hash could not be generated?
How do we track packages that can't be hashed?

Signed-off-by: Christopher Angelo Phillips <[email protected]>

syft/pkg/catalog.go

syft/pkg/id.go

syft/pkg/package.go

github-actions · 2021-10-27T14:43:03Z

Benchmark Test Results

Benchmark results from the latest changes vs base branch

name                                                   old time/op    new time/op    delta
ImagePackageCatalogers/ruby-gemspec-cataloger-2          1.20ms ± 3%    1.38ms ± 6%  +14.75%  (p=0.008 n=5+5)
ImagePackageCatalogers/python-package-cataloger-2        2.41ms ± 2%    3.41ms ±14%  +41.45%  (p=0.008 n=5+5)
ImagePackageCatalogers/javascript-package-cataloger-2     730µs ± 1%     899µs ±15%  +23.25%  (p=0.008 n=5+5)
ImagePackageCatalogers/dpkgdb-cataloger-2                 747µs ± 1%     865µs ± 6%  +15.84%  (p=0.008 n=5+5)
ImagePackageCatalogers/rpmdb-cataloger-2                  769µs ± 1%     854µs ± 4%  +11.06%  (p=0.008 n=5+5)
ImagePackageCatalogers/java-cataloger-2                  10.7ms ± 2%    13.6ms ±14%  +27.24%  (p=0.008 n=5+5)
ImagePackageCatalogers/apkdb-cataloger-2                 1.02ms ± 1%    1.39ms ±11%  +36.14%  (p=0.008 n=5+5)
ImagePackageCatalogers/go-module-binary-cataloger-2       610ns ± 1%     665ns ±17%   +8.90%  (p=0.032 n=5+5)

name                                                   old alloc/op   new alloc/op   delta
ImagePackageCatalogers/ruby-gemspec-cataloger-2           228kB ± 0%     248kB ± 0%   +9.06%  (p=0.008 n=5+5)
ImagePackageCatalogers/python-package-cataloger-2         980kB ± 0%    1109kB ± 0%  +13.19%  (p=0.008 n=5+5)
ImagePackageCatalogers/javascript-package-cataloger-2     192kB ± 0%     199kB ± 0%   +3.70%  (p=0.008 n=5+5)
ImagePackageCatalogers/dpkgdb-cataloger-2                 211kB ± 0%     228kB ± 0%   +8.40%  (p=0.008 n=5+5)
ImagePackageCatalogers/rpmdb-cataloger-2                  217kB ± 0%     222kB ± 0%   +2.31%  (p=0.008 n=5+5)
ImagePackageCatalogers/java-cataloger-2                  3.12MB ± 0%    3.24MB ± 0%   +3.82%  (p=0.008 n=5+5)
ImagePackageCatalogers/apkdb-cataloger-2                 1.26MB ± 0%    1.29MB ± 0%   +2.78%  (p=0.008 n=5+5)
ImagePackageCatalogers/go-module-binary-cataloger-2        336B ± 0%      336B ± 0%     ~     (all equal)

name                                                   old allocs/op  new allocs/op  delta
ImagePackageCatalogers/ruby-gemspec-cataloger-2           5.46k ± 0%     6.82k ± 0%  +24.91%  (p=0.016 n=5+4)
ImagePackageCatalogers/python-package-cataloger-2         18.1k ± 0%     26.3k ± 0%  +45.06%  (p=0.008 n=5+5)
ImagePackageCatalogers/javascript-package-cataloger-2     4.80k ± 0%     5.19k ± 0%   +8.19%  (p=0.008 n=5+5)
ImagePackageCatalogers/dpkgdb-cataloger-2                 5.53k ± 0%     6.67k ± 0%  +20.65%  (p=0.008 n=5+5)
ImagePackageCatalogers/rpmdb-cataloger-2                  6.22k ± 0%     6.56k ± 0%   +5.38%  (p=0.008 n=5+5)
ImagePackageCatalogers/java-cataloger-2                   51.5k ± 0%     59.0k ± 0%  +14.55%  (p=0.008 n=5+5)
ImagePackageCatalogers/apkdb-cataloger-2                  5.49k ± 0%     7.74k ± 0%  +40.88%  (p=0.016 n=4+5)
ImagePackageCatalogers/go-module-binary-cataloger-2        9.00 ± 0%      9.00 ± 0%     ~     (all equal)

spiffcs · 2021-10-27T14:43:56Z

syft/pkg/package.go

+	if err != nil {
+		// if there is an error generating the hash
+		// we still want a unique identifier
+		// TODO: track packages that don't get hashed for report?


On the fence here as well.

I don't like "" as the return.
I don't think we should reserve 0 and keep overwriting it.

I do think we should track these packages that are unable to be hashed for some reason.

Struggling to find edge cases where that could happen.

Do we understand the circumstances in which we'd get an error back from hashstructure.Hash? I'd be curious to know what those are before deciding how we handle this error (or not handle it, for example)

The Package struct is what we're hashing here.

From the hashstructure package, here is the case statement where the struct is parsed:
https://github.com/mitchellh/hashstructure/blob/191a1e82bbeb456444f1a9bb9db2e4a20989cd98/hashstructure.go#L179-L389

Specific to strings:
https://github.com/mitchellh/hashstructure/blob/191a1e82bbeb456444f1a9bb9db2e4a20989cd98/hashstructure.go#L383-L386

It uses the default stdlib hash interface:
https://pkg.go.dev/hash

// Write (via the embedded io.Writer interface) adds more data to the running hash. // It never returns an error.

Given that our package struct doesn't have a variety of complex types (most all boiling down to string) is why I posed that I was struggling to find an edge case where the error could happen to try and fuzz/test this part of the code.

I gotcha, that makes sense. I think this is tricky. One impact of using the hashstructure approach — specifically where we start with the entire struct and exclude select fields, as opposed to starting with nothing and then including select fields — is we have less explicit control over emerging edge cases as the struct's inner data evolves. For example, if we modify a specific package metadata struct that's used for a certain package type, we could be impacting how successfully this fingerprinting operation executes without realizing this. This could have obvious effects (e.g. an error bubbles up) or non-obvious effects (fingerprints are no longer unique) depending on how we handle edge cases here.

I think we should be ably to rely on our fingerprint implementation to be deterministic and reliable. The notion of having a backup approach (particularly, generating a UUID), to me, is a sign that we might want to pause and think about this. Just my two cents!

We could go the extremely strict path and do a Must implementation where we panic or fail the process completely if we cannot hash a package.
https://github.com/google/uuid/blob/44b5fee7c49cf3bcdf723f106b36d56ef13ccc88/uuid.go#L176

UUID is a placeholder since it was the previous method and I could not think of a base case better than it without modifying the code to track packages that could not be hashed.

Is there room in the output to include packages that could not be successfully hashed?

My two cents: We shouldn't allow for any situation where we can't compute a deterministic package fingerprint... and I don't think it's a necessity for such a situation to be possible. If it is possible because of our implementation decision, I think that's worth discussing. As you've mentioned, we're going to be relying on this fingerprint quite a bit.

luhring

This is very exciting work... 🙌

I have some questions around design choices — feel free to point me to an existing thread if there are answers noted already somewhere

luhring · 2021-10-27T17:03:55Z

syft/pkg/catalog.go

-	if exists {
-		log.Errorf("package ID already exists in the catalog : id=%+v %+v", p.ID, p)
-		return
+	if p.ID == "" {


Question: Do we still need the Package struct to have an ID field that's distinct from the Fingerprint() function's return value?

I'm wondering if we could make it clearer who's responsibility it is to determine a package's ID. With this if check, it looks like it might be the consumer of Catalog.Add or it might be this implementation. Could we make it clearer and consistent where a package's ID originates?

I discussed this with @wagoodman and I think that we should just remove this check and have fingerprint() within Add be the sole operator of generating/assigning an ID.

Keeping the ID field is ideal for now since it decouples the fact that a package's fingerprint just happens to be the current form of its unique identification. We're still pre 1.0.0 so there might be more we discover along the way where this approach turns out to need more nuance. ID gives us a canvas to capture that nuance and is a good foothold for that kind of expansion.

A package should definitely know how to produce its fingerprint (see the fingerprint method).

I discussed this with @wagoodman and I think that we should just remove this check and have fingerprint() within Add be the sole operator of generating/assigning an ID.

I think this makes sense.

Keeping the ID field is ideal for now since it decouples the fact that a package's fingerprint just happens to be the current form of its unique identification.

I don't quite follow this part. What's not clicking for me is: what purpose does the package's ID field serve? Who should be setting its value? Who should be reading its value? Removing the field resolves these 3 ambiguities (for me). Leveraging the new (awesome) Fingerprint method forces this to be a readonly, deterministic value. 👍

I think the consideration here is that during catalog.Add() fingerprint is called a single time and then the package takes on the responsibility of persisting that value as the ID for the rest of the programs run. I don't think we should use this PR to change the program where we refactor how we're using ID considering @wagoodman is working on #556.

I do agree there should be some discussion in the future on how the ID field is used, but that should come after #556 is complete and we have a better foundation on IF we are going to continue using the catalog into v1.0.0.

I think the sign here is that when we decide to stop using the catalog we will be asked hard questions about the ID field, but with concurrent work going on over in #556 I believe it's better to get the win of consistent ID rather than totally refactoring or removing the field.

That's reasonable! Thanks for walking me through this. 🙏

How do we plan to track this work? ^ (cleaning up the code with regard to the ID field)

@wagoodman correct me if I'm wrong, but we have this tracked in an issue regarding the removal of the cataloger. I just can't find it at the moment.

I think this #554 might be where we start to address this work?

This work (removal of the catalog) isn't track explicitly. The best place for this is under #556 . The reason for this is that #556 is the first opportunity to see how the nature of IDs will work differently... I was going to write an issue to the effect of "remove the pkg.Catalog" but I felt that the issue would be prescribing a solution without highlighting the reasons why it makes sense to remove it, or how removing it helps out the cataloging process.

If there is a need to remove the pkg.Catalog then signal will probably arise within #556 or shortly there after in a issue/PR that attempts to add relationships within a cataloger.

(#554 is affected by pkg.Catalog removal for sure, but doesn't appear to be a primary motivator for the change)

luhring · 2021-10-27T17:05:04Z

syft/pkg/catalog.go

+	if other, exists := c.byID[p.ID]; exists {
+		// there is already a package with this fingerprint
+		// merge the existing record with the new one
+		other.Merge(p)
+		return


Are we sure we want to do this? What's the motivation for merging two packages? And is it correct to expect that this operation might actually end up changing the package's ID?

Yes on wanting to do the merge.

The current implementation of syft will generate two packages if the locations and layer information is different. By ignoring the locations field, we allow layer information for similar packages to be merged under a single entry.

I do not think it is correct to expect that this operation would change the package ID. This merge only happens if a hash is identical to one already entered into the catalog. We then merge the location data so that the single package information shows all layers of the image that it was captured at.

We then merge the location data

Got it — I think this bit was the implicit part that I didn't pick up on at first, i.e. that the purpose of merging packages is to merge location entries.

It's probably a good idea to make this clear in the naming of the package Merge function, or at minimum, provide a clear doc comment for the benefit of the function's consumers.

Yes on wanting to do the merge [...]

See https://anchorecommunity.slack.com/archives/C4PJFNEEM/p1635425134053900?thread_ts=1635346512.050700&cid=C4PJFNEEM

luhring · 2021-10-27T17:06:20Z

syft/pkg/package.go

+	if err != nil {
+		// if there is an error generating the hash
+		// we still want a unique identifier
+		// TODO: track packages that don't get hashed for report?


Do we understand the circumstances in which we'd get an error back from hashstructure.Hash? I'd be curious to know what those are before deciding how we handle this error (or not handle it, for example)

wagoodman · 2021-10-28T17:30:48Z

Do we understand the circumstances in which we'd get an error back from hashstructure.Hash?

@luhring there doesn't appear to be any input that would directly influence an error being returned from hashstructure.Hash except for circumstances like modifying the datastructure concurrently while calling hashstructure.Hash or if reflect cannot extract a value from interface{} that can be written to a hash.Hash64 (which I can't think of a circumstance [yet?]).

Though to answer the higher question "are there any kinds of inputs that could be a problem for hashstructure.Hash?" I think the answer is "yes": Elements that have cyclic references can cause a panic (the issue is still open as of now).

spiffcs · 2021-10-28T19:19:11Z

Result of our conversation

Remove merge
Remove UUID and allow us to fail to add package

Signed-off-by: Christopher Angelo Phillips <[email protected]>

luhring · 2021-10-29T14:45:01Z

syft/pkg/catalog.go

-		p.ID = newID()
+		fingerprint, err := p.Fingerprint()
+		if err != nil {
+			log.Warnf("failed to add package to catalog: %w", err)


Should we return right after this line? This seems like the point in execution where we know the requested Add operation is not going to happen.

return added - apologies for the oversight here

Signed-off-by: Christopher Angelo Phillips <[email protected]>

…chore#595) Signed-off-by: Christopher Angelo Phillips <[email protected]>

spiffcs added 8 commits October 26, 2021 11:58

basic implementation

2572a53

Signed-off-by: Christopher Angelo Phillips <[email protected]>

remove panic; generate id then check

49eb308

Signed-off-by: Christopher Angelo Phillips <[email protected]>

update branch with functional code

5dbacac

Signed-off-by: Christopher Angelo Phillips <[email protected]>

add merge test

6cdefa9

Signed-off-by: Christopher Angelo Phillips <[email protected]>

update linter checks

aae4ff0

Signed-off-by: Christopher Angelo Phillips <[email protected]>

add and update package test

92db179

Signed-off-by: Christopher Angelo Phillips <[email protected]>

add back location_set_test

1e49818

Signed-off-by: Christopher Angelo Phillips <[email protected]>

add sort test for locations

287af3d

Signed-off-by: Christopher Angelo Phillips <[email protected]>

spiffcs commented Oct 27, 2021

View reviewed changes

syft/pkg/catalog.go Show resolved Hide resolved

spiffcs commented Oct 27, 2021

View reviewed changes

syft/pkg/id.go Show resolved Hide resolved

spiffcs commented Oct 27, 2021

View reviewed changes

syft/pkg/package.go Show resolved Hide resolved

spiffcs commented Oct 27, 2021

View reviewed changes

spiffcs marked this pull request as ready for review October 27, 2021 14:53

spiffcs linked an issue Oct 27, 2021 that may be closed by this pull request

Stabilize package identifier based on contents #363

Closed

spiffcs requested a review from a team October 27, 2021 16:28

luhring reviewed Oct 27, 2021

View reviewed changes

spiffcs mentioned this pull request Oct 28, 2021

Add a release for linux/arm64 #597

Closed

spiffcs force-pushed the 363-package-identifier branch 2 times, most recently from b18f6f4 to 287af3d Compare October 28, 2021 18:56

spiffcs mentioned this pull request Oct 28, 2021

Report known unknowns directly in the output SBOM #518

Closed

spiffcs added 2 commits October 29, 2021 10:24

update to a simpler fingerprint scheme

1d126e2

Signed-off-by: Christopher Angelo Phillips <[email protected]>

remove unneeded code

e470302

Signed-off-by: Christopher Angelo Phillips <[email protected]>

luhring reviewed Oct 29, 2021

View reviewed changes

return on log warning

92eacfa

Signed-off-by: Christopher Angelo Phillips <[email protected]>

wagoodman approved these changes Oct 29, 2021

View reviewed changes

spiffcs enabled auto-merge (squash) October 29, 2021 15:56

spiffcs merged commit a2882ee into main Oct 29, 2021

spiffcs deleted the 363-package-identifier branch October 29, 2021 16:00

spiffcs mentioned this pull request Nov 8, 2021

Support to stack multiple Syft SBOM files into a single one #617

Open

wagoodman mentioned this pull request Mar 7, 2022

SBOM from all-layers scope showing duplicate packages #32

Closed

GijsCalis pushed a commit to GijsCalis/syft that referenced this pull request Feb 19, 2024

Update package identifier to be constant across multiple syft run (an…

1205ed0

…chore#595) Signed-off-by: Christopher Angelo Phillips <[email protected]>

Uh oh!

Update package identifier to be constant across multiple syft run #595

Update package identifier to be constant across multiple syft run #595

Uh oh!

Conversation

spiffcs commented Oct 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stabilize Syft Package ID

TODO

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Oct 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Test Results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luhring left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wagoodman commented Oct 28, 2021

Uh oh!

spiffcs commented Oct 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

spiffcs commented Oct 27, 2021 •

edited

Loading

github-actions bot commented Oct 27, 2021 •

edited

Loading

spiffcs commented Oct 28, 2021 •

edited

Loading