Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@spiffcs
Copy link
Contributor

@spiffcs spiffcs commented Oct 27, 2021

Stabilize Syft Package ID

To explore the changes this branch introduces you can run the following commands.

syft version == main

syft -o json alpine:latest > /tmp/json1.json 
syft -o json alpine:latest > /tmp/json2.json  

diff /tmp/json1.json /tmp/json2.json  

In the above, you should see a bunch of ID printed out with different uuid.

syft version == 363-package-identifier

syft -o json alpine:latest > /tmp/json1.json 
syft -o json alpine:latest > /tmp/json2.json  

diff /tmp/json1.json /tmp/json2.json  

There should be no diff in the above.

syft version == 363-package-identifier

syft -o json alpine:latest > /tmp/json1.json 
syft -o json alpine:3.12 > /tmp/json2.json  

diff /tmp/json1.json /tmp/json2.json  

The above diff should be easier to explore with the missing ID field.

TODO

  • I left a couple of comments on the PR for discussion
  • Are there fields we're not ignoring that we should ignore to generate the hash?
  • How do we feel about the error case where a hash could not be generated?
  • How do we track packages that can't be hashed?

Signed-off-by: Christopher Angelo Phillips <[email protected]>
Signed-off-by: Christopher Angelo Phillips <[email protected]>
Signed-off-by: Christopher Angelo Phillips <[email protected]>
Signed-off-by: Christopher Angelo Phillips <[email protected]>
Signed-off-by: Christopher Angelo Phillips <[email protected]>
Signed-off-by: Christopher Angelo Phillips <[email protected]>
Signed-off-by: Christopher Angelo Phillips <[email protected]>
Signed-off-by: Christopher Angelo Phillips <[email protected]>
@github-actions
Copy link

github-actions bot commented Oct 27, 2021

Benchmark Test Results

Benchmark results from the latest changes vs base branch
name                                                   old time/op    new time/op    delta
ImagePackageCatalogers/ruby-gemspec-cataloger-2          1.20ms ± 3%    1.38ms ± 6%  +14.75%  (p=0.008 n=5+5)
ImagePackageCatalogers/python-package-cataloger-2        2.41ms ± 2%    3.41ms ±14%  +41.45%  (p=0.008 n=5+5)
ImagePackageCatalogers/javascript-package-cataloger-2     730µs ± 1%     899µs ±15%  +23.25%  (p=0.008 n=5+5)
ImagePackageCatalogers/dpkgdb-cataloger-2                 747µs ± 1%     865µs ± 6%  +15.84%  (p=0.008 n=5+5)
ImagePackageCatalogers/rpmdb-cataloger-2                  769µs ± 1%     854µs ± 4%  +11.06%  (p=0.008 n=5+5)
ImagePackageCatalogers/java-cataloger-2                  10.7ms ± 2%    13.6ms ±14%  +27.24%  (p=0.008 n=5+5)
ImagePackageCatalogers/apkdb-cataloger-2                 1.02ms ± 1%    1.39ms ±11%  +36.14%  (p=0.008 n=5+5)
ImagePackageCatalogers/go-module-binary-cataloger-2       610ns ± 1%     665ns ±17%   +8.90%  (p=0.032 n=5+5)

name                                                   old alloc/op   new alloc/op   delta
ImagePackageCatalogers/ruby-gemspec-cataloger-2           228kB ± 0%     248kB ± 0%   +9.06%  (p=0.008 n=5+5)
ImagePackageCatalogers/python-package-cataloger-2         980kB ± 0%    1109kB ± 0%  +13.19%  (p=0.008 n=5+5)
ImagePackageCatalogers/javascript-package-cataloger-2     192kB ± 0%     199kB ± 0%   +3.70%  (p=0.008 n=5+5)
ImagePackageCatalogers/dpkgdb-cataloger-2                 211kB ± 0%     228kB ± 0%   +8.40%  (p=0.008 n=5+5)
ImagePackageCatalogers/rpmdb-cataloger-2                  217kB ± 0%     222kB ± 0%   +2.31%  (p=0.008 n=5+5)
ImagePackageCatalogers/java-cataloger-2                  3.12MB ± 0%    3.24MB ± 0%   +3.82%  (p=0.008 n=5+5)
ImagePackageCatalogers/apkdb-cataloger-2                 1.26MB ± 0%    1.29MB ± 0%   +2.78%  (p=0.008 n=5+5)
ImagePackageCatalogers/go-module-binary-cataloger-2        336B ± 0%      336B ± 0%     ~     (all equal)

name                                                   old allocs/op  new allocs/op  delta
ImagePackageCatalogers/ruby-gemspec-cataloger-2           5.46k ± 0%     6.82k ± 0%  +24.91%  (p=0.016 n=5+4)
ImagePackageCatalogers/python-package-cataloger-2         18.1k ± 0%     26.3k ± 0%  +45.06%  (p=0.008 n=5+5)
ImagePackageCatalogers/javascript-package-cataloger-2     4.80k ± 0%     5.19k ± 0%   +8.19%  (p=0.008 n=5+5)
ImagePackageCatalogers/dpkgdb-cataloger-2                 5.53k ± 0%     6.67k ± 0%  +20.65%  (p=0.008 n=5+5)
ImagePackageCatalogers/rpmdb-cataloger-2                  6.22k ± 0%     6.56k ± 0%   +5.38%  (p=0.008 n=5+5)
ImagePackageCatalogers/java-cataloger-2                   51.5k ± 0%     59.0k ± 0%  +14.55%  (p=0.008 n=5+5)
ImagePackageCatalogers/apkdb-cataloger-2                  5.49k ± 0%     7.74k ± 0%  +40.88%  (p=0.016 n=4+5)
ImagePackageCatalogers/go-module-binary-cataloger-2        9.00 ± 0%      9.00 ± 0%     ~     (all equal)

if err != nil {
// if there is an error generating the hash
// we still want a unique identifier
// TODO: track packages that don't get hashed for report?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the fence here as well.

I don't like "" as the return.
I don't think we should reserve 0 and keep overwriting it.

I do think we should track these packages that are unable to be hashed for some reason.

Struggling to find edge cases where that could happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we understand the circumstances in which we'd get an error back from hashstructure.Hash? I'd be curious to know what those are before deciding how we handle this error (or not handle it, for example)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Package struct is what we're hashing here.

From the hashstructure package, here is the case statement where the struct is parsed:
https://github.com/mitchellh/hashstructure/blob/191a1e82bbeb456444f1a9bb9db2e4a20989cd98/hashstructure.go#L179-L389

Specific to strings:
https://github.com/mitchellh/hashstructure/blob/191a1e82bbeb456444f1a9bb9db2e4a20989cd98/hashstructure.go#L383-L386

It uses the default stdlib hash interface:
https://pkg.go.dev/hash

// Write (via the embedded io.Writer interface) adds more data to the running hash.
// It never returns an error.

Given that our package struct doesn't have a variety of complex types (most all boiling down to string) is why I posed that I was struggling to find an edge case where the error could happen to try and fuzz/test this part of the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gotcha, that makes sense. I think this is tricky. One impact of using the hashstructure approach — specifically where we start with the entire struct and exclude select fields, as opposed to starting with nothing and then including select fields — is we have less explicit control over emerging edge cases as the struct's inner data evolves. For example, if we modify a specific package metadata struct that's used for a certain package type, we could be impacting how successfully this fingerprinting operation executes without realizing this. This could have obvious effects (e.g. an error bubbles up) or non-obvious effects (fingerprints are no longer unique) depending on how we handle edge cases here.

I think we should be ably to rely on our fingerprint implementation to be deterministic and reliable. The notion of having a backup approach (particularly, generating a UUID), to me, is a sign that we might want to pause and think about this. Just my two cents!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could go the extremely strict path and do a Must implementation where we panic or fail the process completely if we cannot hash a package.
https://github.com/google/uuid/blob/44b5fee7c49cf3bcdf723f106b36d56ef13ccc88/uuid.go#L176

UUID is a placeholder since it was the previous method and I could not think of a base case better than it without modifying the code to track packages that could not be hashed.

Is there room in the output to include packages that could not be successfully hashed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My two cents: We shouldn't allow for any situation where we can't compute a deterministic package fingerprint... and I don't think it's a necessity for such a situation to be possible. If it is possible because of our implementation decision, I think that's worth discussing. As you've mentioned, we're going to be relying on this fingerprint quite a bit.

@spiffcs spiffcs marked this pull request as ready for review October 27, 2021 14:53
@spiffcs spiffcs linked an issue Oct 27, 2021 that may be closed by this pull request
@spiffcs spiffcs requested a review from a team October 27, 2021 16:28
Copy link
Contributor

@luhring luhring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very exciting work... 🙌

I have some questions around design choices — feel free to point me to an existing thread if there are answers noted already somewhere

if exists {
log.Errorf("package ID already exists in the catalog : id=%+v %+v", p.ID, p)
return
if p.ID == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Do we still need the Package struct to have an ID field that's distinct from the Fingerprint() function's return value?

I'm wondering if we could make it clearer who's responsibility it is to determine a package's ID. With this if check, it looks like it might be the consumer of Catalog.Add or it might be this implementation. Could we make it clearer and consistent where a package's ID originates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed this with @wagoodman and I think that we should just remove this check and have fingerprint() within Add be the sole operator of generating/assigning an ID.

Keeping the ID field is ideal for now since it decouples the fact that a package's fingerprint just happens to be the current form of its unique identification. We're still pre 1.0.0 so there might be more we discover along the way where this approach turns out to need more nuance. ID gives us a canvas to capture that nuance and is a good foothold for that kind of expansion.

A package should definitely know how to produce its fingerprint (see the fingerprint method).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed this with @wagoodman and I think that we should just remove this check and have fingerprint() within Add be the sole operator of generating/assigning an ID.

I think this makes sense.

Keeping the ID field is ideal for now since it decouples the fact that a package's fingerprint just happens to be the current form of its unique identification.

I don't quite follow this part. What's not clicking for me is: what purpose does the package's ID field serve? Who should be setting its value? Who should be reading its value? Removing the field resolves these 3 ambiguities (for me). Leveraging the new (awesome) Fingerprint method forces this to be a readonly, deterministic value. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the consideration here is that during catalog.Add() fingerprint is called a single time and then the package takes on the responsibility of persisting that value as the ID for the rest of the programs run. I don't think we should use this PR to change the program where we refactor how we're using ID considering @wagoodman is working on #556.

I do agree there should be some discussion in the future on how the ID field is used, but that should come after #556 is complete and we have a better foundation on IF we are going to continue using the catalog into v1.0.0.

I think the sign here is that when we decide to stop using the catalog we will be asked hard questions about the ID field, but with concurrent work going on over in #556 I believe it's better to get the win of consistent ID rather than totally refactoring or removing the field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's reasonable! Thanks for walking me through this. 🙏

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we plan to track this work? ^ (cleaning up the code with regard to the ID field)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wagoodman correct me if I'm wrong, but we have this tracked in an issue regarding the removal of the cataloger. I just can't find it at the moment.

I think this #554 might be where we start to address this work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This work (removal of the catalog) isn't track explicitly. The best place for this is under #556 . The reason for this is that #556 is the first opportunity to see how the nature of IDs will work differently... I was going to write an issue to the effect of "remove the pkg.Catalog" but I felt that the issue would be prescribing a solution without highlighting the reasons why it makes sense to remove it, or how removing it helps out the cataloging process.

If there is a need to remove the pkg.Catalog then signal will probably arise within #556 or shortly there after in a issue/PR that attempts to add relationships within a cataloger.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(#554 is affected by pkg.Catalog removal for sure, but doesn't appear to be a primary motivator for the change)

Comment on lines 74 to 78
if other, exists := c.byID[p.ID]; exists {
// there is already a package with this fingerprint
// merge the existing record with the new one
other.Merge(p)
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want to do this? What's the motivation for merging two packages? And is it correct to expect that this operation might actually end up changing the package's ID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes on wanting to do the merge.

The current implementation of syft will generate two packages if the locations and layer information is different. By ignoring the locations field, we allow layer information for similar packages to be merged under a single entry.

I do not think it is correct to expect that this operation would change the package ID. This merge only happens if a hash is identical to one already entered into the catalog. We then merge the location data so that the single package information shows all layers of the image that it was captured at.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We then merge the location data

Got it — I think this bit was the implicit part that I didn't pick up on at first, i.e. that the purpose of merging packages is to merge location entries.

It's probably a good idea to make this clear in the naming of the package Merge function, or at minimum, provide a clear doc comment for the benefit of the function's consumers.

Yes on wanting to do the merge [...]

See https://anchorecommunity.slack.com/archives/C4PJFNEEM/p1635425134053900?thread_ts=1635346512.050700&cid=C4PJFNEEM

if err != nil {
// if there is an error generating the hash
// we still want a unique identifier
// TODO: track packages that don't get hashed for report?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we understand the circumstances in which we'd get an error back from hashstructure.Hash? I'd be curious to know what those are before deciding how we handle this error (or not handle it, for example)

@wagoodman
Copy link
Contributor

Do we understand the circumstances in which we'd get an error back from hashstructure.Hash?

@luhring there doesn't appear to be any input that would directly influence an error being returned from hashstructure.Hash except for circumstances like modifying the datastructure concurrently while calling hashstructure.Hash or if reflect cannot extract a value from interface{} that can be written to a hash.Hash64 (which I can't think of a circumstance [yet?]).

Though to answer the higher question "are there any kinds of inputs that could be a problem for hashstructure.Hash?" I think the answer is "yes": Elements that have cyclic references can cause a panic (the issue is still open as of now).

@spiffcs spiffcs force-pushed the 363-package-identifier branch 2 times, most recently from b18f6f4 to 287af3d Compare October 28, 2021 18:56
@spiffcs
Copy link
Contributor Author

spiffcs commented Oct 28, 2021

Result of our conversation

  • Remove merge
  • Remove UUID and allow us to fail to add package

Signed-off-by: Christopher Angelo Phillips <[email protected]>
Signed-off-by: Christopher Angelo Phillips <[email protected]>
p.ID = newID()
fingerprint, err := p.Fingerprint()
if err != nil {
log.Warnf("failed to add package to catalog: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return right after this line? This seems like the point in execution where we know the requested Add operation is not going to happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return added - apologies for the oversight here

Signed-off-by: Christopher Angelo Phillips <[email protected]>
@spiffcs spiffcs enabled auto-merge (squash) October 29, 2021 15:56
@spiffcs spiffcs merged commit a2882ee into main Oct 29, 2021
@spiffcs spiffcs deleted the 363-package-identifier branch October 29, 2021 16:00
GijsCalis pushed a commit to GijsCalis/syft that referenced this pull request Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stabilize package identifier based on contents

4 participants