-
Couldn't load subscription status.
- Fork 727
Update package identifier to be constant across multiple syft run #595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Christopher Angelo Phillips <[email protected]>
Signed-off-by: Christopher Angelo Phillips <[email protected]>
Signed-off-by: Christopher Angelo Phillips <[email protected]>
Signed-off-by: Christopher Angelo Phillips <[email protected]>
Signed-off-by: Christopher Angelo Phillips <[email protected]>
Signed-off-by: Christopher Angelo Phillips <[email protected]>
Signed-off-by: Christopher Angelo Phillips <[email protected]>
Signed-off-by: Christopher Angelo Phillips <[email protected]>
Benchmark Test ResultsBenchmark results from the latest changes vs base branch |
syft/pkg/package.go
Outdated
| if err != nil { | ||
| // if there is an error generating the hash | ||
| // we still want a unique identifier | ||
| // TODO: track packages that don't get hashed for report? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the fence here as well.
I don't like "" as the return.
I don't think we should reserve 0 and keep overwriting it.
I do think we should track these packages that are unable to be hashed for some reason.
Struggling to find edge cases where that could happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we understand the circumstances in which we'd get an error back from hashstructure.Hash? I'd be curious to know what those are before deciding how we handle this error (or not handle it, for example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Package struct is what we're hashing here.
From the hashstructure package, here is the case statement where the struct is parsed:
https://github.com/mitchellh/hashstructure/blob/191a1e82bbeb456444f1a9bb9db2e4a20989cd98/hashstructure.go#L179-L389
Specific to strings:
https://github.com/mitchellh/hashstructure/blob/191a1e82bbeb456444f1a9bb9db2e4a20989cd98/hashstructure.go#L383-L386
It uses the default stdlib hash interface:
https://pkg.go.dev/hash
// Write (via the embedded io.Writer interface) adds more data to the running hash.
// It never returns an error.
Given that our package struct doesn't have a variety of complex types (most all boiling down to string) is why I posed that I was struggling to find an edge case where the error could happen to try and fuzz/test this part of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gotcha, that makes sense. I think this is tricky. One impact of using the hashstructure approach — specifically where we start with the entire struct and exclude select fields, as opposed to starting with nothing and then including select fields — is we have less explicit control over emerging edge cases as the struct's inner data evolves. For example, if we modify a specific package metadata struct that's used for a certain package type, we could be impacting how successfully this fingerprinting operation executes without realizing this. This could have obvious effects (e.g. an error bubbles up) or non-obvious effects (fingerprints are no longer unique) depending on how we handle edge cases here.
I think we should be ably to rely on our fingerprint implementation to be deterministic and reliable. The notion of having a backup approach (particularly, generating a UUID), to me, is a sign that we might want to pause and think about this. Just my two cents!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could go the extremely strict path and do a Must implementation where we panic or fail the process completely if we cannot hash a package.
https://github.com/google/uuid/blob/44b5fee7c49cf3bcdf723f106b36d56ef13ccc88/uuid.go#L176
UUID is a placeholder since it was the previous method and I could not think of a base case better than it without modifying the code to track packages that could not be hashed.
Is there room in the output to include packages that could not be successfully hashed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My two cents: We shouldn't allow for any situation where we can't compute a deterministic package fingerprint... and I don't think it's a necessity for such a situation to be possible. If it is possible because of our implementation decision, I think that's worth discussing. As you've mentioned, we're going to be relying on this fingerprint quite a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very exciting work... 🙌
I have some questions around design choices — feel free to point me to an existing thread if there are answers noted already somewhere
| if exists { | ||
| log.Errorf("package ID already exists in the catalog : id=%+v %+v", p.ID, p) | ||
| return | ||
| if p.ID == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Do we still need the Package struct to have an ID field that's distinct from the Fingerprint() function's return value?
I'm wondering if we could make it clearer who's responsibility it is to determine a package's ID. With this if check, it looks like it might be the consumer of Catalog.Add or it might be this implementation. Could we make it clearer and consistent where a package's ID originates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed this with @wagoodman and I think that we should just remove this check and have fingerprint() within Add be the sole operator of generating/assigning an ID.
Keeping the ID field is ideal for now since it decouples the fact that a package's fingerprint just happens to be the current form of its unique identification. We're still pre 1.0.0 so there might be more we discover along the way where this approach turns out to need more nuance. ID gives us a canvas to capture that nuance and is a good foothold for that kind of expansion.
A package should definitely know how to produce its fingerprint (see the fingerprint method).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed this with @wagoodman and I think that we should just remove this check and have fingerprint() within Add be the sole operator of generating/assigning an ID.
I think this makes sense.
Keeping the ID field is ideal for now since it decouples the fact that a package's fingerprint just happens to be the current form of its unique identification.
I don't quite follow this part. What's not clicking for me is: what purpose does the package's ID field serve? Who should be setting its value? Who should be reading its value? Removing the field resolves these 3 ambiguities (for me). Leveraging the new (awesome) Fingerprint method forces this to be a readonly, deterministic value. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the consideration here is that during catalog.Add() fingerprint is called a single time and then the package takes on the responsibility of persisting that value as the ID for the rest of the programs run. I don't think we should use this PR to change the program where we refactor how we're using ID considering @wagoodman is working on #556.
I do agree there should be some discussion in the future on how the ID field is used, but that should come after #556 is complete and we have a better foundation on IF we are going to continue using the catalog into v1.0.0.
I think the sign here is that when we decide to stop using the catalog we will be asked hard questions about the ID field, but with concurrent work going on over in #556 I believe it's better to get the win of consistent ID rather than totally refactoring or removing the field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's reasonable! Thanks for walking me through this. 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we plan to track this work? ^ (cleaning up the code with regard to the ID field)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wagoodman correct me if I'm wrong, but we have this tracked in an issue regarding the removal of the cataloger. I just can't find it at the moment.
I think this #554 might be where we start to address this work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This work (removal of the catalog) isn't track explicitly. The best place for this is under #556 . The reason for this is that #556 is the first opportunity to see how the nature of IDs will work differently... I was going to write an issue to the effect of "remove the pkg.Catalog" but I felt that the issue would be prescribing a solution without highlighting the reasons why it makes sense to remove it, or how removing it helps out the cataloging process.
If there is a need to remove the pkg.Catalog then signal will probably arise within #556 or shortly there after in a issue/PR that attempts to add relationships within a cataloger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(#554 is affected by pkg.Catalog removal for sure, but doesn't appear to be a primary motivator for the change)
syft/pkg/catalog.go
Outdated
| if other, exists := c.byID[p.ID]; exists { | ||
| // there is already a package with this fingerprint | ||
| // merge the existing record with the new one | ||
| other.Merge(p) | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we want to do this? What's the motivation for merging two packages? And is it correct to expect that this operation might actually end up changing the package's ID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes on wanting to do the merge.
The current implementation of syft will generate two packages if the locations and layer information is different. By ignoring the locations field, we allow layer information for similar packages to be merged under a single entry.
I do not think it is correct to expect that this operation would change the package ID. This merge only happens if a hash is identical to one already entered into the catalog. We then merge the location data so that the single package information shows all layers of the image that it was captured at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We then merge the location data
Got it — I think this bit was the implicit part that I didn't pick up on at first, i.e. that the purpose of merging packages is to merge location entries.
It's probably a good idea to make this clear in the naming of the package Merge function, or at minimum, provide a clear doc comment for the benefit of the function's consumers.
Yes on wanting to do the merge [...]
syft/pkg/package.go
Outdated
| if err != nil { | ||
| // if there is an error generating the hash | ||
| // we still want a unique identifier | ||
| // TODO: track packages that don't get hashed for report? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we understand the circumstances in which we'd get an error back from hashstructure.Hash? I'd be curious to know what those are before deciding how we handle this error (or not handle it, for example)
@luhring there doesn't appear to be any input that would directly influence an error being returned from Though to answer the higher question "are there any kinds of inputs that could be a problem for |
b18f6f4 to
287af3d
Compare
|
Result of our conversation
|
Signed-off-by: Christopher Angelo Phillips <[email protected]>
Signed-off-by: Christopher Angelo Phillips <[email protected]>
| p.ID = newID() | ||
| fingerprint, err := p.Fingerprint() | ||
| if err != nil { | ||
| log.Warnf("failed to add package to catalog: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return right after this line? This seems like the point in execution where we know the requested Add operation is not going to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return added - apologies for the oversight here
Signed-off-by: Christopher Angelo Phillips <[email protected]>
…chore#595) Signed-off-by: Christopher Angelo Phillips <[email protected]>
Stabilize Syft Package ID
To explore the changes this branch introduces you can run the following commands.
In the above, you should see a bunch of ID printed out with different uuid.
There should be no diff in the above.
The above diff should be easier to explore with the missing ID field.
TODO