-
Couldn't load subscription status.
- Fork 727
Add package relationships and add ownership-by-file-overlap relationship #329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Alex Goodman <[email protected]>
syft/pkg/npm_metadata.go
Outdated
| @@ -1,5 +1,7 @@ | |||
| package pkg | |||
|
|
|||
| var _ fileOwner = (*NpmPackageJSONMetadata)(nil) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I follow why this is needed, and I see the pattern repeated in the other packages. Could you help me understand what is going on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed; this is a quick assertion that the type in question (NpmPackageJSONMetadata) positively implements the interface (fileOwner). Just a note, having this line here does not aid in functionality at all, but is a nice to have in case you make a change to the code and accidentally break the interface implementation (in this case .ownedFiles())
The specifics of the line are:
var _ fileOwner = ...: make a variable with no name (_, aBlankIdentifier) of typefileOwner(our interface)= (*t)(nil) ...: the value is the result of a type conversation. Specifically interpreting anilvalue as a pointer of our type (*NpmPackageJSONMetadata).
The upsides of this approach is that you get a compiler-time check that your type truly implements the interface.
Here's an article on type assertion vs type conversion: https://www.sohamkamani.com/golang/type-assertions-vs-type-conversions/
A nearly equivalent line would be:
var _ fileOwner = &NpmPackageJSONMetadata{}
The only practical difference is that this line is allocating a struct where as the line in the review does not allocate a new instance of the type being checked.
syft/pkg/relations.go
Outdated
| package pkg | ||
|
|
||
| type Relations struct { | ||
| // ParentsByFileOwnership lists all parent packages that claim ownership of this package |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this it seems that Parents is needed because there can be more than one parent for a package, but that doesn't seem right for the type of relationship this is trying to establish (many-to-one). This is arguable, but I tend to think about a Parent as a one way relationship to a child or children (package or packages in this case).
Is it possible to rethink these variables to map the same concept?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a pure sense, what you are saying is true --there should really be a single parent and a possibility of multiple children. However, it is allowable for multiple packages to claim ownership of the same files, which would register here as multiple parents; whether this is valid for any particular ecosystem to do doesn't necessarily matter --if one ecosystem supports it, we have to. Also, you could have a clash across ecosystems... two packages from two ecosystems put a file in the same location, and we found evidence of a package from that file.
I thought about making this something like:
type Relations struct {
ChildrenByFileOwnership []ID
}
But this has the downside that consumers have to do a bit of work to determine which packages are "root" packages (not owned by another package). By tracking parent relationships a consumer only needs to check that the ParentsByFileOwnership is empty to ensure it is a "root" package.
|
|
||
| func (m DpkgMetadata) ownedFiles() (result []string) { | ||
| for _, f := range m.Files { | ||
| if f.Path != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it not desirable to prevent duplicates here? Or does it want to intentionally capture everything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could make this a set implementation and return the resulting slice, think it's worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question - I'm not sure! But was wondering if it is a problem to have several duplicates here. It does seem that the intent is to report all owned files, not every single file reported from the package owner(s) - in which case it would make sense to prevent duplicates.
Either a set implementation returning a slice or preventing adding a new item if it already exists in the current slice sounds OK to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re set implementation — I'm not sure it's worth it.
I do have a question — are we allowing f.Path to be an empty string upstream? It'd be nice to be able to rely on a usable value here. I could be misunderstanding how this value gets used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the consumer of this, the idea is to use the owned files to get packages by path. Not de-duplicating here means extra (unnecessary) work for the consumer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: sets / duplicates: It's not clear if deduplicating signals anything different to the consumer (the given slice, deduplicated or not still indicates ownership), but it does clarify the intent of the interface ("provide a set of file paths that are owned"). I think I'll update to use a set in each, but still return a slice as the return type.
re: empty path: I felt that it was important to ensure that any implementation could not signal "ownership" of "nothing". I didn't check the underlying implementations to see if it was possible, but instead added the defense here. The coincidental implementation of underlying cataloger parsers shouldn't really drive the behavior of "fileOwner" IMHO.
syft/pkg/catalog.go
Outdated
| // markPackageOwnership find overlaps in file ownership with a file that defines another package. Specifically, a .Location.Path of | ||
| // a package is found to be owned by another (from the owner's .Metadata.Files[]). This relationship is captured on the | ||
| // child package. | ||
| func (c *Catalog) markPackageOwnership() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is a bit on the long and nested side, it'd be nice to find ways to decompose this into more focused functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the "splits" I considered seemed contrived, so I instead added comments to help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments are a good start, because you've taken time to think through what each section of the method is doing. It's clear that different sections of this method are doing specific things. These comments can now provide a roadmap toward what makes sense to factor out. This will help others wrap their head around what's going on in this function more readily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess what I'm saying is that I don't think this function is that long (35-ish lines including whitespace & comments).
I see where it could get broken up into other functions, however in all cases you will be passing what could stay as local state around to other functions (pkgParents), which means a reader needs to understand what the separate function is doing relative to the already-prepared state from another function. In these cases I tend to prefer a slightly longer function (which is still within the linter rules) with healthy comments.
If there were multiple rules for what determines "ownership" I would probably change my mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 We can move forward with this PR. We'll chat offline and follow up in another PR as necessary.
syft/pkg/catalog.go
Outdated
| // a package is found to be owned by another (from the owner's .Metadata.Files[]). This relationship is captured on the | ||
| // child package. | ||
| func (c *Catalog) markPackageOwnership() { | ||
| var pkgParents = make(map[ID]*strset.Set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function would get a bit simpler and safer if the map's value type wasn't a pointer. This would avoid the need for nil checks further on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written in the context of this function the use of a pointer value in a map is safe. (the use of a pointer in of itself is not "unsafe").
Also, the constructor for strset provides a pointer to a set, also which nil check are you referring to? The closest I found was:
if _, exists := pkgParents[subPackage.ID]; !exists {
pkgParents[subPackage.ID] = strset.New()
}
which would be required either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the use of a pointer in of itself is not "unsafe"
Agreed.
which nil check are you referring to?
An imaginary one that's not in the code yet.
As I understand this code, there's no point where a nil value can be assigned as a value for a map key in pkgParents, so in that sense we're good! The danger is in maintainability and future problems.
Making this a pointer means the system is more complex because there are more possible states in the system, which means it's now less difficult to introduce problems down the road. In this code, there are two accesses of a "potentially" nil value (to your point, not today, but there's no guarantee of that remaining true as this code is maintained):
- The call to
.Addon line 154 - The call
.Liston line 164
If I were a developer in the future coming to work on this function, I would try to look at this code and quickly get a sense for what's going on. I would see that the map's value type was a pointer to a strset.Set. I'd assume that this was done for a reason, and I'd further assume that the intention was so that the value could possibly be nil, since that's the primary impact of writing this as a pointer in the first place. I wouldn't assume that a pointer was added arbitrarily. Given this, I might alter the code in such a way that a nil value does become possible. In those cases, the calls to the methods Add and List would result in nil pointer dereferences. Hopefully, there are tests to catch this. But it would have been a problem that could be avoided altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 We can move forward with this PR. We'll chat offline and follow up in another PR as necessary.
|
|
||
| func (m DpkgMetadata) ownedFiles() (result []string) { | ||
| for _, f := range m.Files { | ||
| if f.Path != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re set implementation — I'm not sure it's worth it.
I do have a question — are we allowing f.Path to be an empty string upstream? It'd be nice to be able to rely on a usable value here. I could be misunderstanding how this value gets used
| } | ||
|
|
||
| if p.ID == "" { | ||
| p.ID = newID() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit concerning — the nondeterminism — especially since this gets exposed outside of the implementation. But the same could be said of any persistence/store-oriented object like a database table. No change needed right now — just noting a non-ideal trait. Open to further discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDs by nature should be nondeterministic, we shouldn't be depending on a particular value and claim this is a ID --so the change from an incrementing number is an improvement (which is still not deterministic since package order may vary on addition depending on the cataloger implementation... if we add concurrency for instance).
As you noted, this is a impactful change though since this is being exposed (where previously it was not). You couldn't reference theses IDs across syft runs, since the IDs are random, however, we could in the future provide a stable/deterministic package "fingerprint" field derived from the package property values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue I think is relevant wrt to IDs #166 - definitely something that has come up before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, I remember that one. I don't know about "should be nondeterministic" — I think the important attribute of IDs is uniqueness. Something consumers can use to positively find a single item.
I'm okay with this code going in as is, I was more thinking of what @alfredodeza mentioned — this has come up before, and it warrants caution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alfredodeza good reference, adding to the PR to close it; this PR takes your suggestion to allow IDs to be an exported field on the Package struct and loosens assumptions about ID with regard to catalog assignment.
@luhring Agreed that IDs should be primarily unique, that's not in question (and also why UUID is used as the implementation).
I want to clarify why I think non-determinisim is good quality to have in an ID system as a rule of thumb (note, not as a strict "always-do-this" rule).
In #166 , the fact that the ID was being derived from a global value is bad, but what is also bad is that tests were depending on predictable ID values for correctness, and the value was not being controlled by the test (which is what @alfredodeza was looking to do). If the ID was random from the start such accidental dependence on ID values would not be possible --the tests would not have passed to begin with.
Exposing the existing IDs with the current approach (incrementing number) an end user may assume that there is an implied order to packages. They may assume that the order is stable between runs --which is true, but is only a coincidence due to the current implementation. The truth is, the ID should not convey any intrinsic information --by design or accidentally-- other than the fact that your guaranteed to not have duplicates.
Anywho, I know you weren't asking for a change here @luhring but wanted to clarify my statements on IDs.
…test pkg versions Signed-off-by: Alex Goodman <[email protected]>
Signed-off-by: Alex Goodman <[email protected]>
Signed-off-by: Alex Goodman <[email protected]>
Signed-off-by: Alex Goodman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a very nice improvement to see come to Syft. 🎉 This looks good to go — pending a few minor things we chatted about earlier (like finding a more distinct and descriptive value to replace "ownership-by-files".)
Jotting down a few notes from our conversation for future improvements to this code:
- Moving relationship calculation into a package for business logic instead of presentation logic.
- Types like
map[ID]map[ID]*strset.Setcan be difficult to reason about and prone to misuse. If storing data in this way is deemed necessary, it would be nice to hide this data structure inside an object that provides a clear API to consumers that illustrates the intended way to store and retrieve information. - We're probably approaching a point soon where we want to re-examine the project's package layout. We touched briefly on how some locations for vars/consts/funcs/structs are starting to feel slightly out of place as we try to avoid import cycles and other design smells. This dissonance might be indicating the need for a refactor.
Signed-off-by: Alex Goodman <[email protected]>
* add marking package relations by file ownership Signed-off-by: Alex Goodman <[email protected]> * correct json schema version; ensure fileOwners dont return dups; pin test pkg versions Signed-off-by: Alex Goodman <[email protected]> * extract package relationships into separate section Signed-off-by: Alex Goodman <[email protected]> * pull in client-go features for import of PackageRelationships Signed-off-by: Alex Goodman <[email protected]> * move unit test for ownership by files relationship further down Signed-off-by: Alex Goodman <[email protected]> * rename relationship to "ownership-by-file-overlap" Signed-off-by: Alex Goodman <[email protected]>
This PR adds a new
ArtifactRelationshipsonto the JSON document to track one or moreRelationships(edges between packages). The first relationship added isownership-by-file-overlapwhere packages who's.Location[].RealPathor.Location.[].VirtualPathoverlaps with another packages.Metadata.Files[].Path(or similar field).TODO:
artifactRelationshipsfieldRelates to: anchore/anchore-engine#315 , anchore/anchore-engine#445 , and anchore/anchore-engine#460
Closes #326
Closes #166
Closes #198