Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@spiffcs
Copy link
Contributor

@spiffcs spiffcs commented Sep 30, 2021

Attempts to Solve #497

To see the current issue you can:

syft -o json alpine:latest > /tmp/alpine.json:latest
syft -o json alpine:latest > /tmp/alpine.json:latest:2 
diff -u /tmp/alpine.json:2 /tmp/alpine.json:latest

This will produce a diff where the cpes field for some packages is sorted differently from run to run.

In the case where countFieldLength is equal for two sequential CPE in a given list of CPE, then our sort function would return a non deterministic order for cases as seen below:

cpe:2.3:a:alpine-keys:alpine_keys:2.3-r1:*:*:*:*:*:*:*
cpe:2.3:a:alpine-keys:alpine-keys:2.3-r1:*:*:*:*:*:*:*
cpe:2.3:a:alpine_keys:alpine_keys:2.3-r1:*:*:*:*:*:*:*
cpe:2.3:a:alpine_keys:alpine-keys:2.3-r1:*:*:*:*:*:*:*
cpe:2.3:a:alpine:alpine_keys:2.3-r1:*:*:*:*:*:*:*
cpe:2.3:a:alpine:alpine-keys:2.3-r1:*:*:*:*:*:*:*
// If both Less(i, j) and Less(j, i) are false,
// then the elements at index i and j are considered equal.
// Sort may place equal elements in any order in the final result,
// while Stable preserves the original input order of equal elements.

If the below change is too specific or fragile I can explore using the Stable interface to see if that at least keeps us consistent across runs.

The array order coming into the sort is always consistent, but where we get the diff between runs is when Less(i, j) and Less(j, i) are false,.

This PR assigns a priority based on the character - for the Vendor and Product fields.
Given the above example the ideal sorted output would be:

xxx-xxx-xxx
xxx-xxx_xxx
xxx_xxx-xxx
xxx_xxx_xxx
xx-xx
xx_xx

Currently I'm trying to find images which break the assumptions found in this code.

Signed-off-by: Christopher Angelo Phillips <[email protected]>
Signed-off-by: Christopher Angelo Phillips <[email protected]>
Signed-off-by: Christopher Angelo Phillips <[email protected]>
@github-actions
Copy link

github-actions bot commented Sep 30, 2021

Benchmark Test Results

Benchmark results from the latest changes vs base branch
name                                                   old time/op    new time/op    delta
ImagePackageCatalogers/ruby-gemspec-cataloger-2          1.05ms ± 5%    1.05ms ± 1%     ~     (p=0.690 n=5+5)
ImagePackageCatalogers/python-package-cataloger-2        1.61ms ± 5%    1.85ms ± 6%  +14.98%  (p=0.008 n=5+5)
ImagePackageCatalogers/javascript-package-cataloger-2     497µs ± 4%     512µs ± 2%     ~     (p=0.095 n=5+5)
ImagePackageCatalogers/dpkgdb-cataloger-2                 489µs ± 2%     499µs ± 2%     ~     (p=0.151 n=5+5)
ImagePackageCatalogers/rpmdb-cataloger-2                  498µs ± 2%     515µs ± 4%     ~     (p=0.056 n=5+5)
ImagePackageCatalogers/java-cataloger-2                  10.2ms ± 3%    11.0ms ± 2%   +8.25%  (p=0.008 n=5+5)
ImagePackageCatalogers/apkdb-cataloger-2                  785µs ± 4%     839µs ± 4%   +6.81%  (p=0.016 n=5+5)
ImagePackageCatalogers/go-cataloger-2                     254µs ± 3%     262µs ± 2%     ~     (p=0.095 n=5+5)
ImagePackageCatalogers/rust-cataloger-2                   435µs ± 2%     477µs ± 3%   +9.59%  (p=0.008 n=5+5)

name                                                   old alloc/op   new alloc/op   delta
ImagePackageCatalogers/ruby-gemspec-cataloger-2           145kB ± 0%     146kB ± 0%   +0.78%  (p=0.008 n=5+5)
ImagePackageCatalogers/python-package-cataloger-2         718kB ± 0%     755kB ± 0%   +5.16%  (p=0.008 n=5+5)
ImagePackageCatalogers/javascript-package-cataloger-2     118kB ± 0%     118kB ± 0%   -0.13%  (p=0.032 n=5+5)
ImagePackageCatalogers/dpkgdb-cataloger-2                 132kB ± 0%     132kB ± 0%   -0.09%  (p=0.008 n=5+5)
ImagePackageCatalogers/rpmdb-cataloger-2                  140kB ± 0%     140kB ± 0%     ~     (p=0.159 n=5+5)
ImagePackageCatalogers/java-cataloger-2                  2.70MB ± 0%    2.74MB ± 0%   +1.57%  (p=0.008 n=5+5)
ImagePackageCatalogers/apkdb-cataloger-2                 1.17MB ± 0%    1.18MB ± 0%   +0.37%  (p=0.008 n=5+5)
ImagePackageCatalogers/go-cataloger-2                    55.0kB ± 0%    55.0kB ± 0%     ~     (p=0.690 n=5+5)
ImagePackageCatalogers/rust-cataloger-2                   121kB ± 0%     123kB ± 0%   +2.13%  (p=0.016 n=4+5)

name                                                   old allocs/op  new allocs/op  delta
ImagePackageCatalogers/ruby-gemspec-cataloger-2           2.34k ± 0%     2.41k ± 0%   +2.94%  (p=0.008 n=5+5)
ImagePackageCatalogers/python-package-cataloger-2         8.14k ± 0%     9.58k ± 0%  +17.70%  (p=0.008 n=5+5)
ImagePackageCatalogers/javascript-package-cataloger-2     1.99k ± 0%     1.99k ± 0%     ~     (all equal)
ImagePackageCatalogers/dpkgdb-cataloger-2                 2.54k ± 0%     2.54k ± 0%     ~     (all equal)
ImagePackageCatalogers/rpmdb-cataloger-2                  3.25k ± 0%     3.25k ± 0%     ~     (all equal)
ImagePackageCatalogers/java-cataloger-2                   36.1k ± 0%     37.5k ± 0%   +3.91%  (p=0.016 n=4+5)
ImagePackageCatalogers/apkdb-cataloger-2                  2.28k ± 0%     2.48k ± 0%   +8.78%  (p=0.016 n=4+5)
ImagePackageCatalogers/go-cataloger-2                     1.46k ± 0%     1.46k ± 0%     ~     (all equal)
ImagePackageCatalogers/rust-cataloger-2                   3.10k ± 0%     3.21k ± 0%   +3.75%  (p=0.016 n=5+4)

@spiffcs spiffcs linked an issue Sep 30, 2021 that may be closed by this pull request
@spiffcs spiffcs added the bug Something isn't working label Sep 30, 2021
Copy link
Contributor

@luhring luhring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: If the objective of this PR is to ensure that SPDX output is consistently sorted on the whole, is there a test we can introduce to make that assertion, beyond CPEs?


func dashIndex(cpe wfn.Attributes) int {
count := 0
dash := "-"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This could be a const since we never need to assign a new value to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to rename the PR. The bug was sorted spdx output, but that was not actually the issue. The issue was that the diffs found when running multiple of the same image were located in the cpes array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to: stable sorted CPE array (JSON and SPDX)

@spiffcs spiffcs changed the title 497 sorted spdx output 497 sorted CPE array (JSON and SPDX) Sep 30, 2021
@spiffcs spiffcs changed the title 497 sorted CPE array (JSON and SPDX) 497 stable sorted CPE array (JSON and SPDX) Sep 30, 2021
@spiffcs
Copy link
Contributor Author

spiffcs commented Sep 30, 2021

Question: If the objective of this PR is to ensure that SPDX output is consistently sorted on the whole, is there a test we can introduce to make that assertion, beyond CPEs?

Yea let me also see if I can add an integration level test that asserts there are no regressions on this front as far as producing a consistent document across multiple runs.

return iScore > jScore
}

func dashIndex(cpe wfn.Attributes) int {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment here as to why this is here would go a long way --It isn't obvious that this is here purely to stabilize the sort

Copy link
Contributor Author

@spiffcs spiffcs Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment added here. I'm also open to talk about this being a pretty fragile assumption. It was a patch that was very targeted at the alpine example I listed in the PR. I'm still running some other images to see where/how this might break and if I can find a larger and more general solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed --it's always worth a little extra time up front to attempt and avoid solving the same problem twice!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just mostly confused why are we not just text sorting? We could sort based on each field split based on : or somesuch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea there was some conversation on using

return c[i].BindToFmtString() > c[j].BindToFmtString()

I'll check that rather than sorting based on random -

Signed-off-by: Christopher Angelo Phillips <[email protected]>
@spiffcs spiffcs force-pushed the 497-sorted-spdx-output branch from 3ce3b28 to 8ae8026 Compare September 30, 2021 20:36
Signed-off-by: Christopher Angelo Phillips <[email protected]>
Signed-off-by: Christopher Angelo Phillips <[email protected]>
@spiffcs
Copy link
Contributor Author

spiffcs commented Sep 30, 2021

@kzantow I added text sort as the base case if our iScore or fieldLength are both equal. I also prioritized it so it goes

weightedScore > (prioritize higher weight to front)
length > (prioritize longer length to front)
text < (reverse here so that - shows up in front of _)

@spiffcs
Copy link
Contributor Author

spiffcs commented Sep 30, 2021

With the new stable sort here is a screenshot that shoes spdx-json is now stable for alpine:latest
Screen Shot 2021-09-30 at 5 21 40 PM

@spiffcs
Copy link
Contributor Author

spiffcs commented Sep 30, 2021

@anchore/tools There are two places I found as candidates for regression test additions for this PR.

This first would be in the snapshot testing we have under the presenter package. Here we can add a more complex combination of CPE to the catalog so that every time the presenter generates we can compare to the old snapshot and say 👍 the array did not change.

func TestJSONImagePresenter(t *testing.T) {
testImage := "image-simple"
catalog, metadata, dist := presenterImageInput(t, testImage)
assertPresenterAgainstGoldenImageSnapshot(t,
NewJSONPresenter(catalog, metadata, dist, source.SquashedScope),
testImage,
*updateJSONGoldenFiles,
)
}

CPEs: []pkg.CPE{
must(pkg.NewCPE("cpe:2.3:*:some:package:1:*:*:*:*:*:*:*")),
},

The other is we write a new integration test under test/integration and generate two sbom using alpine:latest, filter those sbom for unique values (Timestamp, ID, etc) and then compare.

I'm happy to add either, just wanted your thoughts going into Friday on if one was good enough or if we want both levels checked.

@luhring
Copy link
Contributor

luhring commented Oct 1, 2021

There are two places I found as candidates for regression test additions for this PR. [...]

Great ideas. I'm liking the second route a bit more at the moment, but I could be convinced of either direction. My initial thinking is if we can avoid the fragility that sometimes comes with snapshot testing, these tests would be more reliable and thus more trusted by the team.

@luhring
Copy link
Contributor

luhring commented Oct 1, 2021

Re: the PR's purpose...

I need to rename the PR. The bug was sorted spdx output, but that was not actually the issue. The issue was that the diffs found when running multiple of the same image were located in the cpes array.

My thinking is this: Right now, merging this PR will cause #497 to be closed. 497 is expressing that the SPDX output's sorting should be stable in general (which I agree with). So if we want to consider 497 done, I think we should have test(s) that ensure 497 is fully addressed. If we don't want to tackle all of 497 in this particular PR, I'm cool with that, too! In that case, 497 would still be unfinished (IMHO) upon merging this PR.

Curious for your latest thinking!

@spiffcs
Copy link
Contributor Author

spiffcs commented Oct 1, 2021

. So if we want to consider 497 done, I think we should have test(s) that ensure 497 is fully addressed. If we don't want to tackle all of 497 in this particular PR, I'm cool with that, too! In that case, 497 would still be unfinished (IMHO) upon merging this PR.

Curious for your latest thinking!

If we can get an integration test added that shows two runs of -o spdx-json consistently sorted then I think we can close #497. I've run this fix against a couple images and have not seen diffs that show unstable sorts baring the CPE issue.

@wagoodman
Copy link
Contributor

wagoodman commented Oct 1, 2021

re: snapshot testing: I agree with @luhring on this one --snapshot testing is great for change detection and seeing what changed, but is not great at communicating why something changed. Leaning more on unit tests for depth in cases and specific business logic verifications is the way to go here.

If we have a concern about how things are wired up, maybe an integration or CLI level test would be useful here --generate an SBOM for the same input a few times and ensure the output is stable as a whole (like you suggested @spiffcs ). Though it doesn't need to be limited to spdx-json, but testing all SBOM outputs (for each output format ... for 1-5... run syft then compare all results within each format). This isn't required for this PR, but could also be helpful to make certain we aren't regressing in SBOM reproducibility.

@spiffcs
Copy link
Contributor Author

spiffcs commented Oct 1, 2021

If we have a concern about how things are wired up, maybe an integration or CLI level test would be useful here --generate an SBOM for the same input a few times and ensure the output is stable as a whole (like you suggested @spiffcs ). Though it doesn't need to be limited to spdx-json, but testing all SBOM outputs (for each output format ... for 1-5... run syft then compare all results within each format). This isn't required for this PR, but could also be helpful to make certain we aren't regressing in SBOM reproducibility.

I think we if add this integration test in this PR we can call #497 closed since we addressed the root of the problem and add regression. I'll write it up today.

Comment on lines 86 to 91
mustCPE("cpe:2.3:a:alpine-keys:alpine_keys:2.3-r1:*:*:*:*:*:*:*"),
mustCPE("cpe:2.3:a:alpine-keys:alpine-keys:2.3-r1:*:*:*:*:*:*:*"),
mustCPE("cpe:2.3:a:alpine_keys:alpine_keys:2.3-r1:*:*:*:*:*:*:*"),
mustCPE("cpe:2.3:a:alpine_keys:alpine-keys:2.3-r1:*:*:*:*:*:*:*"),
mustCPE("cpe:2.3:a:alpine:alpine_keys:2.3-r1:*:*:*:*:*:*:*"),
mustCPE("cpe:2.3:a:alpine:alpine-keys:2.3-r1:*:*:*:*:*:*:*"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make this obviously more out of order?

Signed-off-by: Christopher Angelo Phillips <[email protected]>
Copy link
Contributor

@kzantow kzantow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, this looks good -- super simple and I think it should work for solving the original issue 👍

@spiffcs spiffcs merged commit 5e4b668 into main Oct 1, 2021
@spiffcs spiffcs deleted the 497-sorted-spdx-output branch October 1, 2021 19:31
GijsCalis pushed a commit to GijsCalis/syft that referenced this pull request Feb 19, 2024
* add small sorting change to our specificity

Signed-off-by: Christopher Angelo Phillips <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SPDX output is not consistently sorted

4 participants