497 stable sorted CPE array (JSON and SPDX) #522

spiffcs · 2021-09-30T17:15:01Z

Attempts to Solve #497

To see the current issue you can:

syft -o json alpine:latest > /tmp/alpine.json:latest
syft -o json alpine:latest > /tmp/alpine.json:latest:2 
diff -u /tmp/alpine.json:2 /tmp/alpine.json:latest

This will produce a diff where the cpes field for some packages is sorted differently from run to run.

In the case where countFieldLength is equal for two sequential CPE in a given list of CPE, then our sort function would return a non deterministic order for cases as seen below:

cpe:2.3:a:alpine-keys:alpine_keys:2.3-r1:*:*:*:*:*:*:*
cpe:2.3:a:alpine-keys:alpine-keys:2.3-r1:*:*:*:*:*:*:*
cpe:2.3:a:alpine_keys:alpine_keys:2.3-r1:*:*:*:*:*:*:*
cpe:2.3:a:alpine_keys:alpine-keys:2.3-r1:*:*:*:*:*:*:*
cpe:2.3:a:alpine:alpine_keys:2.3-r1:*:*:*:*:*:*:*
cpe:2.3:a:alpine:alpine-keys:2.3-r1:*:*:*:*:*:*:*

// If both Less(i, j) and Less(j, i) are false,
// then the elements at index i and j are considered equal.
// Sort may place equal elements in any order in the final result,
// while Stable preserves the original input order of equal elements.

If the below change is too specific or fragile I can explore using the Stable interface to see if that at least keeps us consistent across runs.

The array order coming into the sort is always consistent, but where we get the diff between runs is when Less(i, j) and Less(j, i) are false,.

This PR assigns a priority based on the character - for the Vendor and Product fields.
Given the above example the ideal sorted output would be:

xxx-xxx-xxx
xxx-xxx_xxx
xxx_xxx-xxx
xxx_xxx_xxx
xx-xx
xx_xx

Currently I'm trying to find images which break the assumptions found in this code.

Signed-off-by: Christopher Angelo Phillips <[email protected]>

github-actions · 2021-09-30T17:18:10Z

Benchmark Test Results

Benchmark results from the latest changes vs base branch

name                                                   old time/op    new time/op    delta
ImagePackageCatalogers/ruby-gemspec-cataloger-2          1.05ms ± 5%    1.05ms ± 1%     ~     (p=0.690 n=5+5)
ImagePackageCatalogers/python-package-cataloger-2        1.61ms ± 5%    1.85ms ± 6%  +14.98%  (p=0.008 n=5+5)
ImagePackageCatalogers/javascript-package-cataloger-2     497µs ± 4%     512µs ± 2%     ~     (p=0.095 n=5+5)
ImagePackageCatalogers/dpkgdb-cataloger-2                 489µs ± 2%     499µs ± 2%     ~     (p=0.151 n=5+5)
ImagePackageCatalogers/rpmdb-cataloger-2                  498µs ± 2%     515µs ± 4%     ~     (p=0.056 n=5+5)
ImagePackageCatalogers/java-cataloger-2                  10.2ms ± 3%    11.0ms ± 2%   +8.25%  (p=0.008 n=5+5)
ImagePackageCatalogers/apkdb-cataloger-2                  785µs ± 4%     839µs ± 4%   +6.81%  (p=0.016 n=5+5)
ImagePackageCatalogers/go-cataloger-2                     254µs ± 3%     262µs ± 2%     ~     (p=0.095 n=5+5)
ImagePackageCatalogers/rust-cataloger-2                   435µs ± 2%     477µs ± 3%   +9.59%  (p=0.008 n=5+5)

name                                                   old alloc/op   new alloc/op   delta
ImagePackageCatalogers/ruby-gemspec-cataloger-2           145kB ± 0%     146kB ± 0%   +0.78%  (p=0.008 n=5+5)
ImagePackageCatalogers/python-package-cataloger-2         718kB ± 0%     755kB ± 0%   +5.16%  (p=0.008 n=5+5)
ImagePackageCatalogers/javascript-package-cataloger-2     118kB ± 0%     118kB ± 0%   -0.13%  (p=0.032 n=5+5)
ImagePackageCatalogers/dpkgdb-cataloger-2                 132kB ± 0%     132kB ± 0%   -0.09%  (p=0.008 n=5+5)
ImagePackageCatalogers/rpmdb-cataloger-2                  140kB ± 0%     140kB ± 0%     ~     (p=0.159 n=5+5)
ImagePackageCatalogers/java-cataloger-2                  2.70MB ± 0%    2.74MB ± 0%   +1.57%  (p=0.008 n=5+5)
ImagePackageCatalogers/apkdb-cataloger-2                 1.17MB ± 0%    1.18MB ± 0%   +0.37%  (p=0.008 n=5+5)
ImagePackageCatalogers/go-cataloger-2                    55.0kB ± 0%    55.0kB ± 0%     ~     (p=0.690 n=5+5)
ImagePackageCatalogers/rust-cataloger-2                   121kB ± 0%     123kB ± 0%   +2.13%  (p=0.016 n=4+5)

name                                                   old allocs/op  new allocs/op  delta
ImagePackageCatalogers/ruby-gemspec-cataloger-2           2.34k ± 0%     2.41k ± 0%   +2.94%  (p=0.008 n=5+5)
ImagePackageCatalogers/python-package-cataloger-2         8.14k ± 0%     9.58k ± 0%  +17.70%  (p=0.008 n=5+5)
ImagePackageCatalogers/javascript-package-cataloger-2     1.99k ± 0%     1.99k ± 0%     ~     (all equal)
ImagePackageCatalogers/dpkgdb-cataloger-2                 2.54k ± 0%     2.54k ± 0%     ~     (all equal)
ImagePackageCatalogers/rpmdb-cataloger-2                  3.25k ± 0%     3.25k ± 0%     ~     (all equal)
ImagePackageCatalogers/java-cataloger-2                   36.1k ± 0%     37.5k ± 0%   +3.91%  (p=0.016 n=4+5)
ImagePackageCatalogers/apkdb-cataloger-2                  2.28k ± 0%     2.48k ± 0%   +8.78%  (p=0.016 n=4+5)
ImagePackageCatalogers/go-cataloger-2                     1.46k ± 0%     1.46k ± 0%     ~     (all equal)
ImagePackageCatalogers/rust-cataloger-2                   3.10k ± 0%     3.21k ± 0%   +3.75%  (p=0.016 n=5+4)

luhring

Question: If the objective of this PR is to ensure that SPDX output is consistently sorted on the whole, is there a test we can introduce to make that assertion, beyond CPEs?

luhring · 2021-09-30T18:31:39Z

syft/pkg/cataloger/common/cpe/sort_by_specificity.go


+func dashIndex(cpe wfn.Attributes) int {
+	count := 0
+	dash := "-"


nit: This could be a const since we never need to assign a new value to it.

I need to rename the PR. The bug was sorted spdx output, but that was not actually the issue. The issue was that the diffs found when running multiple of the same image were located in the cpes array.

Updated to: stable sorted CPE array (JSON and SPDX)

spiffcs · 2021-09-30T18:41:23Z

Question: If the objective of this PR is to ensure that SPDX output is consistently sorted on the whole, is there a test we can introduce to make that assertion, beyond CPEs?

Yea let me also see if I can add an integration level test that asserts there are no regressions on this front as far as producing a consistent document across multiple runs.

wagoodman · 2021-09-30T18:58:45Z

syft/pkg/cataloger/common/cpe/sort_by_specificity.go

 	return iScore > jScore
 }

+func dashIndex(cpe wfn.Attributes) int {


A comment here as to why this is here would go a long way --It isn't obvious that this is here purely to stabilize the sort

comment added here. I'm also open to talk about this being a pretty fragile assumption. It was a patch that was very targeted at the alpine example I listed in the PR. I'm still running some other images to see where/how this might break and if I can find a larger and more general solution.

agreed --it's always worth a little extra time up front to attempt and avoid solving the same problem twice!

I'm just mostly confused why are we not just text sorting? We could sort based on each field split based on : or somesuch

yea there was some conversation on using

return c[i].BindToFmtString() > c[j].BindToFmtString()

I'll check that rather than sorting based on random -

Signed-off-by: Christopher Angelo Phillips <[email protected]>

spiffcs · 2021-09-30T21:18:20Z

@kzantow I added text sort as the base case if our iScore or fieldLength are both equal. I also prioritized it so it goes

weightedScore > (prioritize higher weight to front)
length > (prioritize longer length to front)
text < (reverse here so that - shows up in front of _)

spiffcs · 2021-09-30T21:22:27Z

With the new stable sort here is a screenshot that shoes spdx-json is now stable for alpine:latest

spiffcs · 2021-09-30T21:59:21Z

@anchore/tools There are two places I found as candidates for regression test additions for this PR.

This first would be in the snapshot testing we have under the presenter package. Here we can add a more complex combination of CPE to the catalog so that every time the presenter generates we can compare to the old snapshot and say 👍 the array did not change.

syft/internal/presenter/packages/json_presenter_test.go

Lines 29 to 37 in 6480f06

    
           func TestJSONImagePresenter(t *testing.T) { 
        
           	testImage := "image-simple" 
        
           	catalog, metadata, dist := presenterImageInput(t, testImage) 
        
           	assertPresenterAgainstGoldenImageSnapshot(t, 
        
           		NewJSONPresenter(catalog, metadata, dist, source.SquashedScope), 
        
           		testImage, 
        
           		*updateJSONGoldenFiles, 
        
           	) 
        
           }

syft/internal/presenter/packages/utils_test.go

Lines 106 to 108 in 6480f06

    
           CPEs: []pkg.CPE{ 
        
           	must(pkg.NewCPE("cpe:2.3:*:some:package:1:*:*:*:*:*:*:*")), 
        
           },

The other is we write a new integration test under test/integration and generate two sbom using alpine:latest, filter those sbom for unique values (Timestamp, ID, etc) and then compare.

I'm happy to add either, just wanted your thoughts going into Friday on if one was good enough or if we want both levels checked.

luhring · 2021-10-01T11:55:12Z

There are two places I found as candidates for regression test additions for this PR. [...]

Great ideas. I'm liking the second route a bit more at the moment, but I could be convinced of either direction. My initial thinking is if we can avoid the fragility that sometimes comes with snapshot testing, these tests would be more reliable and thus more trusted by the team.

luhring · 2021-10-01T11:58:33Z

Re: the PR's purpose...

I need to rename the PR. The bug was sorted spdx output, but that was not actually the issue. The issue was that the diffs found when running multiple of the same image were located in the cpes array.

My thinking is this: Right now, merging this PR will cause #497 to be closed. 497 is expressing that the SPDX output's sorting should be stable in general (which I agree with). So if we want to consider 497 done, I think we should have test(s) that ensure 497 is fully addressed. If we don't want to tackle all of 497 in this particular PR, I'm cool with that, too! In that case, 497 would still be unfinished (IMHO) upon merging this PR.

Curious for your latest thinking!

spiffcs · 2021-10-01T13:28:16Z

. So if we want to consider 497 done, I think we should have test(s) that ensure 497 is fully addressed. If we don't want to tackle all of 497 in this particular PR, I'm cool with that, too! In that case, 497 would still be unfinished (IMHO) upon merging this PR.

Curious for your latest thinking!

If we can get an integration test added that shows two runs of -o spdx-json consistently sorted then I think we can close #497. I've run this fix against a couple images and have not seen diffs that show unstable sorts baring the CPE issue.

wagoodman · 2021-10-01T14:07:56Z

re: snapshot testing: I agree with @luhring on this one --snapshot testing is great for change detection and seeing what changed, but is not great at communicating why something changed. Leaning more on unit tests for depth in cases and specific business logic verifications is the way to go here.

If we have a concern about how things are wired up, maybe an integration or CLI level test would be useful here --generate an SBOM for the same input a few times and ensure the output is stable as a whole (like you suggested @spiffcs ). Though it doesn't need to be limited to spdx-json, but testing all SBOM outputs (for each output format ... for 1-5... run syft then compare all results within each format). This isn't required for this PR, but could also be helpful to make certain we aren't regressing in SBOM reproducibility.

spiffcs · 2021-10-01T14:18:34Z

If we have a concern about how things are wired up, maybe an integration or CLI level test would be useful here --generate an SBOM for the same input a few times and ensure the output is stable as a whole (like you suggested @spiffcs ). Though it doesn't need to be limited to spdx-json, but testing all SBOM outputs (for each output format ... for 1-5... run syft then compare all results within each format). This isn't required for this PR, but could also be helpful to make certain we aren't regressing in SBOM reproducibility.

I think we if add this integration test in this PR we can call #497 closed since we addressed the root of the problem and add regression. I'll write it up today.

kzantow · 2021-10-01T15:12:00Z

syft/pkg/cataloger/common/cpe/sort_by_specificity_test.go

+				mustCPE("cpe:2.3:a:alpine-keys:alpine_keys:2.3-r1:*:*:*:*:*:*:*"),
+				mustCPE("cpe:2.3:a:alpine-keys:alpine-keys:2.3-r1:*:*:*:*:*:*:*"),
+				mustCPE("cpe:2.3:a:alpine_keys:alpine_keys:2.3-r1:*:*:*:*:*:*:*"),
+				mustCPE("cpe:2.3:a:alpine_keys:alpine-keys:2.3-r1:*:*:*:*:*:*:*"),
+				mustCPE("cpe:2.3:a:alpine:alpine_keys:2.3-r1:*:*:*:*:*:*:*"),
+				mustCPE("cpe:2.3:a:alpine:alpine-keys:2.3-r1:*:*:*:*:*:*:*"),


maybe make this obviously more out of order?

Signed-off-by: Christopher Angelo Phillips <[email protected]>

kzantow

To me, this looks good -- super simple and I think it should work for solving the original issue 👍

* add small sorting change to our specificity Signed-off-by: Christopher Angelo Phillips <[email protected]>

spiffcs added 3 commits September 30, 2021 13:02

add small sorting change to our specificity

b8d61a8

Signed-off-by: Christopher Angelo Phillips <[email protected]>

lint fix

df74202

Signed-off-by: Christopher Angelo Phillips <[email protected]>

update count

3d0d1e7

Signed-off-by: Christopher Angelo Phillips <[email protected]>

spiffcs linked an issue Sep 30, 2021 that may be closed by this pull request

SPDX output is not consistently sorted #497

Closed

spiffcs added the bug Something isn't working label Sep 30, 2021

luhring reviewed Sep 30, 2021

View reviewed changes

spiffcs changed the title ~~497 sorted spdx output~~ 497 sorted CPE array (JSON and SPDX) Sep 30, 2021

spiffcs changed the title ~~497 sorted CPE array (JSON and SPDX)~~ 497 stable sorted CPE array (JSON and SPDX) Sep 30, 2021

wagoodman reviewed Sep 30, 2021

View reviewed changes

update dashIndex with comment

8ae8026

Signed-off-by: Christopher Angelo Phillips <[email protected]>

spiffcs force-pushed the 497-sorted-spdx-output branch from 3ce3b28 to 8ae8026 Compare September 30, 2021 20:36

spiffcs added 2 commits September 30, 2021 17:11

text sort is better

b68bdd8

Signed-off-by: Christopher Angelo Phillips <[email protected]>

update nested logic

8a7b8cf

Signed-off-by: Christopher Angelo Phillips <[email protected]>

kzantow reviewed Oct 1, 2021

View reviewed changes

make out of order obvious

b93c9f3

Signed-off-by: Christopher Angelo Phillips <[email protected]>

kzantow approved these changes Oct 1, 2021

View reviewed changes

spiffcs merged commit 5e4b668 into main Oct 1, 2021

spiffcs deleted the 497-sorted-spdx-output branch October 1, 2021 19:31

GijsCalis pushed a commit to GijsCalis/syft that referenced this pull request Feb 19, 2024

497 stable sorted CPE array (JSON and SPDX) (anchore#522)

c134c95

* add small sorting change to our specificity Signed-off-by: Christopher Angelo Phillips <[email protected]>

Uh oh!

497 stable sorted CPE array (JSON and SPDX) #522

497 stable sorted CPE array (JSON and SPDX) #522

Uh oh!

Conversation

spiffcs commented Sep 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Attempts to Solve #497

Uh oh!

github-actions bot commented Sep 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Test Results

Uh oh!

luhring left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

spiffcs commented Sep 30, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

spiffcs Sep 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

spiffcs commented Sep 30, 2021

Uh oh!

spiffcs commented Sep 30, 2021

Uh oh!

spiffcs commented Sep 30, 2021

Uh oh!

luhring commented Oct 1, 2021

Uh oh!

luhring commented Oct 1, 2021

Uh oh!

spiffcs commented Oct 1, 2021

Uh oh!

wagoodman commented Oct 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spiffcs commented Oct 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kzantow left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

spiffcs commented Sep 30, 2021 •

edited

Loading

github-actions bot commented Sep 30, 2021 •

edited

Loading

spiffcs Sep 30, 2021 •

edited

Loading

wagoodman commented Oct 1, 2021 •

edited

Loading

spiffcs commented Oct 1, 2021 •

edited

Loading