Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Track Java "shaded"/uberjarred/jarjared hidden deps vulnerabilities #1266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pombredanne opened this issue Aug 11, 2023 · 3 comments
Open

Comments

@pombredanne
Copy link
Member

See:

@jensdietrich I read "Those projects were detected with a research tool our team has developed" ... is this open source?

See also https://github.com/ctcpip/java-shaded-example as a example by @ctcpip

@jensdietrich
Copy link

@pombredanne thanks for reaching out -- the software is not yet open source but will be once our paper has been published. There is a preview here: https://arxiv.org/abs/2306.05534 . Some of the vulnerabilities found have led to changes in the GHSA, such as github/advisory-database#2258 (has references to a few more).

I am pretty open-minded about making the tool available (before officially open-sourcing it, or perhaps even fast-tracking this) and working with your project, it would be useful for us (there are two more team members) if you could describe how you would integrate and use this.

@pombredanne
Copy link
Member Author

@jensdietrich And thank you for taking the time to reply! At a high level, we are providing an open source SCA solutions backed by open data.

There are two sides to the problem at hand of shared JARs: detecting that a JAR shades other packages and which ones and in general when a package embeds another packages beyond JARs, and reporting the corresponding vulnerabilities if any.

  1. For the detection side, we want to be able to detect shaded JARs in ScanCode and using the PurlDB index as needed. One approach is to map the binaries of JAR to its corresponding source code as we do in this pipeline https://github.com/nexB/scancode.io/blob/main/scanpipe/pipelines/deploy_to_develop.py : the parts that are not mapped to the corresponding sources are therefore from another origin that can be matched in the PurlDB. Another approach is matching so find embedded code, and when this is implies rewritten bytecode, using various "features" that we could extract from these such as symbols, code graphs, or a decompilation to abstract the class paths. Or your approach!

  2. For the vulnerability side, this would likely be resolved with a VulnerableCode improver that would using the results of 1. above, doing by a matching or lookup by PURL in PurlDB from VulnerableCode and would be updating the VulnerableCode DB to update the embedding package with the vulnerabilities affecting its embedded package(s).

So 1. would be integrated in ScanCode, ScanCode.io and PurlDB, and 2. in VulnerableCode

@jensdietrich
Copy link

@pombredanne Thanks for the clarification. In our approach, we associate bytecode with sources using the Maven Rest API . This enables us to perform matching on the source level, using a custom AST analysis (that ignores package names, as they usually change during shading). But to precisely detect vulnerabilities (i.e. avoid false positives) we rely on a proof-of-vulnerability project for each CVE that makes it testable. Those projects then get instantiated for clones, and the respective tests are executed to check whether the vulnerability is still present. There are a few POVs we collected here for evaluation purposes: https://github.com/jensdietrich/xshady/. POVs can often be created from patches, but the process cannot be completely automated IMO and may therefore not scale sufficiently for you. Also, our approach does miss some clones, for instance, we limit the number of REST queries for performance reasons.

We are also working on matching binaries -- a dual setup with an old-fashioned engineered solution, and a dataset that might be suitable to train a classifier to match binaries. Focus atm is on the dataset. Work is in progress, we hope to have something by the end of the year. Being an academic and teaching makes things slow. Our main use case here is not vulnerability detection, but reproducible builds.

I read that NLnet sponsors some projects in this space, perhaps we could apply for this to hire students to help with this.

Re making our tool available -- I will discuss this with my collaborators. It will take a few days through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants