Automated Benchmarking of LLM Agents on Real-World Software Security Tasks
1UIUC, 2Purdue University
π Paper π Leaderboard π€ Dataset
SEC-bench is a comprehensive benchmarking framework designed to evaluate Large Language Model (LLM) agents on real-world software security tasks. It provides automated tools for collecting vulnerability data, building reproducible vulnerability instances, and evaluating agent performance on security-related tasks.
- π Automated Benchmark Generation: Automated benchmark generation from OSV database and CVE records by using multi-agentic system
- π³ Containerized Environments: Docker-based reproducible vulnerability instances
- π€ Agent-oriented Evaluation: Evaluate agents on critical software security tasks (SWE-agent, OpenHands, and Aider are supported)
- π Comprehensive Security Assessment: Both PoC generation and vulnerability patching assessment with extensibility to other tasks (e.g., fuzzing, static analysis, etc.)
- π Rich Reporting: Detailed progress tracking and result visualization with rich terminal output
- Python: 3.12 or higher
- Docker: Latest version with sufficient disk space (>200GB recommended)
- Git: For repository cloning and submodule management
- Conda: For environment management (recommended)
git clone --recurse-submodules https://github.com/SEC-bench/SEC-bench.git
cd SEC-benchconda create -n secb python=3.12
conda activate secbpip install -r requirements.txtConfigure the following environment variables in your shell profile or .env file:
# Required API tokens
export GITHUB_TOKEN=<your_github_token>
export GITLAB_TOKEN=<your_gitlab_token>
export OPENAI_API_KEY=<your_openai_api_key>
export ANTHROPIC_API_KEY=<your_anthropic_api_key>
# Hugging Face configuration
export HF_TOKEN_PATH=$HOME/.cache/hf_hub_token
export HF_HOME=<path/to/huggingface>The data collection process involves three main steps: seed generation, report extraction, and project configuration.
Extract metadata from OSV database files:
python -m secb.preprocessor.seed \
--input-dir [OSV_DIR] \
--output-file [SEED_OUTPUT_FILE_PATH] \
--verboseExtract bug reports from reference URLs:
python -m secb.preprocessor.report \
--input-file [SEED_OUTPUT_FILE_PATH] \
--output-file [REPORT_OUTPUT_FILE_PATH] \
--reports-dir [REPORTS_DIR] \
--lang [LANGUAGE] \
--type [TYPE] \
--whitelist [WHITELIST_PROJECTS] \
--blacklist [BLACKLIST_PROJECTS] \
--oss-fuzzGenerate project configurations for vulnerability reproduction:
python -m secb.preprocessor.project \
--input-file [REPORT_OUTPUT_FILE_PATH] \
--output-file [PROJECT_OUTPUT_FILE_PATH] \
--tracking-file [TRACKING_FILE_PATH] \
--verboseUse the provided script for streamlined processing:
./run_preprocessor.sh <mode> [options]Available modes:
seed: Parse CVE/OSV files and extract relevant informationreport: Extract bug descriptions from reference URLsproject: Generate project configurations for reproducing vulnerabilities
Example workflows:
Note
Download the OSV database and place it in the output/osv directory.
The following example workflows are for C/C++ vulnerabilities. For other languages, you need to specify the language and type.
# Basic seed generation
./run_preprocessor.sh seed --input-dir ./output/osv --output-file ./output/seed.jsonl
# Filter for C/C++ CVEs in OSS-Fuzz projects
./run_preprocessor.sh report \
--input-file ./output/seed.jsonl \
--type CVE \
--oss-fuzz \
--lang C,C++
# Generate minimal project configurations
./run_preprocessor.sh project \
--input-file ./output/report-cve-oss-c-cpp.jsonl \
--sanitizer-only \
--minimalCreate foundational Docker images:
python -m secb.preprocessor.build_base_imagesCreate specific vulnerability instances:
# Build specific instance
python -m secb.preprocessor.build_instance_images \
--input-file [PROJECT_OUTPUT_FILE] \
--ids [INSTANCE_IDS]
# Example: Build OpenJPEG CVE instance
python -m secb.preprocessor.build_instance_images \
--input-file ./output/project-cve-oss-c-cpp-sanitizer-minimal.jsonl \
--ids openjpeg.cve-2024-56827
# Example: Build all GPAC CVE instances
python -m secb.preprocessor.build_instance_images \
--input-file ./output/project-cve-oss-c-cpp-sanitizer-minimal.jsonl \
--filter gpac.cveVerify built instances using the SecVerifier repository. This step ensures that vulnerability instances are correctly configured and reproducible.
Access verified evaluation images from Docker Hub:
docker pull hwiwonlee/secb.eval.x86_64.[instance_name]python -m secb.evaluator.build_eval_instances \
--input-dir [VERIFIED_INSTANCE_DIR]python -m secb.evaluator.eval_instances \
--input-dir [AGENT_OUTPUT_DIR] \
--type [TYPE] \
--split [SPLIT] \
--agent [AGENT] \
--num-workers [NUM_WORKERS] \
--output-dir [OUTPUT_DIR]Parameters:
type: Evaluation type (patchorpoc)split: Dataset split to evaluateagent: Agent type (swea,oh,aider)num-workers: Number of parallel workers
python -m secb.evaluator.view_patch_results \
--agent [AGENT] \
--input-dir [EVALUATION_OUTPUT_DIR]python -m secb.evaluator.view_poc_results \
--agent [AGENT] \
--input-dir [EVALUATION_OUTPUT_DIR]Note
In the Docker evaluation images, you can check a harness secb with options such as build, repro, and patch.
SEC-bench uses a hierarchical Docker image structure:
- Base Images:
hwiwonlee/secb.base:*- Foundation images with build tools - Instance Images:
hwiwonlee/secb.x86_64.*- Vulnerability-specific environments - Evaluation Images:
hwiwonlee/secb.eval.x86_64.*- Verified evaluation instances
hwiwonlee/secb.eval.x86_64.[project].[vulnerability_id]
Example: hwiwonlee/secb.eval.x86_64.mruby.cve-2022-0240
If you use SEC-bench in your research, please cite our paper:
@article{lee2025sec,
title={SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks},
author={Lee, Hwiwon and Zhang, Ziqi and Lu, Hanxiao and Zhang, Lingming},
journal={arXiv preprint arXiv:2506.11791},
year={2025}
}