ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

🎉 News

[2025/10/14]: Checkout our latest blog post!
[2025/10/05]: We updated the evaluation chart with Qwen-235B and Grok-4 (GPT-5 family is also updated)!

We present the first benchmark to test LLM-based agents on threat hunting in the form of security question-answering pairs.

The environment consists 2 main components:

A MYSQL database where an agent can interact to retrieve information.
A set of generated questions and answers for testing in secgym/questions/tests folder, or from hugginface.

🛠️ Environment Setup

Download database from Hugging Face. Please download the data data_anonymized.tar.gz from this link. Put the folder data_anonymized under secgym/database/.
We are using MYSQL docker container for the database. Please first install Docker Desktop and docker-compose and then pull the mysql image:
```
docker pull mysql:9.0
```
Make sure your Docker Desktop is open, then run the following command to set up the mysql container for 8 different databases:
```
bash scripts/setup_docker.sh
```
It will run this command: python secgym/database/setup_database.py --csv <path_to_csv_folder> --port <port> --sql_file <path_to_sql_file> --container_name <container_name> for 8 different incidents.

This script will create 8 different containers. Note that these container are binded to the csv files in the data_anonymized folder. This will take up 10GB of disk space. Check out volumes with docker system df -v.

To set docker for a database that contains all the data (all 8 attacks), please uncomment the first command in setup_docker.sh. Note that this will take up 33GB of disk space.
Setup the environment using conda or venv with Python=3.11 and install the requirements with pip install -e . --use-pep517.The following is an example using conda:
```
conda create -n excytin python=3.11
conda activate excytin
pip install -e . --use-pep517
```
If you find consistent errors with the installation (maybe be caused by updated version of some packages), you can try to install the requirements with pip install -r requirements_freeze.txt, which is the frozen version of the requirements.
LLM setup.

We are using AG2 for API calling. Setup your API key in the secgym/myconfig.py file. You can follow the instructions here.

🏃‍♂️ Runs

Run Baseline. --trial_run will run only 2 questions from 1 incident for testing purposes. The results will be saved in experiments/final_results folder.
```
python experiments/run_exp.py --trial_run
```

🤖 Question Generation Process

All the questions are generated based on constructed graphs from the database. The generation process is as follows:

The SecurityIncident and SecurityAlert logs are used to construct a graph for each incident, check out this notebook for more details.
We run train-test split on the constructed graph. Run the question_split.ipynb notebook to get the split (saved to experiements/split_files). The train and test are split based on a proposed path relavance score.
We use LLM to generate questions based on the constructed graph. Currently, we already have the questions generated for the 8 different incidents in the secgym/questions/tests folder using OpenAI O1. If you want to rerun the question generation process, please use the following command:
```
python experiments/run_qa_gen.py --model gpt-4.1 --solution_model gpt-4.1 --relevant_type low_split --qa_path secgym/qagen/graph_files
```
Note in this script we use gpt-4.1 for question and solution generation.

After all the questions are generated, you should expect new files in secgym/questions folder like incident_<i>_qa.json where i is the incident number.

Note: All results from the paper use the questions in secgym/questions/tests folder. The train questions under secgym/questions/train are only partial and used for Expel to collect new rules.

📊 Results

Below is the evaluation results of the LLM agents on the test questions. We set temperature = 0 and max_step = 25. GPT-4o is used for evaluation. The full evaluation logs with the latest models can be found under the latest_experiments folder. The full evaluation logs for older models can be downloaded from this link. If can also be found under this branch under final_results folder (along with the original code).

📝 Citation

If you find this work useful, please cite our paper:

@article{wu2025excytin,
  title={ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation},
  author={Wu, Yiran and Velazco, Mauricio and Zhao, Andrew and Luj{\'a}n, Manuel Ra{\'u}l Mel{\'e}ndez and Movva, Srisuma and Roy, Yogesh K and Nguyen, Quang and Rodriguez, Roberto and Wu, Qingyun and Albada, Michael and others},
  journal={arXiv preprint arXiv:2507.14201},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 320 Commits
experiments		experiments
incident_reports		incident_reports
latest_experiments		latest_experiments
notebooks		notebooks
scripts		scripts
secgym		secgym
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
ExCyTIn Transparency Note.docx		ExCyTIn Transparency Note.docx
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
TRANSPARENCY_DOC.md		TRANSPARENCY_DOC.md
requirements.txt		requirements.txt
requirements_freeze.txt		requirements_freeze.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

🎉 News

🛠️ Environment Setup

🏃‍♂️ Runs

🤖 Question Generation Process

📊 Results

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

microsoft/SecRL

Folders and files

Latest commit

History

Repository files navigation

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

🎉 News

🛠️ Environment Setup

🏃‍♂️ Runs

🤖 Question Generation Process

📊 Results

📝 Citation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages