A Visual Question Answering Dataset for Agent-Level Risk Assessment in Autonomous Driving (ICRA 2026)
This repository contains the end-to-end pipeline that produces the NuRisk VQA dataset — from raw CommonRoad scenarios to a parquet-backed visual question answering dataset ready for VLM fine-tuning.
flowchart LR
A[1083 CommonRoad<br/>XML scenarios] --> B[Stage 1<br/>Frenetix<br/>Motion Planner]
B --> C[Stage 2<br/>Trajectory<br/>extraction]
C --> D[Stage 3<br/>Safety<br/>metrics]
D --> E[Stage 4<br/>Close-agent ID<br/>+ risk scoring]
E --> F[Stage 5<br/>VQA generation<br/>+ HF upload]
F --> G[Stage 6<br/>VLM<br/>fine-tuning]
| Stage | Folder / external repo | Output |
|---|---|---|
| 1 | TUM-AVS/Frenetix-Motion-Planner (external) |
Per-scenario trajectory logs + BEV PNGs |
| 2 | Trajectory_collection/ |
Ego + dynamic-obstacle CSVs with lanelet info |
| 3 | Safety_metrics_collection/ |
relative_metrics.csv (TTC, relative pos/vel per agent) |
| 4 | CloesdID_identification/ + Riskscore_calculation/ |
risk_scores_output_enhanced.json (0–5 risk scale per agent) |
| 5 | finetune_preprocess/ |
Parquet VQA dataset pushed to Yuan-avs/Nurisk |
| 6 | training/ (launcher + DeepSpeed configs) wrapping 2U1/Qwen-VL-Series-Finetune |
LoRA-fine-tuned Qwen2.5-VL checkpoint (NuRisk-VLM-Agent) |
git clone https://github.com/TUM-AVS/NuRisk.git
cd NuRisk
pip install -r requirements.txtThen edit config.py to point the path constants (XML_SCENARIOS_DIR, LOGS_DIR, VLM_DATASET_DIR, …) at your local data layout. A few individual scripts still carry hard-coded /home/yuan/... paths — search & replace before running them, or override via their CLI flags.
Or run everything in a container — there's a pre-configured CUDA + PyTorch + DeepSpeed image:
docker compose up -d nurisk-train
docker compose exec nurisk-train bashSee DOCKER_SETUP.md for GPU driver + container details.
Fetch the published VQA dataset (parquet + images, ~250 GB once decoded):
bash download_dataset.sh ./Nurisk_dataset
# or, in Python:
# from datasets import load_dataset; load_dataset("Yuan-avs/Nurisk")The full data-preparation pipeline (stages 1–4) is run on a curated collision-rich set of 1083 CommonRoad XML scenarios (the Frenetix dataset/scenarios_collision/ pool). After downstream cleaning in stage 5, 888 scenarios appear in the published Yuan-avs/Nurisk fine-tuning dataset. All 1083 IDs are listed in scenarios.txt — the 888 that survived into the final dataset are marked with *.
The XMLs themselves are not redistributed here. All scenarios used by this pipeline are publicly available in the official CommonRoad scenario database at https://commonroad.in.tum.de/scenarios/ — search by the IDs in scenarios.txt, download the XMLs, and point XML_SCENARIOS_DIR in config.py at the directory you place them in.
Run the Frenetix Motion Planner on every scenario.
Frenetix produces, per scenario:
logs.csv— ego state across time- BEV PNG snapshots per timestep (used in stage 5 as VQA image input)
Refer to the Frenetix README for installation and execution. The remainder of this pipeline assumes the per-scenario Frenetix log directories live under one root (configurable via LOGS_DIR in config.py).
cd Trajectory_collection
python main_multi.pyInput: Frenetix logs + raw CommonRoad XML scenario files Output (per scenario):
ego_trajectory_positions_with_lanelets.csvdynamic_obstacles_with_lanelets.csvdynamic_obstacles.csvego_trajectory.csv<scenario_name>.json(CommonRoad metadata)
Parallelism is set via the num_cpus constant in the script.
cd Safety_metrics_collection
python safety_multi.pyFor every (ego, agent, timestep) tuple this computes:
- Longitudinal / lateral relative distance
- Relative velocity (front / rear / left / right)
- Time-to-collision (TTC)
Output (per scenario): relative_metrics.csv
cd CloesdID_identification
python extract_close_metrics_multi.pyOutput: close_relative_metrics.csv (per scenario)
cd Riskscore_calculation
python risk_score_calculator_multi_enhanced.pyOutput (per scenario):
risk_scores_output_enhanced.json— full per-agent reasoning tracerisk_scores_close_relative_metrics_enhanced.csvrisk_score_summary_enhanced.csv
Risk scale (set in config.py):
| Score | Label | Min. distance | TTC |
|---|---|---|---|
| 0 | Collision | < 0.3 m | < 0.15 s |
| 1 | Extreme | < 0.8 m | < 0.65 s |
| 2 | High | < 1.3 m | < 1.15 s |
| 3 | Medium | < 3.0 m | < 3.0 s |
| 4 | Low | < 5.0 m | < 5.0 s |
| 5 | Negligible | ≥ 5.0 m | ≥ 5.0 s |
Lateral vs. longitudinal weighting and ego dimensions (EGO_LENGTH = 4.508 m, EGO_WIDTH = 1.610 m) are also defined in config.py.
cd finetune_preprocess
python create_groundtruth_dataset.py # 1. link BEV images ↔ risk scores
python prepare_sequential_dataset.py # 2. stack 5 consecutive BEV frames per sample
python create_sequential_groundtruth_dataset.py # 3. attach risk labels to the 5-frame stacks
python create_qwen_finetune_dataset.py # 4. emit LLaVA-style conversation pairs
python prepare_dataset_splits.py # 5. train / validation split
python convert_to_vqa_parquet_fixed.py # 6. → parquet (image / question / answer)
python upload_vqa_to_hf.py --repo-name <user>/<repo> # 7. push to the Hugging Face HubThe final artifact is the public dataset Yuan-avs/Nurisk — 60 train + 7 validation parquet shards with three columns per row: image (5-frame BEV stack), question (driving-risk prompt referencing a specific agent), answer (structured JSON with risk score, level, direction, reasoning).
See finetune_preprocess/README.md for a more detailed view of each sub-step.
We fine-tune Qwen2.5-VL on the NuRisk dataset using 2U1/Qwen-VL-Series-Finetune. The launcher and DeepSpeed configs are bundled in training/; the upstream training code is referenced as a sibling clone (see training/README.md).
# 1. Fetch the dataset
bash download_dataset.sh ./Nurisk_dataset
# 2. Clone the upstream training code next to this repo
git clone https://github.com/2U1/Qwen-VL-Series-Finetune.git ../Qwen-VL-Series-Finetune
# 3. Register the local dataset in the upstream data_config.json, then launch:
bash training/train_risk_analysis.shTypical LoRA configuration used in the paper:
| Hyper-parameter | Value |
|---|---|
| Base model | Qwen/Qwen2.5-VL-7B-Instruct |
| LoRA rank | 64 (best) / 256 (high-capacity variant) |
| LoRA target modules | q_proj, k_proj, v_proj, o_proj |
| Precision | bf16 |
| Sequence length | 2048 prompt / 1024 completion |
| DeepSpeed | ZeRO-3 (training/scripts/zero3.json) |
The resulting checkpoint is referred to in the paper as NuRisk-VLM-Agent.
After editing config.py:
python run_pipeline.py --stage allOr run a single stage: --stage {trajectory, safety, close, risk, vml}.
The same pipeline (stages 2–5) generalises to other sources of ego/agent trajectories + BEV imagery. To target a real-world driving dataset such as nuScenes or Waymo Open, simply swap stage 1 for that dataset's trajectory + BEV producer and keep stages 2–5 as-is — the risk scoring and VQA generation logic are dataset-agnostic.
@article{gao2025nurisk,
title = {NuRisk: A Visual Question Answering Dataset for Agent-Level Risk Assessment in Autonomous Driving},
author = {Gao, Yuan and Piccinini, Mattia and Brusnicki, Roberto and Zhang, Yuchen and Betz, Johannes},
journal = {arXiv preprint arXiv:2509.25944},
year = {2025},
url = {https://arxiv.org/abs/2509.25944}
}Released under the MIT License.
