Songyan Zhang1*, Wenhui Huang2*, Zhan Chen1, Chua Jiahao Collister1, Qihang Huang1, Chen Lv1†
Nanyang Technological University1, Harvard University2
*Equal Contributions, †Corresponding Author
An overview of the capability of our proposed OpenREAD, a vision-language model tailored for autonomous driving by reinforcement learning with GRPO. Besides the trajectory planning, our OpenREAD is also capable of providing reasoning-enhanced response for open-ended scenario understanding, action analysis, etc.
Our OpenREAD is built upon the Qwen3-VL-8B and finetuned on a mixture of datasets including LingoQA, OmniDrive, and NuScenes datasets. Our OpenREAD is now available at huggingface. Enjoy playing with it!
To facalitate the learning of reasoning capability at the cold start stage, we construct a large scale of CoT annotations on the LingoQA and NuScenes datasets as shown above. We further extend the amount of annations for LingoQA from 7K to 11K. All the CoT annotations are available here.
- Clone this repository and navigate to OpenREAD folder
git clone https://github.com/wyddmw/OpenREAD
cd OpenREAD- Install ms-swift package
conda create -n openread python=3.10 -y
conda activate openread
pip install -e .- Install Flash-Attention.
pip install flash_attn=2.8.3 --no-build-isolationIf the installation is not compatable for your device and environment, please refer to the source code and install the suitable version.
- Install Qwen3-VL dependicies.
pip install "transformers==4.57" "qwen_vl_utils==0.0.14"The datasets used to train OpenREAD are as follows:
Please download our pre-processed Lidar-BEV images for the NuScenes dataset. For trajectory evaluation, we use the GT cache introduced in GPT-Driver. Please download the GT cache from Google Drive The datasets are organized in the following structure:
data
├── LingoQA
│ ├── action
│ │ └── images
│ ├── evaluation
│ │ │── images
│ │ └── val.parquet
│ ├── scenery
│ │ └── images
│ ├── training_data.json
│ └── evaluation_data.json
├── nuscenes
│ ├── samples
│ │ ├── CAM_FRONT
│ │ ├── LIDAR_BEV
│ ├── gt
│ │ │── vad_gt_seg.pkl
│ │ └── gt_traj_mask.pkl
│ traj_val_bev_ego_status.json
│
It is recommended to symlink your dataset root to data:
Before running the evaluation script, please first download the pretrained Lingo-Judge. Check the path of LingoQA dataset and LingoJudge pretrained model in the eval/LingoQA/eval_lingo.sh.
sh eval/LingoQA/eval_lingo.shThe predictions, Lingo-Judge, CIDEr, Meteor, and BLEU metrics will be saved to the eval/LingoQA/lingoqa_results_OpenREAD.json.
We also provide scripts to evaluate trajectory prediction quality on the NuScenes validation set using both STP-3 and UniAD metrics. Update the trained model path, eval_file path, training mode, and inference outputs path in the eval/Trajectory/infer_trajs_dist.sh, then run trajectory inference:
bash eval/Trajectory/infer_trajs_dist.shThis script generates trajectory prediction JSON files under the directory specified by inference outputs path. Next, update the trajectory inference outputs path inside eval/Trajectory/eval_trajs.py, Then compute both STP-3 and UniAD metrics by running:
python eval/Trajectory/eval_trajs.py- [✓] Release hugging face model, inference and eval scripts.
- [✓] Release CoT data.
- Release training code.
We appeciate the awesome open-source project of ms-swift, OmniDrive, and GPT-Driver.
Coming soon.