This is the official repository of the paper:
SAR-TEXT: A Large-Scale SAR Image-Text Dataset Built with SAR-Narrator and A Progressive Learning Strategy for Downstream Tasks( https://arxiv.org/pdf/2507.18743 )
It includes:
- 📁 A large-scale SAR image–text paired dataset (SAR-TEXT)
- 🤖 Multiple vision-language foundation models (VLFMs) including:
SAR-CLIPfor retrievalSAR-CoCafor captioningSAR-GPTfor generation
- 🧠 An automatic captioning pipeline based on our SAR-Narrator framework( coming soon )
The goal of this project is to bridge the gap between synthetic aperture radar (SAR) imagery and semantic understanding via vision-language modeling. Everything — code, models, and data — will be open-sourced to support the community.
The complete image and caption data for the SAR-TEXT image-text matching dataset is available via Baidu Netdisk:
- 🖼️ SAR Image–Text Matching Dataset (SAR-TEXT)
SAR-TEXT-data.zip(shared via Baidu NetDisk)
🔗 Download Link
🔑 Extraction Code:fw5a
This the SAR image–text dialogue dataset introduced in our paper. This release includes:
-
🛰 Optical Remote Sensing (RS) Dialogue Dataset(RS-VQA)
RS-VQA_conv.json
Based on the RS-VQA dataset, providing multi-turn visual question answering (VQA) dialogue annotations for optical remote sensing images. -
📡 SAR Image–Text Dialogue Dataset (SAR-VQA)
SAR-VQA_conv.json(shared via Baidu NetDisk)
🔗 Download Link
🔑 Extraction Code:1qqj
-
🧠 SAR-RS-CLIP
SAR-RS-CLIP.pt(shared via Baidu NetDisk)
🔗 Download Link
🔑 Extraction Code:1472 -
🧠 SAR-RS-CoCa
SAR-RS-CoCa.pt(shared via Baidu NetDisk)
🔗 Download Link
🔑 Extraction Code:g4x3 -
🧠 SAR-GPT
SAR-GPT.pth(shared via Baidu NetDisk)
🔗 Download Link
🔑 Extraction Code:aqjy
This repository integrates multiple models from different codebases. Please make sure to follow the correct environment setup for each component:
-
✅ CLIP and CoCa models are implemented using the OpenCLIP framework. All related model loading, training, and inference scripts are based on OpenCLIP.
-
✅ SAR-GPT is based on the TinyGPT-V repository. Any generation tasks involving SAR-GPT should be executed in the TinyGPT-V environment.
Ensure dependencies are installed accordingly before running any module.
The script SAR-CLIP-retrieval.py evaluates image-text retrieval performance using SAR-CLIP, fine-tuned on SAR-Text.
-
CSV File:
HRSID_test_caption.csv
Contains paths and corresponding captions. -
Images:
📦 Download from Baidu NetDisk:
HRSID_JPG.rar
🔑 Extraction Code:i4xf
python evaluate_retrieval.py \
--model-name ViT-L-14 \
--retrieval-csv-path ./HRSID_test_caption.csv \
--sarclip-path ./checkpoints/sarclip_weights.pt \
--batch-size 64 \
--workers 8The script will print standard retrieval metrics:
retrieval-image2text-R@1,@5,@10retrieval-text2image-R@1,@5,@10retrieval-mean-recall
The script SAR-CoCa-generate-caption.py is used to generate captions for SAR images using the CoCa model.
⚠️ Please ensure that this script is run in the OpenCLIP environment.
-
Set the
folder_pathvariable in the script to point to the directory containing your SAR images. -
Run the script. It will generate a CSV file named
SAR-CoCa-caption.csv, containing:- File path for each image
- Corresponding caption generated by the CoCa model
filepath,caption
./test_images/img001.jpg,A ship appears in open water.
./test_images/img002.jpg,A satellite view of a bridge across a river.