π Official Paper: https://link.springer.com/chapter/10.1007/978-981-95-1746-6_18
This repository contains the code and resources for the paper "An Automated Pipeline for Constructing a Vietnamese VQA-NLE Dataset".
This project introduces ViVQA-X, the first Vietnamese dataset for Visual Question Answering with Natural Language Explanations (VQA-NLE). Developed using a novel automated pipeline, our work provides a crucial resource to advance research in multimodal AI and explainability for the Vietnamese language. ViVQA-X features:
- 32,886 question-answer pairs with detailed explanations
- 41,817 high-quality natural language explanations
- Multi-stage automated pipeline for translation and quality control
- Comprehensive evaluation using multiple state-of-the-art models
This project facilitates research in Vietnamese visual question answering and supports the development of explainable AI systems for Vietnamese language understanding.
| Resource | Description | Link |
|---|---|---|
| Dataset | ViVQA-X Dataset on Hugging Face | |
| Model Weights | Pre-trained LSTM-Generative Model | |
| Demo | Interactive Demo Space |
- QA Pairs: 32,886 pairs across Train/Validation/Test splits
- Explanations: 41,817 high-quality explanations
- Average Words: 10 words per explanation
- Vocabulary Size: 4,232 unique words in explanations
- Images: COCO dataset images with Vietnamese annotations
The dataset is organized into JSON files located in the data/final directory, containing questions, answers, and explanations associated with images from the COCO dataset.
# Clone the repository
git clone https://github.com/duongtruongbinh/ViVQA-X.git
cd ViVQA-X
# Install dependencies
pip install -r requirements.txt
# Download the dataset
bash scripts/download_vqax.sh
# Run the complete pipeline
bash scripts/pipeline.sh- Python 3.8+
- CUDA 11.2+ (for GPU support)
- 8GB+ RAM recommended
-
Clone the repository
git clone https://github.com/duongtruongbinh/ViVQA-X.git cd ViVQA-X -
Create and activate virtual environment
# Using conda (recommended) conda create -n vivqa-x python=3.8 conda activate vivqa-x # Or using venv python -m venv vivqa-x source vivqa-x/bin/activate # On Windows: vivqa-x\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Set up environment variables (for pipeline)
# Copy example environment file cp .env.example .env # Edit .env file and add your API keys: # OPENAI_API_KEY=your_openai_api_key_here # GEMINI_APIKEYS=your_gemini_api_key_1,your_gemini_api_key_2
-
Download the original VQA-X dataset
bash scripts/download_vqax.sh
-
Download the COCO dataset The ViVQA-X dataset uses images from the COCO 2014 dataset. You need to download the
train2014andval2014image sets.# Create directory for COCO data mkdir -p data/coco # Download and unzip Train 2014 images (~13GB) wget http://images.cocodataset.org/zips/train2014.zip -P data/coco/ unzip data/coco/train2014.zip -d data/coco/ rm data/coco/train2014.zip # Download and unzip Validation 2014 images (~6GB) wget http://images.cocodataset.org/zips/val2014.zip -P data/coco/ unzip data/coco/val2014.zip -d data/coco/ rm data/coco/val2014.zip
After this step, you should have the following directory structure:
data/coco/train2014anddata/coco/val2014.
Run the complete translation and processing pipeline:
bash scripts/pipeline.shThis will:
- Translate English VQA-X to Vietnamese
- Apply quality selection mechanisms
- Post-process the results
- Generate the final ViVQA-X dataset
We provide comprehensive benchmarks using multiple state-of-the-art models:
| Model | Repository |
|---|---|
| Heuristic Model | Included |
| LSTM-Generative | Included |
| NLX-GPT | GitHub |
| OFA-X | GitHub |
| ReRe | GitHub |
A rule-based approach requiring no training:
-
Configure the model
# src/models/heuristic_model/config/config.yaml data: train_path: "data/final/ViVQA-X_train.json" val_path: "data/final/ViVQA-X_val.json" test_path: "data/final/ViVQA-X_test.json" train_image_dir: "data/coco/train2014" val_image_dir: "data/coco/val2014" test_image_dir: "data/coco/val2014"
-
Run evaluation
python src/models/heuristic_model/run_heuristic.py
LSTM-Generative model with attention mechanism:
-
Configure the model
# src/models/baseline_model/config/config.yaml data: train_path: "data/final/ViVQA-X_train.json" val_path: "data/final/ViVQA-X_val.json" test_path: "data/final/ViVQA-X_test.json" train_image_dir: 'data/coco/train2014' val_image_dir: 'data/coco/val2014' test_image_dir: 'data/coco/val2014' model: device: "cuda:0" # Adjust based on GPU availability embed_size: 400 hidden_size: 2048 num_layers: 2 max_explanation_length: 15 training: learning_rate: 0.0001 num_epochs: 50 batch_size: 128 num_workers: 4 save_dir: "weights/baseline"
-
Train the model
# Using script (recommended) bash scripts/train.sh # Or direct command python src/models/baseline_model/train.py --config src/models/baseline_model/config/config.yaml
-
Evaluate the model
# Using script bash scripts/evaluate.sh # Or direct command python src/models/baseline_model/evaluate.py --model_path weights/baseline/best_model.pth
-
Use pre-trained weights
# The model weights are available on Hugging Face # Follow the repository instructions to download and use
Both models provide comprehensive evaluation metrics:
| Metric | Description |
|---|---|
| Answer Accuracy | Exact match accuracy for answers |
| BLEU-1/2/3/4 | N-gram precision for explanations |
| BERTScore | Contextual similarity score |
| METEOR | Semantic similarity with WordNet |
| ROUGE-L | Longest common subsequence |
| CIDEr | Consensus-based evaluation |
| SPICE | Semantic propositional evaluation |
ViVQA-X/
βββ data/ # Dataset files
β βββ vqax/ # Original VQA-X dataset
β βββ translation/ # Translation intermediate files
β βββ selection/ # Quality selection files
β βββ final/ # Final ViVQA-X dataset
β βββ ViVQA-X_train.json
β βββ ViVQA-X_val.json
β βββ ViVQA-X_test.json
βββ notebooks/ # Jupyter notebooks for analysis
βββ scripts/ # Utility scripts
β βββ download_vqax.sh # Download original dataset
β βββ pipeline.sh # Run complete pipeline
β βββ train.sh # Train baseline model
β βββ evaluate.sh # Evaluate models
βββ src/ # Source code
β βββ models/ # Model implementations
β β βββ baseline_model/ # LSTM-Generative model
β β β βββ config/ # Configuration files
β β β βββ dataloaders/ # Data loading utilities
β β β βββ metrics/ # Evaluation metrics
β β β βββ utils/ # Helper utilities
β β β βββ weights/ # Model checkpoints
β β β βββ train.py # Training script
β β β βββ evaluate.py # Evaluation script
β β β βββ vivqax_model.py # Model architecture
β β βββ heuristic_model/ # Rule-based baseline
β β βββ config/ # Configuration files
β β βββ dataloaders/ # Data loading utilities
β β βββ metrics/ # Evaluation metrics
β β βββ utils/ # Helper utilities
β β βββ run_heuristic.py # Main evaluation script
β β βββ heuristic_baseline.py # Model implementation
β βββ pipeline/ # Data processing pipeline
β βββ translation/ # Translation modules
β β βββ translators/ # Various translator implementations
β β βββ translation.py # Translation pipeline
β βββ selection/ # Quality selection modules
β β βββ evaluators/ # LLM evaluators
β β βββ selection.py # Selection pipeline
β βββ post_processing/ # Post-processing modules
β βββ pipeline.py # Main pipeline script
βββ requirements.txt # Python dependencies
βββ LICENSE
βββ README.md # This file
If you use this dataset or code in your research, please cite our paper:
@InProceedings{duong2026vivqax,
author = {Truong-Binh Duong and Hoang-Minh Tran and Binh-Nam Le-Nguyen and Dinh-Thang Duong},
title = {An Automated Pipeline for Constructing a Vietnamese VQA-NLE Dataset},
booktitle = {Proceedings of the Fifth International Conference on Intelligent Systems and Networks},
series = {Lecture Notes in Networks and Systems},
year = {2026},
publisher = {Springer Nature Singapore},
pages = {164--173},
isbn = {978-981-95-1746-6},
doi = {10.1007/978-981-95-1746-6_18}
}π€ Dataset β’ π€ Model β’ π€ Demo β’ π§ Contact