Authors: Maxence Boels, Harry Robertshaw, Thomas C Booth, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin
Affiliation: Surgical and Interventional Engineering, King's College London
This repository contains the official implementation of DARIL (Dual-task Autoregressive Imitation Learning), presenting the first comprehensive comparison of Imitation Learning (IL) versus Reinforcement Learning (RL) for surgical action planning. Our work challenges conventional assumptions about RL superiority in sequential decision-making tasks.
- First systematic IL vs RL comparison for surgical action planning on CholecT50 dataset
- Novel DARIL architecture combining dual-task learning with autoregressive prediction
- Surprising findings: IL consistently outperforms sophisticated RL approaches (world models, direct video RL, inverse RL)
- Critical insights on evaluation bias and distribution matching in expert domains
Surgical action planning predicts future instrument-verb-target (IVT) triplets from laparoscopic video feeds for real-time surgical assistance. Unlike recognition tasks, planning requires multi-horizon prediction under safety-critical constraints with sparse annotations (100 distinct triplet classes, 0-3 simultaneous actions per frame).
DARIL combines three key components:
- MHA Encoder - Temporal processing for current action recognition
- GPT-2 Decoder - Causal autoregressive generation for future action prediction (20-frame context window)
- Dual-task Optimization - Joint training on recognition + prediction with auxiliary losses
L = L_current + L_next + L_embed + L_phaseInput: 1024-dim Swin Transformer features
Output: Multi-horizon IVT triplet predictions (1s, 2s, 3s, 5s, 10s, 20s)
| Method | Current | 1s | 5s | 10s |
|---|---|---|---|---|
| DARIL (Ours) | 34.6 | 33.6 | 31.2 | 29.2 |
| DARIL + IRL | 33.1 | 32.1 | 29.6 | 28.1 |
| Direct Video RL | 33.2 | 22.6 | 19.3 | 15.9 |
| World Model RL | 33.1 | 14.0 | 9.1 | 3.1 |
| Component | Current | Next |
|---|---|---|
| Instrument (I) | 91.4 | 88.2 |
| Verb (V) | 69.4 | 68.1 |
| Target (T) | 52.7 | 52.5 |
| IVT | 34.6 | 33.6 |
Key Insight: DARIL maintains robust temporal consistency with only 13.1% relative performance decrease from 1s to 10s planning horizons, while world model RL catastrophically degrades to 3.1% mAP.
Our analysis identifies critical factors:
- Expert-Optimal Demonstrations - CholecT50 contains near-optimal expert data; RL explores valid alternatives penalized by expert-similarity metrics
- Evaluation Metric Alignment - Test metrics directly reward expert-like behavior, systematically favoring IL
- State-Action Representation Challenges - Frame embeddings + discrete action triplets + sparse rewards limit RL learning
- Distribution Mismatch - RL policies optimized for different objectives produce behaviors misaligned with test distributions
- Limited Exploration Benefits - Safety constraints and expert optimality reduce advantages from exploration
- Method Selection: Well-optimized IL may outperform sophisticated RL in expert domains with high-quality demonstrations
- Hybrid Approaches: Bootstrap RL with IL-learned skills, explore safely in simulation/world models
- Safety Advantages: IL inherently stays closer to expert behavior for clinical deployment
- Evaluation Frameworks: Alternative metrics focusing on patient outcomes (beyond expert similarity) may favor RL
The repository is organized for easy navigation and reproducibility. See STRUCTURE.md for detailed documentation.
DARIL/
βββ README.md # This file
βββ STRUCTURE.md # Detailed structure guide
βββ .gitignore # Git ignore rules
β
βββ π configs/ # Configuration files
β βββ config_dgx_all_v8.yaml # Main experiment config
β βββ config_dgx_all.yaml # Alternative configs
β
βββ π scripts/ # Executable scripts
β βββ run_experiment_v8.py # Main experiment runner
β βββ run_paper_generation.py # Paper figure generator
β βββ runai.sh # GPU cluster scripts
β βββ *.sh # Shell scripts
β
βββ π src/ # Core source code
β βββ training/ # Training modules
β β βββ autoregressive_il_trainer.py # DARIL trainer
β β βββ world_model_trainer.py # World model RL
β β βββ world_model_rl_trainer.py # RL in world models
β β βββ irl_direct_trainer.py # Inverse RL
β β βββ irl_next_action_trainer.py # IRL variants
β βββ evaluation/ # Evaluation framework
β βββ models/ # Model architectures
β βββ environment/ # RL environments
β βββ utils/ # Utility functions
β βββ debugging/ # Debug tools
β
βββ π notebooks/ # Interactive visualizations
β βββ visualization/ # Visualization modules
β βββ *.html # Interactive HTML demos
β
βββ π docs/ # Documentation
β βββ paper_manuscript/ # LaTeX paper source
β βββ paper_notes/ # Research notes
β βββ paper_generation/ # Figure generation
β
βββ π outputs/ # Experiment outputs (gitignored)
β βββ results/ # Evaluation results
β βββ models_saved/ # Trained model checkpoints
β βββ logs/ # Training logs
β βββ figures/ # Generated figures
β βββ data/ # Processed datasets
β
βββ π data/ # Raw dataset (user-provided)
β βββ cholect50/ # CholecT50 video features
β
βββ π docker/ # Docker configurations
β
βββ π archive/ # Historical code versions
-
Main Scripts:
scripts/run_experiment_v8.py- Primary training/evaluation pipelinescripts/run_paper_generation.py- Generate paper figures
-
Core Implementations:
src/training/autoregressive_il_trainer.py- DARIL modelsrc/models/- Model architectures (MHA encoder, GPT-2 decoder)src/evaluation/- Evaluation metrics and pipelines
-
Configuration:
configs/config_dgx_all_v8.yaml- Main experimental setup
git clone https://github.com/yourusername/DARIL.git
cd DARIL
pip install -r requirements.txt- Python 3.8+
- PyTorch 1.12+
- Transformers (GPT-2)
- timm (Swin Transformer)
- Standard ML libraries (numpy, pandas, scikit-learn)
CholecT50: 50 laparoscopic cholecystectomy videos with frame-level annotations
- Training: 40 videos (78,968 frames)
- Testing: 10 videos (21,895 frames)
- Sampling: 1 FPS
- Classes: 100 distinct IVT triplets
# Train DARIL model
python scripts/run_experiment_v8.py --config configs/config_dgx_all_v8.yaml --data_path /path/to/cholect50
# Evaluate on test set
python scripts/evaluate.py --checkpoint outputs/models_saved/daril_best.pth --horizon 10
# Run multi-horizon planning evaluation
python scripts/evaluate_planning.py --checkpoint outputs/models_saved/daril_best.pth# DARIL baseline (Imitation Learning)
python scripts/run_experiment_v8.py \
--method daril \
--config configs/config_dgx_all_v8.yaml \
--epochs 100 \
--lr 1e-4
# Direct Video RL
python scripts/run_experiment_v8.py \
--method direct_video_rl \
--config configs/config_rl.yaml
# World Model RL (Dreamer-based)
python scripts/run_experiment_v8.py \
--method world_model_rl \
--config configs/config_world_model.yaml
# Inverse RL
python scripts/run_experiment_v8.py \
--method inverse_rl \
--config configs/config_irl.yaml# Generate all figures for the paper
python scripts/run_paper_generation.py \
--results_dir outputs/results \
--output_dir outputs/figuresIf you find this work useful, please cite:
@article{boels2025daril,
title={DARIL: When Imitation Learning outperforms Reinforcement Learning in Surgical Action Planning},
author={Boels, Maxence and Robertshaw, Harry and Booth, Thomas C and Dasgupta, Prokar and Granados, Alejandro and Ourselin, Sebastien},
journal={arXiv preprint arXiv:2507.05011},
year={2025},
note={Accepted at MICCAI 2025 COLAS Workshop}
}- Paper: arXiv:2507.05011
- Conference: MICCAI 2025 COLAS Workshop
- Dataset: CholecT50
- Lab: Surgical & Interventional Engineering, KCL
Contributions are welcome! Please open an issue or submit a pull request for:
- Bug fixes
- Feature enhancements
- Improved RL implementations
- Extensions to other surgical datasets
This project is licensed under the MIT License - see the LICENSE file for details.
- CholecT50 dataset by CAMMA, University of Strasbourg
- Pre-trained Swin Transformer models
- OpenAI GPT-2 architecture
- Dreamer world model implementation
- Single dataset evaluation (CholecT50) - generalization to other procedures needed
- Expert test data may favor IL - sub-expert scenarios unexplored
- Evaluation metrics reward expert-like behavior - outcome-focused metrics needed
- RL state-action representations require further optimization
- Limited dataset size may cause overfitting - larger datasets and simulators needed
Future Directions: Cross-dataset evaluation, outcome-based metrics, physics simulators, comprehensive state-action-reward modeling
For questions or collaboration:
- Maxence Boels: [email protected]
- GitHub Issues: Open an issue