🔥🔥🔥 Multimodal Large Language Models for Remote Sensing: A Survey
[Project Page]This Page |
School of Artificial Intelligence, OPtics, and ElectroNics (iOPEN), Northwestern Polytechnical University
✨✨✨ Behold our meticulously curated trove of RS-MLLMs resources!!!
🎉🚀💡 The website will be updated in real-time to track the latest state of RS-MLLMs!!!
📑📚🔍 Feast your eyes on an assortment of model architecture, training pipelines, datasets, comprehensive evaluation benchmarks, intelligent agents for remote sensing, techniques for instruction tuning, and much more.
🌟🔥📢 A collection of remote sensing multimodal large language model papers focusing on the vision-language domain.
In this repository, we will collect and document researchers and their outstanding work related to remote sensing multimodal large language model (vision-language).
- The list will be continuously updated 🔥🔥
- 📦 coming soon! 🚀
- May-22-2024: The first RS-MLLMs review manuscript has been submitted for review. 🔥🔥
Table of Contents
- Awesome Papers
- Awesome Datasets
- Latest Evaluation Benchmarks for Remote Sensing Vision-Language Tasks
Title | Venue | Date | Code | Note |
---|---|---|---|---|
A semantic-enhanced multi-modal remote sensing foundation model for Earth observation Wu, K., Zhang, Y., Ru, L., Dang, B., Lao, J., Yu, L., ... & Li, Y. |
Nature Machine Intelligence | 2025-08-04 | - | - |
Remote Sensing Large Vision-Language Model: Semantic-augmented Multi-level Alignment and Semantic-aware Expert Modeling Sungjune Park, Yeongyun Kim, Se Yeon Kim, Yong Man Ro |
arXiv | 2025-06-27 | - | - |
EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models Yan Shu, Bin Ren, Zhitong Xiong, Danda Pani Paudel, Luc Van Gool, Begum Demir, Nicu Sebe, Paolo Rota |
arXiv | 2025-06-02 | Github | - |
GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution Fengxiang Wang, Mingshuo Chen, Yueying Li, Di Wang, Haotian Wang, Zonghao Guo, Zefan Wang, Boqi Shan, Long Lan, Yulin Wang, Hongzhen Wang, Wenjing Yang, Bo Du, Jing Zhang |
arXiv | 2025-05-27 | Github | - |
TinyRS-R1: Compact Multimodal Language Model for Remote Sensing Aybora Koksal, A. Aydin Alatan |
arXiv | 2025-05-17 | - | - |
EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery Wei Zhang, Miaoxin Cai, Yaqian Ning, Tong Zhang, Yin Zhuang, He Chen, Jun Li, Xuerui Mao |
arXiv | 2025-04-17 | - | - |
SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model Kaiyu Li, Zepeng Xin, Li Pang, Chao Pang, Yupeng Deng, Jing Yao, Guisong Xia, Deyu Meng, Zhi Wang, Xiangyong Cao |
arXiv | 2025-04-13 | Github | - |
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shahbaz Khan, Salman Khan |
CVPR-25 | 2025-04-07 | Github | - |
EagleVision: Object-level Attribute Multimodal LLM for Remote Sensing H. Jiang, J. Yin, Q. Wang, J. Feng, G. Chen |
arXiv | 2025-03-30 | Github | - |
OmniGeo: Towards a Multimodal Large Language Models for Geospatial Artificial Intelligence Yuan, L., Mo, F., Huang, K., Wang, W., Zhai, W., Zhu, X., ... & Nie, J. Y. |
arXiv | 2025-03-20 | - | - |
Falcon: A Remote Sensing Vision-Language Foundation Model K. Yao, N. Xu, R. Yang, Y. Xu, Z. Gao, T. Kitrungrotsakul, Y. Ren, P. Zhang, J. Wang, N. Wei, C. Li |
arXiv | 2025-03-14 | Github | - |
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning J. Luo, Y. Zhang, X. Yang, K. Wu, Q. Zhu, L. Liang, J. Chen, Y. Li |
arXiv | 2025-03-10 | Github | - |
Co-LLaVA: Efficient Remote Sensing Visual Question Answering via Model Collaboration Liu, F., Dai, W., Zhang, C., Zhu, J., Yao, L., & Li, X. |
arXiv | 2025-01-29 | Github | - |
GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing A. Shabbir, M. Zumri, M. Bennamoun, F. S. Khan, S. Khan |
arXiv | 2025-01-23 | Github | - |
GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing R. Ou, Y. Hu, F. Zhang, J. Chen, Y. Liu |
arXiv | 2025-01-12 | Github | accepted by GRSM-25 |
RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts Xu Liu, Zhouhui Lian |
arXiv | 2024-12-10 | Github | - |
RingMoGPT: A Unified Remote Sensing Foundation Model for Vision, Language, and grounded tasks P. Wang, H. Hu, B. Tong, Z. Zhang, F. Yao, Y. Feng, Z. Zhu, H. Chang, W. Diao, Q. Ye, and X. Sun |
T-GRS | 2024-12-04 | - | - |
GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding Y. Zhou, M. Lan, X. Li, Y. Ke, X. Jiang, L. Feng, and W. Zhang |
arXiv | 2024-11-16 | Github | - |
Large Vision-Language Models for Remote Sensing Visual Question Answering S. Siripong, A. Chaiyapan, and T. Phonchai |
arXiv | 2024-11-16 | - | - |
RS-MoE: Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering Lin, H., Hong, D., Ge, S., Luo, C., Jiang, K., Jin, H., and Wen, C |
arXiv | 2024-11-03 | - | accepted by TGRS-25 |
GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing Elgendy, H., Sharshar, A., Aboeitta, A., Ashraf, Y., and Guizani, M. |
arXiv | 2024-10-25 | Github | - |
TEOChat: Large Language and Vision Assistant for Temporal Earth Observation Data J. Irvin, Jeremy Andrew, et al. |
arXiv | 2024-10-08 | Github | - |
CDChat: A Large Multimodal Model for Remote Sensing Change Description Noman, M., Ahsan, N., Naseer, M., Cholakkal, H., Anwer, R. M., Khan, S., & Khan, F. S C |
arXiv | 2024-09-24 | - | - |
EarthMarker: A Visual Prompting MLLM for Region-level and Point-level Remote Sensing Imagery Comprehension Zhang, W., Cai, M., Zhang, T., Zhuang, Y., and Mao, X. |
arXiv | 2024-07-18 | Github | accepted by TGRS-24 |
SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding J. Luo et al. |
arXiv | 2024-06-14 | Github | - |
RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery Y. Bazi, L. Bashmal, M. M. Al Rahhal, R. Ricci, and F. Melgani. |
Remote Sensing | 2024-04-23 | Github | - |
VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis [H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model] C. Pang, W. Jiang, L. Jiayu, L. Yi, S. Jiaxing, L. Weijia, W. Xingxing, W. Shuai, F. Litong, X. Guisong, H.Conghui. |
arXiv | 2024-03-29 | Github | accepted by AAAI-25 |
Popeye: A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing Imagery W. Zhang, M. Cai, T. Zhang, G. Lei, Y. Zhuang, and X. Mao. |
arXiv | 2024-03-06 | - | accepted by JSTARS-24 |
Large Language Models for Captioning and Retrieving Remote Sensing Images J. D. Silva, J. Magalhaes, and D. Tuia. |
arXiv | 2024-02-09 | - | - |
LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation Z. Li, D. Muhtar, F. Gu, X. Zhang, P. Xiao, G. He, and X. Zhu |
arXiv | 2024-02-04 | Github | accepted by ISPRS-25 |
LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model D. Muhtar, Z. Li, F. Gu, X. Zhang, and P. Xiao. |
arXiv | 2024-02-04 | Github | accepted by ECCV-24 |
EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain W. Zhang, M. Cai, T. Zhang, Y. Zhuang, and X. Mao. |
arXiv | 2024-01-30 | Github | accepted by IEEE-TGRS |
SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model Y. Zhan, Z. Xiong, and Y. Yuan. |
arXiv | 2024-01-18 | Github | accepted by ISPRS-25 |
GeoChat: Grounded Large Vision-Language Model for Remote Sensing K. Kuckreja, M. S. Danish, M. Naseer, A. Das, S. Khan, and F. S. Khan. |
arXiv | 2023-11-24 | Github | accepted by CVPR-24 |
RSGPT: A Remote Sensing Vision Language Model and Benchmark Y. Hu, J. Yuan, and C. Wen. |
arXiv | 2023-07-28 | Github | accepted by ISPRS-25 |
Title | Venue | Date | Code | Note |
---|---|---|---|---|
RS-Agent: Automating Remote Sensing Tasks through Intelligent Agents W. Xu, Z. Yu, Y. Wang, J. Wang, and M. Peng. |
arXiv | 2024-06-11 | - | - |
GeoLLM-Engine: A Realistic Environment for Building Geospatial Copilots S. Singh, M. Fore, D. Stamoulis, and D. Group. |
arXiv | 2024-04-23 | - | - |
Evaluating Tool-Augmented Agents in Remote Sensing Platforms S. Singh, M. Fore, and D. Stamoulis. |
arXiv | 2024-04-23 | - | - |
Change-Agent: Towards Interactive Comprehensive Remote Sensing Change Interpretation and Analysis C. Liu, K. Chen, H. Zhang, Z. Qi, Z. Zou, and Z. Shi. |
arXiv | 2024-04-01 | Github | - |
Remote Sensing ChatGPT: Solving Remote Sensing Tasks with ChatGPT and Visual Models H. Guo, X. Su, C. Wu, B. Du, L. Zhang, and D. Li. |
arXiv | 2024-01-17 | Github | - |
Tree-GPT: Modular Large Language Model Expert System for Forest Remote Sensing Image Understanding and Interactive Analysis S. Du, S. Tang, W. Wang, X. Li, and R. Guo. |
arXiv | 2023-10-07 | - | - |
Title | Venue | Date | Code | Note |
---|---|---|---|---|
Vision-Language Modeling Meets Remote Sensing: Models, datasets, and perspectives Xingxing Weng; Chao Pang; Gui-Song Xia. |
GRSM | 2025-06-09 | - | - |
A Survey on Remote Sensing Foundation Models: From Vision to Multimodality Z. Huang, H. Yan, Q. Zhan, S. Yang, M. Zhang, C. Zhang, Y. Lei, Z. Liu, Q. Liu, Y. Wang |
arXiv | 2025-03-28 | Github | arXiv |
GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing Z. Zhang, H. Shen, T. Zhao, B. Chen, Z. Guan, Y. Wang, X. Jia, Y. Cai, Y. Shang, J. Yin. |
arXiv | 2025-03-16 | - | - |
When Remote Sensing Meets Foundation Model: A Survey and Beyond Huo, Chunlei; Chen, Keming; Zhang, Shuaihao; Wang, Zeyu; Yan, Heyu; Shen, Jing; Hong, Yuyang; Qi, Geqi; Fang, Hongmei; Wang, Zihan. |
remote sensing | 2025-01-07 | - | - |
Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey C. Liu, J. Zhang, K. Chen, M. Wang, Z. Zou, and Z. Shi |
arXiv | 2024-12-03 | Github | arXiv |
From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing X. Sun, B. Peng, C. Zhang, F. Jin, Q. Niu, J. Liu, K. Chen, M. Li, P. Feng, Z. Bi, M. Liu, and Y. Zhang. |
arXiv | 2024-11-05 | - | - |
Foundation Models for Remote Sensing and Earth Observation: A Survey A. Xiao, W. Xuan, J. Wang, J. Huang, D. Tao, S. Lu, and N. Yokoya. |
arXiv | 2024-10-22 | Github | accepted by GRSM-25 |
Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques L. Tao, H. Zhang, H. Jing, Y. Liu, K. Yao, C. Li, and X. Xue. |
arXiv | 2024-10-15 | Github | accepted by remote sensing-25 |
Towards Vision-Language Geo-Foundation Model: A Survey Y. Zhou, L. Feng, Y. Ke, X. Jiang, J. Yan, and W. Zhang. |
arXiv | 2024-06-13 | Github | arXiv |
Vision-Language Models in Remote Sensing: Current progress and future trends X. Li, C. Wen, Y. Hu, Z. Yuan, and X. X. Zhu. |
MGRS | 2024-04-22 | - | - |
Language Integration in Remote Sensing: Tasks, datasets, and future directions L. Bashmal, Y. Bazi, F. Melgani, M. M. Al Rahhal, and M. A. Al Zuair. |
MGRS | 2023-10-11 | - | - |
Brain-Inspired Remote Sensing Foundation Models and Open Problems: A Comprehensive Survey L. Jiao et al. |
JSTARS | 2023-09-18 | - | - |
Title | Venue | Date | Code | Note |
---|---|---|---|---|
On the Foundations of Earth and Climate Foundation Models X. X. Zhu et al. |
arXiv | 2024-05-07 | Github | - |
On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications C. Tan et al. |
arXiv | 2023-12-23 | - | - |
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs J. Roberts, T. Lüddecke, R. Sheikh, K. Han, and S. Albanie. |
arXiv | 2023-11-24 | Github | - |
The Potential of Visual ChatGPT for Remote Sensing L. P. Osco, E. L. de Lemos, W. N. Gonçalves, A. P. M. Ramos, and J. Marcato Junior. |
Remote Sensing | 2023-06-22 | - | - |
Title | Venue | Date | Code | Note |
---|---|---|---|---|
J. Ge, Y. Zheng, K. Guo, and J. Liang. |
arXiv | 2024-08-27 | Github | Link |
Z. Yuan, Z. Xiong, L. Mou, and X. X. Zhu. |
arXiv | 2024-02-17 | Github | Link |
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing Z. Zhang, T. Zhao, Y. Guo, and J. Yin. |
arXiv | 2024-01-02 | Github | - |
SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing Z. Wang, R. Prabha, T. Huang, J. Wu, and R. Rajagopal. |
AAAI | 2024-03-24 | Github | arXiv |
If you have any questions about this project, please feel free to contact [email protected].