Awesome-Remote-Sensing-Multimodal-Large-Language-Models

🔥🔥🔥 Multimodal Large Language Models for Remote Sensing: A Survey
[Project Page]This Page |

School of Artificial Intelligence, OPtics, and ElectroNics (iOPEN), Northwestern Polytechnical University

✨ The first survey for Multimodal Large Language Models for Remote Sensing (RS-MLLMs).

✨✨✨ Behold our meticulously curated trove of RS-MLLMs resources!!!

🎉🚀💡 The website will be updated in real-time to track the latest state of RS-MLLMs!!!

📑📚🔍 Feast your eyes on an assortment of model architecture, training pipelines, datasets, comprehensive evaluation benchmarks, intelligent agents for remote sensing, techniques for instruction tuning, and much more.

🌟🔥📢 A collection of remote sensing multimodal large language model papers focusing on the vision-language domain.

🍎 Multimodal Large Language Models for Remote Sensing

🍎 Intelligent Agents for Remote Sensing

Please share a STAR ⭐ if this project does help

📢 Latest Updates

In this repository, we will collect and document researchers and their outstanding work related to remote sensing multimodal large language model (vision-language).

The list will be continuously updated 🔥🔥
📦 coming soon! 🚀
May-22-2024: The first RS-MLLMs review manuscript has been submitted for review. 🔥🔥

Table of Contents

Awesome Papers

Multimodal Large Language Models for Remote Sensing

Title	Venue	Date	Code	Note
A semantic-enhanced multi-modal remote sensing foundation model for Earth observation Wu, K., Zhang, Y., Ru, L., Dang, B., Lao, J., Yu, L., ... & Li, Y.	Nature Machine Intelligence	2025-08-04	-	-
Remote Sensing Large Vision-Language Model: Semantic-augmented Multi-level Alignment and Semantic-aware Expert Modeling Sungjune Park, Yeongyun Kim, Se Yeon Kim, Yong Man Ro	arXiv	2025-06-27	-	-
EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models Yan Shu, Bin Ren, Zhitong Xiong, Danda Pani Paudel, Luc Van Gool, Begum Demir, Nicu Sebe, Paolo Rota	arXiv	2025-06-02	Github	-
GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution Fengxiang Wang, Mingshuo Chen, Yueying Li, Di Wang, Haotian Wang, Zonghao Guo, Zefan Wang, Boqi Shan, Long Lan, Yulin Wang, Hongzhen Wang, Wenjing Yang, Bo Du, Jing Zhang	arXiv	2025-05-27	Github	-
TinyRS-R1: Compact Multimodal Language Model for Remote Sensing Aybora Koksal, A. Aydin Alatan	arXiv	2025-05-17	-	-
EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery Wei Zhang, Miaoxin Cai, Yaqian Ning, Tong Zhang, Yin Zhuang, He Chen, Jun Li, Xuerui Mao	arXiv	2025-04-17	-	-
SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model Kaiyu Li, Zepeng Xin, Li Pang, Chao Pang, Yupeng Deng, Jing Yao, Guisong Xia, Deyu Meng, Zhi Wang, Xiangyong Cao	arXiv	2025-04-13	Github	-
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shahbaz Khan, Salman Khan	CVPR-25	2025-04-07	Github	-
EagleVision: Object-level Attribute Multimodal LLM for Remote Sensing H. Jiang, J. Yin, Q. Wang, J. Feng, G. Chen	arXiv	2025-03-30	Github	-
OmniGeo: Towards a Multimodal Large Language Models for Geospatial Artificial Intelligence Yuan, L., Mo, F., Huang, K., Wang, W., Zhai, W., Zhu, X., ... & Nie, J. Y.	arXiv	2025-03-20	-	-
Falcon: A Remote Sensing Vision-Language Foundation Model K. Yao, N. Xu, R. Yang, Y. Xu, Z. Gao, T. Kitrungrotsakul, Y. Ren, P. Zhang, J. Wang, N. Wei, C. Li	arXiv	2025-03-14	Github	-
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning J. Luo, Y. Zhang, X. Yang, K. Wu, Q. Zhu, L. Liang, J. Chen, Y. Li	arXiv	2025-03-10	Github	-
Co-LLaVA: Efficient Remote Sensing Visual Question Answering via Model Collaboration Liu, F., Dai, W., Zhang, C., Zhu, J., Yao, L., & Li, X.	arXiv	2025-01-29	Github	-
GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing A. Shabbir, M. Zumri, M. Bennamoun, F. S. Khan, S. Khan	arXiv	2025-01-23	Github	-
GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing R. Ou, Y. Hu, F. Zhang, J. Chen, Y. Liu	arXiv	2025-01-12	Github	accepted by GRSM-25
RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts Xu Liu, Zhouhui Lian	arXiv	2024-12-10	Github	-
RingMoGPT: A Unified Remote Sensing Foundation Model for Vision, Language, and grounded tasks P. Wang, H. Hu, B. Tong, Z. Zhang, F. Yao, Y. Feng, Z. Zhu, H. Chang, W. Diao, Q. Ye, and X. Sun	T-GRS	2024-12-04	-	-
GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding Y. Zhou, M. Lan, X. Li, Y. Ke, X. Jiang, L. Feng, and W. Zhang	arXiv	2024-11-16	Github	-
Large Vision-Language Models for Remote Sensing Visual Question Answering S. Siripong, A. Chaiyapan, and T. Phonchai	arXiv	2024-11-16	-	-
RS-MoE: Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering Lin, H., Hong, D., Ge, S., Luo, C., Jiang, K., Jin, H., and Wen, C	arXiv	2024-11-03	-	accepted by TGRS-25
GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing Elgendy, H., Sharshar, A., Aboeitta, A., Ashraf, Y., and Guizani, M.	arXiv	2024-10-25	Github	-
TEOChat: Large Language and Vision Assistant for Temporal Earth Observation Data J. Irvin, Jeremy Andrew, et al.	arXiv	2024-10-08	Github	-
CDChat: A Large Multimodal Model for Remote Sensing Change Description Noman, M., Ahsan, N., Naseer, M., Cholakkal, H., Anwer, R. M., Khan, S., & Khan, F. S C	arXiv	2024-09-24	-	-
EarthMarker: A Visual Prompting MLLM for Region-level and Point-level Remote Sensing Imagery Comprehension Zhang, W., Cai, M., Zhang, T., Zhuang, Y., and Mao, X.	arXiv	2024-07-18	Github	accepted by TGRS-24
SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding J. Luo et al.	arXiv	2024-06-14	Github	-
RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery Y. Bazi, L. Bashmal, M. M. Al Rahhal, R. Ricci, and F. Melgani.	Remote Sensing	2024-04-23	Github	-
VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis [H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model] C. Pang, W. Jiang, L. Jiayu, L. Yi, S. Jiaxing, L. Weijia, W. Xingxing, W. Shuai, F. Litong, X. Guisong, H.Conghui.	arXiv	2024-03-29	Github	accepted by AAAI-25
Popeye: A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing Imagery W. Zhang, M. Cai, T. Zhang, G. Lei, Y. Zhuang, and X. Mao.	arXiv	2024-03-06	-	accepted by JSTARS-24
Large Language Models for Captioning and Retrieving Remote Sensing Images J. D. Silva, J. Magalhaes, and D. Tuia.	arXiv	2024-02-09	-	-
LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation Z. Li, D. Muhtar, F. Gu, X. Zhang, P. Xiao, G. He, and X. Zhu	arXiv	2024-02-04	Github	accepted by ISPRS-25
LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model D. Muhtar, Z. Li, F. Gu, X. Zhang, and P. Xiao.	arXiv	2024-02-04	Github	accepted by ECCV-24
EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain W. Zhang, M. Cai, T. Zhang, Y. Zhuang, and X. Mao.	arXiv	2024-01-30	Github	accepted by IEEE-TGRS
SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model Y. Zhan, Z. Xiong, and Y. Yuan.	arXiv	2024-01-18	Github	accepted by ISPRS-25
GeoChat: Grounded Large Vision-Language Model for Remote Sensing K. Kuckreja, M. S. Danish, M. Naseer, A. Das, S. Khan, and F. S. Khan.	arXiv	2023-11-24	Github	accepted by CVPR-24
RSGPT: A Remote Sensing Vision Language Model and Benchmark Y. Hu, J. Yuan, and C. Wen.	arXiv	2023-07-28	Github	accepted by ISPRS-25

Intelligent Agents for Remote Sensing

Title	Venue	Date	Code	Note
RS-Agent: Automating Remote Sensing Tasks through Intelligent Agents W. Xu, Z. Yu, Y. Wang, J. Wang, and M. Peng.	arXiv	2024-06-11	-	-
GeoLLM-Engine: A Realistic Environment for Building Geospatial Copilots S. Singh, M. Fore, D. Stamoulis, and D. Group.	arXiv	2024-04-23	-	-
Evaluating Tool-Augmented Agents in Remote Sensing Platforms S. Singh, M. Fore, and D. Stamoulis.	arXiv	2024-04-23	-	-
Change-Agent: Towards Interactive Comprehensive Remote Sensing Change Interpretation and Analysis C. Liu, K. Chen, H. Zhang, Z. Qi, Z. Zou, and Z. Shi.	arXiv	2024-04-01	Github	-
Remote Sensing ChatGPT: Solving Remote Sensing Tasks with ChatGPT and Visual Models H. Guo, X. Su, C. Wu, B. Du, L. Zhang, and D. Li.	arXiv	2024-01-17	Github	-
Tree-GPT: Modular Large Language Model Expert System for Forest Remote Sensing Image Understanding and Interactive Analysis S. Du, S. Tang, W. Wang, X. Li, and R. Guo.	arXiv	2023-10-07	-	-

Vision-Language Pre-training Models for Remote Sensing

Title	Venue	Date	Code	Note
LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text W. Chen, J. Chen, Y. Deng, J. Chen, Y. Feng, Z. Xi, D. Liu, K. Li, Y. Meng.	arXiv	2025-03-25	Github	arXiv
DOFA-CLIP: Multimodal Vision-Language Foundation Models for Earth Observation Zhitong Xiong, Yi Wang, Weikang Yu, Adam J Stewart, Jie Zhao, Nils Lehmann, Thomas Dujardin, Zhenghang Yuan, Pedram Ghamisi, Xiao Xiang Zhu.	arXiv	2025-03-08	Github	-
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing Z. Zhang, T. Zhao, Y. Guo, and J. Yin.	arXiv	2024-01-02	Github	accepted by IEEE-TGRS
RemoteCLIP: A Vision Language Foundation Model for Remote Sensing F. Liu, D. Chen, Z. Guan, X. Zhou, J. Zhu, and J. Zhou.	T-GRS	2024-04-18	Github	arXiv
Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment U. Mall, C. P. Phoo, M. K. Liu, C. Vondrick, B. Hariharan, and K. Bala.	ICLR	2024-01-16	Project	arXiv
RS-CLIP: Zero Shot Remote Sensing Scene Classification via Contrastive Vision-Language Supervision X. Li, C. Wen, Y. Hu, and N. Zhou.	JAG	2023-09-18	Github	-
Parameter-Efficient Transfer Learning for Remote Sensing Image–Text Retrieval Y. Yuan, Y. Zhan, and Z. Xiong.	T-GRS	2023-08-28	Github	arXiv

Survey Papers for Remote Sensing Vision-Language Tasks

Title	Venue	Date	Code	Note
Vision-Language Modeling Meets Remote Sensing: Models, datasets, and perspectives Xingxing Weng; Chao Pang; Gui-Song Xia.	GRSM	2025-06-09	-	-
A Survey on Remote Sensing Foundation Models: From Vision to Multimodality Z. Huang, H. Yan, Q. Zhan, S. Yang, M. Zhang, C. Zhang, Y. Lei, Z. Liu, Q. Liu, Y. Wang	arXiv	2025-03-28	Github	arXiv
GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing Z. Zhang, H. Shen, T. Zhao, B. Chen, Z. Guan, Y. Wang, X. Jia, Y. Cai, Y. Shang, J. Yin.	arXiv	2025-03-16	-	-
When Remote Sensing Meets Foundation Model: A Survey and Beyond Huo, Chunlei; Chen, Keming; Zhang, Shuaihao; Wang, Zeyu; Yan, Heyu; Shen, Jing; Hong, Yuyang; Qi, Geqi; Fang, Hongmei; Wang, Zihan.	remote sensing	2025-01-07	-	-
Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey C. Liu, J. Zhang, K. Chen, M. Wang, Z. Zou, and Z. Shi	arXiv	2024-12-03	Github	arXiv
From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing X. Sun, B. Peng, C. Zhang, F. Jin, Q. Niu, J. Liu, K. Chen, M. Li, P. Feng, Z. Bi, M. Liu, and Y. Zhang.	arXiv	2024-11-05	-	-
Foundation Models for Remote Sensing and Earth Observation: A Survey A. Xiao, W. Xuan, J. Wang, J. Huang, D. Tao, S. Lu, and N. Yokoya.	arXiv	2024-10-22	Github	accepted by GRSM-25
Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques L. Tao, H. Zhang, H. Jing, Y. Liu, K. Yao, C. Li, and X. Xue.	arXiv	2024-10-15	Github	accepted by remote sensing-25
Towards Vision-Language Geo-Foundation Model: A Survey Y. Zhou, L. Feng, Y. Ke, X. Jiang, J. Yan, and W. Zhang.	arXiv	2024-06-13	Github	arXiv
Vision-Language Models in Remote Sensing: Current progress and future trends X. Li, C. Wen, Y. Hu, Z. Yuan, and X. X. Zhu.	MGRS	2024-04-22	-	-
Language Integration in Remote Sensing: Tasks, datasets, and future directions L. Bashmal, Y. Bazi, F. Melgani, M. M. Al Rahhal, and M. A. Al Zuair.	MGRS	2023-10-11	-	-
Brain-Inspired Remote Sensing Foundation Models and Open Problems: A Comprehensive Survey L. Jiao et al.	JSTARS	2023-09-18	-	-

Others

Title	Venue	Date	Code	Note
On the Foundations of Earth and Climate Foundation Models X. X. Zhu et al.	arXiv	2024-05-07	Github	-
On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications C. Tan et al.	arXiv	2023-12-23	-	-
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs J. Roberts, T. Lüddecke, R. Sheikh, K. Han, and S. Albanie.	arXiv	2023-11-24	Github	-
The Potential of Visual ChatGPT for Remote Sensing L. P. Osco, E. L. de Lemos, W. N. Gonçalves, A. P. M. Ramos, and J. Marcato Junior.	Remote Sensing	2023-06-22	-	-

Awesome Datasets

Datasets of Pre-Training for Alignment

Title	Venue	Date	Code	Note
RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models J. Ge, Y. Zheng, K. Guo, and J. Liang.	arXiv	2024-08-27	Github	Link
ChatEarthNet: A Global-Scale, High-Quality Image-Text Dataset for Remote Sensing Z. Yuan, Z. Xiong, L. Mou, and X. X. Zhu.	arXiv	2024-02-17	Github	Link
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing Z. Zhang, T. Zhao, Y. Guo, and J. Yin.	arXiv	2024-01-02	Github	-
SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing Z. Wang, R. Prabha, T. Huang, J. Wu, and R. Rajagopal.	AAAI	2024-03-24	Github	arXiv

Datasets of Multimodal Instruction Tuning

Name	Paper	Link	Note
DDFAV	DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark	Link	27.7k
VRSBench	VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding	Link	29.6k
FIT-RS	SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding	Link	1800.8k
RS-GPT4V	RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding	Link	991k
RS-instructions	RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery	Link	7,058
SkyEye-968k	SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model	Link	968k
Multi-task Instruction	LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model	Link	42,322
MMRS-1M	EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain	Link	>1M
RS-ClsQaGrd-Instruct	H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model	Link	78k
MMShip	Popeye: A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing Imagery	Link	81k
RS-Specialized-Instruct	H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model	Link	29.8k
RS multimodal instruction	GeoChat: Grounded Large Vision-Language Model for Remote Sensing	Link	318k
LHRS-Instruct	LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model	Link	39.8k
HqDC-Instruct	H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model	Link	30k

Latest Evaluation Benchmarks for Remote Sensing Vision-Language Tasks

Remote Sensing Image Captioning and Aerial Video Captioning

Remote Sensing Visual Question Answering and Remote Sensing Visual Grounding

Remote Sensing Image-Text Retrieval

Remote Sensing Scene Classification

🤖 Contact

If you have any questions about this project, please feel free to contact [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
images		images
README.md		README.md
README_old.md		README_old.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome-Remote-Sensing-Multimodal-Large-Language-Models

Please share a STAR ⭐ if this project does help

📢 Latest Updates

Awesome Papers

Multimodal Large Language Models for Remote Sensing

Intelligent Agents for Remote Sensing

Vision-Language Pre-training Models for Remote Sensing

Survey Papers for Remote Sensing Vision-Language Tasks

Others

Awesome Datasets

Datasets of Pre-Training for Alignment

Datasets of Multimodal Instruction Tuning

Latest Evaluation Benchmarks for Remote Sensing Vision-Language Tasks

Remote Sensing Image Captioning and Aerial Video Captioning

Remote Sensing Visual Question Answering and Remote Sensing Visual Grounding

Remote Sensing Image-Text Retrieval

Remote Sensing Scene Classification

🤖 Contact

About

Uh oh!

Releases

Packages

ZhanYang-nwpu/Awesome-Remote-Sensing-Multimodal-Large-Language-Model

Folders and files

Latest commit

History

Repository files navigation

Awesome-Remote-Sensing-Multimodal-Large-Language-Models

Please share a STAR ⭐ if this project does help

📢 Latest Updates

Awesome Papers

Multimodal Large Language Models for Remote Sensing

Intelligent Agents for Remote Sensing

Vision-Language Pre-training Models for Remote Sensing

Survey Papers for Remote Sensing Vision-Language Tasks

Others

Awesome Datasets

Datasets of Pre-Training for Alignment

Datasets of Multimodal Instruction Tuning

Latest Evaluation Benchmarks for Remote Sensing Vision-Language Tasks

Remote Sensing Image Captioning and Aerial Video Captioning

Remote Sensing Visual Question Answering and Remote Sensing Visual Grounding

Remote Sensing Image-Text Retrieval

Remote Sensing Scene Classification

🤖 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages