Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ZhanYang-nwpu/Awesome-Remote-Sensing-Multimodal-Large-Language-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

Awesome-Remote-Sensing-Multimodal-Large-Language-Models

🔥🔥🔥 Multimodal Large Language Models for Remote Sensing: A Survey
[Project Page]This Page |

School of Artificial Intelligence, OPtics, and ElectroNics (iOPEN), Northwestern Polytechnical University

✨ The first survey for Multimodal Large Language Models for Remote Sensing (RS-MLLMs).

✨✨✨ Behold our meticulously curated trove of RS-MLLMs resources!!!

🎉🚀💡 The website will be updated in real-time to track the latest state of RS-MLLMs!!!

📑📚🔍 Feast your eyes on an assortment of model architecture, training pipelines, datasets, comprehensive evaluation benchmarks, intelligent agents for remote sensing, techniques for instruction tuning, and much more.

🌟🔥📢 A collection of remote sensing multimodal large language model papers focusing on the vision-language domain.

🍎 Multimodal Large Language Models for Remote Sensing

🍎 Intelligent Agents for Remote Sensing

Please share a STAR ⭐ if this project does help

📢 Latest Updates

In this repository, we will collect and document researchers and their outstanding work related to remote sensing multimodal large language model (vision-language).

  • The list will be continuously updated 🔥🔥
  • 📦 coming soon! 🚀
  • May-22-2024: The first RS-MLLMs review manuscript has been submitted for review. 🔥🔥

Table of Contents


Awesome Papers

Multimodal Large Language Models for Remote Sensing

Title Venue Date Code Note
A semantic-enhanced multi-modal remote sensing foundation model for Earth observation
Wu, K., Zhang, Y., Ru, L., Dang, B., Lao, J., Yu, L., ... & Li, Y.
Nature Machine Intelligence 2025-08-04 - -
Remote Sensing Large Vision-Language Model: Semantic-augmented Multi-level Alignment and Semantic-aware Expert Modeling
Sungjune Park, Yeongyun Kim, Se Yeon Kim, Yong Man Ro
arXiv 2025-06-27 - -
Star
EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models
Yan Shu, Bin Ren, Zhitong Xiong, Danda Pani Paudel, Luc Van Gool, Begum Demir, Nicu Sebe, Paolo Rota
arXiv 2025-06-02 Github -
Star
GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution
Fengxiang Wang, Mingshuo Chen, Yueying Li, Di Wang, Haotian Wang, Zonghao Guo, Zefan Wang, Boqi Shan, Long Lan, Yulin Wang, Hongzhen Wang, Wenjing Yang, Bo Du, Jing Zhang
arXiv 2025-05-27 Github -
TinyRS-R1: Compact Multimodal Language Model for Remote Sensing
Aybora Koksal, A. Aydin Alatan
arXiv 2025-05-17 - -
EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery
Wei Zhang, Miaoxin Cai, Yaqian Ning, Tong Zhang, Yin Zhuang, He Chen, Jun Li, Xuerui Mao
arXiv 2025-04-17 - -
Star
SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model
Kaiyu Li, Zepeng Xin, Li Pang, Chao Pang, Yupeng Deng, Jing Yao, Guisong Xia, Deyu Meng, Zhi Wang, Xiangyong Cao
arXiv 2025-04-13 Github -
Star
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues
Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shahbaz Khan, Salman Khan
CVPR-25 2025-04-07 Github -
Star
EagleVision: Object-level Attribute Multimodal LLM for Remote Sensing
H. Jiang, J. Yin, Q. Wang, J. Feng, G. Chen
arXiv 2025-03-30 Github -
OmniGeo: Towards a Multimodal Large Language Models for Geospatial Artificial Intelligence
Yuan, L., Mo, F., Huang, K., Wang, W., Zhai, W., Zhu, X., ... & Nie, J. Y.
arXiv 2025-03-20 - -
Star
Falcon: A Remote Sensing Vision-Language Foundation Model
K. Yao, N. Xu, R. Yang, Y. Xu, Z. Gao, T. Kitrungrotsakul, Y. Ren, P. Zhang, J. Wang, N. Wei, C. Li
arXiv 2025-03-14 Github -
Star
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
J. Luo, Y. Zhang, X. Yang, K. Wu, Q. Zhu, L. Liang, J. Chen, Y. Li
arXiv 2025-03-10 Github -
Star
Co-LLaVA: Efficient Remote Sensing Visual Question Answering via Model Collaboration Liu, F., Dai, W., Zhang, C., Zhu, J., Yao, L., & Li, X.
arXiv 2025-01-29 Github -
Star
GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing A. Shabbir, M. Zumri, M. Bennamoun, F. S. Khan, S. Khan
arXiv 2025-01-23 Github -
Star
GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing R. Ou, Y. Hu, F. Zhang, J. Chen, Y. Liu
arXiv 2025-01-12 Github accepted by GRSM-25
Star
RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts Xu Liu, Zhouhui Lian
arXiv 2024-12-10 Github -
RingMoGPT: A Unified Remote Sensing Foundation Model for Vision, Language, and grounded tasks
P. Wang, H. Hu, B. Tong, Z. Zhang, F. Yao, Y. Feng, Z. Zhu, H. Chang, W. Diao, Q. Ye, and X. Sun
T-GRS 2024-12-04 - -
Star
GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding
Y. Zhou, M. Lan, X. Li, Y. Ke, X. Jiang, L. Feng, and W. Zhang
arXiv 2024-11-16 Github -
Large Vision-Language Models for Remote Sensing Visual Question Answering
S. Siripong, A. Chaiyapan, and T. Phonchai
arXiv 2024-11-16 - -
RS-MoE: Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering
Lin, H., Hong, D., Ge, S., Luo, C., Jiang, K., Jin, H., and Wen, C
arXiv 2024-11-03 - accepted by TGRS-25
Star
GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing
Elgendy, H., Sharshar, A., Aboeitta, A., Ashraf, Y., and Guizani, M.
arXiv 2024-10-25 Github -
Star
TEOChat: Large Language and Vision Assistant for Temporal Earth Observation Data
J. Irvin, Jeremy Andrew, et al.
arXiv 2024-10-08 Github -
CDChat: A Large Multimodal Model for Remote Sensing Change Description
Noman, M., Ahsan, N., Naseer, M., Cholakkal, H., Anwer, R. M., Khan, S., & Khan, F. S C
arXiv 2024-09-24 - -
Star
EarthMarker: A Visual Prompting MLLM for Region-level and Point-level Remote Sensing Imagery Comprehension
Zhang, W., Cai, M., Zhang, T., Zhuang, Y., and Mao, X.
arXiv 2024-07-18 Github accepted by TGRS-24
Star
SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding
J. Luo et al.
arXiv 2024-06-14 Github -
Star
RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery
Y. Bazi, L. Bashmal, M. M. Al Rahhal, R. Ricci, and F. Melgani.
Remote Sensing 2024-04-23 Github -
Star
VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis [H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model]
C. Pang, W. Jiang, L. Jiayu, L. Yi, S. Jiaxing, L. Weijia, W. Xingxing, W. Shuai, F. Litong, X. Guisong, H.Conghui.
arXiv 2024-03-29 Github accepted by AAAI-25
Popeye: A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing Imagery
W. Zhang, M. Cai, T. Zhang, G. Lei, Y. Zhuang, and X. Mao.
arXiv 2024-03-06 - accepted by JSTARS-24
Large Language Models for Captioning and Retrieving Remote Sensing Images
J. D. Silva, J. Magalhaes, and D. Tuia.
arXiv 2024-02-09 - -
Star
LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation
Z. Li, D. Muhtar, F. Gu, X. Zhang, P. Xiao, G. He, and X. Zhu
arXiv 2024-02-04 Github accepted by ISPRS-25
Star
LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model
D. Muhtar, Z. Li, F. Gu, X. Zhang, and P. Xiao.
arXiv 2024-02-04 Github accepted by ECCV-24
Star
EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain
W. Zhang, M. Cai, T. Zhang, Y. Zhuang, and X. Mao.
arXiv 2024-01-30 Github accepted by IEEE-TGRS
Star
SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model
Y. Zhan, Z. Xiong, and Y. Yuan.
arXiv 2024-01-18 Github accepted by ISPRS-25
Star
GeoChat: Grounded Large Vision-Language Model for Remote Sensing
K. Kuckreja, M. S. Danish, M. Naseer, A. Das, S. Khan, and F. S. Khan.
arXiv 2023-11-24 Github accepted by CVPR-24
Star
RSGPT: A Remote Sensing Vision Language Model and Benchmark
Y. Hu, J. Yuan, and C. Wen.
arXiv 2023-07-28 Github accepted by ISPRS-25

Intelligent Agents for Remote Sensing

Title Venue Date Code Note
RS-Agent: Automating Remote Sensing Tasks through Intelligent Agents
W. Xu, Z. Yu, Y. Wang, J. Wang, and M. Peng.
arXiv 2024-06-11 - -
GeoLLM-Engine: A Realistic Environment for Building Geospatial Copilots
S. Singh, M. Fore, D. Stamoulis, and D. Group.
arXiv 2024-04-23 - -
Evaluating Tool-Augmented Agents in Remote Sensing Platforms
S. Singh, M. Fore, and D. Stamoulis.
arXiv 2024-04-23 - -
Star
Change-Agent: Towards Interactive Comprehensive Remote Sensing Change Interpretation and Analysis
C. Liu, K. Chen, H. Zhang, Z. Qi, Z. Zou, and Z. Shi.
arXiv 2024-04-01 Github -
Star
Remote Sensing ChatGPT: Solving Remote Sensing Tasks with ChatGPT and Visual Models
H. Guo, X. Su, C. Wu, B. Du, L. Zhang, and D. Li.
arXiv 2024-01-17 Github -
Tree-GPT: Modular Large Language Model Expert System for Forest Remote Sensing Image Understanding and Interactive Analysis
S. Du, S. Tang, W. Wang, X. Li, and R. Guo.
arXiv 2023-10-07 - -

Vision-Language Pre-training Models for Remote Sensing

Title Venue Date Code Note
Star
LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text
W. Chen, J. Chen, Y. Deng, J. Chen, Y. Feng, Z. Xi, D. Liu, K. Li, Y. Meng.
arXiv 2025-03-25 Github arXiv
Star
DOFA-CLIP: Multimodal Vision-Language Foundation Models for Earth Observation
Zhitong Xiong, Yi Wang, Weikang Yu, Adam J Stewart, Jie Zhao, Nils Lehmann, Thomas Dujardin, Zhenghang Yuan, Pedram Ghamisi, Xiao Xiang Zhu.
arXiv 2025-03-08 Github -
Star
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing
Z. Zhang, T. Zhao, Y. Guo, and J. Yin.
arXiv 2024-01-02 Github accepted by IEEE-TGRS
Star
RemoteCLIP: A Vision Language Foundation Model for Remote Sensing
F. Liu, D. Chen, Z. Guan, X. Zhou, J. Zhu, and J. Zhou.
T-GRS 2024-04-18 Github arXiv
Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment
U. Mall, C. P. Phoo, M. K. Liu, C. Vondrick, B. Hariharan, and K. Bala.
ICLR 2024-01-16 Project arXiv
Star
RS-CLIP: Zero Shot Remote Sensing Scene Classification via Contrastive Vision-Language Supervision
X. Li, C. Wen, Y. Hu, and N. Zhou.
JAG 2023-09-18 Github -
Star
Parameter-Efficient Transfer Learning for Remote Sensing Image–Text Retrieval
Y. Yuan, Y. Zhan, and Z. Xiong.
T-GRS 2023-08-28 Github arXiv

Survey Papers for Remote Sensing Vision-Language Tasks

Title Venue Date Code Note
Vision-Language Modeling Meets Remote Sensing: Models, datasets, and perspectives
Xingxing Weng; Chao Pang; Gui-Song Xia.
GRSM 2025-06-09 - -
Star
A Survey on Remote Sensing Foundation Models: From Vision to Multimodality
Z. Huang, H. Yan, Q. Zhan, S. Yang, M. Zhang, C. Zhang, Y. Lei, Z. Liu, Q. Liu, Y. Wang
arXiv 2025-03-28 Github arXiv
GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing
Z. Zhang, H. Shen, T. Zhao, B. Chen, Z. Guan, Y. Wang, X. Jia, Y. Cai, Y. Shang, J. Yin.
arXiv 2025-03-16 - -
When Remote Sensing Meets Foundation Model: A Survey and Beyond
Huo, Chunlei; Chen, Keming; Zhang, Shuaihao; Wang, Zeyu; Yan, Heyu; Shen, Jing; Hong, Yuyang; Qi, Geqi; Fang, Hongmei; Wang, Zihan.
remote sensing 2025-01-07 - -
Star
Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey
C. Liu, J. Zhang, K. Chen, M. Wang, Z. Zou, and Z. Shi
arXiv 2024-12-03 Github arXiv
From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing
X. Sun, B. Peng, C. Zhang, F. Jin, Q. Niu, J. Liu, K. Chen, M. Li, P. Feng, Z. Bi, M. Liu, and Y. Zhang.
arXiv 2024-11-05 - -
Star
Foundation Models for Remote Sensing and Earth Observation: A Survey
A. Xiao, W. Xuan, J. Wang, J. Huang, D. Tao, S. Lu, and N. Yokoya.
arXiv 2024-10-22 Github accepted by GRSM-25
Star
Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques
L. Tao, H. Zhang, H. Jing, Y. Liu, K. Yao, C. Li, and X. Xue.
arXiv 2024-10-15 Github accepted by remote sensing-25
Star
Towards Vision-Language Geo-Foundation Model: A Survey
Y. Zhou, L. Feng, Y. Ke, X. Jiang, J. Yan, and W. Zhang.
arXiv 2024-06-13 Github arXiv
Vision-Language Models in Remote Sensing: Current progress and future trends
X. Li, C. Wen, Y. Hu, Z. Yuan, and X. X. Zhu.
MGRS 2024-04-22 - -
Language Integration in Remote Sensing: Tasks, datasets, and future directions
L. Bashmal, Y. Bazi, F. Melgani, M. M. Al Rahhal, and M. A. Al Zuair.
MGRS 2023-10-11 - -
Brain-Inspired Remote Sensing Foundation Models and Open Problems: A Comprehensive Survey
L. Jiao et al.
JSTARS 2023-09-18 - -

Others

Title Venue Date Code Note
On the Foundations of Earth and Climate Foundation Models
X. X. Zhu et al.
arXiv 2024-05-07 Github -
On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications
C. Tan et al.
arXiv 2023-12-23 - -
Star
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs
J. Roberts, T. Lüddecke, R. Sheikh, K. Han, and S. Albanie.
arXiv 2023-11-24 Github -
The Potential of Visual ChatGPT for Remote Sensing
L. P. Osco, E. L. de Lemos, W. N. Gonçalves, A. P. M. Ramos, and J. Marcato Junior.
Remote Sensing 2023-06-22 - -

Awesome Datasets

Datasets of Pre-Training for Alignment

Title Venue Date Code Note
Star RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models
J. Ge, Y. Zheng, K. Guo, and J. Liang.
arXiv 2024-08-27 Github Link
Star ChatEarthNet: A Global-Scale, High-Quality Image-Text Dataset for Remote Sensing
Z. Yuan, Z. Xiong, L. Mou, and X. X. Zhu.
arXiv 2024-02-17 Github Link
Star
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing
Z. Zhang, T. Zhao, Y. Guo, and J. Yin.
arXiv 2024-01-02 Github -
Star
SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing
Z. Wang, R. Prabha, T. Huang, J. Wu, and R. Rajagopal.
AAAI 2024-03-24 Github arXiv

Datasets of Multimodal Instruction Tuning

Name Paper Link Note
DDFAV DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark Link 27.7k
VRSBench VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding Link 29.6k
FIT-RS SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding Link 1800.8k
RS-GPT4V RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding Link 991k
RS-instructions RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery Link 7,058
SkyEye-968k SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model Link 968k
Multi-task Instruction LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model Link 42,322
MMRS-1M EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain Link >1M
RS-ClsQaGrd-Instruct H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model Link 78k
MMShip Popeye: A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing Imagery Link 81k
RS-Specialized-Instruct H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model Link 29.8k
RS multimodal instruction GeoChat: Grounded Large Vision-Language Model for Remote Sensing Link 318k
LHRS-Instruct LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model Link 39.8k
HqDC-Instruct H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model Link 30k

Latest Evaluation Benchmarks for Remote Sensing Vision-Language Tasks

Remote Sensing Image Captioning and Aerial Video Captioning

Remote Sensing Visual Question Answering and Remote Sensing Visual Grounding

Remote Sensing Image-Text Retrieval

Remote Sensing Scene Classification

🤖 Contact

If you have any questions about this project, please feel free to contact [email protected].

About

Multimodal Large Language Models for Remote Sensing (RS-MLLMs): A Survey

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published