Ruizhe Ou1 · Yuan Hu2,* · Fan Zhang2 · Jiaxin Chen1 · Yu Liu2,3
1Beijing University of Posts and Telecommunications · 2Peking University · 3Peking University Ordos Research Institute of Energy *corresponding authors
GeoPix is a new state-of-the-art pixel-level multi-modal large language model in remote sensing domain, supporting referring image segmentation and other tasks.
- [2025.04.11] We release the annotations of GeoPixInstruct. HuggingFace🤗
- [2025.04.10] GeoPix has been accepted by GRSM (IEEE Geoscience and Remote Sensing Magazine).
- [2025.02.20] We release the pre-trained checkpoints, inference code and gradio demo!
- [2025.01.12] We release the paper.
GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing [Arxiv]
In this work, we propose GeoPix, a RS MLLM that extends image understanding capabilities to the pixel level. This is achieved by equipping the MLLM with a mask predictor, which transforms visual features from the vision encoder into masks conditioned on the LLM’s segmentation token embeddings. For more details, please refer to the paper.
conda create -n geopix python=3.10 -y
conda activate geopix
pip install -r requirements.txt
mkdir pretrained_modelsYou can directly download the model from Huggingface, ModelScope or OpenXLab. You also can download the model in python script:
# Huggingface
from huggingface_hub import snapshot_download
snapshot_download(repo_id="Norman-ou/GeoPix-ft-sior_rsicap", local_dir="./pretrained_models")
# ModelScope
from modelscope import snapshot_download
model_dir = snapshot_download("NormanOU/GeoPix-ft-sior_rsicap", local_dir="./pretrained_models")Once you have prepared all models, the folder tree should be like:
.
├── ...
├── model
├── pretrained_models
├── app.py
├── engine.py
├── ...
└── README.md
Run the following command:
python app.pyThee instruction is well written. Enjoy our work.
Run the following command:
python inference.py@ARTICLE{10994415,
author={Ou, Ruizhe and Hu, Yuan and Zhang, Fan and Chen, Jiaxin and Liu, Yu},
journal={IEEE Geoscience and Remote Sensing Magazine},
title={GeoPix: A multimodal large language model for pixel-level image understanding in remote sensing},
year={2025},
volume={},
number={},
pages={2-16},
keywords={Visualization;Image segmentation;Training;Integrated circuit modeling;Grounding;Feature extraction;Accuracy;Remote sensing;Prototypes;Predictive models},
doi={10.1109/MGRS.2025.3560293}
}