Thanks to visit codestin.com
Credit goes to github.com

Skip to content

CharlesCNorton/NEO

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

NEO Series: Native Vision-Language Models

📜 News

[2025/10] The paper, weights, and test code of NEO are released !
[2025/09] 💥💥💥 NEO has been completed !

💡 Motivation

  • What constraints set native VLMs apart from modular ones, and to what extent can they be overcome?

  • How to make native VLMs more accessible and democratized, thereby accelerating their progress?

💡 Highlights

  • 🔥 Native Architecture: NEO innovates a native VLM primitive that unifies pixel-word encoding, alignment, and reasoning within a dense, monolithic model architecture.

  • 🔥 Superior Efficiency: With merely 390M image-text examples, NEO develops strong visual perception from scratch, rivaling top-tier modular VLMs and outperforming native ones.

  • 🔥 Promising Roadmap: NEO pioneers a promsing route for scalable and powerful native VLMs, paired with diverse reusable components that foster a cost-effective and extensible ecosystem.

🤖 Model Zoo

We release 2B and 9B NEO in Pre-Training (PT), Mid-Training (MT), and Supervised Fine-Tuning (SFT).

Model Name Model Weight
NEO-2B-PT 🤗 NEO-2B-PT HF link
NEO-2B-MT 🤗 NEO-2B-MT HF link
NEO-2B-SFT 🤗 NEO-2B-SFT HF link
NEO-9B-PT 🤗 NEO-9B-PT HF link
NEO-9B-MT 🤗 NEO-9B-MT HF link
NEO-9B-SFT 🤗 NEO-9B-SFT HF link

📊 Benchmark Results

Notes:

  • “# Data” denotes the data scale for pre-training / mid-training / supervised fine-tuning.
  • † indicates models using Reinforcement Learning (RL).
  • “Any Res.” = any resolution; “Tile-wise” = image split into tiles; “Any Rat.” = any aspect ratio; “Fix Res.” = fixed resolution.
  • MoE = Mixture-of-Experts; DaC = Divide-and-Conquer.
  • Bold = best score in each column.
Model_NAME Base_LLM_NAME #Data_PT·MT·SFT Input_TYPE RoPE_TYPE MMMU MMB MMVet MMStar SEED_I POPE HallB AI2D DocVQA ChartQA InfoVQA TextVQA OCRBench
🔻Modular_VLMs_(2B)
Qwen2-VL Qwen2-1.5B --·--·-- Any Res. M-RoPE 41.1 74.9 49.5 48.0 -- -- 41.7 74.7 90.1 73.5 65.5 79.7 80.9
InternVL2.5 InternLM2.5-1.8B >6B·100M·16M Tile-wise 1D-RoPE 43.6 74.7 60.8 53.7 -- 90.6 42.6 74.9 88.7 79.2 60.9 74.3 80.4
Qwen2.5-VL† Qwen2.5-1.5B --·--·-- Any Res. M-RoPE 51.2 79.1 61.8 55.9 -- -- 46.3 81.6 93.9 84.0 77.1 79.3 79.7
InternVL3† Qwen2.5-1.5B >6B·100M·22M Tile-wise 1D-RoPE 48.6 81.1 62.2 60.7 -- 89.6 42.5 78.7 88.3 80.2 66.1 77.0 83.5
Encoder_Based Qwen3-1.7B >6B·40M·4M Tile-wise 1D-RoPE 47.1 75.8 37.4 52.7 73.6 87.0 44.4 77.4 89.9 78.4 65.9 73.3 83.5
🔻Native_VLMs_(2B)
Mono-InternVL InternLM2-1.8B 1.2B·143M·7M Tile-wise 1D-RoPE 33.7 65.5 40.1 -- 67.4 -- 34.8 68.6 80.0 73.7 43.0 72.6 76.7
Mono-InternVL-1.5 InternLM2-1.8B 400M·150M·7M Tile-wise 1D-RoPE 39.1 64.0 54.0 -- 66.9 -- 32.5 67.4 81.7 72.2 47.9 73.7 80.1
HoVLE InternLM2-1.8B 550M·50M·7M Tile-wise 1D-RoPE 32.2 73.3 43.8 -- 70.9 87.4 38.4 73.0 86.1 78.6 55.7 70.9 74.0
OneCAT Qwen2.5-1.5B 436M·70M·13M Any Res. M-RoPE 39.0 72.4 42.4 -- 70.9 -- -- 72.4 87.1 76.2 56.3 67.0 --
NEO Qwen3-1.7B 345M·40M·4M Any Res. Native_RoPE 48.6 76.0 49.6 54.2 74.2 87.5 43.1 80.1 89.9 81.2 63.2 74.0 77.1
🔻Modular_VLMs_(8B)
Qwen2-VL Qwen2-7B --·--·-- Any Res. M-RoPE 54.1 83.0 62.0 60.7 -- 88.1 50.6 83.0 94.5 83.0 76.5 84.3 86.6
InternVL2.5 InternLM2.5-7B >6B·50M·4M Tile-wise 1D-RoPE 56.0 84.6 62.8 64.4 -- 90.6 50.1 84.5 93.0 84.8 77.6 79.1 82.2
Qwen2.5-VL† Qwen2.5-7B --·--·-- Any Res. M-RoPE 55.0 83.5 67.1 63.9 -- 86.4 52.9 83.9 95.7 87.3 82.6 84.9 86.4
InternVL3† Qwen2.5-7B >6B·100M·22M Tile-wise 1D-RoPE 62.7 83.4 81.3 68.2 -- 91.1 49.9 85.2 92.7 86.6 76.8 80.2 88.0
Encoder-Based Qwen3-8B >6B·40M·4M Tile-wise 1D-RoPE 54.1 84.0 60.0 63.5 76.2 87.8 51.4 82.9 92.1 83.5 75.0 77.1 85.3
🔻Native_VLMs_(8B)
Fuyu Persimmon-8B --·--·-- Any Res. 1D-RoPE 27.9 10.7 21.4 -- 59.3 84.0 -- 64.5 -- -- -- -- 36.6
Chameleon from scratch 1.4B·0M·1.8M Fix Res. 1D-RoPE 25.4 31.1 8.3 -- 30.6 19.4 17.1 46.0 1.5 2.9 5.0 4.8 0.7
EVE Vicuna-7B 33M·0M·1.8M Any Rat. 1D-RoPE 32.6 52.3 25.7 -- 64.6 85.0 26.4 61.0 53.0 59.1 25.0 56.8 39.8
SOLO Mistral-7B 44M·0M·2M Any Res. 1D-RoPE -- 67.7 30.4 -- 64.4 78.6 -- 61.4 -- -- -- -- 12.6
Emu3 from scratch --·--·-- Fix Res. 1D-RoPE 31.6 58.5 37.2 -- 68.2 85.2 -- 70.0 76.3 68.6 43.8 64.7 68.7
EVEv2 Qwen2.5-7B 77M·15M·7M Any Rat. 1D-RoPE 39.3 66.3 45.0 -- 71.4 87.6 -- 74.8 -- 73.9 -- 71.1 70.2
BREEN Qwen2.5-7B 13M·0M·4M Any Res. 1D-RoPE 42.7 71.4 38.9 51.2 -- -- 37.0 76.4 -- -- -- 65.7 --
VoRA Qwen2.5-7B 30M·0M·0.6M Any Res. 1D-RoPE 32.0 61.3 33.7 -- 68.9 85.5 -- 61.1 -- -- -- 58.7 --
SAIL Mistral-7B 512M·86M·6M Any Res. M-RoPE -- 70.1 46.3 53.1 72.9 85.8 54.2 76.7 -- -- -- 77.1 78.3
NEO Qwen3-8B 345M·40M·4M Any Res. Native_RoPE 54.6 82.1 53.6 62.4 76.3 88.4 46.4 83.1 88.6 82.1 60.9 75.0 77.7

📋 Todo List

✒️ Citation

If NEO series is helpful for your research, please consider star ⭐ and citation 📝 :

@article{Diao2025NEO,
  title        = {From Pixels to Words — Towards Native Vision-Language Primitives at Scale},
  author       = {Diao, Haiwen and Li, Mingxuan and Wu, Silei and Dai, Linjun and Wang, Xiaohua and Deng, Hanming and Lu, Lewei and Lin, Dahua and Liu, Ziwei},
  journal      = {arXiv preprint arXiv:2510.14979},
  year         = {2025}
}

📄 License

The content of this project itself is licensed under LICENSE.

About

NEO Series: Native Vision-Language Models from First Principles

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.5%
  • Jupyter Notebook 0.5%
  • CSS 0.0%
  • Makefile 0.0%
  • HTML 0.0%
  • Shell 0.0%