NEO Series: Native Vision-Language Models

2025/09: From Pixels to Words -- Towards Native Vision-Language Primitives at Scale (arXiv 2025)

📜 News

[2025/10] The paper, weights, and test code of NEO are released !
[2025/09] 💥💥💥 NEO has been completed !

💡 Motivation

What constraints set native VLMs apart from modular ones, and to what extent can they be overcome?
How to make native VLMs more accessible and democratized, thereby accelerating their progress?

💡 Highlights

🔥 Native Architecture: NEO innovates a native VLM primitive that unifies pixel-word encoding, alignment, and reasoning within a dense, monolithic model architecture.
🔥 Superior Efficiency: With merely 390M image-text examples, NEO develops strong visual perception from scratch, rivaling top-tier modular VLMs and outperforming native ones.
🔥 Promising Roadmap: NEO pioneers a promsing route for scalable and powerful native VLMs, paired with diverse reusable components that foster a cost-effective and extensible ecosystem.

🤖 Model Zoo

We release 2B and 9B NEO in Pre-Training (PT), Mid-Training (MT), and Supervised Fine-Tuning (SFT).

Model Name	Model Weight
NEO-2B-PT	🤗 NEO-2B-PT HF link
NEO-2B-MT	🤗 NEO-2B-MT HF link
NEO-2B-SFT	🤗 NEO-2B-SFT HF link
NEO-9B-PT	🤗 NEO-9B-PT HF link
NEO-9B-MT	🤗 NEO-9B-MT HF link
NEO-9B-SFT	🤗 NEO-9B-SFT HF link

📊 Benchmark Results

Notes:

“# Data” denotes the data scale for pre-training / mid-training / supervised fine-tuning.

† indicates models using Reinforcement Learning (RL).

“Any Res.” = any resolution; “Tile-wise” = image split into tiles; “Any Rat.” = any aspect ratio; “Fix Res.” = fixed resolution.

MoE = Mixture-of-Experts; DaC = Divide-and-Conquer.

Bold = best score in each column.

Model_NAME	Base_LLM_NAME	#Data_PT·MT·SFT	Input_TYPE	RoPE_TYPE	MMMU	MMB	MMVet	MMStar	SEED_I	POPE	HallB	AI2D	DocVQA	ChartQA	InfoVQA	TextVQA	OCRBench
🔻Modular_VLMs_(2B)
Qwen2-VL	Qwen2-1.5B	--·--·--	Any Res.	M-RoPE	41.1	74.9	49.5	48.0	--	--	41.7	74.7	90.1	73.5	65.5	79.7	80.9
InternVL2.5	InternLM2.5-1.8B	>6B·100M·16M	Tile-wise	1D-RoPE	43.6	74.7	60.8	53.7	--	90.6	42.6	74.9	88.7	79.2	60.9	74.3	80.4
Qwen2.5-VL†	Qwen2.5-1.5B	--·--·--	Any Res.	M-RoPE	51.2	79.1	61.8	55.9	--	--	46.3	81.6	93.9	84.0	77.1	79.3	79.7
InternVL3†	Qwen2.5-1.5B	>6B·100M·22M	Tile-wise	1D-RoPE	48.6	81.1	62.2	60.7	--	89.6	42.5	78.7	88.3	80.2	66.1	77.0	83.5
Encoder_Based	Qwen3-1.7B	>6B·40M·4M	Tile-wise	1D-RoPE	47.1	75.8	37.4	52.7	73.6	87.0	44.4	77.4	89.9	78.4	65.9	73.3	83.5
🔻Native_VLMs_(2B)
Mono-InternVL	InternLM2-1.8B	1.2B·143M·7M	Tile-wise	1D-RoPE	33.7	65.5	40.1	--	67.4	--	34.8	68.6	80.0	73.7	43.0	72.6	76.7
Mono-InternVL-1.5	InternLM2-1.8B	400M·150M·7M	Tile-wise	1D-RoPE	39.1	64.0	54.0	--	66.9	--	32.5	67.4	81.7	72.2	47.9	73.7	80.1
HoVLE	InternLM2-1.8B	550M·50M·7M	Tile-wise	1D-RoPE	32.2	73.3	43.8	--	70.9	87.4	38.4	73.0	86.1	78.6	55.7	70.9	74.0
OneCAT	Qwen2.5-1.5B	436M·70M·13M	Any Res.	M-RoPE	39.0	72.4	42.4	--	70.9	--	--	72.4	87.1	76.2	56.3	67.0	--
NEO	Qwen3-1.7B	345M·40M·4M	Any Res.	Native_RoPE	48.6	76.0	49.6	54.2	74.2	87.5	43.1	80.1	89.9	81.2	63.2	74.0	77.1
🔻Modular_VLMs_(8B)
Qwen2-VL	Qwen2-7B	--·--·--	Any Res.	M-RoPE	54.1	83.0	62.0	60.7	--	88.1	50.6	83.0	94.5	83.0	76.5	84.3	86.6
InternVL2.5	InternLM2.5-7B	>6B·50M·4M	Tile-wise	1D-RoPE	56.0	84.6	62.8	64.4	--	90.6	50.1	84.5	93.0	84.8	77.6	79.1	82.2
Qwen2.5-VL†	Qwen2.5-7B	--·--·--	Any Res.	M-RoPE	55.0	83.5	67.1	63.9	--	86.4	52.9	83.9	95.7	87.3	82.6	84.9	86.4
InternVL3†	Qwen2.5-7B	>6B·100M·22M	Tile-wise	1D-RoPE	62.7	83.4	81.3	68.2	--	91.1	49.9	85.2	92.7	86.6	76.8	80.2	88.0
Encoder-Based	Qwen3-8B	>6B·40M·4M	Tile-wise	1D-RoPE	54.1	84.0	60.0	63.5	76.2	87.8	51.4	82.9	92.1	83.5	75.0	77.1	85.3
🔻Native_VLMs_(8B)
Fuyu	Persimmon-8B	--·--·--	Any Res.	1D-RoPE	27.9	10.7	21.4	--	59.3	84.0	--	64.5	--	--	--	--	36.6
Chameleon	from scratch	1.4B·0M·1.8M	Fix Res.	1D-RoPE	25.4	31.1	8.3	--	30.6	19.4	17.1	46.0	1.5	2.9	5.0	4.8	0.7
EVE	Vicuna-7B	33M·0M·1.8M	Any Rat.	1D-RoPE	32.6	52.3	25.7	--	64.6	85.0	26.4	61.0	53.0	59.1	25.0	56.8	39.8
SOLO	Mistral-7B	44M·0M·2M	Any Res.	1D-RoPE	--	67.7	30.4	--	64.4	78.6	--	61.4	--	--	--	--	12.6
Emu3	from scratch	--·--·--	Fix Res.	1D-RoPE	31.6	58.5	37.2	--	68.2	85.2	--	70.0	76.3	68.6	43.8	64.7	68.7
EVEv2	Qwen2.5-7B	77M·15M·7M	Any Rat.	1D-RoPE	39.3	66.3	45.0	--	71.4	87.6	--	74.8	--	73.9	--	71.1	70.2
BREEN	Qwen2.5-7B	13M·0M·4M	Any Res.	1D-RoPE	42.7	71.4	38.9	51.2	--	--	37.0	76.4	--	--	--	65.7	--
VoRA	Qwen2.5-7B	30M·0M·0.6M	Any Res.	1D-RoPE	32.0	61.3	33.7	--	68.9	85.5	--	61.1	--	--	--	58.7	--
SAIL	Mistral-7B	512M·86M·6M	Any Res.	M-RoPE	--	70.1	46.3	53.1	72.9	85.8	54.2	76.7	--	--	--	77.1	78.3
NEO	Qwen3-8B	345M·40M·4M	Any Res.	Native_RoPE	54.6	82.1	53.6	62.4	76.3	88.4	46.4	83.1	88.6	82.1	60.9	75.0	77.7

📋 Todo List

Evaluation guide
Training guide

✒️ Citation

If NEO series is helpful for your research, please consider star ⭐ and citation 📝 :

@article{Diao2025NEO,
  title        = {From Pixels to Words — Towards Native Vision-Language Primitives at Scale},
  author       = {Diao, Haiwen and Li, Mingxuan and Wu, Silei and Dai, Linjun and Wang, Xiaohua and Deng, Hanming and Lu, Lewei and Lin, Dahua and Liu, Ziwei},
  journal      = {arXiv preprint arXiv:2510.14979},
  year         = {2025}
}

📄 License

The content of this project itself is licensed under LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
VLMEvalKit		VLMEvalKit
docs		docs
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NEO Series: Native Vision-Language Models

📜 News

💡 Motivation

💡 Highlights

🤖 Model Zoo

📊 Benchmark Results

📋 Todo List

✒️ Citation

📄 License

About

Uh oh!

Releases

Packages

Languages

License

CharlesCNorton/NEO

Folders and files

Latest commit

History

Repository files navigation

NEO Series: Native Vision-Language Models

📜 News

💡 Motivation

💡 Highlights

🤖 Model Zoo

📊 Benchmark Results

📋 Todo List

✒️ Citation

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages