NeoBERT

Important

This is a fork of the original chandar-lab/NeoBERT, refactored to support experimentation. ⚠️ WIP/active development⚠️

NeoBERT

Description

NeoBERT is a next-generation encoder model for English text representation, pre-trained from scratch on the RefinedWeb dataset. NeoBERT integrates state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. It is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it is the most efficient model of its kind and achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions.

Paper: paper
Model: huggingface
Documentation: docs/

Get started

Install

git clone https://github.com/pszemraj/NeoBERT.git
cd NeoBERT
pip install -e .[dev]  # drop [dev] if you only need runtime deps

Tip

For faster training on supported GPUs, add flash-attn (and optionally xformers) with pip install flash-attn --no-build-isolation.

Verify your setup

# 5-minute smoke test (tiny model, CPU-friendly)
python scripts/pretraining/pretrain.py \
    --config tests/configs/pretraining/test_tiny_pretrain.yaml

# Optional: run the full regression suite
python tests/run_tests.py

Quick commands

Task	Command
Pretrain	`python scripts/pretraining/pretrain.py --config configs/pretrain_neobert.yaml`
GLUE eval	`python scripts/evaluation/run_glue.py --config configs/glue/{task}.yaml`
Summarize GLUE	`python scripts/evaluation/summarize_glue.py {results_path}`
Run tests	`python tests/run_tests.py`

Next steps

Train or fine-tune: see /docs/training.md
Evaluate on GLUE or MTEB: see /docs/evaluation.md
Tune configs and overrides: see /docs/configuration.md
Export checkpoints to Hugging Face: see /docs/export.md
Troubleshoot common issues: see /docs/troubleshooting.md

How to use

Load the official model using Hugging Face Transformers:

For Text Embeddings

from transformers import AutoModel, AutoTokenizer

model_name = "chandar-lab/NeoBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

# Tokenize input text
text = "NeoBERT is the most efficient model of its kind!"
inputs = tokenizer(text, return_tensors="pt")

# Generate embeddings
outputs = model(**inputs)
embedding = outputs.last_hidden_state[:, 0, :]  # CLS token embedding
print(embedding.shape)

For Masked Language Modeling

from transformers import AutoModelForMaskedLM, AutoTokenizer

model_name = "chandar-lab/NeoBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained(model_name, trust_remote_code=True)

# Fill in masked tokens
text = "The quick brown [MASK] jumps over the lazy dog."
inputs = tokenizer(text, return_tensors="pt")

# Get predictions
outputs = model(**inputs)
mask_token_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted_token_id = outputs.logits[0, mask_token_index].argmax(axis=-1)
print(tokenizer.decode(predicted_token_id))

Documentation

For detailed guides and documentation, see the Documentation:

Training Guide - Pretraining, contrastive learning, and monitoring runs
Evaluation Guide - GLUE, MTEB, and result analysis
Configuration System - YAML hierarchy and CLI overrides
Export Guide - Convert checkpoints to Hugging Face format
Architecture Details - Model internals
Testing Guide - Regression suite and coverage
Troubleshooting - Common failure modes and fixes

Features

Feature	NeoBERT
`Depth-to-width`	28 × 768
`Parameter count`	250M
`Activation`	SwiGLU
`Positional embeddings`	RoPE
`Normalization`	Pre-RMSNorm
`Data Source`	RefinedWeb
`Data Size`	2.8 TB
`Tokenizer`	google/bert
`Context length`	4,096
`MLM Masking Rate`	20%
`Optimizer`	AdamW
`Scheduler`	CosineDecay
`Training Tokens`	2.1 T
`Efficiency`	FlashAttention

License

Model weights and code repository are licensed under the permissive MIT license.

Citation

If you use this model in your research, please cite:

@misc{breton2025neobertnextgenerationbert,
      title={NeoBERT: A Next-Generation BERT},
      author={Lola Le Breton and Quentin Fournier and Mariam El Mezouar and Sarath Chandar},
      year={2025},
      eprint={2502.19587},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.19587},
}

Training and Development

This repository includes the complete training and evaluation codebase for NeoBERT, featuring:

Repository Structure

configs/ - YAML configuration files for training, evaluation, and contrastive learning
scripts/ - CLI entry points for pretraining, evaluation, contrastive learning, and exporting
jobs/ - Example shell launchers for clusters or batch systems
tests/ - Automated regression suite and tiny configs
src/neobert/ - Core model, trainer, and utilities

Additional guidance lives in:

docs/training.md for full training workflows
docs/evaluation.md for benchmark recipes
docs/testing.md for extending the test suite
docs/export.md for Hugging Face conversion
docs/troubleshooting.md for debugging tips

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NeoBERT

Description

Get started

Install

Verify your setup

Quick commands

Next steps

How to use

For Text Embeddings

For Masked Language Modeling

Documentation

Features

License

Citation

Training and Development

Repository Structure

About

Uh oh!

Releases 6

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
configs		configs
docs		docs
jobs		jobs
scripts		scripts
src/neobert		src/neobert
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

pszemraj/NeoBERT

Folders and files

Latest commit

History

Repository files navigation

NeoBERT

Description

Get started

Install

Verify your setup

Quick commands

Next steps

How to use

For Text Embeddings

For Masked Language Modeling

Documentation

Features

License

Citation

Training and Development

Repository Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 6

Languages