Important
This is a fork of the original chandar-lab/NeoBERT, refactored to support experimentation.
NeoBERT is a next-generation encoder model for English text representation, pre-trained from scratch on the RefinedWeb dataset. NeoBERT integrates state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. It is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it is the most efficient model of its kind and achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions.
- Paper: paper
- Model: huggingface
- Documentation: docs/
git clone https://github.com/pszemraj/NeoBERT.git
cd NeoBERT
pip install -e .[dev] # drop [dev] if you only need runtime depsTip
For faster training on supported GPUs, add flash-attn (and optionally xformers) with pip install flash-attn --no-build-isolation.
# 5-minute smoke test (tiny model, CPU-friendly)
python scripts/pretraining/pretrain.py \
--config tests/configs/pretraining/test_tiny_pretrain.yaml
# Optional: run the full regression suite
python tests/run_tests.py| Task | Command |
|---|---|
| Pretrain | python scripts/pretraining/pretrain.py --config configs/pretrain_neobert.yaml |
| GLUE eval | python scripts/evaluation/run_glue.py --config configs/glue/{task}.yaml |
| Summarize GLUE | python scripts/evaluation/summarize_glue.py {results_path} |
| Run tests | python tests/run_tests.py |
- Train or fine-tune: see /docs/training.md
- Evaluate on GLUE or MTEB: see /docs/evaluation.md
- Tune configs and overrides: see /docs/configuration.md
- Export checkpoints to Hugging Face: see /docs/export.md
- Troubleshoot common issues: see /docs/troubleshooting.md
Load the official model using Hugging Face Transformers:
from transformers import AutoModel, AutoTokenizer
model_name = "chandar-lab/NeoBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
# Tokenize input text
text = "NeoBERT is the most efficient model of its kind!"
inputs = tokenizer(text, return_tensors="pt")
# Generate embeddings
outputs = model(**inputs)
embedding = outputs.last_hidden_state[:, 0, :] # CLS token embedding
print(embedding.shape)from transformers import AutoModelForMaskedLM, AutoTokenizer
model_name = "chandar-lab/NeoBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained(model_name, trust_remote_code=True)
# Fill in masked tokens
text = "The quick brown [MASK] jumps over the lazy dog."
inputs = tokenizer(text, return_tensors="pt")
# Get predictions
outputs = model(**inputs)
mask_token_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted_token_id = outputs.logits[0, mask_token_index].argmax(axis=-1)
print(tokenizer.decode(predicted_token_id))For detailed guides and documentation, see the Documentation:
- Training Guide - Pretraining, contrastive learning, and monitoring runs
- Evaluation Guide - GLUE, MTEB, and result analysis
- Configuration System - YAML hierarchy and CLI overrides
- Export Guide - Convert checkpoints to Hugging Face format
- Architecture Details - Model internals
- Testing Guide - Regression suite and coverage
- Troubleshooting - Common failure modes and fixes
| Feature | NeoBERT |
|---|---|
Depth-to-width |
28 × 768 |
Parameter count |
250M |
Activation |
SwiGLU |
Positional embeddings |
RoPE |
Normalization |
Pre-RMSNorm |
Data Source |
RefinedWeb |
Data Size |
2.8 TB |
Tokenizer |
google/bert |
Context length |
4,096 |
MLM Masking Rate |
20% |
Optimizer |
AdamW |
Scheduler |
CosineDecay |
Training Tokens |
2.1 T |
Efficiency |
FlashAttention |
Model weights and code repository are licensed under the permissive MIT license.
If you use this model in your research, please cite:
@misc{breton2025neobertnextgenerationbert,
title={NeoBERT: A Next-Generation BERT},
author={Lola Le Breton and Quentin Fournier and Mariam El Mezouar and Sarath Chandar},
year={2025},
eprint={2502.19587},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.19587},
}This repository includes the complete training and evaluation codebase for NeoBERT, featuring:
configs/- YAML configuration files for training, evaluation, and contrastive learningscripts/- CLI entry points for pretraining, evaluation, contrastive learning, and exportingjobs/- Example shell launchers for clusters or batch systemstests/- Automated regression suite and tiny configssrc/neobert/- Core model, trainer, and utilities
Additional guidance lives in:
docs/training.mdfor full training workflowsdocs/evaluation.mdfor benchmark recipesdocs/testing.mdfor extending the test suitedocs/export.mdfor Hugging Face conversiondocs/troubleshooting.mdfor debugging tips