Thanks to visit codestin.com
Credit goes to github.com

Skip to content

pszemraj/NeoBERT

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NeoBERT

Important

This is a fork of the original chandar-lab/NeoBERT, refactored to support experimentation. ⚠️ WIP/active development⚠️



Description

NeoBERT is a next-generation encoder model for English text representation, pre-trained from scratch on the RefinedWeb dataset. NeoBERT integrates state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. It is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it is the most efficient model of its kind and achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions.

Get started

Install

git clone https://github.com/pszemraj/NeoBERT.git
cd NeoBERT
pip install -e .[dev]  # drop [dev] if you only need runtime deps

Tip

For faster training on supported GPUs, add flash-attn (and optionally xformers) with pip install flash-attn --no-build-isolation.

Verify your setup

# 5-minute smoke test (tiny model, CPU-friendly)
python scripts/pretraining/pretrain.py \
    --config tests/configs/pretraining/test_tiny_pretrain.yaml

# Optional: run the full regression suite
python tests/run_tests.py

Quick commands

Task Command
Pretrain python scripts/pretraining/pretrain.py --config configs/pretrain_neobert.yaml
GLUE eval python scripts/evaluation/run_glue.py --config configs/glue/{task}.yaml
Summarize GLUE python scripts/evaluation/summarize_glue.py {results_path}
Run tests python tests/run_tests.py

Next steps

How to use

Load the official model using Hugging Face Transformers:

For Text Embeddings

from transformers import AutoModel, AutoTokenizer

model_name = "chandar-lab/NeoBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

# Tokenize input text
text = "NeoBERT is the most efficient model of its kind!"
inputs = tokenizer(text, return_tensors="pt")

# Generate embeddings
outputs = model(**inputs)
embedding = outputs.last_hidden_state[:, 0, :]  # CLS token embedding
print(embedding.shape)

For Masked Language Modeling

from transformers import AutoModelForMaskedLM, AutoTokenizer

model_name = "chandar-lab/NeoBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained(model_name, trust_remote_code=True)

# Fill in masked tokens
text = "The quick brown [MASK] jumps over the lazy dog."
inputs = tokenizer(text, return_tensors="pt")

# Get predictions
outputs = model(**inputs)
mask_token_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted_token_id = outputs.logits[0, mask_token_index].argmax(axis=-1)
print(tokenizer.decode(predicted_token_id))

Documentation

For detailed guides and documentation, see the Documentation:

Features

Feature NeoBERT
Depth-to-width 28 × 768
Parameter count 250M
Activation SwiGLU
Positional embeddings RoPE
Normalization Pre-RMSNorm
Data Source RefinedWeb
Data Size 2.8 TB
Tokenizer google/bert
Context length 4,096
MLM Masking Rate 20%
Optimizer AdamW
Scheduler CosineDecay
Training Tokens 2.1 T
Efficiency FlashAttention

License

Model weights and code repository are licensed under the permissive MIT license.

Citation

If you use this model in your research, please cite:

@misc{breton2025neobertnextgenerationbert,
      title={NeoBERT: A Next-Generation BERT},
      author={Lola Le Breton and Quentin Fournier and Mariam El Mezouar and Sarath Chandar},
      year={2025},
      eprint={2502.19587},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.19587},
}

Training and Development

This repository includes the complete training and evaluation codebase for NeoBERT, featuring:

Repository Structure

  • configs/ - YAML configuration files for training, evaluation, and contrastive learning
  • scripts/ - CLI entry points for pretraining, evaluation, contrastive learning, and exporting
  • jobs/ - Example shell launchers for clusters or batch systems
  • tests/ - Automated regression suite and tiny configs
  • src/neobert/ - Core model, trainer, and utilities

Additional guidance lives in:


About

fork of NeoBERT refactored for easier experimentation, WIP

Topics

Resources

Stars

Watchers

Forks

Languages

  • Python 98.3%
  • Shell 1.7%