EN | 简体中文
Sponsored by Mixedbread
For more detailed usage, please read the 📘 document: https://angle.readthedocs.io/en/latest/index.html
📢 Train/Infer Powerful Sentence Embeddings with AnglE. This library is from the paper: AnglE: Angle-optimized Text Embeddings. It allows for training state-of-the-art BERT/LLM-based sentence embeddings with just a few lines of code. AnglE is also a general sentence embedding inference framework, allowing for infering a variety of transformer-based sentence embeddings.
Loss:
- 📐 AnglE loss (ACL24)
- ⚖ Contrastive loss
- 📏 CoSENT loss
- ☕️ Espresso loss (ICLR 2025, a.k.a 2DMSE, detail: README_ESE)
Backbones:
- BERT-based models (BERT, RoBERTa, ModernBERT, etc.)
- LLM-based models (LLaMA, Mistral, Qwen, etc.)
- Bi-directional LLM-based models (LLaMA, Mistral, Qwen, OpenELMo, etc.. refer to: https://github.com/WhereIsAI/BiLLM)
Training:
- Single-GPU training
- Multi-GPU training
📅 May 16, 2024 | Paper "AnglE: Angle-optimized Text Embeddings" is accepted by ACL 2024 Main Conference.
📅 Mar 13, 2024 | Paper "BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings" is accepted by NAACL 2024 Main Conference.
📅 Mar 8, 2024 | 🍞 mixedbread's embedding (mixedbread-ai/mxbai-embed-large-v1) achieves SOTA on the MTEB Leaderboard with an average score of 64.68! The model is trained using AnglE. Congrats mixedbread!
📅 Dec 4, 2023 | Our universal sentence embedding WhereIsAI/UAE-Large-V1 achieves SOTA on the MTEB Leaderboard with an average score of 64.64! The model is trained using AnglE.
📅 Dec, 2023 | AnglE achieves SOTA performance on the STS Bechmark Semantic Textual Similarity!
BERT-based models:
| 🤗 HF | Max Tokens | Pooling Strategy | Scenario |
|---|---|---|---|
| WhereIsAI/UAE-Large-V1 | 512 | cls | English, General-purpose |
| WhereIsAI/UAE-Code-Large-V1 | 512 | cls | Code Similarity |
| WhereIsAI/pubmed-angle-base-en | 512 | cls | Medical Similarity |
| WhereIsAI/pubmed-angle-large-en | 512 | cls | Medical Similarity |
LLM-based models:
| 🤗 HF (lora weight) | Backbone | Max Tokens | Prompts | Pooling Strategy | Scenario |
|---|---|---|---|---|---|
| SeanLee97/angle-llama-13b-nli | NousResearch/Llama-2-13b-hf | 4096 | Prompts.A |
last token | English, Similarity Measurement |
| SeanLee97/angle-llama-7b-nli-v2 | NousResearch/Llama-2-7b-hf | 4096 | Prompts.A |
last token | English, Similarity Measurement |
💡 You can find more third-party embeddings trained with AnglE in HuggingFace Collection
use uv
uv pip install -U angle-embor pip
pip install -U angle-embOption A: With Prompts (for Retrieval Tasks)
Use prompts with {text} as placeholder. Check available prompts via Prompts.list_prompts().
from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity
# Load model
angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()
# Encode query with prompt, documents without prompt
qv = angle.encode(['what is the weather?'], to_numpy=True, prompt=Prompts.C)
doc_vecs = angle.encode([
'The weather is great!',
'it is rainy today.',
'i am going to bed'
], to_numpy=True)
# Calculate similarity
for dv in doc_vecs:
print(cosine_similarity(qv[0], dv))Option B: Without Prompts (for Similarity Tasks)
from angle_emb import AnglE
from angle_emb.utils import cosine_similarity
# Load model
angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()
# Encode documents
doc_vecs = angle.encode([
'The weather is great!',
'The weather is very good!',
'i am going to bed'
])
# Calculate pairwise similarity
for i, dv1 in enumerate(doc_vecs):
for dv2 in doc_vecs[i+1:]:
print(cosine_similarity(dv1, dv2))For LoRA-based models, specify both the backbone model and LoRA weights. Always set is_llm=True for LLM models.
import torch
from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity
# Load LLM with LoRA weights
angle = AnglE.from_pretrained(
'NousResearch/Llama-2-7b-hf',
pretrained_lora_path='SeanLee97/angle-llama-7b-nli-v2',
pooling_strategy='last',
is_llm=True,
torch_dtype=torch.float16
).cuda()
# Encode with prompt
doc_vecs = angle.encode([
'The weather is great!',
'The weather is very good!',
'i am going to bed'
], prompt=Prompts.A)
# Calculate similarity
for i, dv1 in enumerate(doc_vecs):
for dv2 in doc_vecs[i+1:]:
print(cosine_similarity(dv1, dv2))Enable bidirectional LLMs with apply_billm=True and specify the model class.
import os
import torch
from angle_emb import AnglE
from angle_emb.utils import cosine_similarity
# Set BiLLM environment variable
os.environ['BiLLM_START_INDEX'] = '31'
# Load BiLLM model
angle = AnglE.from_pretrained(
'NousResearch/Llama-2-7b-hf',
pretrained_lora_path='SeanLee97/bellm-llama-7b-nli',
pooling_strategy='last',
is_llm=True,
apply_billm=True,
billm_model_class='LlamaForCausalLM',
torch_dtype=torch.float16
).cuda()
# Encode with custom prompt
doc_vecs = angle.encode([
'The weather is great!',
'The weather is very good!',
'i am going to bed'
], prompt='The representative word for sentence {text} is:"')
# Calculate similarity
for i, dv1 in enumerate(doc_vecs):
for dv2 in doc_vecs[i+1:]:
print(cosine_similarity(dv1, dv2))Truncate layers and embedding dimensions for flexible model compression.
from angle_emb import AnglE
from angle_emb.utils import cosine_similarity
# Load model
angle = AnglE.from_pretrained('mixedbread-ai/mxbai-embed-2d-large-v1', pooling_strategy='cls').cuda()
# Truncate to specific layer
angle = angle.truncate_layer(layer_index=22)
# Encode with truncated embedding size
doc_vecs = angle.encode([
'The weather is great!',
'The weather is very good!',
'i am going to bed'
], embedding_size=768)
# Calculate similarity
for i, dv1 in enumerate(doc_vecs):
for dv2 in doc_vecs[i+1:]:
print(cosine_similarity(dv1, dv2))Load any transformer-based models (e.g., sentence-transformers, BAAI/bge, etc.) using AnglE.
from angle_emb import AnglE
# Load third-party model
model = AnglE.from_pretrained('mixedbread-ai/mxbai-embed-large-v1', pooling_strategy='cls').cuda()
# Encode text
vec = model.encode('hello world', to_numpy=True)
print(vec)Speed up inference with the batched library (recommended for large-scale processing).
uv pip install batchedimport batched
from angle_emb import AnglE
# Load model
model = AnglE.from_pretrained("WhereIsAI/UAE-Large-V1", pooling_strategy='cls').cuda()
# Enable dynamic batching
model.encode = batched.dynamically(model.encode, batch_size=64)
# Encode large batch
vecs = model.encode([
'The weather is great!',
'The weather is very good!',
'i am going to bed'
] * 50)💡 For complete details, see the official training documentation.
AnglE supports three dataset formats. Choose based on your task:
| Format | Columns | Description | Use Case |
|---|---|---|---|
| Format A | text1, text2, label |
Paired texts with similarity scores (0-1) | Similarity scoring |
| Format B | query, positive |
Query-document pairs | Retrieval without hard negatives |
| Format C | query, positive, negative |
Query with positive and negative samples | Contrastive learning |
Notes:
- All formats use HuggingFace
datasets.Dataset text1,text2,query,positive, andnegativecan bestrorList[str](random sampling for lists)
Single GPU:
CUDA_VISIBLE_DEVICES=0 angle-trainer --helpMulti-GPU with FSDP:
CUDA_VISIBLE_DEVICES=0,1,2,3 WANDB_MODE=disabled accelerate launch \
--multi_gpu \
--num_processes 4 \
--main_process_port 2345 \
--config_file examples/FSDP/fsdp_config.yaml \
-m angle_emb.angle_trainer \
--gradient_checkpointing 1 \
--use_reentrant 0 \
...Multi-GPU (Standard):
CUDA_VISIBLE_DEVICES=0,1,2,3 WANDB_MODE=disabled accelerate launch \
--multi_gpu \
--num_processes 4 \
--main_process_port 2345 \
-m angle_emb.angle_trainer \
--model_name_or_path YOUR_MODEL \
--train_name_or_path YOUR_DATASET \
...📁 More examples: examples/Training
from datasets import load_dataset
from angle_emb import AnglE
# Step 1: Load pretrained model
angle = AnglE.from_pretrained(
'SeanLee97/angle-bert-base-uncased-nli-en-v1',
max_length=128,
pooling_strategy='cls'
).cuda()
# Step 2: Prepare dataset (Format A example)
ds = load_dataset('mteb/stsbenchmark-sts')
ds = ds.map(lambda obj: {
"text1": str(obj["sentence1"]),
"text2": str(obj['sentence2']),
"label": obj['score']
})
ds = ds.select_columns(["text1", "text2", "label"])
# Step 3: Train the model
angle.fit(
train_ds=ds['train'].shuffle(),
valid_ds=ds['validation'],
output_dir='ckpts/sts-b',
batch_size=32,
epochs=5,
learning_rate=2e-5,
save_steps=100,
eval_steps=1000,
warmup_steps=0,
gradient_accumulation_steps=1,
loss_kwargs={
'cosine_w': 1.0,
'ibn_w': 1.0,
'angle_w': 0.02,
'cosine_tau': 20,
'ibn_tau': 20,
'angle_tau': 20
},
fp16=True,
logging_steps=100
)
# Step 4: Evaluate
corrcoef = angle.evaluate(ds['test'])
print('Spearman\'s corrcoef:', corrcoef)| Model Type | CLI Flags | Description |
|---|---|---|
| LLM | --is_llm 1 + LoRA params |
Must manually enable LLM mode |
| BiLLM | --apply_billm 1 --billm_model_class LlamaForCausalLM |
Bidirectional LLMs (guide) |
| Espresso (ESE) | --apply_ese 1 --ese_kl_temperature 1.0 --ese_compression_size 256 |
Matryoshka-style embeddings |
| Format | Flag | Applies To |
|---|---|---|
| Format A | --text_prompt "text: {text}" |
Both text1 and text2 |
| Format B/C | --query_prompt "query: {text}" |
query field |
| Format B/C | --doc_prompt "document: {text}" |
positive and negative fields |
Adapt old datasets without modification:
# CLI
--column_rename_mapping "text:query"
# Python
column_rename_mapping={"text": "query"}Convert trained models to sentence-transformers format:
python scripts/convert_to_sentence_transformers.py --help| Format | Recommendation |
|---|---|
| Format A | Increase cosine_w or decrease ibn_w |
| Format B | Only tune ibn_w and ibn_tau |
| Format C | Set cosine_w=0, angle_w=0.02, and configure cln_w + ibn_w |
Prevent Catastrophic Forgetting:
- Set
teacher_name_or_pathfor knowledge distillation - Use same model path for self-distillation
⚠️ Ensure teacher and student use the same tokenizer
| Task | Status | Notes |
|---|---|---|
| Training | SentenceTransformers has AnglE loss, but use official angle_emb for best results |
|
| Inference | ✅ Full | Convert trained models: examples/convert_to_sentence_transformers.py |
If you use our code and pre-trained models, please support us by citing our work as follows:
@article{li2023angle,
title={AnglE-optimized Text Embeddings},
author={Li, Xianming and Li, Jing},
journal={arXiv preprint arXiv:2309.12871},
year={2023}
}| 📅 | Description |
|---|---|
| 2025 Jan | v0.6.0 - Major refactoring 🎉: • Removed AngleDataTokenizer - no need to pre-tokenize datasets!• Removed DatasetFormats class - use string literals ('A', 'B', 'C')• Removed auto-detection of LLM models - set is_llm manually• Renamed --prompt_template to --text_prompt (Format A only)• Added --query_prompt and --doc_prompt for Format B/C• Added --column_rename_mapping to adapt old datasets without modification• Updated data formats: Format B/C now use query, positive, negative fields• Support list-based sampling in Format B/C • Updated examples to use accelerate launch• See MIGRATION_GUIDE.md for upgrade instructions |
| 2024 May 21 | support Espresso Sentence Embeddings |
| 2024 Feb 7 | support training with only positive pairs (Format C: query, positive) |
| 2023 Dec 4 | Release a universal English sentence embedding model: WhereIsAI/UAE-Large-V1 |
| 2023 Nov 2 | Release an English pretrained model: SeanLee97/angle-llama-13b-nli |
| 2023 Oct 28 | Release two chinese pretrained models: SeanLee97/angle-roberta-wwm-base-zhnli-v1 and SeanLee97/angle-llama-7b-zhnli-v1; Add chinese README.md |
If you have any questions or suggestions, please feel free to contact us via email: [email protected]
This project is licensed under the MIT License. For the pretrained models, please refer to the corresponding license of the models.