A template repository for fine-tuning sentence embedding models using the sentence-transformers package. This project supports fine-tuning on various types of datasets and loss functions, including synthetic dataset generation from PDF files using Azure OpenAI.
- Fine-tune any SentenceTransformer-compatible model
- Supports multiple dataset formats:
positive_pairtripletspair_with_score
- Generates synthetic training data from PDFs using Azure OpenAI
- Compatible with various loss functions:
matryoshkatripletcontrastivecosine_similarity
- Evaluates the fine-tuned model using metrics such as NDCG@k
- CLI and config file-based execution
- Outputs fine-tuned models and evaluation results
.
├── config/ # Configuration files for fine-tuning and evaluation
├── data/ # Place your PDF files here for data generation
├── output/ # Fine-tuned models and evaluation metrics
├── src/ # Core source code
├── main.py # Entry point for training/evaluation
├── .env # Azure OpenAI secrets
└── requirements.txt # Dependencies
git clone https://github.com/ritesh-modi/fine-tuning-embeddings-template.git
cd fine-tuning-embeddings-templateUsing conda (recommended):
conda create -n fine-tune-env python=3.10
conda activate fine-tune-envpython -m pip install -e .
python -m pip install -r requirements.txtCreate a .env file in the root directory:
AZURE_OPENAI_ENDPOINT=
AZURE_OPENAI_API_KEY=
API_VERSION=
AZURE_DEPLOYMENT=
MODEL_NAME=
TEMPERATURE=0.0Place your PDF documents inside the data/ directory. The repo will generate synthetic training data from these using Azure OpenAI.
Modify the configuration files inside the config/ folder to specify:
- Dataset type (
positive_pair,triplets,pair_with_score) - Model to fine-tune
- Loss function
- Output paths
sentence-transformers documentation for compatibility.
Example:
tripletsdataset →TripletLosspositive_pair→CosineSimilarityLoss, etc.
python main.py --config_path config/train_config.yamlThis will:
- Generate synthetic training data (if configured)
- Load and preprocess the data
- Train the model with the selected loss function
- Evaluate the fine-tuned model and store results in the
output/directory
Evaluation metrics like NDCG@k, MAP, and Recall are automatically calculated post-training. The fne-tuned model will be saved in the output/ directory.
Ensure the dataset format is compatible with the selected loss function. Here are some guidelines:
| Dataset Type | Supported Loss Functions |
|---|---|
triplets |
TripletLoss |
positive_pair |
CosineSimilarityLoss, ContrastiveLoss |
pair_with_score |
MatryoshkaLoss, CosineSimilarityLoss |
📖 Visit the SentenceTransformers Loss Functions Docs for detailed compatibility.
- Python 3.10+
- Conda (optional but recommended)
- Azure OpenAI account with deployment
Install all dependencies using:
python -m pip install -e .
python -m pip install -r requirements.txtContributions, issues, and feature requests are welcome! Feel free to open an issue or submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.