Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ritesh-modi/fine-tuning-embeddings-template

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fine-Tuning Embeddings Template

A template repository for fine-tuning sentence embedding models using the sentence-transformers package. This project supports fine-tuning on various types of datasets and loss functions, including synthetic dataset generation from PDF files using Azure OpenAI.

🚀 Features

  • Fine-tune any SentenceTransformer-compatible model
  • Supports multiple dataset formats:
    • positive_pair
    • triplets
    • pair_with_score
  • Generates synthetic training data from PDFs using Azure OpenAI
  • Compatible with various loss functions:
    • matryoshka
    • triplet
    • contrastive
    • cosine_similarity
  • Evaluates the fine-tuned model using metrics such as NDCG@k
  • CLI and config file-based execution
  • Outputs fine-tuned models and evaluation results

📁 Folder Structure

.
├── config/               # Configuration files for fine-tuning and evaluation
├── data/                 # Place your PDF files here for data generation
├── output/               # Fine-tuned models and evaluation metrics
├── src/                  # Core source code
├── main.py               # Entry point for training/evaluation
├── .env                  # Azure OpenAI secrets
└── requirements.txt      # Dependencies

🔧 Getting Started

1. Clone the Repository

git clone https://github.com/ritesh-modi/fine-tuning-embeddings-template.git
cd fine-tuning-embeddings-template

2. Create a Python Environment

Using conda (recommended):

conda create -n fine-tune-env python=3.10
conda activate fine-tune-env

3. Install Dependencies in Editable Mode

python -m pip install -e .
python -m pip install -r requirements.txt

4. Set Azure OpenAI Secrets

Create a .env file in the root directory:

AZURE_OPENAI_ENDPOINT=
AZURE_OPENAI_API_KEY=
API_VERSION=
AZURE_DEPLOYMENT=
MODEL_NAME=
TEMPERATURE=0.0

📘 Usage Instructions

Step 1: Add Your PDF Files

Place your PDF documents inside the data/ directory. The repo will generate synthetic training data from these using Azure OpenAI.

Step 2: Update Configuration

Modify the configuration files inside the config/ folder to specify:

  • Dataset type (positive_pair, triplets, pair_with_score)
  • Model to fine-tune
  • Loss function
  • Output paths

⚠️ Note: Not all dataset types are compatible with every loss function. Refer to the official sentence-transformers documentation for compatibility.

Example:

  • triplets dataset → TripletLoss
  • positive_pairCosineSimilarityLoss, etc.

Step 3: Run Fine-Tuning

python main.py --config_path config/train_config.yaml

This will:

  • Generate synthetic training data (if configured)
  • Load and preprocess the data
  • Train the model with the selected loss function
  • Evaluate the fine-tuned model and store results in the output/ directory

📊 Evaluation

Evaluation metrics like NDCG@k, MAP, and Recall are automatically calculated post-training. The fne-tuned model will be saved in the output/ directory.


🧠 Dataset & Loss Function Compatibility

Ensure the dataset format is compatible with the selected loss function. Here are some guidelines:

Dataset Type Supported Loss Functions
triplets TripletLoss
positive_pair CosineSimilarityLoss, ContrastiveLoss
pair_with_score MatryoshkaLoss, CosineSimilarityLoss

📖 Visit the SentenceTransformers Loss Functions Docs for detailed compatibility.


📌 Requirements

  • Python 3.10+
  • Conda (optional but recommended)
  • Azure OpenAI account with deployment

Install all dependencies using:

python -m pip install -e .
python -m pip install -r requirements.txt

🙌 Contributing

Contributions, issues, and feature requests are welcome! Feel free to open an issue or submit a pull request.


📄 License

This project is licensed under the MIT License. See the LICENSE file for details.


✨ Acknowledgments

About

This repo is a template to fine-tune embedding models using sentencetransformers based on different on configuration

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages