Tool for generating high-quality synthetic datasets to fine-tune LLMs.
Generate Reasoning Traces, QA Pairs, save them to a fine-tuning format with a simple CLI.
Checkout our guide on using the tool to unlock task-specific reasoning in Llama-3 family
Fine-Tuning Large Language Models is easy. There are many mature tools that you can use to fine-tune Llama model family using various post-training techniques.
Multiple tools support standardised formats. However, most of the times your dataset is not structured in "user", "assistant" threads or in a certain format that plays well with a fine-tuning packages.
This toolkit simplifies the journey of:
- Using local LLM (via vLLM) to generate examples
- Modular 4 command flow
- Converting your existing files to fine-tuning friendly formats
- Creating synthetic datasets
- Supporting various formats of post-training fine-tuning
The tool is designed to follow a simple CLI structure with 4 commands:
ingest
various file formatscreate
your fine-tuning format:QA
pairs,QA
pairs with CoT,summary
formatcurate
: Using Llama as a judge to curate high quality examples.save-as
: After that you can simply save these to a format that your fine-tuning workflow requires.
You can override any parameter or detail by either using the CLI or overiding the default YAML config.
# Create a new environment
conda create -n synthetic-data python=3.10
conda activate synthetic-data
pip install synthetic-data-kit
git clone https://github.com/meta-llama/synthetic-data-kit.git
cd synthetic-data-kit
pip install -e .
To get an overview of commands type:
synthetic-data-kit --help
- The tool expects respective files to be put in named folders.
- We also require a vLLM server running the LLM that we will utilise for generating our dataset.
# Create directory structure
mkdir -p data/{pdf,html,youtube,docx,ppt,txt,output,generated,cleaned,final}
# Start VLLM server
# Note you will need to grab your HF Authentication from: https://huggingface.co/settings/tokens
vllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000
The flow follows 4 simple steps: ingest
, create
, curate
, save-as
, please paste your file into the respective folder:
# Check if VLLM server is running
synthetic-data-kit system-check
# Parse a document to text
synthetic-data-kit ingest docs/report.pdf
# This will save file to data/output/report.txt
# Generate QA pairs (default)
synthetic-data-kit create data/output/report.txt --type qa
OR
# Generate Chain of Thought (CoT) reasoning examples
synthetic-data-kit create data/output/report.txt --type cot
# Both of these will save file to data/generated/report_qa_pairs.json
# Filter content based on quality
synthetic-data-kit curate data/generated/report_qa_pairs.json
# Convert to alpaca fine-tuning format and save as HF arrow file
synthetic-data-kit save-as data/cleaned/report_cleaned.json --format alpaca --storage hf
The toolkit uses a YAML configuration file (default: configs/config.yaml
).
Note, this can be overriden via either CLI arguments OR passing a custom YAML file
# Example configuration
vllm:
api_base: "http://localhost:8000/v1"
model: "meta-llama/Llama-3.3-70B-Instruct"
generation:
temperature: 0.7
chunk_size: 4000
num_pairs: 25
curate:
threshold: 7.0
batch_size: 8
Create a overiding configuration file and use it with the -c
flag:
synthetic-data-kit -c my_config.yaml ingest docs/paper.pdf
# Ingest PDF
synthetic-data-kit ingest research_paper.pdf
# Generate QA pairs
synthetic-data-kit create data/output/research_paper.txt -n 30 --threshold 8.0
# Curate data
synthetic-data-kit curate data/generated/research_paper_qa_pairs.json -t 8.5
# Save in OpenAI fine-tuning format (JSON)
synthetic-data-kit save-as data/cleaned/research_paper_cleaned.json -f ft
# Save in OpenAI fine-tuning format (HF dataset)
synthetic-data-kit save-as data/cleaned/research_paper_cleaned.json -f ft --storage hf
# Extract transcript
synthetic-data-kit ingest "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
# Generate QA pairs with specific model
synthetic-data-kit create data/output/youtube_dQw4w9WgXcQ.txt
# Bash script to process multiple files
for file in data/pdf/*.pdf; do
filename=$(basename "$file" .pdf)
synthetic-data-kit ingest "$file"
synthetic-data-kit create "data/output/${filename}.txt" -n 20
synthetic-data-kit curate "data/generated/${filename}_qa_pairs.json" -t 7.5
synthetic-data-kit save-as "data/cleaned/${filename}_cleaned.json" -f chatml
done
Edit the prompts
section in your configuration file to customize generation behavior:
prompts:
qa_generation: |
You are creating question-answer pairs for fine-tuning a legal assistant.
Focus on technical legal concepts, precedents, and statutory interpretation.
Below is a chunk of text about: {summary}...
Create {num_pairs} high-quality question-answer pairs based ONLY on this text.
Return ONLY valid JSON formatted as:
[
{
"question": "Detailed legal question?",
"answer": "Precise legal answer."
},
...
]
Text:
---
{text}
---
graph LR
SDK --> SystemCheck[system-check]
SDK[synthetic-data-kit] --> Ingest[ingest]
SDK --> Create[create]
SDK --> Curate[curate]
SDK --> SaveAs[save-as]
Ingest --> PDFFile[PDF File]
Ingest --> HTMLFile[HTML File]
Ingest --> YouTubeURL[File Format]
Create --> CoT[CoT]
Create --> QA[QA Pairs]
Create --> Summary[Summary]
Curate --> Filter[Filter by Quality]
SaveAs --> JSONL[JSONL Format]
SaveAs --> Alpaca[Alpaca Format]
SaveAs --> FT[Fine-Tuning Format]
SaveAs --> ChatML[ChatML Format]
- Ensure VLLM is installed:
pip install vllm
- Start server with:
vllm serve <model_name> --port 8000
- Check connection:
synthetic-data-kit system-check
If you encounter CUDA out of memory errors:
- Use a smaller model
- Reduce batch size in config
- Start VLLM with
--gpu-memory-utilization 0.85
If you encounter issues with the curate
command:
- Use the
-v
flag to enable verbose output - Set smaller batch sizes in your config.yaml
- Ensure the LLM model supports proper JSON output
- Install json5 for enhanced JSON parsing:
pip install json5
- Ensure required dependencies are installed for specific parsers:
- PDF:
pip install pdfminer.six
- HTML:
pip install beautifulsoup4
- YouTube:
pip install pytube youtube-transcript-api
- DOCX:
pip install python-docx
- PPTX:
pip install python-pptx
- PDF:
Read more about the License
Contributions are welcome! Read our contributing guide