Thanks to visit codestin.com
Credit goes to github.com

Skip to content

meta-llama/synthetic-data-kit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synthetic Data Kit

Tool for generating high-quality synthetic datasets to fine-tune LLMs.

Generate Reasoning Traces, QA Pairs, save them to a fine-tuning format with a simple CLI.

Checkout our guide on using the tool to unlock task-specific reasoning in Llama-3 family

What do we offer?

Fine-Tuning Large Language Models is easy. There are many mature tools that you can use to fine-tune Llama model family using various post-training techniques.

Why target data preparation?

Multiple tools support standardised formats. However, most of the times your dataset is not structured in "user", "assistant" threads or in a certain format that plays well with a fine-tuning packages.

This toolkit simplifies the journey of:

  • Using local LLM (via vLLM) to generate examples
  • Modular 4 command flow
  • Converting your existing files to fine-tuning friendly formats
  • Creating synthetic datasets
  • Supporting various formats of post-training fine-tuning

How do we offer it?

The tool is designed to follow a simple CLI structure with 4 commands:

  • ingest various file formats
  • create your fine-tuning format: QA pairs, QA pairs with CoT, summary format
  • curate: Using Llama as a judge to curate high quality examples.
  • save-as: After that you can simply save these to a format that your fine-tuning workflow requires.

You can override any parameter or detail by either using the CLI or overiding the default YAML config.

Installation

From PyPI

# Create a new environment

conda create -n synthetic-data python=3.10 

conda activate synthetic-data

pip install synthetic-data-kit

(Alternatively) From Source

git clone https://github.com/meta-llama/synthetic-data-kit.git
cd synthetic-data-kit
pip install -e .

To get an overview of commands type:

synthetic-data-kit --help

1. Tool Setup

  • The tool expects respective files to be put in named folders.
  • We also require a vLLM server running the LLM that we will utilise for generating our dataset.
# Create directory structure
mkdir -p data/{pdf,html,youtube,docx,ppt,txt,output,generated,cleaned,final}

# Start VLLM server
# Note you will need to grab your HF Authentication from: https://huggingface.co/settings/tokens
vllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000

2. Usage

The flow follows 4 simple steps: ingest, create, curate, save-as, please paste your file into the respective folder:

# Check if VLLM server is running
synthetic-data-kit system-check

# Parse a document to text
synthetic-data-kit ingest docs/report.pdf
# This will save file to data/output/report.txt

# Generate QA pairs (default)
synthetic-data-kit create data/output/report.txt --type qa

OR 

# Generate Chain of Thought (CoT) reasoning examples
synthetic-data-kit create data/output/report.txt --type cot

# Both of these will save file to data/generated/report_qa_pairs.json

# Filter content based on quality
synthetic-data-kit curate data/generated/report_qa_pairs.json

# Convert to alpaca fine-tuning format and save as HF arrow file
synthetic-data-kit save-as data/cleaned/report_cleaned.json --format alpaca --storage hf

Configuration

The toolkit uses a YAML configuration file (default: configs/config.yaml).

Note, this can be overriden via either CLI arguments OR passing a custom YAML file

# Example configuration
vllm:
  api_base: "http://localhost:8000/v1"
  model: "meta-llama/Llama-3.3-70B-Instruct"

generation:
  temperature: 0.7
  chunk_size: 4000
  num_pairs: 25

curate:
  threshold: 7.0
  batch_size: 8

Customizing Configuration

Create a overiding configuration file and use it with the -c flag:

synthetic-data-kit -c my_config.yaml ingest docs/paper.pdf

Examples

Processing a PDF Document

# Ingest PDF
synthetic-data-kit ingest research_paper.pdf

# Generate QA pairs
synthetic-data-kit create data/output/research_paper.txt -n 30 --threshold 8.0

# Curate data
synthetic-data-kit curate data/generated/research_paper_qa_pairs.json -t 8.5

# Save in OpenAI fine-tuning format (JSON)
synthetic-data-kit save-as data/cleaned/research_paper_cleaned.json -f ft

# Save in OpenAI fine-tuning format (HF dataset)
synthetic-data-kit save-as data/cleaned/research_paper_cleaned.json -f ft --storage hf

Processing a YouTube Video

# Extract transcript
synthetic-data-kit ingest "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

# Generate QA pairs with specific model
synthetic-data-kit create data/output/youtube_dQw4w9WgXcQ.txt

Processing Multiple Files

# Bash script to process multiple files
for file in data/pdf/*.pdf; do
  filename=$(basename "$file" .pdf)
  
  synthetic-data-kit ingest "$file"
  synthetic-data-kit create "data/output/${filename}.txt" -n 20
  synthetic-data-kit curate "data/generated/${filename}_qa_pairs.json" -t 7.5
  synthetic-data-kit save-as "data/cleaned/${filename}_cleaned.json" -f chatml
done

Advanced Usage

Custom Prompt Templates

Edit the prompts section in your configuration file to customize generation behavior:

prompts:
  qa_generation: |
    You are creating question-answer pairs for fine-tuning a legal assistant.
    Focus on technical legal concepts, precedents, and statutory interpretation.
    
    Below is a chunk of text about: {summary}...
    
    Create {num_pairs} high-quality question-answer pairs based ONLY on this text.
    
    Return ONLY valid JSON formatted as:
    [
      {
        "question": "Detailed legal question?",
        "answer": "Precise legal answer."
      },
      ...
    ]
    
    Text:
    ---
    {text}
    ---

Mental Model:

graph LR
    SDK --> SystemCheck[system-check]
    SDK[synthetic-data-kit] --> Ingest[ingest]
    SDK --> Create[create]
    SDK --> Curate[curate]
    SDK --> SaveAs[save-as]
    
    Ingest --> PDFFile[PDF File]
    Ingest --> HTMLFile[HTML File]
    Ingest --> YouTubeURL[File Format]

    
    Create --> CoT[CoT]
    Create --> QA[QA Pairs]
    Create --> Summary[Summary]
    
    Curate --> Filter[Filter by Quality]
    
    SaveAs --> JSONL[JSONL Format]
    SaveAs --> Alpaca[Alpaca Format]
    SaveAs --> FT[Fine-Tuning Format]
    SaveAs --> ChatML[ChatML Format]
Loading

Troubleshooting FAQs:

VLLM Server Issues

  • Ensure VLLM is installed: pip install vllm
  • Start server with: vllm serve <model_name> --port 8000
  • Check connection: synthetic-data-kit system-check

Memory Issues

If you encounter CUDA out of memory errors:

  • Use a smaller model
  • Reduce batch size in config
  • Start VLLM with --gpu-memory-utilization 0.85

JSON Parsing Issues

If you encounter issues with the curate command:

  • Use the -v flag to enable verbose output
  • Set smaller batch sizes in your config.yaml
  • Ensure the LLM model supports proper JSON output
  • Install json5 for enhanced JSON parsing: pip install json5

Parser Errors

  • Ensure required dependencies are installed for specific parsers:
    • PDF: pip install pdfminer.six
    • HTML: pip install beautifulsoup4
    • YouTube: pip install pytube youtube-transcript-api
    • DOCX: pip install python-docx
    • PPTX: pip install python-pptx

License

Read more about the License

Contributing

Contributions are welcome! Read our contributing guide