Configurable Workflows for Synthetic Data Generation.
Config Synth Flow provides a configurable, flexible pipeline system that allows you to:
- Create modular from data prcossing to data generation pipelines through YAML configuration
- Sytnth data asynchronously with high concurrency.
- Designed for easy extensibility with custom pipeline components.
- This project uses Poetry for dependency management. Install Poetry first:
curl -sSL https://install.python-poetry.org | python3 -- Install dependencies:
poetry install- Activate the virtual environment:
poetry shellTo run a simple pipeline, use the run-seq command:
run-seq configs/examples/magpie.ymlConfigSynthFlow uses YAML configuration files to define the pipeline structure. Here's a simple example:
import_path: SequentialExecutor
init_kwargs:
reader:
import_path: HfDatasetReader
init_kwargs:
debug: true # debug mode
resume: false # resume from the last processed data
dataset_kwargs:
path: json
num_proc: 10
data_files:
- <your_data_path>
writer:
import_path: HfWriter # will output in jsonl format
init_kwargs:
chunk_size: 5000 # number of data to save in one file
output_path: result/test # output path
pipes:
- import_path: ChatGenerator # chat generator
init_kwargs:
prompt_type_col: role # prompt type column
litellm_kwargs:
model: gpt-4.1-nano # llm model
- import_path: RemoveColumns # remove columns
async_cfg:
concurrency: 100
batch_size: 10000
init_kwargs:
prompt_type_col: role
litellm_kwargs:
model: gpt-4.1-nano
system_template:
- name: zhtw # template name
weight: 1 # sample weight
template_str: | # template string
你是一個善於改寫長文本的AI助理,請幫我改寫以下文章,不要有任何多餘的前後綴。你的改寫必須要盡可能涵蓋所有原始文章的重點,且不要省略任何重要的細節。
user_template:
- name: zhtw
template_str: | # jinja template to format the prompt
# 任務
- 把文章改寫成 wiki 的文章段落,並去除不重要的部分。
# 文章
{{ text }}
output_col: rephrase # output key to saveConfiguration files support:
- Pipelines instantiation via
import_pathandinit_kwargs - Recursive pipeline definition (pipelines can contain other pipelines)
- Fuzzy matching of pipeline names (
import_path). - Environment variable interpolation for sensitive values
you can see configs/examples/ for more examples.
The foundational class for all pipelines, providing:
- Configuration loading and parsing
- Logging capabilities
from config_synth_flow.base import BasePipeline, PipelineConfig
# Create from a YAML file
pipeline = BasePipeline.from_yaml("config.yml")
# Create from a config object
config = PipelineConfig(import_path="path.to.Class", init_kwargs={...})
pipeline = BasePipeline(config)Extends BasePipeline with asynchronous processing capabilities, allowing for:
- Concurrent data processing
- Batch processing with configurable batch sizes
- Asynchronous API calls
Provides parallel processing capabilities using Python's multiprocessing:
- Process data across multiple CPU cores
- Handle CPU-bound tasks efficiently
- Scale processing across available hardware
The base input handler for pipelines:
- Reads data from various sources (files, databases, etc.)
- Provides resumable processing support
- Handles data ID generation and tracking
The base output handler for pipelines:
- Writes processed data to various destinations
- Supports different output formats
- Handles output file management
Coordinates the execution flow between readers and writers:
- Connects readers to writers
- Manages the execution lifecycle
- Handles configuration serialization
Readers are responsible for loading data into the pipeline system. Config Synth Flow provides several built-in reader implementations:
-
BaseReader: Abstract base class for all readers with support for:
- Resumable processing through unique ID tracking
- Automatic generation of hash IDs for data objects
- Integration with writers for saving progress
-
HfDatasetReader: Loads data from Hugging Face datasets with features like:
- Support for both Dataset and IterableDataset types
- Debug mode for quick testing with limited samples
- Shuffling functionality for randomized data processing
- Customizable loading through dataset_kwargs
Example reader configuration:
reader:
import_path: config_synth_flow.reader.HfDatasetReader
init_kwargs:
dataset_kwargs:
path: "your_dataset_name"
split: "train"
resume: true
shuffle: trueWriters handle the output of processed data in various formats. Available writers include:
-
BaseWriter: Foundational writer class with common output management functions
- Configurable output paths
- Basic output handling
-
HfWriter: Specialized writer for saving data in Hugging Face dataset formats
- Supports multiple output formats (jsonl, json, csv, parquet)
- Chunk-based saving for large datasets
- Automatic chunk naming and management
Example writer configuration:
writer:
import_path: config_synth_flow.writer.HfWriter
init_kwargs:
output_path: "path/to/output"
chunk_size: 1000
output_format: "jsonl"Judge pipelines evaluate and score content generated within the system:
- SglangRmJudgePipeline: Evaluates conversations using reward models
- Integration with SgLang served reward models
- Support for per-round or full conversation judging
- OpenaiLmPplPipeline: Calculates perplexity scores using OpenAI models
- InfinitySlidingWindowEduClassifier: Specialized classifier for educational content
The Papers pipelines implement algorithms and techniques from academic papers:
- Magpie: Implementation of the Magpie approach for instruction tuning data generation
- Built on
AsyncChatBasePipelinefor efficient processing
- Built on
- ContextualMagpie: Implementation of the ContextualMagpie approach for instruction tuning data generation
The pipeline system follows a hierarchical structure:
- Core Mixins: Provide shared functionality like logging, async support, and serialization
- Base Classes: Build on mixins to define core behaviors
- Specialized Pipelines: Implement specific functionality for different use cases
The main execution flow follows the pattern controlled by Executor:
Reader → [\*Pipes] → Writer
Key components communicate through well-defined interfaces, making the system modular and extensible.
from config_synth_flow.base.io import BaseReader
from config_synth_flow.base.pipeline import DictsGenerator
class MyCustomReader(BaseReader):
def post_init(self, input_path: str, resume: bool = False):
super().post_init(resume=resume)
self.input_path = input_path
def read(self) -> DictsGenerator:
# Implement your reading logic here
for item in my_data_source:
yield {"data": item}from config_synth_flow.base.io import BaseWriter
class MyCustomWriter(BaseWriter):
def post_init(self, output_path: str, save_format: str = "jsonl"):
self.output_path = output_path
self.save_format = save_format
def write(self, data):
# Implement your writing logic here
with open(self.output_path, "a") as f:
f.write(f"{data}\n")from config_synth_flow.base.pipeline import BasePipeline
class MyCustomPipeline(BasePipeline):
def post_init(self, param1: str, param2: int = 0):
self.param1 = param1
self.param2 = param2
def __call__(self, data: DictsGenerator) -> DictsGenerator:
# Implement your processing logic here
for item in data:
yield self.run_each(item)
def run_each(self, data: dict) -> dict:
# Implement your processing logic here
return {"processed": data, "param1": self.param1}from config_synth_flow.base import AsyncChatBasePipeline
from config_synth_flow.base import PromptTemplate
class MyAsyncChatPipeline(AsyncChatBasePipeline):
def post_init(self,
litellm_kwargs: dict,
prompt_template: PromptTemplate):
super().post_init(litellm_kwargs)
self.prompt_template = prompt_template
async def run_each(self, data: dict) -> dict:
messages = [{"role": "user", "content": self.prompt_template.render(**data)}]
response = await self.chat(messages=messages)
data["response"] = response.choices[0].message.content
return data@software{config_synth_flow,
author = {aqweteddy},
title = {ConfigSynthFlow: Configurable Workflows for Synthetic Data Generation.},
year = {2025},
publisher = {GitHub},
url = {https://github.com/aqweteddy/ConfigSynthFlow}
}