A configurable system for generating multiple parquet files with custom schemas, data distributions, and file parameters. Perfect for testing, development, and creating sample datasets.
- Configurable Schema: Define custom column types, names, and data generation rules
- Multiple Data Types: Support for integers, floats, strings, booleans, timestamps, and UUIDs
- Realistic Data Generation: Various distribution types (normal, uniform, choice-based) with realistic patterns
- Flexible File Parameters: Control compression, row group size, file count, and size variations
- Multiple File Configurations: Generate different file sets with different parameters in one run
- Advanced Data Generators: Support for time series, correlated data, and hierarchical categories
- CLI Interface: Easy-to-use command line interface with validation and file inspection tools
- Clone or download this repository
- Install the required dependencies:
pip install -r requirements.txt- Create a configuration file (or use one of the examples):
# simple_config.yaml
global:
output_directory: "./output"
file_prefix: "data"
random_seed: 42
schema:
columns:
- name: "id"
type: "int64"
nullable: false
generator:
type: "sequence"
start: 1
- name: "name"
type: "string"
nullable: false
generator:
type: "choice"
choices: ["Alice", "Bob", "Charlie"]
parquet_options:
compression: "snappy"
files:
count: 3
rows_per_file: 10000- Generate parquet files:
python cli.py generate --config simple_config.yaml- Inspect the generated files:
python cli.py list --output-dir ./output
python cli.py info --file ./output/data_001.parquetglobal:
output_directory: "./output" # Where to save files
file_prefix: "data" # Prefix for generated files
random_seed: 42 # For reproducible generation (optional)schema:
columns:
- name: "column_name" # Column name
type: "data_type" # int32, int64, float32, float64, string, boolean, timestamp
nullable: true/false # Whether column can contain nulls
generator: # How to generate data for this column
type: "generator_type" # See generator types below
# ... generator-specific parametersgenerator:
type: "sequence"
start: 1 # Starting numbergenerator:
type: "uuid" # Generates UUID stringsgenerator:
type: "datetime_range"
start: "2023-01-01" # Start date (YYYY-MM-DD)
end: "2024-12-31" # End date (YYYY-MM-DD)generator:
type: "normal"
mean: 100.0 # Mean value
std: 25.0 # Standard deviation
min: 0.0 # Minimum value (optional)
max: 1000.0 # Maximum value (optional)generator:
type: "choice"
choices: ["A", "B", "C"] # List of possible values
weights: [0.5, 0.3, 0.2] # Probability weights (optional)generator:
type: "boolean"
probability: 0.7 # Probability of Truegenerator:
type: "uniform_int"
min: 1 # Minimum value
max: 100 # Maximum valueparquet_options:
compression: "snappy" # snappy, gzip, lz4, brotli, none
row_group_size: 50000 # Rows per row group
page_size: 8192 # Page size in bytes
use_dictionary: true # Enable dictionary encoding
write_statistics: true # Write column statisticsfiles:
count: 5 # Number of files to generate
rows_per_file: 100000 # Rows in each file
size_variation: 0.1 # Random size variation (0.0 to 1.0)
# Optional: Multiple file configurations
file_configs:
- file_suffix: "_small" # Suffix for file names
count: 3 # Number of files
rows_per_file: 10000 # Rows per file
parquet_options: # Override parquet options
compression: "gzip"python cli.py generate --config config.yaml [--verbose]python cli.py validate --config config.yamlpython cli.py list [--output-dir ./output]python cli.py info --file path/to/file.parquetThe examples/ directory contains configuration files for common scenarios:
simple_config.yaml: Basic example with minimal configurationecommerce_config.yaml: E-commerce transaction data with realistic distributionsiot_sensor_config.yaml: IoT sensor time-series data with multiple sensor typesfinancial_config.yaml: Financial trading data with market-realistic patterns
Add null_probability to any generator to introduce null values:
generator:
type: "normal"
mean: 100.0
std: 25.0
null_probability: 0.05 # 5% null valuesControl variation in file sizes:
files:
size_variation: 0.2 # ±20% variation in file sizesGenerate files with different compression for testing:
file_configs:
- file_suffix: "_snappy"
parquet_options:
compression: "snappy"
- file_suffix: "_gzip"
parquet_options:
compression: "gzip"- Use
snappycompression for fastest write/read performance - Use
gzipfor smallest file sizes - Adjust
row_group_sizebased on your query patterns - Use appropriate data types (int32 vs int64, float32 vs float64)
- Enable dictionary encoding for categorical data
- Import Error: Make sure all dependencies are installed with
pip install -r requirements.txt - Permission Error: Ensure the output directory is writable
- Memory Error: Reduce
rows_per_filefor large datasets - Invalid Configuration: Use
python cli.py validate --config your_config.yamlto check syntax
Run any command with --help for detailed usage information:
python cli.py --help
python cli.py generate --helpThis project is open source and available under the MIT License.