Parquet File Generator

A configurable system for generating multiple parquet files with custom schemas, data distributions, and file parameters. Perfect for testing, development, and creating sample datasets.

Features

Configurable Schema: Define custom column types, names, and data generation rules
Multiple Data Types: Support for integers, floats, strings, booleans, timestamps, and UUIDs
Realistic Data Generation: Various distribution types (normal, uniform, choice-based) with realistic patterns
Flexible File Parameters: Control compression, row group size, file count, and size variations
Multiple File Configurations: Generate different file sets with different parameters in one run
Advanced Data Generators: Support for time series, correlated data, and hierarchical categories
CLI Interface: Easy-to-use command line interface with validation and file inspection tools

Installation

Clone or download this repository
Install the required dependencies:

pip install -r requirements.txt

Quick Start

Create a configuration file (or use one of the examples):

# simple_config.yaml
global:
  output_directory: "./output"
  file_prefix: "data"
  random_seed: 42

schema:
  columns:
    - name: "id"
      type: "int64"
      nullable: false
      generator:
        type: "sequence"
        start: 1
    - name: "name"
      type: "string"
      nullable: false
      generator:
        type: "choice"
        choices: ["Alice", "Bob", "Charlie"]

parquet_options:
  compression: "snappy"

files:
  count: 3
  rows_per_file: 10000

Generate parquet files:

python cli.py generate --config simple_config.yaml

Inspect the generated files:

python cli.py list --output-dir ./output
python cli.py info --file ./output/data_001.parquet

Configuration Reference

Global Settings

global:
  output_directory: "./output"    # Where to save files
  file_prefix: "data"            # Prefix for generated files
  random_seed: 42                # For reproducible generation (optional)

Schema Definition

schema:
  columns:
    - name: "column_name"         # Column name
      type: "data_type"           # int32, int64, float32, float64, string, boolean, timestamp
      nullable: true/false        # Whether column can contain nulls
      generator:                  # How to generate data for this column
        type: "generator_type"    # See generator types below
        # ... generator-specific parameters

Data Generator Types

Sequence Generator

generator:
  type: "sequence"
  start: 1                      # Starting number

UUID Generator

generator:
  type: "uuid"                  # Generates UUID strings

DateTime Range Generator

generator:
  type: "datetime_range"
  start: "2023-01-01"          # Start date (YYYY-MM-DD)
  end: "2024-12-31"            # End date (YYYY-MM-DD)

Normal Distribution Generator

generator:
  type: "normal"
  mean: 100.0                  # Mean value
  std: 25.0                    # Standard deviation
  min: 0.0                     # Minimum value (optional)
  max: 1000.0                  # Maximum value (optional)

Choice Generator

generator:
  type: "choice"
  choices: ["A", "B", "C"]     # List of possible values
  weights: [0.5, 0.3, 0.2]     # Probability weights (optional)

Boolean Generator

generator:
  type: "boolean"
  probability: 0.7             # Probability of True

Uniform Integer Generator

generator:
  type: "uniform_int"
  min: 1                       # Minimum value
  max: 100                     # Maximum value

Parquet Options

parquet_options:
  compression: "snappy"         # snappy, gzip, lz4, brotli, none
  row_group_size: 50000        # Rows per row group
  page_size: 8192              # Page size in bytes
  use_dictionary: true         # Enable dictionary encoding
  write_statistics: true       # Write column statistics

File Generation Settings

files:
  count: 5                     # Number of files to generate
  rows_per_file: 100000        # Rows in each file
  size_variation: 0.1          # Random size variation (0.0 to 1.0)

# Optional: Multiple file configurations
file_configs:
  - file_suffix: "_small"      # Suffix for file names
    count: 3                   # Number of files
    rows_per_file: 10000       # Rows per file
    parquet_options:           # Override parquet options
      compression: "gzip"

CLI Commands

Generate Files

python cli.py generate --config config.yaml [--verbose]

Validate Configuration

python cli.py validate --config config.yaml

List Generated Files

python cli.py list [--output-dir ./output]

Inspect File Details

python cli.py info --file path/to/file.parquet

Example Use Cases

The examples/ directory contains configuration files for common scenarios:

simple_config.yaml: Basic example with minimal configuration
ecommerce_config.yaml: E-commerce transaction data with realistic distributions
iot_sensor_config.yaml: IoT sensor time-series data with multiple sensor types
financial_config.yaml: Financial trading data with market-realistic patterns

Advanced Features

Nullable Columns

Add null_probability to any generator to introduce null values:

generator:
  type: "normal"
  mean: 100.0
  std: 25.0
  null_probability: 0.05      # 5% null values

File Size Variation

Control variation in file sizes:

files:
  size_variation: 0.2         # ±20% variation in file sizes

Multiple Compression Types

Generate files with different compression for testing:

file_configs:
  - file_suffix: "_snappy"
    parquet_options:
      compression: "snappy"
  - file_suffix: "_gzip"
    parquet_options:
      compression: "gzip"

Performance Tips

Use snappy compression for fastest write/read performance
Use gzip for smallest file sizes
Adjust row_group_size based on your query patterns
Use appropriate data types (int32 vs int64, float32 vs float64)
Enable dictionary encoding for categorical data

Troubleshooting

Common Issues

Import Error: Make sure all dependencies are installed with pip install -r requirements.txt
Permission Error: Ensure the output directory is writable
Memory Error: Reduce rows_per_file for large datasets
Invalid Configuration: Use python cli.py validate --config your_config.yaml to check syntax

Getting Help

Run any command with --help for detailed usage information:

python cli.py --help
python cli.py generate --help

License

This project is open source and available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
azure-uploader		azure-uploader
examples		examples
.gitignore		.gitignore
README.md		README.md
cli.py		cli.py
config_schema.yaml		config_schema.yaml
data_generators.py		data_generators.py
parquet_generator.py		parquet_generator.py
requirements.txt		requirements.txt
run_cli_tests.py		run_cli_tests.py
setup.py		setup.py
test_cli.py		test_cli.py
test_generator.py		test_generator.py

markcurtis1970/WD40-parquet

Folders and files

Latest commit

History

Repository files navigation