β‘ Blazingly fast fake data generator powered by Rust
Fakelake is actively developed and maintained by SOMA in Paris.
Any feedback is welcome! Feel free to open an issue or start a discussion.
Fakelake is a lightning-fast command-line tool that generates realistic fake data from simple YAML configurations. Perfect for load testing, database seeding, development, and data pipeline testing.
- Fast: Generate millions of rows in seconds - up to 53x faster than Python alternatives
- Lightweight: Small binary size with minimal memory footprint
- Simple: Define your data schema in YAML, get results instantly
- Reliable: Written in pure Rust with zero unsafe code
- Cross-platform: Works seamlessly on Linux, macOS, and Windows
- Multiple Formats: Export to Parquet, CSV, or JSON
- Reproducible: Optional seed for deterministic data generation
flowchart LR
A[YAML Schema] --> B[Fakelake]
B --> C[Parquet]
B --> D[CSV]
B --> E[JSON]
style B fill:#f96,stroke:#333,stroke-width:4px
- Load Testing: Generate millions of realistic rows for database stress testing
- Database Seeding: Populate development/staging databases with realistic data
- Data Pipeline Testing: Test ETL processes with configurable data volumes
- Analytics Development: Create sample datasets for BI tool development
- Learning & Training: Generate datasets for SQL practice or data science tutorials
- Data Quality Testing: Use
corruptedandpresenceoptions to test validation logic
Generate 1 million rows with random strings (10 characters):
| Tool | Time | Speed vs Fakelake |
|---|---|---|
| Fakelake | 253 ms | 1.00x |
| Mimesis (Python) | 3,375 ms | 13.35x slower |
| Faker (Python) | 13,553 ms | 53.62x slower |
Benchmark Details
- Environment: AMD Ryzen 5 7530U, 8GB RAM, SSD
- OS: Windows
- Test: Generate 1M rows, single column with 10-character random strings
- Command: Run
scripts/benchmark.shto reproduce
Binaries are accessible from the Releases page. Or can be downloaded directly via command line:
# Download the latest release
wget https://github.com/soma-smart/Fakelake/releases/latest/download/Fakelake_<version>_<target>.tar.gz
# Extract and run
tar -xvf Fakelake_<version>_<target>.tar.gz
./fakelake --helpgit clone https://github.com/soma-smart/Fakelake.git
cd Fakelake
cargo build --release
./target/release/fakelake --helpCreate a YAML file describing your data schema:
columns:
- name: user_id
provider: Increment.integer
start: 1
- name: email
provider: Person.email
domain: example.com
- name: signup_date
provider: Random.Date.date
format: "%Y-%m-%d"
after: 2020-01-01
before: 2024-12-31
info:
output_name: users
output_format: parquet
rows: 1_000_000Generate the data:
fakelake generate users.yamlThat's it! You'll get a users.parquet file with 1 million rows in seconds.
fakelake generate schema1.yaml schema2.yaml schema3.yamlFakelake comes with rich built-in providers for generating realistic data:
All Providers are listed in the documentation
- name: id
provider: Increment.integer
start: 100 # Starting value (default: 0)
step: 2 # Increment step (default: 1)- name: first_name
provider: Person.fname # French first names (top 1000)
- name: last_name
provider: Person.lname # French last names (top 1000)
- name: email
provider: Person.email
domain: company.com # Custom domain (default: example.com)- name: score
provider: Random.Number.i32
min: 0
max: 100
- name: percentage
provider: Random.Number.f64
min: 0.0
max: 100.0- name: code
provider: Random.String.alphanumeric
length: 10 # Fixed length
- name: dynamic_code
provider: Random.String.alphanumeric
length: 5..15 # Variable length range- name: created_at
provider: Random.Date.date
format: "%Y-%m-%d"
after: 2020-01-01
before: 2024-12-31
- name: last_login
provider: Random.Date.datetime
format: "%Y-%m-%d %H:%M:%S"
after: 2024-01-01 00:00:00
before: 2024-12-31 23:59:59- name: is_active
provider: Random.bool# Single value
- name: country
provider: Constant.string
data: France
# List (random selection)
- name: status
provider: Constant.string
data: [active, inactive, pending]
# Weighted list (for data skewing)
- name: priority
provider: Constant.string
data:
- value: low
weight: 5
- value: medium
weight: 3
- value: high
weight: 1- name: product_name
provider: Constant.external
path: data/products.txt # One value per lineColumn Options are listed in the documentation.
- name: optional_field
provider: Person.email
presence: 0.8 # 80% filled, 20% missing (null)- name: email
provider: Person.email
corrupted: 0.01 # 1% of emails will be intentionally invalidUseful for testing data validation and error handling!
Output Parameters are listed in the documentation.
info:
output_format: parquetinfo:
output_format: csv
delimiter: ',' # Customizable delimiterinfo:
output_format: json
wrap_up: false # false: JSONL (one object per line)
# true: Valid JSON arrayUse a seed for deterministic output:
info:
seed: 42 # Same seed = same data every time
rows: 1_000_000Perfect for testing, debugging, and consistent datasets!
Here's a comprehensive example showcasing most features:
columns:
- name: id
provider: Increment.integer
start: 42
step: 2
presence: 0.8 # 80% present, 20% null
- name: first_name
provider: Person.fname
- name: last_name
provider: Person.lname
- name: company_email
provider: Person.email
domain: soma-smart.com
corrupted: 0.0001 # 0.01% corrupted emails
- name: created
provider: Random.Date.date
format: "%Y-%m-%d"
after: 2000-02-15
before: 2020-07-17
- name: connection
provider: Random.Date.datetime
format: "%Y-%m-%d %H:%M:%S"
after: 2000-02-15 12:15:00
before: 2020-07-17 23:11:57
- name: code
provider: Random.String.alphanumeric
length: 20
- name: code_variable
provider: Random.String.alphanumeric
length: 5..15 # Variable length
- name: is_subscribed
provider: Random.bool
- name: score
provider: Random.Number.i32
min: -100
max: 100
- name: percentage
provider: Random.Number.f64
min: -1000
max: 1000
- name: constant_string
provider: Constant.string
data: my_constant
- name: category
provider: Constant.string
data: [electronics, clothing, books]
- name: priority
provider: Constant.string
data:
- value: low
weight: 5
- value: high
weight: 1
- name: product_name
provider: Constant.external
path: tests/example.txt
info:
output_name: target/comprehensive_example
output_format: parquet
rows: 174_957
seed: 12345 # Reproducible dataKey Dependencies:
arrow&parquet- High-performance columnar datafastrand- Fast random number generationrayon- Parallel processingchrono- Date and time handlingclap- Command-line interface
Full documentation is available at soma-smart.github.io/Fakelake
Contributions are welcome! Whether it's bug reports, feature requests, or code contributions.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Distributed under the MIT License. See LICENSE.txt for more information.