Codestin Search App

⚡ Blazingly fast fake data generator powered by Rust

🏢 About

Fakelake is actively developed and maintained by SOMA in Paris.

Any feedback is welcome! Feel free to open an issue or start a discussion.

🚀 Why Fakelake?

Fakelake is a lightning-fast command-line tool that generates realistic fake data from simple YAML configurations. Perfect for load testing, database seeding, development, and data pipeline testing.

✨ Key Highlights

Fast: Generate millions of rows in seconds - up to 53x faster than Python alternatives
Lightweight: Small binary size with minimal memory footprint
Simple: Define your data schema in YAML, get results instantly
Reliable: Written in pure Rust with zero unsafe code
Cross-platform: Works seamlessly on Linux, macOS, and Windows
Multiple Formats: Export to Parquet, CSV, or JSON
Reproducible: Optional seed for deterministic data generation

flowchart LR
    A[YAML Schema] --> B[Fakelake]
    B --> C[Parquet]
    B --> D[CSV]
    B --> E[JSON]
    style B fill:#f96,stroke:#333,stroke-width:4px

🏗️ Use Cases

Load Testing: Generate millions of realistic rows for database stress testing
Database Seeding: Populate development/staging databases with realistic data
Data Pipeline Testing: Test ETL processes with configurable data volumes
Analytics Development: Create sample datasets for BI tool development
Learning & Training: Generate datasets for SQL practice or data science tutorials
Data Quality Testing: Use corrupted and presence options to test validation logic

📊 Performance Benchmark

Generate 1 million rows with random strings (10 characters):

Tool	Time	Speed vs Fakelake
Fakelake	253 ms	1.00x
Mimesis (Python)	3,375 ms	13.35x slower
Faker (Python)	13,553 ms	53.62x slower

Benchmark Details

Environment: AMD Ryzen 5 7530U, 8GB RAM, SSD
OS: Windows
Test: Generate 1M rows, single column with 10-character random strings
Command: Run scripts/benchmark.sh to reproduce

Quick Start

Installation

Option 1: Download Precompiled Binary (Recommended)

Binaries are accessible from the Releases page. Or can be downloaded directly via command line:

# Download the latest release
wget https://github.com/soma-smart/Fakelake/releases/latest/download/Fakelake_<version>_<target>.tar.gz

# Extract and run
tar -xvf Fakelake_<version>_<target>.tar.gz
./fakelake --help

Option 2: Build from Source

git clone https://github.com/soma-smart/Fakelake.git
cd Fakelake
cargo build --release
./target/release/fakelake --help

💡 Usage

Basic Example

Create a YAML file describing your data schema:

columns:
  - name: user_id
    provider: Increment.integer
    start: 1

  - name: email
    provider: Person.email
    domain: example.com

  - name: signup_date
    provider: Random.Date.date
    format: "%Y-%m-%d"
    after: 2020-01-01
    before: 2024-12-31

info:
  output_name: users
  output_format: parquet
  rows: 1_000_000

Generate the data:

fakelake generate users.yaml

That's it! You'll get a users.parquet file with 1 million rows in seconds.

Generate Multiple Files

fakelake generate schema1.yaml schema2.yaml schema3.yaml

🎯 Features & Capabilities

📦 Data Providers

Fakelake comes with rich built-in providers for generating realistic data:

All Providers are listed in the documentation

Increment

- name: id
  provider: Increment.integer
  start: 100      # Starting value (default: 0)
  step: 2         # Increment step (default: 1)

Person

- name: first_name
  provider: Person.fname    # French first names (top 1000)

- name: last_name
  provider: Person.lname    # French last names (top 1000)

- name: email
  provider: Person.email
  domain: company.com       # Custom domain (default: example.com)

Random Numbers

- name: score
  provider: Random.Number.i32
  min: 0
  max: 100

- name: percentage
  provider: Random.Number.f64
  min: 0.0
  max: 100.0

Random Strings

- name: code
  provider: Random.String.alphanumeric
  length: 10         # Fixed length

- name: dynamic_code
  provider: Random.String.alphanumeric
  length: 5..15      # Variable length range

Random Dates

- name: created_at
  provider: Random.Date.date
  format: "%Y-%m-%d"
  after: 2020-01-01
  before: 2024-12-31

- name: last_login
  provider: Random.Date.datetime
  format: "%Y-%m-%d %H:%M:%S"
  after: 2024-01-01 00:00:00
  before: 2024-12-31 23:59:59

Random Boolean

- name: is_active
  provider: Random.bool

Constant Values

# Single value
- name: country
  provider: Constant.string
  data: France

# List (random selection)
- name: status
  provider: Constant.string
  data: [active, inactive, pending]

# Weighted list (for data skewing)
- name: priority
  provider: Constant.string
  data:
    - value: low
      weight: 5
    - value: medium
      weight: 3
    - value: high
      weight: 1

External Data

- name: product_name
  provider: Constant.external
  path: data/products.txt    # One value per line

🎛️ Column Options

Column Options are listed in the documentation.

Presence (Missing Values)

- name: optional_field
  provider: Person.email
  presence: 0.8    # 80% filled, 20% missing (null)

Corrupted (Invalid Data)

- name: email
  provider: Person.email
  corrupted: 0.01  # 1% of emails will be intentionally invalid

Useful for testing data validation and error handling!

📤 Output Formats

Output Parameters are listed in the documentation.

Parquet (Default)

info:
  output_format: parquet

CSV

info:
  output_format: csv
  delimiter: ','    # Customizable delimiter

JSON

info:
  output_format: json
  wrap_up: false    # false: JSONL (one object per line)
                    # true: Valid JSON array

🎲 Reproducible Data Generation

Use a seed for deterministic output:

info:
  seed: 42          # Same seed = same data every time
  rows: 1_000_000

Perfect for testing, debugging, and consistent datasets!

📚 Complete Example

Here's a comprehensive example showcasing most features:

columns:
  - name: id
    provider: Increment.integer
    start: 42
    step: 2
    presence: 0.8       # 80% present, 20% null

  - name: first_name
    provider: Person.fname

  - name: last_name
    provider: Person.lname

  - name: company_email
    provider: Person.email
    domain: soma-smart.com
    corrupted: 0.0001   # 0.01% corrupted emails

  - name: created
    provider: Random.Date.date
    format: "%Y-%m-%d"
    after: 2000-02-15
    before: 2020-07-17

  - name: connection
    provider: Random.Date.datetime
    format: "%Y-%m-%d %H:%M:%S"
    after: 2000-02-15 12:15:00
    before: 2020-07-17 23:11:57

  - name: code
    provider: Random.String.alphanumeric
    length: 20

  - name: code_variable
    provider: Random.String.alphanumeric
    length: 5..15       # Variable length

  - name: is_subscribed
    provider: Random.bool

  - name: score
    provider: Random.Number.i32
    min: -100
    max: 100

  - name: percentage
    provider: Random.Number.f64
    min: -1000
    max: 1000

  - name: constant_string
    provider: Constant.string
    data: my_constant

  - name: category
    provider: Constant.string
    data: [electronics, clothing, books]

  - name: priority
    provider: Constant.string
    data:
      - value: low
        weight: 5
      - value: high
        weight: 1

  - name: product_name
    provider: Constant.external
    path: tests/example.txt

info:
  output_name: target/comprehensive_example
  output_format: parquet
  rows: 174_957
  seed: 12345          # Reproducible data

🛠️ Built With

Key Dependencies:

arrow & parquet - High-performance columnar data
fastrand - Fast random number generation
rayon - Parallel processing
chrono - Date and time handling
clap - Command-line interface

📖 Documentation

Full documentation is available at soma-smart.github.io/Fakelake

🤝 Contributing

Contributions are welcome! Whether it's bug reports, feature requests, or code contributions.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

📄 License

Distributed under the MIT License. See LICENSE.txt for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.github		.github
bench		bench
docs		docs
images		images
scripts		scripts
src		src
static		static
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.txt		LICENSE.txt
README.md		README.md
mkdocs.yml		mkdocs.yml

Uh oh!

License

Uh oh!

soma-smart/Fakelake

Folders and files

Latest commit

History

Repository files navigation

🏢 About

🚀 Why Fakelake?

✨ Key Highlights

🏗️ Use Cases

📊 Performance Benchmark

Quick Start

Installation

Option 1: Download Precompiled Binary (Recommended)

Option 2: Build from Source

💡 Usage

Basic Example

Generate Multiple Files

🎯 Features & Capabilities

📦 Data Providers

Increment

Person

Random Numbers

Random Strings

Random Dates

Random Boolean

Constant Values

External Data

🎛️ Column Options

Presence (Missing Values)

Corrupted (Invalid Data)

📤 Output Formats

Parquet (Default)

CSV

JSON

🎲 Reproducible Data Generation

📚 Complete Example

🛠️ Built With

📖 Documentation

🤝 Contributing

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages