simtom

Realistic data simulator for ML system testing with time-compressed scenarios and controlled drift

SIMTOM is an extensible data generation platform that creates realistic streaming data for machine learning model training and testing. Features include configurable arrival patterns, noise injection, drift simulation, and time compression for accelerated development cycles.

Why simtom?

The Problem: Your ML model works in dev but fails in production. Unit tests use toy data. Load testing (Locust, wrk) only tests performance, not model behavior. Real production data is risky, regulated, or unavailable.

The Solution: simtom generates statistically realistic synthetic data with controlled patterns, drift, and edge cases. Test your ML models with production-like scenarios without production risks.

Different from load testing: While Locust tests "can your API handle 1000 requests?", simtom tests "does your fraud model still work when spending patterns change seasonally?"

🚀 Live API

Production Endpoint: https://simtom-production.up.railway.app

# Quick test
curl https://simtom-production.up.railway.app/generators

# Stream sample data
curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
  -H "Content-Type: application/json" \
  -d '{"rate_per_second": 2.0, "max_records": 3}'

⚡ Key Features

🎯 Realistic Traffic Patterns: Uniform, Poisson, NHPP, and Burst arrival patterns
📊 Rich Data Generation: BNPL transactions with risk scoring and customer profiles
📅 Historical Data Generation: Generate years of data with realistic temporal patterns
🎄 Holiday & Seasonal Effects: Black Friday +60%, Christmas -88%, weekend reductions
⚡ Day-per-Second Delivery: Historical data streams at 1 day per second (365 days in 6 minutes)
⏱️ Time Compression: Simulate days/weeks of data in minutes
🔧 Plugin Architecture: Easy extension with custom generators
📡 Real-time Streaming: Server-sent events with configurable rates
🧪 ML-Ready: Built-in noise, drift, and deterministic seeding

📋 Quick Start

Try the Live API

Real-time Data Streaming

# Check health and available generators
curl https://simtom-production.up.railway.app/

# Stream live BNPL data (current timestamps)
curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
  -H "Content-Type: application/json" \
  -d '{"rate_per_second": 2.0, "max_records": 5, "seed": 42}'

Historical Data for ML Training

# Generate 3 months of historical BNPL data with realistic volumes
curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
  -H "Content-Type: application/json" \
  -d '{
    "start_date": "2024-06-01",
    "end_date": "2024-09-01",
    "base_daily_volume": 1000,
    "seed": 42
  }' > historical_bnpl_data.jsonl

# Fast generation of full year dataset (delivered in ~6 minutes)
curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
  -H "Content-Type: application/json" \
  -d '{
    "start_date": "2024-01-01",
    "end_date": "2024-12-31",
    "base_daily_volume": 1000,
    "include_holiday_patterns": true,
    "seed": 42
  }' > bnpl_full_year.jsonl

Response Format

The streaming endpoints return Server-Sent Events (SSE) format, where each record is prefixed with data: :

data: {"transaction_id": "txn_00000000", "timestamp": "2025-09-15T11:10:21.307911", "customer_id": "cust_000001", "amount": 143.02, ...}
data: {"transaction_id": "txn_00000001", "timestamp": "2025-09-15T11:10:21.318045", "customer_id": "cust_000002", "amount": 67.89, ...}

Parsing SSE Responses

Python Example:

import requests
import json

response = requests.post(
    'https://simtom-production.up.railway.app/stream/bnpl',
    json={"rate_per_second": 10, "max_records": 5},
    stream=True
)

for line in response.iter_lines(decode_unicode=True):
    if line.startswith('data: '):
        json_data = line[6:]  # Remove 'data: ' prefix
        record = json.loads(json_data)
        print(record['transaction_id'], record['amount'])

JavaScript Example:

fetch('/stream/bnpl', {
    method: 'POST',
    headers: {'Content-Type': 'application/json'},
    body: JSON.stringify({rate_per_second: 10, max_records: 5})
})
.then(response => response.body.getReader())
.then(reader => {
    const decoder = new TextDecoder();
    function read() {
        return reader.read().then(({done, value}) => {
            if (done) return;
            const lines = decoder.decode(value).split('\n');
            lines.forEach(line => {
                if (line.startsWith('data: ')) {
                    const record = JSON.parse(line.substring(6));
                    console.log(record.transaction_id, record.amount);
                }
            });
            return read();
        });
    }
    return read();
});

Important Notes:

Standard JSON parsers will fail without handling the data: prefix
Use streaming HTTP clients for large datasets to avoid memory issues
Each line contains a complete JSON record (no multi-line JSON)

Local Installation

git clone https://github.com/whitehackr/simtom.git
cd simtom
poetry install

Run Locally

poetry run python scripts/run_server.py
curl http://localhost:8000/generators

Basic Usage

from simtom.generators.ecommerce.bnpl import BNPLGenerator, BNPLConfig
from datetime import date

# Real-time streaming (current timestamps)
config = BNPLConfig(
    rate_per_second=10.0,
    max_records=1000,
    seed=42
)

generator = BNPLGenerator(config)
async for record in generator.stream():
    print(record)  # Process each synthetic transaction

# Historical data generation (specific date range)
historical_config = BNPLConfig(
    start_date=date(2024, 1, 1),
    end_date=date(2024, 12, 31),
    base_daily_volume=1000,  # Realistic daily volumes
    include_holiday_patterns=True,
    seed=42
)

historical_generator = BNPLGenerator(historical_config)
async for record in historical_generator.stream():
    print(record)  # Historical transactions delivered day-per-second

📅 Historical Data Generation

Generate realistic historical datasets with proper temporal patterns for ML training and backtesting.

Key Features

Date Range Support: Generate data for any period up to 1 year
Statistical Volume Distribution: 4-factor model (day-of-week, week-of-month, seasonal, events)
Business Hour Patterns: 70% during 9am-6pm, 20% evenings, 10% nights
Weekend Adjustments: 15% reduction on weekends (realistic e-commerce patterns)
Holiday Effects: Black Friday +60%, Christmas -88%, configurable patterns
Day-per-Second Delivery: Historical data streams rapidly (1 day per second)
Chronological Ordering: All timestamps properly sorted for time-series analysis

Statistical Volume Multipliers

{
  // Special Events
  "black_friday": 1.6,        // +60% traffic (biggest shopping day)
  "cyber_monday": 1.4,        // +40% traffic
  "christmas_day": 0.12,      // -88% (most stores closed)
  "new_years_day": 0.3,       // -70% traffic
  "valentines_day": 1.15,     // +15% traffic
  "mothers_day": 1.15,        // +15% traffic

  // Day of Week
  "friday": 1.25,             // +25% (weekend prep)
  "saturday": 0.85,           // -15% weekend reduction
  "sunday": 0.70,             // -30% weekend reduction

  // Seasonal
  "january": 0.75,            // -25% post-holiday low
  "november": 1.10            // +10% pre-holiday buildup
}

Performance

Historical Mode: Day-per-second delivery (365 days in ~6 minutes)
Real-time Mode: Configurable rates 0.1-1000 records/second
Memory Efficient: O(1) streaming regardless of dataset size
Network Optimized: Batched delivery prevents overwhelming clients

🚦 Arrival Patterns

Uniform (Default)

Fixed intervals - predictable for testing

curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
  -H "Content-Type: application/json" \
  -d '{
  "rate_per_second": 2.0,
  "arrival_pattern": "uniform"
}'

Poisson

Random intervals with realistic variability

curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
  -H "Content-Type: application/json" \
  -d '{
  "rate_per_second": 2.0,
  "arrival_pattern": "poisson"
}'

NHPP (Non-Homogeneous Poisson)

Daily traffic patterns with peak hours

curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
  -H "Content-Type: application/json" \
  -d '{
  "rate_per_second": 1.0,
  "arrival_pattern": "nhpp",
  "peak_hours": [12, 19],
  "time_compression": 24.0
}'

Burst

Flash sale and event-driven spikes

curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
  -H "Content-Type: application/json" \
  -d '{
  "rate_per_second": 2.0,
  "arrival_pattern": "burst",
  "burst_intensity": 3.0,
  "burst_probability": 0.6
}'

🏗️ Architecture

Core Principles

Plugin Architecture: Auto-discovery of data generators via decorators
Async Streaming: Memory-efficient generation of large datasets
Type Safety: Pydantic models for configuration and validation
Extensibility: Add new generators without touching core code

Architecture Highlights

Plugin System: Auto-discovery of generators
Memory Efficient: O(1) streaming regardless of dataset size
Entity Consistency: LRU registries maintain referential integrity
FastAPI: Modern async web framework
Pydantic: Type-safe configuration validation

Component Overview

simtom/
├── core/           # Stable abstractions
│   ├── generator.py    # BaseGenerator + GeneratorConfig
│   ├── registry.py     # Plugin auto-discovery
│   └── entities.py     # Core data models
├── generators/     # Pluggable data generators
│   └── ecommerce/
│       └── bnpl.py     # BNPL risk data generator
├── api/            # FastAPI web layer
│   ├── main.py         # Application factory
│   ├── routes.py       # Streaming endpoints
│   └── models.py       # Request/response schemas
└── scenarios/      # Time-based scenario modeling

Plugin System

New generators are automatically registered:

@register_generator("my_generator")
class MyGenerator(BaseGenerator):
    async def generate_record(self) -> Dict[str, Any]:
        return {"id": uuid4(), "value": random.random()}

📊 Sample Data

BNPL transactions include 40+ fields:

{
  "transaction_id": "txn_00000001",
  "customer_id": "cust_000001",
  "amount": 485.61,
  "risk_score": 0.85,
  "risk_level": "high",
  "installment_count": 4,
  "customer_age_bracket": "25-34",
  "product_category": "electronics",
  "device_type": "mobile",
  "payment_provider": "afterpay"
}

📊 Available Generators

Generator	Description	Use Case
`bnpl`	Buy-Now-Pay-Later transactions with risk scoring	Credit risk, fraud detection

🔧 Configuration

Generator Configuration

from simtom.core.generator import GeneratorConfig

config = GeneratorConfig(
    rate_per_second=1.0,     # Records per second (1-1000)
    total_records=None,      # Infinite if None
    seed=42,                 # Reproducible randomness
    time_compression=1.0     # Real-time = 1.0, faster = > 1.0
)

Configuration Options

Parameter	Description	Default	Mode
`rate_per_second`	Arrival rate (0.1-1000)	1.0	Current-date
`base_daily_volume`	Average daily transactions	1000	Historical
`start_date`	Historical start date	null	Historical
`end_date`	Historical end date	null	Historical
`include_holiday_patterns`	Enable seasonal effects	true	Both
`arrival_pattern`	Traffic pattern	"uniform"	Current-date
`peak_hours`	NHPP peak hours	[12, 19]	Current-date
`max_records`	Maximum records to generate	null	Both
`seed`	Deterministic output	null	Both

Environment Variables

# API Configuration
SIMTOM_HOST=0.0.0.0
SIMTOM_PORT=8000
SIMTOM_LOG_LEVEL=info

# Redis (optional, for caching)
REDIS_URL=redis://localhost:6379

🧪 Use Cases

ML Model Training: Realistic arrival patterns for better model performance
Load Testing: Simulate traffic spikes and patterns
Feature Engineering: Rich, consistent data for pipeline development
System Testing: Controlled drift and noise injection
Research: Reproducible datasets with deterministic seeding

Scenario: BNPL Fraud Detection

import asyncio
from simtom.generators.ecommerce.bnpl import BNPLGenerator

async def test_fraud_model():
    # Generate baseline data
    baseline_config = GeneratorConfig(seed=42, total_records=1000)
    baseline_gen = BNPLGenerator(baseline_config)

    # Train model on baseline
    baseline_data = [record async for record in baseline_gen.stream()]
    model = train_fraud_model(baseline_data)

    # Test with drift scenario
    drift_config = GeneratorConfig(
        seed=123,  # Different seed = different patterns
        total_records=200
    )
    drift_gen = BNPLGenerator(drift_config)

    # Evaluate model performance
    async for record in drift_gen.stream():
        prediction = model.predict(record)
        actual = record['default_risk']
        # Track accuracy degradation

🚀 Deployment

Docker

docker build -t simtom .
docker run -p 8000:8000 simtom

Railway

# Connect to Railway
railway login
railway link

# Deploy
railway up

🤝 Contributing

SIMTOM is designed for community extension. Add new generators by:

Inherit from BaseGenerator
Implement async def generate_record()
Add @register_generator("name") decorator
Place in simtom/generators/ - auto-discovered!

Adding New Generators

Create Generator Class

# simtom/generators/finance/credit_cards.py
from simtom.core.generator import BaseGenerator, register_generator

@register_generator("credit_cards")
class CreditCardGenerator(BaseGenerator):
    async def generate_record(self) -> Dict[str, Any]:
        return {
            "card_number": self.faker.credit_card_number(),
            "amount": self.faker.pyfloat(min_value=1, max_value=1000),
            "merchant": self.faker.company()
        }

Add Tests

# tests/generators/test_credit_cards.py
async def test_credit_card_generation():
    config = GeneratorConfig(total_records=10)
    generator = CreditCardGenerator(config)
    records = [r async for r in generator.stream()]
    assert len(records) == 10
    assert all("card_number" in r for r in records)

Update Documentation: Add to generator table above

Development Setup

# Install development dependencies
poetry install --with dev

# Run tests
pytest

# Code formatting
black .
ruff check .

# Type checking
mypy simtom/

Code Quality Standards

Type Hints: All public APIs must have type annotations
Async First: Use async/await for I/O operations
Testing: >90% test coverage required
Documentation: Docstrings for all public methods

📈 Performance

Benchmarks

Records/sec	Memory Usage	CPU Usage
10	~50MB	~5%
100	~75MB	~15%
1000	~150MB	~40%

Optimization Tips

Use appropriate rate_per_second for your use case
Set total_records to avoid infinite streams
Consider Redis caching for repeated scenarios
Use Docker limits in production

🐛 Troubleshooting

Common Issues

Generator Not Found

# Error: Generator 'my_gen' not found
# Solution: Ensure @register_generator decorator is used

High Memory Usage

# Issue: Memory grows over time
# Solution: Set total_records limit or use streaming processing
async for record in generator.stream():
    process_record(record)  # Process immediately, don't accumulate

Slow Generation

# Issue: Generation too slow
# Solution: Increase rate_per_second or check async usage
config = GeneratorConfig(rate_per_second=100)  # Faster

📚 Advanced Usage

Custom Time Scenarios

# Simulate Black Friday traffic spike
config = GeneratorConfig(
    time_compression=24.0,  # 1 hour = 24 hours of data
    rate_per_second=50.0    # Higher transaction volume
)

Data Drift Simulation

# Gradual drift over time
configs = [
    GeneratorConfig(seed=42),    # Baseline
    GeneratorConfig(seed=43),    # Month 1
    GeneratorConfig(seed=44),    # Month 2
]

for config in configs:
    generator = BNPLGenerator(config)
    # Test model performance degradation

📄 License

MIT License - see LICENSE file for details.

🙋‍♂️ Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Live Demo: https://simtom-production.up.railway.app

Built for ML Engineers, by ML Engineers 🤖

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
docs		docs
scripts		scripts
simtom		simtom
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

whitehackr/simtom

Folders and files

Latest commit

History

Repository files navigation