Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ML-focused synthetic data platform with realistic traffic patterns, seasonal effects, and temporal drift. BNPL transaction generator with risk scoring, configurable arrival patterns (Poisson, NHPP, Burst). Live API: simtom-production.up.railway.app | Day-per-second historical replay.

License

Notifications You must be signed in to change notification settings

whitehackr/simtom

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

56 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

simtom

Python FastAPI License: MIT Live Demo

Realistic data simulator for ML system testing with time-compressed scenarios and controlled drift

SIMTOM is an extensible data generation platform that creates realistic streaming data for machine learning model training and testing. Features include configurable arrival patterns, noise injection, drift simulation, and time compression for accelerated development cycles.

Why simtom?

The Problem: Your ML model works in dev but fails in production. Unit tests use toy data. Load testing (Locust, wrk) only tests performance, not model behavior. Real production data is risky, regulated, or unavailable.

The Solution: simtom generates statistically realistic synthetic data with controlled patterns, drift, and edge cases. Test your ML models with production-like scenarios without production risks.

Different from load testing: While Locust tests "can your API handle 1000 requests?", simtom tests "does your fraud model still work when spending patterns change seasonally?"

πŸš€ Live API

Production Endpoint: https://simtom-production.up.railway.app

# Quick test
curl https://simtom-production.up.railway.app/generators

# Stream sample data
curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
  -H "Content-Type: application/json" \
  -d '{"rate_per_second": 2.0, "max_records": 3}'

⚑ Key Features

  • 🎯 Realistic Traffic Patterns: Uniform, Poisson, NHPP, and Burst arrival patterns
  • πŸ“Š Rich Data Generation: BNPL transactions with risk scoring and customer profiles
  • πŸ“… Historical Data Generation: Generate years of data with realistic temporal patterns
  • πŸŽ„ Holiday & Seasonal Effects: Black Friday +60%, Christmas -88%, weekend reductions
  • ⚑ Day-per-Second Delivery: Historical data streams at 1 day per second (365 days in 6 minutes)
  • ⏱️ Time Compression: Simulate days/weeks of data in minutes
  • πŸ”§ Plugin Architecture: Easy extension with custom generators
  • πŸ“‘ Real-time Streaming: Server-sent events with configurable rates
  • πŸ§ͺ ML-Ready: Built-in noise, drift, and deterministic seeding

πŸ“‹ Quick Start

Try the Live API

Real-time Data Streaming

# Check health and available generators
curl https://simtom-production.up.railway.app/

# Stream live BNPL data (current timestamps)
curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
  -H "Content-Type: application/json" \
  -d '{"rate_per_second": 2.0, "max_records": 5, "seed": 42}'

Historical Data for ML Training

# Generate 3 months of historical BNPL data with realistic volumes
curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
  -H "Content-Type: application/json" \
  -d '{
    "start_date": "2024-06-01",
    "end_date": "2024-09-01",
    "base_daily_volume": 1000,
    "seed": 42
  }' > historical_bnpl_data.jsonl

# Fast generation of full year dataset (delivered in ~6 minutes)
curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
  -H "Content-Type: application/json" \
  -d '{
    "start_date": "2024-01-01",
    "end_date": "2024-12-31",
    "base_daily_volume": 1000,
    "include_holiday_patterns": true,
    "seed": 42
  }' > bnpl_full_year.jsonl

Response Format

The streaming endpoints return Server-Sent Events (SSE) format, where each record is prefixed with data: :

data: {"transaction_id": "txn_00000000", "timestamp": "2025-09-15T11:10:21.307911", "customer_id": "cust_000001", "amount": 143.02, ...}
data: {"transaction_id": "txn_00000001", "timestamp": "2025-09-15T11:10:21.318045", "customer_id": "cust_000002", "amount": 67.89, ...}

Parsing SSE Responses

Python Example:

import requests
import json

response = requests.post(
    'https://simtom-production.up.railway.app/stream/bnpl',
    json={"rate_per_second": 10, "max_records": 5},
    stream=True
)

for line in response.iter_lines(decode_unicode=True):
    if line.startswith('data: '):
        json_data = line[6:]  # Remove 'data: ' prefix
        record = json.loads(json_data)
        print(record['transaction_id'], record['amount'])

JavaScript Example:

fetch('/stream/bnpl', {
    method: 'POST',
    headers: {'Content-Type': 'application/json'},
    body: JSON.stringify({rate_per_second: 10, max_records: 5})
})
.then(response => response.body.getReader())
.then(reader => {
    const decoder = new TextDecoder();
    function read() {
        return reader.read().then(({done, value}) => {
            if (done) return;
            const lines = decoder.decode(value).split('\n');
            lines.forEach(line => {
                if (line.startsWith('data: ')) {
                    const record = JSON.parse(line.substring(6));
                    console.log(record.transaction_id, record.amount);
                }
            });
            return read();
        });
    }
    return read();
});

Important Notes:

  • Standard JSON parsers will fail without handling the data: prefix
  • Use streaming HTTP clients for large datasets to avoid memory issues
  • Each line contains a complete JSON record (no multi-line JSON)

Local Installation

git clone https://github.com/whitehackr/simtom.git
cd simtom
poetry install

Run Locally

poetry run python scripts/run_server.py
curl http://localhost:8000/generators

Basic Usage

from simtom.generators.ecommerce.bnpl import BNPLGenerator, BNPLConfig
from datetime import date

# Real-time streaming (current timestamps)
config = BNPLConfig(
    rate_per_second=10.0,
    max_records=1000,
    seed=42
)

generator = BNPLGenerator(config)
async for record in generator.stream():
    print(record)  # Process each synthetic transaction

# Historical data generation (specific date range)
historical_config = BNPLConfig(
    start_date=date(2024, 1, 1),
    end_date=date(2024, 12, 31),
    base_daily_volume=1000,  # Realistic daily volumes
    include_holiday_patterns=True,
    seed=42
)

historical_generator = BNPLGenerator(historical_config)
async for record in historical_generator.stream():
    print(record)  # Historical transactions delivered day-per-second

πŸ“… Historical Data Generation

Generate realistic historical datasets with proper temporal patterns for ML training and backtesting.

Key Features

  • Date Range Support: Generate data for any period up to 1 year
  • Statistical Volume Distribution: 4-factor model (day-of-week, week-of-month, seasonal, events)
  • Business Hour Patterns: 70% during 9am-6pm, 20% evenings, 10% nights
  • Weekend Adjustments: 15% reduction on weekends (realistic e-commerce patterns)
  • Holiday Effects: Black Friday +60%, Christmas -88%, configurable patterns
  • Day-per-Second Delivery: Historical data streams rapidly (1 day per second)
  • Chronological Ordering: All timestamps properly sorted for time-series analysis

Statistical Volume Multipliers

{
  // Special Events
  "black_friday": 1.6,        // +60% traffic (biggest shopping day)
  "cyber_monday": 1.4,        // +40% traffic
  "christmas_day": 0.12,      // -88% (most stores closed)
  "new_years_day": 0.3,       // -70% traffic
  "valentines_day": 1.15,     // +15% traffic
  "mothers_day": 1.15,        // +15% traffic

  // Day of Week
  "friday": 1.25,             // +25% (weekend prep)
  "saturday": 0.85,           // -15% weekend reduction
  "sunday": 0.70,             // -30% weekend reduction

  // Seasonal
  "january": 0.75,            // -25% post-holiday low
  "november": 1.10            // +10% pre-holiday buildup
}

Performance

  • Historical Mode: Day-per-second delivery (365 days in ~6 minutes)
  • Real-time Mode: Configurable rates 0.1-1000 records/second
  • Memory Efficient: O(1) streaming regardless of dataset size
  • Network Optimized: Batched delivery prevents overwhelming clients

🚦 Arrival Patterns

Uniform (Default)

Fixed intervals - predictable for testing

curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
  -H "Content-Type: application/json" \
  -d '{
  "rate_per_second": 2.0,
  "arrival_pattern": "uniform"
}'

Poisson

Random intervals with realistic variability

curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
  -H "Content-Type: application/json" \
  -d '{
  "rate_per_second": 2.0,
  "arrival_pattern": "poisson"
}'

NHPP (Non-Homogeneous Poisson)

Daily traffic patterns with peak hours

curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
  -H "Content-Type: application/json" \
  -d '{
  "rate_per_second": 1.0,
  "arrival_pattern": "nhpp",
  "peak_hours": [12, 19],
  "time_compression": 24.0
}'

Burst

Flash sale and event-driven spikes

curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
  -H "Content-Type: application/json" \
  -d '{
  "rate_per_second": 2.0,
  "arrival_pattern": "burst",
  "burst_intensity": 3.0,
  "burst_probability": 0.6
}'

πŸ—οΈ Architecture

Core Principles

  • Plugin Architecture: Auto-discovery of data generators via decorators
  • Async Streaming: Memory-efficient generation of large datasets
  • Type Safety: Pydantic models for configuration and validation
  • Extensibility: Add new generators without touching core code

Architecture Highlights

  • Plugin System: Auto-discovery of generators
  • Memory Efficient: O(1) streaming regardless of dataset size
  • Entity Consistency: LRU registries maintain referential integrity
  • FastAPI: Modern async web framework
  • Pydantic: Type-safe configuration validation

Component Overview

simtom/
β”œβ”€β”€ core/           # Stable abstractions
β”‚   β”œβ”€β”€ generator.py    # BaseGenerator + GeneratorConfig
β”‚   β”œβ”€β”€ registry.py     # Plugin auto-discovery
β”‚   └── entities.py     # Core data models
β”œβ”€β”€ generators/     # Pluggable data generators
β”‚   └── ecommerce/
β”‚       └── bnpl.py     # BNPL risk data generator
β”œβ”€β”€ api/            # FastAPI web layer
β”‚   β”œβ”€β”€ main.py         # Application factory
β”‚   β”œβ”€β”€ routes.py       # Streaming endpoints
β”‚   └── models.py       # Request/response schemas
└── scenarios/      # Time-based scenario modeling

Plugin System

New generators are automatically registered:

@register_generator("my_generator")
class MyGenerator(BaseGenerator):
    async def generate_record(self) -> Dict[str, Any]:
        return {"id": uuid4(), "value": random.random()}

πŸ“Š Sample Data

BNPL transactions include 40+ fields:

{
  "transaction_id": "txn_00000001",
  "customer_id": "cust_000001",
  "amount": 485.61,
  "risk_score": 0.85,
  "risk_level": "high",
  "installment_count": 4,
  "customer_age_bracket": "25-34",
  "product_category": "electronics",
  "device_type": "mobile",
  "payment_provider": "afterpay"
}

πŸ“Š Available Generators

Generator Description Use Case
bnpl Buy-Now-Pay-Later transactions with risk scoring Credit risk, fraud detection

πŸ”§ Configuration

Generator Configuration

from simtom.core.generator import GeneratorConfig

config = GeneratorConfig(
    rate_per_second=1.0,     # Records per second (1-1000)
    total_records=None,      # Infinite if None
    seed=42,                 # Reproducible randomness
    time_compression=1.0     # Real-time = 1.0, faster = > 1.0
)

Configuration Options

Parameter Description Default Mode
rate_per_second Arrival rate (0.1-1000) 1.0 Current-date
base_daily_volume Average daily transactions 1000 Historical
start_date Historical start date null Historical
end_date Historical end date null Historical
include_holiday_patterns Enable seasonal effects true Both
arrival_pattern Traffic pattern "uniform" Current-date
peak_hours NHPP peak hours [12, 19] Current-date
max_records Maximum records to generate null Both
seed Deterministic output null Both

Environment Variables

# API Configuration
SIMTOM_HOST=0.0.0.0
SIMTOM_PORT=8000
SIMTOM_LOG_LEVEL=info

# Redis (optional, for caching)
REDIS_URL=redis://localhost:6379

πŸ§ͺ Use Cases

  • ML Model Training: Realistic arrival patterns for better model performance
  • Load Testing: Simulate traffic spikes and patterns
  • Feature Engineering: Rich, consistent data for pipeline development
  • System Testing: Controlled drift and noise injection
  • Research: Reproducible datasets with deterministic seeding

Scenario: BNPL Fraud Detection

import asyncio
from simtom.generators.ecommerce.bnpl import BNPLGenerator

async def test_fraud_model():
    # Generate baseline data
    baseline_config = GeneratorConfig(seed=42, total_records=1000)
    baseline_gen = BNPLGenerator(baseline_config)

    # Train model on baseline
    baseline_data = [record async for record in baseline_gen.stream()]
    model = train_fraud_model(baseline_data)

    # Test with drift scenario
    drift_config = GeneratorConfig(
        seed=123,  # Different seed = different patterns
        total_records=200
    )
    drift_gen = BNPLGenerator(drift_config)

    # Evaluate model performance
    async for record in drift_gen.stream():
        prediction = model.predict(record)
        actual = record['default_risk']
        # Track accuracy degradation

πŸš€ Deployment

Docker

docker build -t simtom .
docker run -p 8000:8000 simtom

Railway

# Connect to Railway
railway login
railway link

# Deploy
railway up

🀝 Contributing

SIMTOM is designed for community extension. Add new generators by:

  1. Inherit from BaseGenerator
  2. Implement async def generate_record()
  3. Add @register_generator("name") decorator
  4. Place in simtom/generators/ - auto-discovered!

Adding New Generators

  1. Create Generator Class

    # simtom/generators/finance/credit_cards.py
    from simtom.core.generator import BaseGenerator, register_generator
    
    @register_generator("credit_cards")
    class CreditCardGenerator(BaseGenerator):
        async def generate_record(self) -> Dict[str, Any]:
            return {
                "card_number": self.faker.credit_card_number(),
                "amount": self.faker.pyfloat(min_value=1, max_value=1000),
                "merchant": self.faker.company()
            }
  2. Add Tests

    # tests/generators/test_credit_cards.py
    async def test_credit_card_generation():
        config = GeneratorConfig(total_records=10)
        generator = CreditCardGenerator(config)
        records = [r async for r in generator.stream()]
        assert len(records) == 10
        assert all("card_number" in r for r in records)
  3. Update Documentation: Add to generator table above

Development Setup

# Install development dependencies
poetry install --with dev

# Run tests
pytest

# Code formatting
black .
ruff check .

# Type checking
mypy simtom/

Code Quality Standards

  • Type Hints: All public APIs must have type annotations
  • Async First: Use async/await for I/O operations
  • Testing: >90% test coverage required
  • Documentation: Docstrings for all public methods

πŸ“ˆ Performance

Benchmarks

Records/sec Memory Usage CPU Usage
10 ~50MB ~5%
100 ~75MB ~15%
1000 ~150MB ~40%

Optimization Tips

  • Use appropriate rate_per_second for your use case
  • Set total_records to avoid infinite streams
  • Consider Redis caching for repeated scenarios
  • Use Docker limits in production

πŸ› Troubleshooting

Common Issues

Generator Not Found

# Error: Generator 'my_gen' not found
# Solution: Ensure @register_generator decorator is used

High Memory Usage

# Issue: Memory grows over time
# Solution: Set total_records limit or use streaming processing
async for record in generator.stream():
    process_record(record)  # Process immediately, don't accumulate

Slow Generation

# Issue: Generation too slow
# Solution: Increase rate_per_second or check async usage
config = GeneratorConfig(rate_per_second=100)  # Faster

πŸ“š Advanced Usage

Custom Time Scenarios

# Simulate Black Friday traffic spike
config = GeneratorConfig(
    time_compression=24.0,  # 1 hour = 24 hours of data
    rate_per_second=50.0    # Higher transaction volume
)

Data Drift Simulation

# Gradual drift over time
configs = [
    GeneratorConfig(seed=42),    # Baseline
    GeneratorConfig(seed=43),    # Month 1
    GeneratorConfig(seed=44),    # Month 2
]

for config in configs:
    generator = BNPLGenerator(config)
    # Test model performance degradation

πŸ“„ License

MIT License - see LICENSE file for details.

πŸ™‹β€β™‚οΈ Support


Built for ML Engineers, by ML Engineers πŸ€–

About

ML-focused synthetic data platform with realistic traffic patterns, seasonal effects, and temporal drift. BNPL transaction generator with risk scoring, configurable arrival patterns (Poisson, NHPP, Burst). Live API: simtom-production.up.railway.app | Day-per-second historical replay.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published