Realistic data simulator for ML system testing with time-compressed scenarios and controlled drift
SIMTOM is an extensible data generation platform that creates realistic streaming data for machine learning model training and testing. Features include configurable arrival patterns, noise injection, drift simulation, and time compression for accelerated development cycles.
The Problem: Your ML model works in dev but fails in production. Unit tests use toy data. Load testing (Locust, wrk) only tests performance, not model behavior. Real production data is risky, regulated, or unavailable.
The Solution: simtom generates statistically realistic synthetic data with controlled patterns, drift, and edge cases. Test your ML models with production-like scenarios without production risks.
Different from load testing: While Locust tests "can your API handle 1000 requests?", simtom tests "does your fraud model still work when spending patterns change seasonally?"
Production Endpoint: https://simtom-production.up.railway.app
# Quick test
curl https://simtom-production.up.railway.app/generators
# Stream sample data
curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
-H "Content-Type: application/json" \
-d '{"rate_per_second": 2.0, "max_records": 3}'
- π― Realistic Traffic Patterns: Uniform, Poisson, NHPP, and Burst arrival patterns
- π Rich Data Generation: BNPL transactions with risk scoring and customer profiles
- π Historical Data Generation: Generate years of data with realistic temporal patterns
- π Holiday & Seasonal Effects: Black Friday +60%, Christmas -88%, weekend reductions
- β‘ Day-per-Second Delivery: Historical data streams at 1 day per second (365 days in 6 minutes)
- β±οΈ Time Compression: Simulate days/weeks of data in minutes
- π§ Plugin Architecture: Easy extension with custom generators
- π‘ Real-time Streaming: Server-sent events with configurable rates
- π§ͺ ML-Ready: Built-in noise, drift, and deterministic seeding
# Check health and available generators
curl https://simtom-production.up.railway.app/
# Stream live BNPL data (current timestamps)
curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
-H "Content-Type: application/json" \
-d '{"rate_per_second": 2.0, "max_records": 5, "seed": 42}'
# Generate 3 months of historical BNPL data with realistic volumes
curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
-H "Content-Type: application/json" \
-d '{
"start_date": "2024-06-01",
"end_date": "2024-09-01",
"base_daily_volume": 1000,
"seed": 42
}' > historical_bnpl_data.jsonl
# Fast generation of full year dataset (delivered in ~6 minutes)
curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
-H "Content-Type: application/json" \
-d '{
"start_date": "2024-01-01",
"end_date": "2024-12-31",
"base_daily_volume": 1000,
"include_holiday_patterns": true,
"seed": 42
}' > bnpl_full_year.jsonl
The streaming endpoints return Server-Sent Events (SSE) format, where each record is prefixed with data:
:
data: {"transaction_id": "txn_00000000", "timestamp": "2025-09-15T11:10:21.307911", "customer_id": "cust_000001", "amount": 143.02, ...}
data: {"transaction_id": "txn_00000001", "timestamp": "2025-09-15T11:10:21.318045", "customer_id": "cust_000002", "amount": 67.89, ...}
Python Example:
import requests
import json
response = requests.post(
'https://simtom-production.up.railway.app/stream/bnpl',
json={"rate_per_second": 10, "max_records": 5},
stream=True
)
for line in response.iter_lines(decode_unicode=True):
if line.startswith('data: '):
json_data = line[6:] # Remove 'data: ' prefix
record = json.loads(json_data)
print(record['transaction_id'], record['amount'])
JavaScript Example:
fetch('/stream/bnpl', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({rate_per_second: 10, max_records: 5})
})
.then(response => response.body.getReader())
.then(reader => {
const decoder = new TextDecoder();
function read() {
return reader.read().then(({done, value}) => {
if (done) return;
const lines = decoder.decode(value).split('\n');
lines.forEach(line => {
if (line.startsWith('data: ')) {
const record = JSON.parse(line.substring(6));
console.log(record.transaction_id, record.amount);
}
});
return read();
});
}
return read();
});
Important Notes:
- Standard JSON parsers will fail without handling the
data:
prefix - Use streaming HTTP clients for large datasets to avoid memory issues
- Each line contains a complete JSON record (no multi-line JSON)
git clone https://github.com/whitehackr/simtom.git
cd simtom
poetry install
poetry run python scripts/run_server.py
curl http://localhost:8000/generators
from simtom.generators.ecommerce.bnpl import BNPLGenerator, BNPLConfig
from datetime import date
# Real-time streaming (current timestamps)
config = BNPLConfig(
rate_per_second=10.0,
max_records=1000,
seed=42
)
generator = BNPLGenerator(config)
async for record in generator.stream():
print(record) # Process each synthetic transaction
# Historical data generation (specific date range)
historical_config = BNPLConfig(
start_date=date(2024, 1, 1),
end_date=date(2024, 12, 31),
base_daily_volume=1000, # Realistic daily volumes
include_holiday_patterns=True,
seed=42
)
historical_generator = BNPLGenerator(historical_config)
async for record in historical_generator.stream():
print(record) # Historical transactions delivered day-per-second
Generate realistic historical datasets with proper temporal patterns for ML training and backtesting.
- Date Range Support: Generate data for any period up to 1 year
- Statistical Volume Distribution: 4-factor model (day-of-week, week-of-month, seasonal, events)
- Business Hour Patterns: 70% during 9am-6pm, 20% evenings, 10% nights
- Weekend Adjustments: 15% reduction on weekends (realistic e-commerce patterns)
- Holiday Effects: Black Friday +60%, Christmas -88%, configurable patterns
- Day-per-Second Delivery: Historical data streams rapidly (1 day per second)
- Chronological Ordering: All timestamps properly sorted for time-series analysis
{
// Special Events
"black_friday": 1.6, // +60% traffic (biggest shopping day)
"cyber_monday": 1.4, // +40% traffic
"christmas_day": 0.12, // -88% (most stores closed)
"new_years_day": 0.3, // -70% traffic
"valentines_day": 1.15, // +15% traffic
"mothers_day": 1.15, // +15% traffic
// Day of Week
"friday": 1.25, // +25% (weekend prep)
"saturday": 0.85, // -15% weekend reduction
"sunday": 0.70, // -30% weekend reduction
// Seasonal
"january": 0.75, // -25% post-holiday low
"november": 1.10 // +10% pre-holiday buildup
}
- Historical Mode: Day-per-second delivery (365 days in ~6 minutes)
- Real-time Mode: Configurable rates 0.1-1000 records/second
- Memory Efficient: O(1) streaming regardless of dataset size
- Network Optimized: Batched delivery prevents overwhelming clients
Fixed intervals - predictable for testing
curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
-H "Content-Type: application/json" \
-d '{
"rate_per_second": 2.0,
"arrival_pattern": "uniform"
}'
Random intervals with realistic variability
curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
-H "Content-Type: application/json" \
-d '{
"rate_per_second": 2.0,
"arrival_pattern": "poisson"
}'
Daily traffic patterns with peak hours
curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
-H "Content-Type: application/json" \
-d '{
"rate_per_second": 1.0,
"arrival_pattern": "nhpp",
"peak_hours": [12, 19],
"time_compression": 24.0
}'
Flash sale and event-driven spikes
curl -X POST https://simtom-production.up.railway.app/stream/bnpl \
-H "Content-Type: application/json" \
-d '{
"rate_per_second": 2.0,
"arrival_pattern": "burst",
"burst_intensity": 3.0,
"burst_probability": 0.6
}'
- Plugin Architecture: Auto-discovery of data generators via decorators
- Async Streaming: Memory-efficient generation of large datasets
- Type Safety: Pydantic models for configuration and validation
- Extensibility: Add new generators without touching core code
- Plugin System: Auto-discovery of generators
- Memory Efficient: O(1) streaming regardless of dataset size
- Entity Consistency: LRU registries maintain referential integrity
- FastAPI: Modern async web framework
- Pydantic: Type-safe configuration validation
simtom/
βββ core/ # Stable abstractions
β βββ generator.py # BaseGenerator + GeneratorConfig
β βββ registry.py # Plugin auto-discovery
β βββ entities.py # Core data models
βββ generators/ # Pluggable data generators
β βββ ecommerce/
β βββ bnpl.py # BNPL risk data generator
βββ api/ # FastAPI web layer
β βββ main.py # Application factory
β βββ routes.py # Streaming endpoints
β βββ models.py # Request/response schemas
βββ scenarios/ # Time-based scenario modeling
New generators are automatically registered:
@register_generator("my_generator")
class MyGenerator(BaseGenerator):
async def generate_record(self) -> Dict[str, Any]:
return {"id": uuid4(), "value": random.random()}
BNPL transactions include 40+ fields:
{
"transaction_id": "txn_00000001",
"customer_id": "cust_000001",
"amount": 485.61,
"risk_score": 0.85,
"risk_level": "high",
"installment_count": 4,
"customer_age_bracket": "25-34",
"product_category": "electronics",
"device_type": "mobile",
"payment_provider": "afterpay"
}
Generator | Description | Use Case |
---|---|---|
bnpl |
Buy-Now-Pay-Later transactions with risk scoring | Credit risk, fraud detection |
from simtom.core.generator import GeneratorConfig
config = GeneratorConfig(
rate_per_second=1.0, # Records per second (1-1000)
total_records=None, # Infinite if None
seed=42, # Reproducible randomness
time_compression=1.0 # Real-time = 1.0, faster = > 1.0
)
Parameter | Description | Default | Mode |
---|---|---|---|
rate_per_second |
Arrival rate (0.1-1000) | 1.0 | Current-date |
base_daily_volume |
Average daily transactions | 1000 | Historical |
start_date |
Historical start date | null | Historical |
end_date |
Historical end date | null | Historical |
include_holiday_patterns |
Enable seasonal effects | true | Both |
arrival_pattern |
Traffic pattern | "uniform" | Current-date |
peak_hours |
NHPP peak hours | [12, 19] | Current-date |
max_records |
Maximum records to generate | null | Both |
seed |
Deterministic output | null | Both |
# API Configuration
SIMTOM_HOST=0.0.0.0
SIMTOM_PORT=8000
SIMTOM_LOG_LEVEL=info
# Redis (optional, for caching)
REDIS_URL=redis://localhost:6379
- ML Model Training: Realistic arrival patterns for better model performance
- Load Testing: Simulate traffic spikes and patterns
- Feature Engineering: Rich, consistent data for pipeline development
- System Testing: Controlled drift and noise injection
- Research: Reproducible datasets with deterministic seeding
import asyncio
from simtom.generators.ecommerce.bnpl import BNPLGenerator
async def test_fraud_model():
# Generate baseline data
baseline_config = GeneratorConfig(seed=42, total_records=1000)
baseline_gen = BNPLGenerator(baseline_config)
# Train model on baseline
baseline_data = [record async for record in baseline_gen.stream()]
model = train_fraud_model(baseline_data)
# Test with drift scenario
drift_config = GeneratorConfig(
seed=123, # Different seed = different patterns
total_records=200
)
drift_gen = BNPLGenerator(drift_config)
# Evaluate model performance
async for record in drift_gen.stream():
prediction = model.predict(record)
actual = record['default_risk']
# Track accuracy degradation
docker build -t simtom .
docker run -p 8000:8000 simtom
# Connect to Railway
railway login
railway link
# Deploy
railway up
SIMTOM is designed for community extension. Add new generators by:
- Inherit from
BaseGenerator
- Implement
async def generate_record()
- Add
@register_generator("name")
decorator - Place in
simtom/generators/
- auto-discovered!
-
Create Generator Class
# simtom/generators/finance/credit_cards.py from simtom.core.generator import BaseGenerator, register_generator @register_generator("credit_cards") class CreditCardGenerator(BaseGenerator): async def generate_record(self) -> Dict[str, Any]: return { "card_number": self.faker.credit_card_number(), "amount": self.faker.pyfloat(min_value=1, max_value=1000), "merchant": self.faker.company() }
-
Add Tests
# tests/generators/test_credit_cards.py async def test_credit_card_generation(): config = GeneratorConfig(total_records=10) generator = CreditCardGenerator(config) records = [r async for r in generator.stream()] assert len(records) == 10 assert all("card_number" in r for r in records)
-
Update Documentation: Add to generator table above
# Install development dependencies
poetry install --with dev
# Run tests
pytest
# Code formatting
black .
ruff check .
# Type checking
mypy simtom/
- Type Hints: All public APIs must have type annotations
- Async First: Use
async/await
for I/O operations - Testing: >90% test coverage required
- Documentation: Docstrings for all public methods
Records/sec | Memory Usage | CPU Usage |
---|---|---|
10 | ~50MB | ~5% |
100 | ~75MB | ~15% |
1000 | ~150MB | ~40% |
- Use appropriate
rate_per_second
for your use case - Set
total_records
to avoid infinite streams - Consider Redis caching for repeated scenarios
- Use Docker limits in production
Generator Not Found
# Error: Generator 'my_gen' not found
# Solution: Ensure @register_generator decorator is used
High Memory Usage
# Issue: Memory grows over time
# Solution: Set total_records limit or use streaming processing
async for record in generator.stream():
process_record(record) # Process immediately, don't accumulate
Slow Generation
# Issue: Generation too slow
# Solution: Increase rate_per_second or check async usage
config = GeneratorConfig(rate_per_second=100) # Faster
# Simulate Black Friday traffic spike
config = GeneratorConfig(
time_compression=24.0, # 1 hour = 24 hours of data
rate_per_second=50.0 # Higher transaction volume
)
# Gradual drift over time
configs = [
GeneratorConfig(seed=42), # Baseline
GeneratorConfig(seed=43), # Month 1
GeneratorConfig(seed=44), # Month 2
]
for config in configs:
generator = BNPLGenerator(config)
# Test model performance degradation
MIT License - see LICENSE file for details.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Live Demo: https://simtom-production.up.railway.app
Built for ML Engineers, by ML Engineers π€