Thanks to visit codestin.com
Credit goes to github.com

Skip to content

SmallML is a three-layer Bayesian framework that enables small and medium-sized enterprises (SMEs) to build production-grade machine learning models despite having limited customer data.

License

Notifications You must be signed in to change notification settings

seemyon/smallml

Repository files navigation

SmallML: Bayesian Transfer Learning for Small Data

PyPI version Downloads_1 Downloads_2 License: MIT Python 3.9+

Build production-grade machine learning models with just 50-200 observations per business entity.

SmallML combines transfer learning, hierarchical Bayesian inference, and conformal prediction to enable SMEs to achieve reliable predictive analytics despite limited data.

🎯 Key Features

  • Works with tiny datasets: 50-200 observations per entity, 3-10 entities total
  • Transfer learning: Extracts knowledge from 100K+ public observations (pre-trained priors included)
  • Hierarchical pooling: Shares statistical strength across multiple business entities
  • Uncertainty guarantees: Bayesian credible intervals + distribution-free prediction sets
  • Production-ready: <30 minutes training, <100ms inference, automatic convergence validation

πŸš€ Quick Start

Installation

pip install smallml

Development Version

To install the latest development version from GitHub:

pip install git+https://github.com/seemyon/smallml@main

Basic Usage (5 lines of code!)

from smallml import Pipeline
import pandas as pd

# Your data: dict of {entity_name: dataframe}
sme_data = {
    'store_1': pd.read_csv('store_1.csv'),  # 80 customers
    'store_2': pd.read_csv('store_2.csv'),  # 120 customers
    'store_3': pd.read_csv('store_3.csv'),  # 95 customers
    # ... 3-10 stores total
}

# Create and fit pipeline (automatically validates convergence)
pipeline = Pipeline()
pipeline.fit(sme_data, target_col='churned')

# Make predictions with uncertainty
predictions = pipeline.predict(new_customers, sme_id='store_1')
print(predictions)
#    prediction  bayesian_std  bayesian_lower_90  bayesian_upper_90  conformal_set  conformal_set_size
# 0       0.23          0.12                0.04               0.42            {0}                   1
# 1       0.78          0.15                0.51               0.95            {1}                   1
# 2       0.51          0.21                0.18               0.84          {0,1}                   2  # Uncertain!

See full tutorial β†’

πŸ“š How It Works

SmallML uses a two-layer architecture:

  1. Layer 2 (Hierarchical Bayesian): Pools information across J entities using PyMC NUTS sampler

    • Uses pre-trained priors from 100K+ public observations
    • Returns full posterior distributions, not just point estimates
    • Automatic convergence validation (RΜ‚ < 1.01, ESS > 400)
  2. Layer 3 (Conformal Prediction): Provides distribution-free uncertainty

    • Split-conformal calibration for coverage guarantees
    • Returns prediction sets: {0} (certain), {1} (certain), or {0,1} (uncertain)
    • Empirical coverage typically 87-93% for 90% target

πŸ“Š Performance Expectations

  • Prediction Accuracy: 75-85% AUC on churn with 100 customers per entity
  • Conformal Coverage: 87-93% empirical for 90% target intervals
  • Training Time: 15-30 minutes for J=5 entities with 100 customers each
  • Inference: <100ms per prediction
  • Convergence: RΜ‚ < 1.01, ESS > 400 (automatically validated)

πŸ§ͺ Requirements

Data Requirements

  • Minimum: 3 entities, 30 observations per entity
  • Recommended: 5+ entities, 50+ observations per entity
  • Use Case: Binary classification (churn, conversion, etc.)
  • Features: Numerical + categorical (automatically handled)

Input Format

sme_data = {
    'entity_1': pd.DataFrame({
        'feature_1': [...],      # Numerical or categorical
        'feature_2': [...],
        'feature_3': [...],
        'churned': [0, 1, 0, ...]  # Binary target (0/1)
    }),
    'entity_2': pd.DataFrame({...}),
    # ... 3-10 entities
}

Python Requirements

  • Python: 3.9 or higher
  • Dependencies: PyMC β‰₯5.0, ArviZ β‰₯0.22.0, pandas β‰₯2.3, numpy β‰₯2.3, scikit-learn β‰₯1.7, scipy β‰₯1.16

πŸ“– Documentation

  • Installation Guide: See above for basic installation
  • Quickstart Tutorial: examples/quickstart.py
  • API Reference: Check docstrings in smallml.pipeline.Pipeline
  • Research Paper: See docs/ for technical details

πŸ”¬ Research & Reproducibility

This package is the production-ready version of the SmallML research framework. For research code, paper reproduction, and detailed technical documentation, see:

  • Research Code: src/ directory
  • Reproduction Scripts: scripts/ directory
  • Technical Docs: docs/ directory
  • Original README: See existing README.md for research details

πŸŽ“ Citation

If you use SmallML in your research, please cite:

@software{smallml2025,
  title = {SmallML: Bayesian Transfer Learning for Small-Data Predictive Analytics},
  author = {Leontev, Semen},
  year = {2025},
  url = {https://github.com/seemyon/smallml},
}

🀝 Contributing

Contributions welcome! Please open an issue or pull request.

πŸ“ License

This project is licensed under the MIT License - see LICENSE file.

πŸ”— Links


SmallML: Empowering small businesses with reliable ML despite limited data.

βš™οΈ Advanced Usage

Evaluating Model Performance

# Evaluate on test data
X_test, y_test = load_test_data()
metrics = pipeline.evaluate(X_test, y_test, sme_id='store_1')

print(f"AUC: {metrics['auc']:.3f}")
print(f"Accuracy: {metrics['accuracy']:.3f}")
print(f"F1 Score: {metrics['f1_score']:.3f}")
print(f"Conformal Coverage: {metrics['conformal_coverage']:.3f}")  # Should be ~0.90
print(f"Mean Set Size: {metrics['mean_set_size']:.2f}")  # 1.0 = certain, 2.0 = uncertain

Checking MCMC Convergence

# Get convergence diagnostics
diagnostics = pipeline.get_convergence_diagnostics()
print(diagnostics)
#        parameter  r_hat    ess_bulk  ess_tail
# 0       mu[0]     1.003    1845      2103
# 1       mu[1]     1.002    1923      2247
# ...

# All RΜ‚ should be < 1.01, ESS should be > 400

Saving and Loading Pipelines

# Save fitted pipeline
pipeline.save('models/my_pipeline.pkl')

# Load later
from smallml import Pipeline
pipeline = Pipeline.load('models/my_pipeline.pkl')
predictions = pipeline.predict(new_data)

Quick Mode for Prototyping

# Faster MCMC (fewer iterations) for testing
pipeline = Pipeline(quick_mode=True)
pipeline.fit(sme_data, target_col='churned')  # Takes ~5-10 min instead of 15-30

# For production, use default settings:
pipeline = Pipeline(quick_mode=False)  # More reliable convergence

❓ FAQ

Q: What if I don't have pre-trained priors? A: You'll need to add your own priors to smallml/data/priors_churn.pkl. The package structure is ready, and you can copy your existing priors there. The file should contain {'beta_0': np.ndarray, 'Sigma_0': np.ndarray}.

Q: Can I use this for regression instead of classification? A: Currently SmallML focuses on binary classification. Regression support is planned for future versions.

Q: What if MCMC doesn't converge? A: The pipeline automatically validates convergence. If it fails, try:

  • Use quick_mode=False for more MCMC iterations
  • Ensure you have at least 50 observations per entity
  • Check that features are properly normalized

Q: How do I interpret conformal sets? A:

  • {0} = Certain prediction: will NOT churn
  • {1} = Certain prediction: WILL churn
  • {0,1} = Uncertain prediction: could go either way

Q: Can I use this with just 2 entities? A: The package will warn but still work. However, hierarchical pooling works best with 3+ entities (5+ recommended).

About

SmallML is a three-layer Bayesian framework that enables small and medium-sized enterprises (SMEs) to build production-grade machine learning models despite having limited customer data.

Resources

License

Stars

Watchers

Forks

Packages

No packages published