3 releases
| 0.1.0-beta.1 | Jan 1, 2026 |
|---|---|
| 0.1.0-alpha.2 | Dec 23, 2025 |
| 0.1.0-alpha.1 | Oct 13, 2025 |
#16 in #data-generation
Used in sklears
5MB
106K
SLoC
sklears-datasets
Latest release:
0.1.0-beta.1(January 1, 2026). See the workspace release notes for highlights and upgrade guidance.
Overview
sklears-datasets centralizes dataset loaders, synthetic generators, and data utilities used throughout the sklears ecosystem. It mirrors scikit-learn’s dataset module while adding Rust-first performance and IO enhancements.
Key Features
- Classic Loaders: Diabetes, Iris, Digits, Wine, Breast Cancer, 20 Newsgroups, and more.
- Synthetic Generators:
make_blobs,make_moons,make_circles, Gaussian quantiles, regression surfaces, and streaming generators. - File IO: CSV, Parquet, Arrow IPC, and memory-mapped dataset support with Polars integration.
- Benchmark Utilities: Deterministic dataset splits and sampling strategies for reproducible experiments.
Quick Start
use sklears_datasets::{load_iris, make_blobs};
// Built-in dataset
let iris = load_iris()?;
println!("{} samples, {} features", iris.data.nrows(), iris.data.ncols());
// Synthetic data
let blobs = make_blobs(1000)
.n_features(10)
.centers(4)
.cluster_std(2.5)
.random_state(Some(42))
.build()?;
Status
- All loaders/generators validated through the 11,292 passing workspace tests for
0.1.0-beta.1. - Supports lazy loading and streaming for large-scale workflows.
- Future work (federated dataset shards, synthetic time series) tracked in this crate’s
TODO.md.
Dependencies
~41–74MB
~1.5M SLoC