3 releases (breaking)
| 0.3.0 | Mar 11, 2026 |
|---|---|
| 0.2.0 | Jan 8, 2026 |
| 0.1.0 | Dec 22, 2025 |
#633 in Database implementations
Used in 7 crates
(3 directly)
39KB
487 lines
DataSpool - Efficient Data Bundling System
DataSpool is a high-performance data bundling library that eliminates filesystem overhead by concatenating multiple items (cards, images, binary blobs) into a single indexed .spool file with SQLite-based metadata and vector embeddings.
Features
- π¦ Efficient Bundling - Single file storage with byte-offset index
- π Random Access - Direct seeks to any item without scanning
- π Vector Search - SQLite-backed embeddings for semantic retrieval
- π Metadata Storage - Rich metadata with full-text search (FTS5)
- π Multiple Variants - Cards (compressed CML), images, binary blobs
- πΎ Compact Format - Minimal overhead, optimal for thousands of items
- π Type-Safe - Rust type safety with serde serialization
Quick Start
Writing a Spool
use dataspool::{SpoolBuilder, SpoolEntry};
// Create spool builder
let mut builder = SpoolBuilder::new();
// Add entries
builder.add_entry(SpoolEntry {
id: "item1".to_string(),
data: b"Item 1 data".to_vec(),
});
builder.add_entry(SpoolEntry {
id: "item2".to_string(),
data: b"Item 2 data".to_vec(),
});
// Write to file
builder.write_to_file("data.spool")?;
Reading from a Spool
use dataspool::SpoolReader;
// Open spool
let reader = SpoolReader::open("data.spool")?;
// Read specific entry
let data = reader.read_entry(0)?; // Read first entry
println!("Item 0: {} bytes", data.len());
// Iterate entries
for (index, entry) in reader.iter_entries().enumerate() {
let data = entry?;
println!("Item {}: {} bytes", index, data.len());
}
Reading an Embedded Spool
Spools can be embedded within larger files (e.g., an Engram archive) and read directly without extraction. open_embedded() takes a base byte offset and adjusts all internal offsets so that read_card() seeks to the correct position within the host file:
use dataspool::SpoolReader;
// Open a spool stitched into a larger file at byte offset 4096.
let mut reader = SpoolReader::open_embedded("archive.eng", 4096)?;
// read_card() transparently seeks within the host file.
let card = reader.read_card(0)?;
println!("Card 0: {} bytes", card.len());
This enables consumers like Engram to stitch spool data inline during archive compilation, then serve card reads directly from the archive file β no temp extraction, no filesystem overhead.
Persistent Vector Store
use dataspool::{PersistentVectorStore, DocumentRef};
// Create persistent store
let mut store = PersistentVectorStore::new("vectors.db")?;
// Add document with embedding
let doc_ref = DocumentRef {
id: "doc1".to_string(),
file_path: "data.spool".to_string(),
source: "web-scrape".to_string(),
metadata: Some(r#"{"title": "Example"}"#.to_string()),
spool_offset: Some(0),
spool_length: Some(1024),
};
let embedding = vec![0.1, 0.2, 0.3, 0.4]; // Example embedding vector
store.add_document_ref(&doc_ref, &embedding)?;
// Search by vector similarity
let query_vector = vec![0.15, 0.25, 0.35, 0.45];
let results = store.search(&query_vector, 10)?;
for result in results {
println!("ID: {}, Score: {:.3}", result.id, result.score);
}
Spool Format
File Structure
.spool file:
βββββββββββββββββββββββββββββββ
β Magic: "SP01" (4 bytes)β
β Version: 1 (1 byte) β
β Card Count (4 bytes)β
β Index Offset (8 bytes)β
βββββββββββββββββββββββββββββββ€
β Card 0 Data β
β Card 1 Data β
β ... β
β Card N Data β
βββββββββββββββββββββββββββββββ€
β Index: β
β [offset0, len0] β
β [offset1, len1] β
β ... β
β [offsetN, lenN] β
βββββββββββββββββββββββββββββββ
.db file (SQLite):
βββββββββββββββββββββββββββββββ
β documents table: β
β - id β
β - file_path β
β - source β
β - metadata (JSON) β
β - spool_offset β
β - spool_length β
βββββββββββββββββββββββββββββββ€
β embeddings table: β
β - doc_id β
β - vector (BLOB) β
βββββββββββββββββββββββββββββββ
Format Details
- Magic Number:
SP01(4 bytes) - Identifies spool format - Version:
1(1 byte) - Format version - Card Count: Number of entries in spool (u32)
- Index Offset: Byte offset where index starts (u64)
- Index: Array of
[offset: u64, length: u32]pairs (12 bytes each)
Architecture
βββββββββββββββ
β DataCard β (compressed CML)
ββββββββ¬βββββββ
β
v
βββββββββββββββ ββββββββββββββββ
β SpoolBuilderβββββ>β .spool file β
βββββββββββββββ ββββββββββββββββ
β
ββββββββββββββΌβββββββββββββββββ
v v v
ββββββββββββββ ββββββββββββ ββββββββββββββββ
β Standalone β β Embedded β β .db (SQLite) β
β SpoolReaderβ β in .eng β β - documents β
β ::open() β β::open_ β β - embeddings β
ββββββββββββββ βembedded()β βββββββββ¬ββββββββ
ββββββββββββ β
v
ββββββββββββββββββββ
β PersistentVector β
β Store β
ββββββββββββββββββββ
Standalone vs. Embedded
| Mode | Constructor | Use Case |
|---|---|---|
| Standalone | SpoolReader::open(path) |
Reading .spool files directly from disk |
| Embedded | SpoolReader::open_embedded(path, offset) |
Reading spools stitched into a host file (e.g., Engram archives) |
Both modes share the same read_card() / read_card_at() API. The embedded constructor adjusts internal byte offsets by the base offset so all seeks target the correct position within the host file.
Use Cases
1. Knowledge Base Archival
Bundle thousands of documentation cards into a single file:
// Build spool from cards
let mut builder = SpoolBuilder::new();
for card in documentation_cards {
builder.add_entry(SpoolEntry {
id: card.id,
data: card.compressed_data,
});
}
builder.write_to_file("rust-stdlib.spool")?;
// Create vector index
let mut store = PersistentVectorStore::new("rust-stdlib.db")?;
for (i, embedding) in embeddings.iter().enumerate() {
store.add_document_ref(&DocumentRef {
id: format!("card_{}", i),
file_path: "rust-stdlib.spool".to_string(),
spool_offset: Some(offsets[i]),
spool_length: Some(lengths[i]),
...
}, embedding)?;
}
2. Image Dataset Storage
Store image collections with metadata:
let mut builder = SpoolBuilder::new();
for image_path in image_paths {
let data = std::fs::read(&image_path)?;
builder.add_entry(SpoolEntry {
id: image_path.file_stem().unwrap().to_string(),
data,
});
}
builder.write_to_file("images.spool")?;
3. Binary Blob Archival
Archive arbitrary binary data with fast random access:
// Write blobs
let mut builder = SpoolBuilder::new();
builder.add_entry(SpoolEntry { id: "blob1".into(), data: blob1 });
builder.add_entry(SpoolEntry { id: "blob2".into(), data: blob2 });
builder.write_to_file("blobs.spool")?;
// Random access read
let reader = SpoolReader::open("blobs.spool")?;
let blob1_data = reader.read_entry(0)?; // Direct access, no scan
Performance
Benchmark results (3,309 items, Rust stdlib documentation):
| Operation | Time | Notes |
|---|---|---|
| Build spool | ~200ms | Writing all items + index |
| Read single item | <1ms | Direct byte offset seek |
| Read all items | ~50ms | Sequential read |
| SQLite insert (1 doc) | ~0.5ms | With embedding |
| Vector search (10 results) | ~5ms | Cosine similarity + index |
Comparison to Alternatives
| Approach | Read Speed | Storage Overhead | Random Access |
|---|---|---|---|
| Individual files | Slow (3,309 inodes) | High (4KB/file) | Yes |
| tar archive | Slow (must scan) | Low | No |
| zip archive | Fast | Medium | Yes |
| DataSpool | Fast | Minimal | Yes |
DataSpool Advantages
- No compression overhead - Items pre-compressed by BytePunch
- Instant random access - Direct byte offset, no central directory scan
- Integrated vector DB - Semantic search without external tools
- Minimal format - Simple binary format, easy to parse
Dependencies
[dependencies]
dataspool = "0.1.0"
bytepunch = "0.1.0" # For compressed item decompression
Dependency Graph
dataspool
βββ bytepunch (compression)
βββ rusqlite (SQLite database)
βββ serde (serialization)
βββ thiserror (error handling)
Features
Default
Basic spool read/write and persistent vector store.
Optional: async
Async APIs for non-blocking I/O:
[dependencies]
dataspool = { version = "0.1.0", features = ["async"] }
use dataspool::async_api::AsyncSpoolReader;
let reader = AsyncSpoolReader::open("data.spool").await?;
let data = reader.read_entry(0).await?;
Installation
Add to Cargo.toml:
[dependencies]
dataspool = "0.1.0"
Or with async support:
[dependencies]
dataspool = { version = "0.1.0", features = ["async"] }
Testing
# Run all tests
cargo test
# Run with logging
RUST_LOG=debug cargo test
# Test specific module
cargo test spool
cargo test persistent_store
Examples
See examples/ directory:
build_spool.rs- Build a spool from filesread_spool.rs- Read entries from a spoolvector_search.rs- Semantic search with embeddings
Run with:
cargo run --example build_spool
cargo run --example read_spool
cargo run --example vector_search
Roadmap
- Image-based spools with EXIF metadata
- Audio/video spool variants
- Compression statistics per entry
- Incremental spool updates (append-only mode)
- Multi-threaded indexing
- Memory-mapped I/O for large spools
- Network streaming protocol
History
Extracted from the SAM (Societal Advisory Module) project, where it provides the spool bundling system for knowledge base archival.
License
MIT - See LICENSE for details.
Author
Magnus Trent [email protected]
Links
- GitHub: https://github.com/Blackfall-Labs/dataspool-rs
- Docs: https://docs.rs/dataspool
- Crates.io: https://crates.io/crates/dataspool
- SAM Project: https://github.com/Blackfall-Labs/sam
Dependencies
~25MB
~459K SLoC