3 releases
Uses new Rust 2024
| new 0.3.4 | Jan 13, 2026 |
|---|---|
| 0.3.3 | Jan 11, 2026 |
| 0.3.1 | Jan 11, 2026 |
#376 in Asynchronous
200KB
5K
SLoC
niblits
A powerful, token-aware text chunking library for processing multiple file formats with language-aware semantic splitting.
Overview
This library provides streaming, async-first text chunking capabilities designed for ingestion pipelines and search systems. It handles diverse document types while maintaining semantic boundaries and offering configurable tokenization strategies.
Features
Multi-Format Support
- Plain Text: Basic text splitting with configurable overlap
- Markdown: Structure-aware chunking preserving headers and sections
- HTML: Tag-aware splitting that respects document structure
- PDF: Text extraction with intelligent chunking of document content
- DOCX: Word document parsing and content chunking
- Source Code: Semantic chunking for 50+ programming languages using tree-sitter grammars
Language-Aware Code Chunking
- Grammar-aware parsing using tree-sitter
- Semantic boundary detection (functions, classes, etc.)
- Language-specific chunking strategies
- Support for Rust, Python, JavaScript, TypeScript, Go, and many more
Flexible Tokenization
- Character-based: Simple character counting
- OpenAI tiktoken: cl100k_base, p50k_base, p50k_edit, r50k_base, o200k_base
- HuggingFace: Custom model tokenizers for specialized embeddings
Streaming Architecture
- Async-first design with Stream API
- Memory-efficient processing of large files
- Progress tracking with file size monitoring
- Graceful error handling and recovery
Quick Start
Add to your Cargo.toml:
[dependencies]
niblits = "0.3.0"
tokio = { version = "1", features = ["rt", "macros"] }
futures = "0.3"
use niblits::{chunk_stream, ChunkerConfig, Tokenizer};
use futures::StreamExt;
use std::io::Cursor;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Configure chunking
let config = ChunkerConfig {
max_chunk_size: 1000,
overlap_percentage: 0.2,
tokenizer: Tokenizer::Tiktoken("cl100k_base".to_string()),
};
// Process a file
let content = r#"fn main() {
println!("Hello, world!");
}
fn helper() {
println!("This is a helper function");
}"#;
let reader = Cursor::new(content.as_bytes());
let mut stream = chunk_stream("main.rs", reader, config).await;
while let Some(result) = stream.next().await {
match result? {
project_chunk => {
println!("File: {}", project_chunk.file_path);
match project_chunk.chunk {
niblits::Chunk::Semantic(chunk) => {
println!("Semantic chunk: {} bytes", chunk.text.len());
}
niblits::Chunk::Text(chunk) => {
println!("Text chunk: {} bytes", chunk.text.len());
}
niblits::Chunk::EndOfFile { expected_chunks, .. } => {
println!("File complete. Expected {} chunks", expected_chunks);
}
_ => {}
}
}
}
}
Ok(())
}
Configuration
ChunkerConfig
pub struct ChunkerConfig {
/// Percentage of tokens to reserve for overlap (0.0 - 1.0)
pub overlap_percentage: f32,
/// Maximum size of each chunk (in tokens/characters)
pub max_chunk_size: usize,
/// Tokenizer strategy for size calculation
pub tokenizer: Tokenizer,
}
Tokenizer Options
pub enum Tokenizer {
/// Simple character-based tokenization
Characters,
/// OpenAI tiktoken with encoding name
Tiktoken(String), // "cl100k_base", "p50k_base", etc.
/// HuggingFace tokenizer with model ID
HuggingFace(String), // "bert-base-uncased", etc.
// Preloaded variants (internal use)
PreloadedTiktoken(Arc<CoreBPE>),
PreloadedHuggingFace(Arc<Tokenizer>),
}
Supported Languages
Check supported programming languages:
use niblits::{supported_languages, is_language_supported};
// Get all supported languages
let languages = supported_languages();
println!("Supported languages: {:?}", languages);
// Check specific language
assert!(is_language_supported("rust"));
assert!(is_language_supported("python"));
Commonly supported languages include: Rust, Python, JavaScript, TypeScript, Go, Java, C++, C#, Ruby, PHP, Swift, Kotlin, and many more.
API Reference
Core Functions
chunk_stream(path, reader, config)- Process a file stream and yield chunkswalk_project(path, options)- Recursively walk a directory and stream chunkswalk_files(files, project_root, options)- Chunk a stream of file paths with ignore ruleswalker_includes_path(project_root, path, max_file_size)- Check if a path would be includedsupported_languages()- Get list of supported programming languagesis_language_supported(name)- Check if a language is supported
Types
Chunk- Represents different chunk types (Semantic, Text, EndOfFile, Delete)SemanticChunk- Contains text, tokens, and byte offset informationProjectChunk- File path, chunk data, and file sizeChunkError- Error types for parsing, IO, and unsupported formats
Examples
Processing Different File Types
// Markdown file
let config = ChunkerConfig::default();
let reader = Cursor::new("# Header\n\nSome content\n\n## Subheader").as_bytes();
let stream = chunk_stream("doc.md", reader, config).await;
// PDF file
let file = tokio::fs::File::open("document.pdf").await?;
let stream = chunk_stream("document.pdf", file, config).await;
// Code file
let code_stream = chunk_stream("script.py", python_file, config).await;
Walking Projects
use niblits::{walk_project, WalkOptions};
use futures::StreamExt;
let mut stream = walk_project(
"./my-project",
WalkOptions {
max_chunk_size: 1000,
overlap_percentage: 0.2,
..Default::default()
},
);
while let Some(result) = stream.next().await {
let chunk = result?;
println!("{} -> {:?}", chunk.file_path, chunk.chunk);
}
Custom Tokenizer
// Using HuggingFace tokenizer
let config = ChunkerConfig {
tokenizer: Tokenizer::HuggingFace("bert-base-uncased".to_string()),
..Default::default()
};
// Using characters for simple cases
let config = ChunkerConfig {
tokenizer: Tokenizer::Characters,
max_chunk_size: 500,
overlap_percentage: 0.1,
};
Architecture
src/
├── lib.rs # Public API and main exports
├── types.rs # Core data structures and error types
├── chunker/ # Format-specific chunkers
│ ├── code.rs # Language-aware code chunking
│ ├── text.rs # Plain text chunking
│ ├── markdown.rs # Markdown-aware chunking
│ ├── html.rs # HTML-aware chunking
│ ├── pdf.rs # PDF processing
│ └── docx.rs # Word document processing
├── languages.rs # Language support utilities
├── grammars.rs # Tree-sitter grammar management
└── grammar_loader.rs # Dynamic grammar loading
Performance Considerations
- Streaming: All processing is streaming-based to handle large files efficiently
- Memory: Minimal memory footprint with async I/O
- Tokenizers: Preload tokenizers for better performance in batch processing
- Grammars: Tree-sitter grammars are loaded on-demand and cached
Development
Building
mise build # Build the workspace
mise build:rust # Rust-only build
Testing
mise test # All tests
mise test:rust # Crate tests only
Dependencies
Key dependencies:
text-splitter: Core splitting logic with tokenization supporttree-sitter: Code parsing for semantic chunkingtiktoken-rs: OpenAI tokenizer implementationtokenizers: HuggingFace tokenizer supportoxidize-pdf: PDF text extractiondocx-parser: Word document parsinghtmd: HTML processingpalate: Language detection
Dependencies
~98MB
~1.5M SLoC