Current Version: 0.4.0 🎉
CLI tool to parse, chunk, and evaluate Markdown documents for Retrieval-Augmented Generation (RAG) pipelines with token-accurate chunking and semantic intelligence.
Available on PyPI: https://pypi.org/project/rag-chunk/
📚 Documentation | 🐙 GitHub
- 📄 Parse and clean Markdown files
- ✂️ 6 Chunking Strategies:
fixed-size: Split by fixed word/token countsliding-window: Overlapping chunks for context preservationparagraph: Natural paragraph boundariesrecursive-character: LangChain's semantic splitterheader: NEW - Markdown header-aware splittingsemantic: NEW - Embedding-based semantic boundaries
- 🧠 Semantic Intelligence: Embedding-based chunking and retrieval
- 🎯 Token-based chunking with tiktoken (OpenAI models: GPT-3.5, GPT-4, etc.)
- 🎨 Model selection via
--tiktoken-modelflag - 📊 Advanced Metrics: Precision, Recall, F1-score evaluation
- 🌈 Beautiful CLI output with Rich tables
- 📈 Compare all strategies with
--strategy all - 💾 Export results as JSON or CSV
rag-chunk is actively developed! Here's the plan to move from a useful tool to a full-featured chunking workbench.
- Core CLI engine (
argparse) - Markdown (
.md) file parsing - Basic chunking strategies:
fixed-size,sliding-window, andparagraph(word-based) - Evaluation harness: calculate Recall score from a
test-file.json - Beautiful CLI output (
richtables) - Published on PyPI:
pip install rag-chunk
- Tiktoken Support: Added
--use-tiktokenflag for precise token-based chunking - Model Selection: Added
--tiktoken-modelto choose tokenization model (default:gpt-3.5-turbo) - Improved Documentation: Updated README with tiktoken usage examples and comparisons
- Enhanced Testing: Added comprehensive unit tests for token-based chunking
- Optional Dependencies: tiktoken available via
pip install rag-chunk[tiktoken]
- Recursive Character Splitting: Add LangChain's
RecursiveCharacterTextSplitterfor semantic chunking- Install with:
pip install rag-chunk[langchain] - Strategy:
--strategy recursive-character - Works with both word-based and tiktoken modes
- Install with:
- More File Formats: Support
.txtfiles - Additional Metrics: Add precision, F1-score, and chunk quality metrics
- Header-Aware Chunking: Split by markdown headers while respecting size limits
- Strategy:
--strategy header - Preserves document structure with metadata
- Strategy:
- Semantic Chunking: Use sentence embeddings to split at semantic boundaries
- Strategy:
--strategy semantic - Install with:
pip install rag-chunk[semantic] - Powered by sentence-transformers
- Strategy:
- Embedding-Based Retrieval: Semantic similarity matching with
--use-embeddings- Superior to lexical matching for semantic queries
- Uses cosine similarity on sentence embeddings
- Documentation Site: Complete GitHub Pages documentation
- Chunk Size Optimizer: Automated sweep to find optimal chunk size
- Visualization Dashboard: HTML report with interactive charts
- Context Augmentation: Add metadata (position, section, summaries) to chunks
- Export Connectors: Direct integration with vector stores (Pinecone, Weaviate, Chroma)
- Benchmarking Mode: Statistical comparison with significance testing
- MLFlow Integration: Track experiments and chunking configurations
- Performance Optimization: Parallel processing for large document sets
pip install rag-chunk
## Features
- Parse and clean Markdown files in a folder
- Chunk text using fixed-size, sliding-window, or paragraph-based strategies
- Evaluate chunk recall based on a provided test JSON file
- Output results as table, JSON, or CSV
- Store generated chunks temporarily in `.chunks`
## Installation
```bash
# Base installation
pip install rag-chunk
# With all features (recommended)
pip install rag-chunk[all]
# Or install specific features:
pip install rag-chunk[tiktoken] # Token-based chunking
pip install rag-chunk[semantic] # Semantic chunking & retrieval
pip install rag-chunk[langchain] # Recursive character splittingDevelopment mode:
pip install -e .[all]# Compare all strategies with semantic retrieval
rag-chunk analyze examples/ \
--strategy all \
--use-embeddings \
--test-file examples/questions.json \
--top-k 3
# Header-aware chunking for technical docs
rag-chunk analyze docs/ --strategy header --chunk-size 300
# Semantic chunking with embeddings
rag-chunk analyze examples/ \
--strategy semantic \
--chunk-size 200 \
--use-embeddings \
--test-file questions.jsonrag-chunk analyze <folder> [options]| Option | Description | Default |
|---|---|---|
--strategy |
Chunking strategy: fixed-size, sliding-window, paragraph, recursive-character, header, semantic, or all |
fixed-size |
--chunk-size |
Number of words or tokens per chunk | 200 |
--overlap |
Number of overlapping words or tokens | 50 |
--use-tiktoken |
Use tiktoken for precise token-based chunking | False |
--tiktoken-model |
Model for tiktoken encoding | gpt-3.5-turbo |
--use-embeddings |
Use semantic embeddings for retrieval (requires sentence-transformers) |
False |
--test-file |
Path to JSON test file with questions | None |
--top-k |
Number of chunks to retrieve per question | 3 |
--output |
Output format: table, json, or csv |
table |
If --strategy all is chosen, every strategy is run with the supplied chunk-size and overlap where applicable.
Split by markdown headers while respecting chunk size limits:
rag-chunk analyze docs/ --strategy header --chunk-size 300Preserves document structure with metadata about headers and hierarchy levels.
Use sentence embeddings to split at semantic boundaries:
rag-chunk analyze examples/ --strategy semantic --chunk-size 200Groups semantically similar sentences together, splitting when topic shifts occur.
Compare lexical vs semantic retrieval:
# Lexical (keyword-based)
rag-chunk analyze examples/ --strategy all --test-file questions.json
# Semantic (embedding-based)
rag-chunk analyze examples/ --strategy all --test-file questions.json --use-embeddingsSemantic retrieval typically achieves higher recall for conceptual queries.
Analyze markdown files and generate chunks without evaluation:
rag-chunk analyze examples/ --strategy paragraphOutput:
strategy | chunks | avg_recall | saved
----------+--------+------------+----------------------------------
paragraph | 12 | 0.0 | .chunks/paragraph-20251115-020145
Total text length (chars): 3542
Run all chunking strategies with custom parameters:
rag-chunk analyze examples/ --strategy all --chunk-size 100 --overlap 20 --output tableOutput:
strategy | chunks | avg_recall | avg_precision | avg_f1 | saved
-------------------+--------+------------+---------------+--------+---------------------------------------
fixed-size | 36 | 0.0 | 0.0 | 0.0 | .chunks/fixed-size-20251115-020156
sliding-window | 45 | 0.0 | 0.0 | 0.0 | .chunks/sliding-window-20251115-020156
paragraph | 12 | 0.0 | 0.0 | 0.0 | .chunks/paragraph-20251115-020156
recursive-character| 28 | 0.0 | 0.0 | 0.0 | .chunks/recursive-character-20251115-020156
header | 15 | 0.0 | 0.0 | 0.0 | .chunks/header-20251115-020156
semantic | 22 | 0.0 | 0.0 | 0.0 | .chunks/semantic-20251115-020156
Total text length (chars): 3542
Measure recall using a test file with questions and relevant phrases:
rag-chunk analyze examples/ --strategy all --chunk-size 150 --overlap 30 --test-file examples/questions.json --top-k 3 --use-embeddings --output tableOutput:
strategy | chunks | avg_recall | avg_precision | avg_f1 | saved
-------------------+--------+------------+---------------+--------+---------------------------------------
fixed-size | 24 | 0.7812 | 0.7812 | 0.7812 | .chunks/fixed-size-20251115-020203
sliding-window | 32 | 0.8542 | 0.8542 | 0.8542 | .chunks/sliding-window-20251115-020203
paragraph | 12 | 0.9167 | 0.9167 | 0.9167 | .chunks/paragraph-20251115-020203
semantic | 19 | 0.9583 | 0.9583 | 0.9583 | .chunks/semantic-20251115-020203
Semantic chunking with embedding-based retrieval achieves highest recall (95.83%) by preserving semantic coherence.
rag-chunk analyze examples/ --strategy sliding-window --chunk-size 120 --overlap 40 --test-file examples/questions.json --top-k 5 --output json > results.jsonOutput structure:
{
"results": [
{
"strategy": "sliding-window",
"chunks": 38,
"avg_recall": 0.8958,
"avg_precision": 0.8958,
"avg_f1": 0.8958,
"saved": ".chunks/sliding-window-20251115-020210"
}
],
"detail": {
"per_questions": [
{
"question": "What are the three main stages of a RAG pipeline?",
"recall": 1.0,
"precision": 1.0,
"f1": 1.0
},
{
"question": "What is the main advantage of RAG over pure generative models?",
"recall": 0.6667,
"precision": 0.6667,
"f1": 0.6667
}
]
}
}rag-chunk analyze examples/ --strategy all --test-file examples/questions.json --output csvCreates analysis_results.csv with columns: strategy, chunks, avg_recall, avg_precision, avg_f1, saved.
By default, rag-chunk uses word-based tokenization (whitespace splitting). For precise token-level chunking that matches LLM context limits (e.g., GPT-3.5/GPT-4), use the --use-tiktoken flag.
pip install rag-chunk[tiktoken]Token-based fixed-size chunking:
rag-chunk analyze examples/ --strategy fixed-size --chunk-size 512 --use-tiktoken --output tableThis creates chunks of exactly 512 tokens (as counted by tiktoken for GPT models), not 512 words.
Compare word-based vs token-based chunking:
# Word-based (default)
rag-chunk analyze examples/ --strategy fixed-size --chunk-size 200 --output json
# Token-based
rag-chunk analyze examples/ --strategy fixed-size --chunk-size 200 --use-tiktoken --output jsonToken-based with sliding window:
rag-chunk analyze examples/ --strategy sliding-window --chunk-size 1024 --overlap 128 --use-tiktoken --test-file examples/questions.json --top-k 3-
✅ Use tiktoken when:
- Preparing chunks for OpenAI models (GPT-3.5, GPT-4)
- You need to respect strict token limits (e.g., 8k, 16k context windows)
- Comparing chunking strategies with token-accurate measurements
- Your documents contain special characters, emojis, or non-ASCII text
-
⚠️ Use word-based (default) when:- Quick prototyping and testing
- Working with well-formatted English text
- Don't need exact token counts
- Want to avoid the tiktoken dependency
You can also use tiktoken in your own scripts:
from src.chunker import count_tokens
text = "Your document text here..."
# Word-based count
word_count = count_tokens(text, use_tiktoken=False)
print(f"Words: {word_count}")
# Token-based count (requires tiktoken installed)
token_count = count_tokens(text, use_tiktoken=True)
print(f"Tokens: {token_count}")JSON file with a questions array (or direct array at top level):
{
"questions": [
{
"question": "What are the three main stages of a RAG pipeline?",
"relevant": ["indexing", "retrieval", "generation"]
},
{
"question": "What is the main advantage of RAG over pure generative models?",
"relevant": ["grounding", "retrieved documents", "hallucinate"]
}
]
}question: The query text used for chunk retrievalrelevant: List of phrases/terms that should appear in relevant chunks
Recall calculation: For each question, the tool retrieves top-k chunks using lexical similarity and checks how many relevant phrases appear in those chunks. Recall = (found phrases) / (total relevant phrases). Average recall is computed across all questions.
Number of chunks created by the strategy. More chunks = finer granularity but higher indexing cost.
- Average Recall: Percentage of relevant phrases successfully retrieved (0.0 to 1.0). Higher is better.
- Average Precision: Ratio of relevant content in retrieved chunks (0.0 to 1.0). Higher is better.
- Average F1-Score: Harmonic mean of precision and recall (0.0 to 1.0). Balanced measure of quality.
Interpreting scores:
- > 0.85: Excellent - strategy works very well for your content
- 0.70 - 0.85: Good - acceptable for most use cases
- 0.50 - 0.70: Fair - consider adjusting chunk size or strategy
- < 0.50: Poor - important information being lost or fragmented
Directory where chunks are written as individual .txt files for inspection.
| Strategy | Best For | Pros | Cons |
|---|---|---|---|
| Fixed-Size | Consistent chunk sizes, simple docs | Fast, deterministic | May break semantic boundaries |
| Sliding-Window | Preventing boundary loss | Preserves context at edges | Redundancy, more chunks |
| Paragraph | Well-structured docs | Preserves semantic coherence | Variable chunk sizes |
| Recursive-Character | General purpose | Good balance, semantic-aware | Requires LangChain |
| Header ⭐ | Technical docs, markdown | Preserves document structure | Requires header markup |
| Semantic ⭐ | Maximum retrieval quality | Best semantic coherence | Requires embeddings, slower |
- Technical documentation: Use
headerstrategy to preserve structure - Knowledge bases: Use
semanticfor best retrieval quality - General content: Start with
recursive-characterorparagraph - Token-limited models: Enable
--use-tiktokenfor accurate counting - Evaluation: Always use
--use-embeddingswith test files for better semantic matching
| Strategy | Best For | Chunk Size Recommendation |
|---|---|---|
| fixed-size | Uniform processing, consistent latency | 150-250 words |
| sliding-window | Preserving context at boundaries, dense text | 120-200 words, 20-30% overlap |
| paragraph | Well-structured docs with clear sections | N/A (variable) |
General guidelines:
- Start with paragraph for markdown with clear structure
- Use sliding-window if paragraphs are too long (>300 words)
- Use fixed-size as baseline for comparison
- Always test with representative questions from your domain
Add a new chunking strategy:
- Implement a function in
src/chunker.py:
def my_custom_chunks(text: str, chunk_size: int, overlap: int) -> List[Dict]:
chunks = []
# Your logic here
chunks.append({"id": 0, "text": "chunk text"})
return chunks- Register in
STRATEGIES:
STRATEGIES = {
"custom": lambda text, chunk_size=200, overlap=0: my_custom_chunks(text, chunk_size, overlap),
...
}- Use via CLI:
rag-chunk analyze docs/ --strategy custom --chunk-size 180rag-chunk/
├── src/
│ ├── __init__.py
│ ├── parser.py # Markdown parsing and cleaning
│ ├── chunker.py # Chunking strategies
│ ├── scorer.py # Retrieval and recall evaluation
│ └── cli.py # Command-line interface
├── tests/
│ └── test_basic.py # Unit tests
├── examples/
│ ├── rag_introduction.md
│ ├── chunking_strategies.md
│ ├── evaluation_metrics.md
│ └── questions.json
├── .chunks/ # Generated chunks (gitignored)
├── pyproject.toml
├── README.md
└── .gitignore
MIT
By default, --chunk-size and --overlap count words (whitespace-based tokenization). This keeps the tool simple and dependency-free.
For precise token-level chunking that matches LLM token counts (e.g., OpenAI GPT models using subword tokenization), use the --use-tiktoken flag after installing the optional dependency:
pip install rag-chunk[tiktoken]
rag-chunk analyze docs/ --strategy fixed-size --chunk-size 512 --use-tiktokenSee the Using Tiktoken section for more details.