Thanks to visit codestin.com
Credit goes to GitHub.com

Skip to content

A Python CLI to test, benchmark, and find the best RAG chunking strategy for your Markdown documents.

License

Notifications You must be signed in to change notification settings

messkan/rag-chunk

Repository files navigation

rag-chunk

Current Version: 0.4.0 🎉

CLI tool to parse, chunk, and evaluate Markdown documents for Retrieval-Augmented Generation (RAG) pipelines with token-accurate chunking and semantic intelligence.

Available on PyPI: https://pypi.org/project/rag-chunk/

📚 Documentation | 🐙 GitHub

✨ Features

  • 📄 Parse and clean Markdown files
  • ✂️ 6 Chunking Strategies:
    • fixed-size: Split by fixed word/token count
    • sliding-window: Overlapping chunks for context preservation
    • paragraph: Natural paragraph boundaries
    • recursive-character: LangChain's semantic splitter
    • header: NEW - Markdown header-aware splitting
    • semantic: NEW - Embedding-based semantic boundaries
  • 🧠 Semantic Intelligence: Embedding-based chunking and retrieval
  • 🎯 Token-based chunking with tiktoken (OpenAI models: GPT-3.5, GPT-4, etc.)
  • 🎨 Model selection via --tiktoken-model flag
  • 📊 Advanced Metrics: Precision, Recall, F1-score evaluation
  • 🌈 Beautiful CLI output with Rich tables
  • 📈 Compare all strategies with --strategy all
  • 💾 Export results as JSON or CSV

Demo

rag-chunk demo

🚀 Roadmap

rag-chunk is actively developed! Here's the plan to move from a useful tool to a full-featured chunking workbench.

✅ Version 0.1.0 – Launched

  • Core CLI engine (argparse)
  • Markdown (.md) file parsing
  • Basic chunking strategies: fixed-size, sliding-window, and paragraph (word-based)
  • Evaluation harness: calculate Recall score from a test-file.json
  • Beautiful CLI output (rich tables)
  • Published on PyPI: pip install rag-chunk

✅ Version 0.2.0 – Completed

  • Tiktoken Support: Added --use-tiktoken flag for precise token-based chunking
  • Model Selection: Added --tiktoken-model to choose tokenization model (default: gpt-3.5-turbo)
  • Improved Documentation: Updated README with tiktoken usage examples and comparisons
  • Enhanced Testing: Added comprehensive unit tests for token-based chunking
  • Optional Dependencies: tiktoken available via pip install rag-chunk[tiktoken]

✅ Version 0.3.0 – Released

  • Recursive Character Splitting: Add LangChain's RecursiveCharacterTextSplitter for semantic chunking
    • Install with: pip install rag-chunk[langchain]
    • Strategy: --strategy recursive-character
    • Works with both word-based and tiktoken modes
  • More File Formats: Support .txt files
  • Additional Metrics: Add precision, F1-score, and chunk quality metrics

✅ Version 0.4.0 – Released

  • Header-Aware Chunking: Split by markdown headers while respecting size limits
    • Strategy: --strategy header
    • Preserves document structure with metadata
  • Semantic Chunking: Use sentence embeddings to split at semantic boundaries
    • Strategy: --strategy semantic
    • Install with: pip install rag-chunk[semantic]
    • Powered by sentence-transformers
  • Embedding-Based Retrieval: Semantic similarity matching with --use-embeddings
    • Superior to lexical matching for semantic queries
    • Uses cosine similarity on sentence embeddings
  • Documentation Site: Complete GitHub Pages documentation

📈 Version 1.0.0 – Planned

  • Chunk Size Optimizer: Automated sweep to find optimal chunk size
  • Visualization Dashboard: HTML report with interactive charts
  • Context Augmentation: Add metadata (position, section, summaries) to chunks
  • Export Connectors: Direct integration with vector stores (Pinecone, Weaviate, Chroma)
  • Benchmarking Mode: Statistical comparison with significance testing
  • MLFlow Integration: Track experiments and chunking configurations
  • Performance Optimization: Parallel processing for large document sets

Installation

pip install rag-chunk
## Features

- Parse and clean Markdown files in a folder
- Chunk text using fixed-size, sliding-window, or paragraph-based strategies
- Evaluate chunk recall based on a provided test JSON file
- Output results as table, JSON, or CSV
- Store generated chunks temporarily in `.chunks`

## Installation

```bash
# Base installation
pip install rag-chunk

# With all features (recommended)
pip install rag-chunk[all]

# Or install specific features:
pip install rag-chunk[tiktoken]    # Token-based chunking
pip install rag-chunk[semantic]    # Semantic chunking & retrieval
pip install rag-chunk[langchain]   # Recursive character splitting

Development mode:

pip install -e .[all]

Quick Start

# Compare all strategies with semantic retrieval
rag-chunk analyze examples/ \
  --strategy all \
  --use-embeddings \
  --test-file examples/questions.json \
  --top-k 3

# Header-aware chunking for technical docs
rag-chunk analyze docs/ --strategy header --chunk-size 300

# Semantic chunking with embeddings
rag-chunk analyze examples/ \
  --strategy semantic \
  --chunk-size 200 \
  --use-embeddings \
  --test-file questions.json

CLI Usage

rag-chunk analyze <folder> [options]

Options

Option Description Default
--strategy Chunking strategy: fixed-size, sliding-window, paragraph, recursive-character, header, semantic, or all fixed-size
--chunk-size Number of words or tokens per chunk 200
--overlap Number of overlapping words or tokens 50
--use-tiktoken Use tiktoken for precise token-based chunking False
--tiktoken-model Model for tiktoken encoding gpt-3.5-turbo
--use-embeddings Use semantic embeddings for retrieval (requires sentence-transformers) False
--test-file Path to JSON test file with questions None
--top-k Number of chunks to retrieve per question 3
--output Output format: table, json, or csv table

If --strategy all is chosen, every strategy is run with the supplied chunk-size and overlap where applicable.

Examples

Header-Aware Chunking (NEW)

Split by markdown headers while respecting chunk size limits:

rag-chunk analyze docs/ --strategy header --chunk-size 300

Preserves document structure with metadata about headers and hierarchy levels.

Semantic Chunking (NEW)

Use sentence embeddings to split at semantic boundaries:

rag-chunk analyze examples/ --strategy semantic --chunk-size 200

Groups semantically similar sentences together, splitting when topic shifts occur.

Embedding-Based Retrieval (NEW)

Compare lexical vs semantic retrieval:

# Lexical (keyword-based)
rag-chunk analyze examples/ --strategy all --test-file questions.json

# Semantic (embedding-based)
rag-chunk analyze examples/ --strategy all --test-file questions.json --use-embeddings

Semantic retrieval typically achieves higher recall for conceptual queries.

Basic Usage: Generate Chunks Only

Analyze markdown files and generate chunks without evaluation:

rag-chunk analyze examples/ --strategy paragraph

Output:

strategy  | chunks | avg_recall | saved                            
----------+--------+------------+----------------------------------
paragraph | 12     | 0.0        | .chunks/paragraph-20251115-020145
Total text length (chars): 3542

Compare All Strategies

Run all chunking strategies with custom parameters:

rag-chunk analyze examples/ --strategy all --chunk-size 100 --overlap 20 --output table

Output:

strategy           | chunks | avg_recall | avg_precision | avg_f1 | saved                                 
-------------------+--------+------------+---------------+--------+---------------------------------------
fixed-size         | 36     | 0.0        | 0.0           | 0.0    | .chunks/fixed-size-20251115-020156    
sliding-window     | 45     | 0.0        | 0.0           | 0.0    | .chunks/sliding-window-20251115-020156
paragraph          | 12     | 0.0        | 0.0           | 0.0    | .chunks/paragraph-20251115-020156
recursive-character| 28     | 0.0        | 0.0           | 0.0    | .chunks/recursive-character-20251115-020156
header             | 15     | 0.0        | 0.0           | 0.0    | .chunks/header-20251115-020156
semantic           | 22     | 0.0        | 0.0           | 0.0    | .chunks/semantic-20251115-020156
Total text length (chars): 3542

Evaluate with Test File

Measure recall using a test file with questions and relevant phrases:

rag-chunk analyze examples/ --strategy all --chunk-size 150 --overlap 30 --test-file examples/questions.json --top-k 3 --use-embeddings --output table

Output:

strategy           | chunks | avg_recall | avg_precision | avg_f1 | saved                                 
-------------------+--------+------------+---------------+--------+---------------------------------------
fixed-size         | 24     | 0.7812     | 0.7812        | 0.7812 | .chunks/fixed-size-20251115-020203    
sliding-window     | 32     | 0.8542     | 0.8542        | 0.8542 | .chunks/sliding-window-20251115-020203
paragraph          | 12     | 0.9167     | 0.9167        | 0.9167 | .chunks/paragraph-20251115-020203
semantic           | 19     | 0.9583     | 0.9583        | 0.9583 | .chunks/semantic-20251115-020203

Semantic chunking with embedding-based retrieval achieves highest recall (95.83%) by preserving semantic coherence.

Export Results as JSON

rag-chunk analyze examples/ --strategy sliding-window --chunk-size 120 --overlap 40 --test-file examples/questions.json --top-k 5 --output json > results.json

Output structure:

{
  "results": [
    {
      "strategy": "sliding-window",
      "chunks": 38,
      "avg_recall": 0.8958,
      "avg_precision": 0.8958,
      "avg_f1": 0.8958,
      "saved": ".chunks/sliding-window-20251115-020210"
    }
  ],
  "detail": {
    "per_questions": [
      {
        "question": "What are the three main stages of a RAG pipeline?",
        "recall": 1.0,
        "precision": 1.0,
        "f1": 1.0
      },
      {
        "question": "What is the main advantage of RAG over pure generative models?",
        "recall": 0.6667,
        "precision": 0.6667,
        "f1": 0.6667
      }
    ]
  }
}

Export as CSV

rag-chunk analyze examples/ --strategy all --test-file examples/questions.json --output csv

Creates analysis_results.csv with columns: strategy, chunks, avg_recall, avg_precision, avg_f1, saved.

Using Tiktoken for Precise Token-Based Chunking

By default, rag-chunk uses word-based tokenization (whitespace splitting). For precise token-level chunking that matches LLM context limits (e.g., GPT-3.5/GPT-4), use the --use-tiktoken flag.

Installation

pip install rag-chunk[tiktoken]

Usage Examples

Token-based fixed-size chunking:

rag-chunk analyze examples/ --strategy fixed-size --chunk-size 512 --use-tiktoken --output table

This creates chunks of exactly 512 tokens (as counted by tiktoken for GPT models), not 512 words.

Compare word-based vs token-based chunking:

# Word-based (default)
rag-chunk analyze examples/ --strategy fixed-size --chunk-size 200 --output json

# Token-based
rag-chunk analyze examples/ --strategy fixed-size --chunk-size 200 --use-tiktoken --output json

Token-based with sliding window:

rag-chunk analyze examples/ --strategy sliding-window --chunk-size 1024 --overlap 128 --use-tiktoken --test-file examples/questions.json --top-k 3

When to Use Tiktoken

  • Use tiktoken when:

    • Preparing chunks for OpenAI models (GPT-3.5, GPT-4)
    • You need to respect strict token limits (e.g., 8k, 16k context windows)
    • Comparing chunking strategies with token-accurate measurements
    • Your documents contain special characters, emojis, or non-ASCII text
  • ⚠️ Use word-based (default) when:

    • Quick prototyping and testing
    • Working with well-formatted English text
    • Don't need exact token counts
    • Want to avoid the tiktoken dependency

Token Counting

You can also use tiktoken in your own scripts:

from src.chunker import count_tokens

text = "Your document text here..."

# Word-based count
word_count = count_tokens(text, use_tiktoken=False)
print(f"Words: {word_count}")

# Token-based count (requires tiktoken installed)
token_count = count_tokens(text, use_tiktoken=True)
print(f"Tokens: {token_count}")

Test File Format

JSON file with a questions array (or direct array at top level):

{
  "questions": [
    {
      "question": "What are the three main stages of a RAG pipeline?",
      "relevant": ["indexing", "retrieval", "generation"]
    },
    {
      "question": "What is the main advantage of RAG over pure generative models?",
      "relevant": ["grounding", "retrieved documents", "hallucinate"]
    }
  ]
}
  • question: The query text used for chunk retrieval
  • relevant: List of phrases/terms that should appear in relevant chunks

Recall calculation: For each question, the tool retrieves top-k chunks using lexical similarity and checks how many relevant phrases appear in those chunks. Recall = (found phrases) / (total relevant phrases). Average recall is computed across all questions.

Understanding the Output

Chunks

Number of chunks created by the strategy. More chunks = finer granularity but higher indexing cost.

Metrics

  • Average Recall: Percentage of relevant phrases successfully retrieved (0.0 to 1.0). Higher is better.
  • Average Precision: Ratio of relevant content in retrieved chunks (0.0 to 1.0). Higher is better.
  • Average F1-Score: Harmonic mean of precision and recall (0.0 to 1.0). Balanced measure of quality.

Interpreting scores:

  • > 0.85: Excellent - strategy works very well for your content
  • 0.70 - 0.85: Good - acceptable for most use cases
  • 0.50 - 0.70: Fair - consider adjusting chunk size or strategy
  • < 0.50: Poor - important information being lost or fragmented

Saved Location

Directory where chunks are written as individual .txt files for inspection.

Choosing the Right Strategy

Strategy Best For Pros Cons
Fixed-Size Consistent chunk sizes, simple docs Fast, deterministic May break semantic boundaries
Sliding-Window Preventing boundary loss Preserves context at edges Redundancy, more chunks
Paragraph Well-structured docs Preserves semantic coherence Variable chunk sizes
Recursive-Character General purpose Good balance, semantic-aware Requires LangChain
Header Technical docs, markdown Preserves document structure Requires header markup
Semantic Maximum retrieval quality Best semantic coherence Requires embeddings, slower

Recommendations

  • Technical documentation: Use header strategy to preserve structure
  • Knowledge bases: Use semantic for best retrieval quality
  • General content: Start with recursive-character or paragraph
  • Token-limited models: Enable --use-tiktoken for accurate counting
  • Evaluation: Always use --use-embeddings with test files for better semantic matching
Strategy Best For Chunk Size Recommendation
fixed-size Uniform processing, consistent latency 150-250 words
sliding-window Preserving context at boundaries, dense text 120-200 words, 20-30% overlap
paragraph Well-structured docs with clear sections N/A (variable)

General guidelines:

  1. Start with paragraph for markdown with clear structure
  2. Use sliding-window if paragraphs are too long (>300 words)
  3. Use fixed-size as baseline for comparison
  4. Always test with representative questions from your domain

Extending

Add a new chunking strategy:

  1. Implement a function in src/chunker.py:
def my_custom_chunks(text: str, chunk_size: int, overlap: int) -> List[Dict]:
    chunks = []
    # Your logic here
    chunks.append({"id": 0, "text": "chunk text"})
    return chunks
  1. Register in STRATEGIES:
STRATEGIES = {
    "custom": lambda text, chunk_size=200, overlap=0: my_custom_chunks(text, chunk_size, overlap),
    ...
}
  1. Use via CLI:
rag-chunk analyze docs/ --strategy custom --chunk-size 180

Project Structure

rag-chunk/
├── src/
│   ├── __init__.py
│   ├── parser.py       # Markdown parsing and cleaning
│   ├── chunker.py      # Chunking strategies
│   ├── scorer.py       # Retrieval and recall evaluation
│   └── cli.py          # Command-line interface
├── tests/
│   └── test_basic.py   # Unit tests
├── examples/
│   ├── rag_introduction.md
│   ├── chunking_strategies.md
│   ├── evaluation_metrics.md
│   └── questions.json
├── .chunks/            # Generated chunks (gitignored)
├── pyproject.toml
├── README.md
└── .gitignore

License

MIT

Note on Tokenization

By default, --chunk-size and --overlap count words (whitespace-based tokenization). This keeps the tool simple and dependency-free.

For precise token-level chunking that matches LLM token counts (e.g., OpenAI GPT models using subword tokenization), use the --use-tiktoken flag after installing the optional dependency:

pip install rag-chunk[tiktoken]
rag-chunk analyze docs/ --strategy fixed-size --chunk-size 512 --use-tiktoken

See the Using Tiktoken section for more details.