rag-chunk

Current Version: 0.4.0 🎉

CLI tool to parse, chunk, and evaluate Markdown documents for Retrieval-Augmented Generation (RAG) pipelines with token-accurate chunking and semantic intelligence.

Available on PyPI: https://pypi.org/project/rag-chunk/

📚 Documentation | 🐙 GitHub

✨ Features

📄 Parse and clean Markdown files
✂️ 6 Chunking Strategies:
- fixed-size: Split by fixed word/token count
- sliding-window: Overlapping chunks for context preservation
- paragraph: Natural paragraph boundaries
- recursive-character: LangChain's semantic splitter
- header: NEW - Markdown header-aware splitting
- semantic: NEW - Embedding-based semantic boundaries
🧠 Semantic Intelligence: Embedding-based chunking and retrieval
🎯 Token-based chunking with tiktoken (OpenAI models: GPT-3.5, GPT-4, etc.)
🎨 Model selection via --tiktoken-model flag
📊 Advanced Metrics: Precision, Recall, F1-score evaluation
🌈 Beautiful CLI output with Rich tables
📈 Compare all strategies with --strategy all
💾 Export results as JSON or CSV

Demo

🚀 Roadmap

rag-chunk is actively developed! Here's the plan to move from a useful tool to a full-featured chunking workbench.

✅ Version 0.1.0 – Launched

Core CLI engine (argparse)
Markdown (.md) file parsing
Basic chunking strategies: fixed-size, sliding-window, and paragraph (word-based)
Evaluation harness: calculate Recall score from a test-file.json
Beautiful CLI output (rich tables)
Published on PyPI: pip install rag-chunk

✅ Version 0.2.0 – Completed

Tiktoken Support: Added --use-tiktoken flag for precise token-based chunking
Model Selection: Added --tiktoken-model to choose tokenization model (default: gpt-3.5-turbo)
Improved Documentation: Updated README with tiktoken usage examples and comparisons
Enhanced Testing: Added comprehensive unit tests for token-based chunking
Optional Dependencies: tiktoken available via pip install rag-chunk[tiktoken]

✅ Version 0.3.0 – Released

Recursive Character Splitting: Add LangChain's RecursiveCharacterTextSplitter for semantic chunking
- Install with: pip install rag-chunk[langchain]
- Strategy: --strategy recursive-character
- Works with both word-based and tiktoken modes
More File Formats: Support .txt files
Additional Metrics: Add precision, F1-score, and chunk quality metrics

✅ Version 0.4.0 – Released

Header-Aware Chunking: Split by markdown headers while respecting size limits
- Strategy: --strategy header
- Preserves document structure with metadata
Semantic Chunking: Use sentence embeddings to split at semantic boundaries
- Strategy: --strategy semantic
- Install with: pip install rag-chunk[semantic]
- Powered by sentence-transformers
Embedding-Based Retrieval: Semantic similarity matching with --use-embeddings
- Superior to lexical matching for semantic queries
- Uses cosine similarity on sentence embeddings
Documentation Site: Complete GitHub Pages documentation

📈 Version 1.0.0 – Planned

Chunk Size Optimizer: Automated sweep to find optimal chunk size
Visualization Dashboard: HTML report with interactive charts
Context Augmentation: Add metadata (position, section, summaries) to chunks
Export Connectors: Direct integration with vector stores (Pinecone, Weaviate, Chroma)
Benchmarking Mode: Statistical comparison with significance testing
MLFlow Integration: Track experiments and chunking configurations
Performance Optimization: Parallel processing for large document sets

Installation

pip install rag-chunk
## Features

- Parse and clean Markdown files in a folder
- Chunk text using fixed-size, sliding-window, or paragraph-based strategies
- Evaluate chunk recall based on a provided test JSON file
- Output results as table, JSON, or CSV
- Store generated chunks temporarily in `.chunks`

## Installation

```bash
# Base installation
pip install rag-chunk

# With all features (recommended)
pip install rag-chunk[all]

# Or install specific features:
pip install rag-chunk[tiktoken]    # Token-based chunking
pip install rag-chunk[semantic]    # Semantic chunking & retrieval
pip install rag-chunk[langchain]   # Recursive character splitting

Development mode:

pip install -e .[all]

Quick Start

# Compare all strategies with semantic retrieval
rag-chunk analyze examples/ \
  --strategy all \
  --use-embeddings \
  --test-file examples/questions.json \
  --top-k 3

# Header-aware chunking for technical docs
rag-chunk analyze docs/ --strategy header --chunk-size 300

# Semantic chunking with embeddings
rag-chunk analyze examples/ \
  --strategy semantic \
  --chunk-size 200 \
  --use-embeddings \
  --test-file questions.json

CLI Usage

rag-chunk analyze <folder> [options]

Options

Option	Description	Default
`--strategy`	Chunking strategy: `fixed-size`, `sliding-window`, `paragraph`, `recursive-character`, `header`, `semantic`, or `all`	`fixed-size`
`--chunk-size`	Number of words or tokens per chunk	`200`
`--overlap`	Number of overlapping words or tokens	`50`
`--use-tiktoken`	Use tiktoken for precise token-based chunking	`False`
`--tiktoken-model`	Model for tiktoken encoding	`gpt-3.5-turbo`
`--use-embeddings`	Use semantic embeddings for retrieval (requires `sentence-transformers`)	`False`
`--test-file`	Path to JSON test file with questions	None
`--top-k`	Number of chunks to retrieve per question	`3`
`--output`	Output format: `table`, `json`, or `csv`	`table`

If --strategy all is chosen, every strategy is run with the supplied chunk-size and overlap where applicable.

Examples

Header-Aware Chunking (NEW)

Split by markdown headers while respecting chunk size limits:

rag-chunk analyze docs/ --strategy header --chunk-size 300

Preserves document structure with metadata about headers and hierarchy levels.

Semantic Chunking (NEW)

Use sentence embeddings to split at semantic boundaries:

rag-chunk analyze examples/ --strategy semantic --chunk-size 200

Groups semantically similar sentences together, splitting when topic shifts occur.

Embedding-Based Retrieval (NEW)

Compare lexical vs semantic retrieval:

# Lexical (keyword-based)
rag-chunk analyze examples/ --strategy all --test-file questions.json

# Semantic (embedding-based)
rag-chunk analyze examples/ --strategy all --test-file questions.json --use-embeddings

Semantic retrieval typically achieves higher recall for conceptual queries.

Basic Usage: Generate Chunks Only

Analyze markdown files and generate chunks without evaluation:

rag-chunk analyze examples/ --strategy paragraph

Output:

strategy  | chunks | avg_recall | saved                            
----------+--------+------------+----------------------------------
paragraph | 12     | 0.0        | .chunks/paragraph-20251115-020145
Total text length (chars): 3542

Compare All Strategies

Run all chunking strategies with custom parameters:

rag-chunk analyze examples/ --strategy all --chunk-size 100 --overlap 20 --output table

Output:

strategy           | chunks | avg_recall | avg_precision | avg_f1 | saved                                 
-------------------+--------+------------+---------------+--------+---------------------------------------
fixed-size         | 36     | 0.0        | 0.0           | 0.0    | .chunks/fixed-size-20251115-020156    
sliding-window     | 45     | 0.0        | 0.0           | 0.0    | .chunks/sliding-window-20251115-020156
paragraph          | 12     | 0.0        | 0.0           | 0.0    | .chunks/paragraph-20251115-020156
recursive-character| 28     | 0.0        | 0.0           | 0.0    | .chunks/recursive-character-20251115-020156
header             | 15     | 0.0        | 0.0           | 0.0    | .chunks/header-20251115-020156
semantic           | 22     | 0.0        | 0.0           | 0.0    | .chunks/semantic-20251115-020156
Total text length (chars): 3542

Evaluate with Test File

Measure recall using a test file with questions and relevant phrases:

rag-chunk analyze examples/ --strategy all --chunk-size 150 --overlap 30 --test-file examples/questions.json --top-k 3 --use-embeddings --output table

Output:

strategy           | chunks | avg_recall | avg_precision | avg_f1 | saved                                 
-------------------+--------+------------+---------------+--------+---------------------------------------
fixed-size         | 24     | 0.7812     | 0.7812        | 0.7812 | .chunks/fixed-size-20251115-020203    
sliding-window     | 32     | 0.8542     | 0.8542        | 0.8542 | .chunks/sliding-window-20251115-020203
paragraph          | 12     | 0.9167     | 0.9167        | 0.9167 | .chunks/paragraph-20251115-020203
semantic           | 19     | 0.9583     | 0.9583        | 0.9583 | .chunks/semantic-20251115-020203

Semantic chunking with embedding-based retrieval achieves highest recall (95.83%) by preserving semantic coherence.

Export Results as JSON

rag-chunk analyze examples/ --strategy sliding-window --chunk-size 120 --overlap 40 --test-file examples/questions.json --top-k 5 --output json > results.json

Output structure:

{
  "results": [
    {
      "strategy": "sliding-window",
      "chunks": 38,
      "avg_recall": 0.8958,
      "avg_precision": 0.8958,
      "avg_f1": 0.8958,
      "saved": ".chunks/sliding-window-20251115-020210"
    }
  ],
  "detail": {
    "per_questions": [
      {
        "question": "What are the three main stages of a RAG pipeline?",
        "recall": 1.0,
        "precision": 1.0,
        "f1": 1.0
      },
      {
        "question": "What is the main advantage of RAG over pure generative models?",
        "recall": 0.6667,
        "precision": 0.6667,
        "f1": 0.6667
      }
    ]
  }
}

Export as CSV

rag-chunk analyze examples/ --strategy all --test-file examples/questions.json --output csv

Creates analysis_results.csv with columns: strategy, chunks, avg_recall, avg_precision, avg_f1, saved.

Using Tiktoken for Precise Token-Based Chunking

By default, rag-chunk uses word-based tokenization (whitespace splitting). For precise token-level chunking that matches LLM context limits (e.g., GPT-3.5/GPT-4), use the --use-tiktoken flag.

Installation

pip install rag-chunk[tiktoken]

Usage Examples

Token-based fixed-size chunking:

rag-chunk analyze examples/ --strategy fixed-size --chunk-size 512 --use-tiktoken --output table

This creates chunks of exactly 512 tokens (as counted by tiktoken for GPT models), not 512 words.

Compare word-based vs token-based chunking:

# Word-based (default)
rag-chunk analyze examples/ --strategy fixed-size --chunk-size 200 --output json

# Token-based
rag-chunk analyze examples/ --strategy fixed-size --chunk-size 200 --use-tiktoken --output json

Token-based with sliding window:

rag-chunk analyze examples/ --strategy sliding-window --chunk-size 1024 --overlap 128 --use-tiktoken --test-file examples/questions.json --top-k 3

When to Use Tiktoken

✅ Use tiktoken when:
- Preparing chunks for OpenAI models (GPT-3.5, GPT-4)
- You need to respect strict token limits (e.g., 8k, 16k context windows)
- Comparing chunking strategies with token-accurate measurements
- Your documents contain special characters, emojis, or non-ASCII text
⚠️ Use word-based (default) when:
- Quick prototyping and testing
- Working with well-formatted English text
- Don't need exact token counts
- Want to avoid the tiktoken dependency

Token Counting

You can also use tiktoken in your own scripts:

from src.chunker import count_tokens

text = "Your document text here..."

# Word-based count
word_count = count_tokens(text, use_tiktoken=False)
print(f"Words: {word_count}")

# Token-based count (requires tiktoken installed)
token_count = count_tokens(text, use_tiktoken=True)
print(f"Tokens: {token_count}")

Test File Format

JSON file with a questions array (or direct array at top level):

{
  "questions": [
    {
      "question": "What are the three main stages of a RAG pipeline?",
      "relevant": ["indexing", "retrieval", "generation"]
    },
    {
      "question": "What is the main advantage of RAG over pure generative models?",
      "relevant": ["grounding", "retrieved documents", "hallucinate"]
    }
  ]
}

question: The query text used for chunk retrieval
relevant: List of phrases/terms that should appear in relevant chunks

Recall calculation: For each question, the tool retrieves top-k chunks using lexical similarity and checks how many relevant phrases appear in those chunks. Recall = (found phrases) / (total relevant phrases). Average recall is computed across all questions.

Understanding the Output

Chunks

Number of chunks created by the strategy. More chunks = finer granularity but higher indexing cost.

Metrics

Average Recall: Percentage of relevant phrases successfully retrieved (0.0 to 1.0). Higher is better.
Average Precision: Ratio of relevant content in retrieved chunks (0.0 to 1.0). Higher is better.
Average F1-Score: Harmonic mean of precision and recall (0.0 to 1.0). Balanced measure of quality.

Interpreting scores:

> 0.85: Excellent - strategy works very well for your content
0.70 - 0.85: Good - acceptable for most use cases
0.50 - 0.70: Fair - consider adjusting chunk size or strategy
< 0.50: Poor - important information being lost or fragmented

Saved Location

Directory where chunks are written as individual .txt files for inspection.

Choosing the Right Strategy

Strategy	Best For	Pros	Cons
Fixed-Size	Consistent chunk sizes, simple docs	Fast, deterministic	May break semantic boundaries
Sliding-Window	Preventing boundary loss	Preserves context at edges	Redundancy, more chunks
Paragraph	Well-structured docs	Preserves semantic coherence	Variable chunk sizes
Recursive-Character	General purpose	Good balance, semantic-aware	Requires LangChain
Header ⭐	Technical docs, markdown	Preserves document structure	Requires header markup
Semantic ⭐	Maximum retrieval quality	Best semantic coherence	Requires embeddings, slower

Recommendations

Technical documentation: Use header strategy to preserve structure
Knowledge bases: Use semantic for best retrieval quality
General content: Start with recursive-character or paragraph
Token-limited models: Enable --use-tiktoken for accurate counting
Evaluation: Always use --use-embeddings with test files for better semantic matching

Strategy	Best For	Chunk Size Recommendation
fixed-size	Uniform processing, consistent latency	150-250 words
sliding-window	Preserving context at boundaries, dense text	120-200 words, 20-30% overlap
paragraph	Well-structured docs with clear sections	N/A (variable)

General guidelines:

Start with paragraph for markdown with clear structure
Use sliding-window if paragraphs are too long (>300 words)
Use fixed-size as baseline for comparison
Always test with representative questions from your domain

Extending

Add a new chunking strategy:

Implement a function in src/chunker.py:

def my_custom_chunks(text: str, chunk_size: int, overlap: int) -> List[Dict]:
    chunks = []
    # Your logic here
    chunks.append({"id": 0, "text": "chunk text"})
    return chunks

Register in STRATEGIES:

STRATEGIES = {
    "custom": lambda text, chunk_size=200, overlap=0: my_custom_chunks(text, chunk_size, overlap),
    ...
}

Use via CLI:

rag-chunk analyze docs/ --strategy custom --chunk-size 180

Project Structure

rag-chunk/
├── src/
│   ├── __init__.py
│   ├── parser.py       # Markdown parsing and cleaning
│   ├── chunker.py      # Chunking strategies
│   ├── scorer.py       # Retrieval and recall evaluation
│   └── cli.py          # Command-line interface
├── tests/
│   └── test_basic.py   # Unit tests
├── examples/
│   ├── rag_introduction.md
│   ├── chunking_strategies.md
│   ├── evaluation_metrics.md
│   └── questions.json
├── .chunks/            # Generated chunks (gitignored)
├── pyproject.toml
├── README.md
└── .gitignore

License

MIT

Note on Tokenization

By default, --chunk-size and --overlap count words (whitespace-based tokenization). This keeps the tool simple and dependency-free.

For precise token-level chunking that matches LLM token counts (e.g., OpenAI GPT models using subword tokenization), use the --use-tiktoken flag after installing the optional dependency:

pip install rag-chunk[tiktoken]
rag-chunk analyze docs/ --strategy fixed-size --chunk-size 512 --use-tiktoken

See the Using Tiktoken section for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
demo.gif		demo.gif
example.md		example.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

messkan/rag-chunk

Folders and files

Latest commit

History

Repository files navigation

rag-chunk

✨ Features

Demo

🚀 Roadmap

✅ Version 0.1.0 – Launched

✅ Version 0.2.0 – Completed

✅ Version 0.3.0 – Released

✅ Version 0.4.0 – Released

📈 Version 1.0.0 – Planned

Installation

Quick Start

CLI Usage

Options

Examples

Header-Aware Chunking (NEW)

Semantic Chunking (NEW)

Embedding-Based Retrieval (NEW)

Basic Usage: Generate Chunks Only

Compare All Strategies

Evaluate with Test File

Export Results as JSON

Export as CSV

Using Tiktoken for Precise Token-Based Chunking

Installation

Usage Examples

When to Use Tiktoken

Token Counting

Test File Format

Understanding the Output

Chunks

Metrics

Saved Location

Choosing the Right Strategy

Recommendations

Extending

Project Structure

License

Note on Tokenization

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages