chunky

Meet chunky, the coolest text chunking API you'll ever use.

Overview

chunky is a proof-of-concept application designed to demonstrate the process of text chunking for Retrieval-Augmented Generation (RAG) models using vector search technologies. The primary objective of chunky is to provide a simple API endpoint that allows users to upload a PDF file, from which the application will extract, clean, and chunk the text into smaller, more optimal pieces.

Endpoints

POST /chunk (Uploads a PDF file, extracts text from it, cleans the text, and chunks it into smaller pieces)

Parameters

name type data type description

file required file The PDF file to be uploaded

Responses

http code content-type response

200 application/json {"chunks": ["chunk1", "chunk2", ...]}

400 application/json {"detail": "Invalid file type. Please upload a PDF file."}

500 application/json {"detail": "An error occurred: {error_message}"}

Example cURL

curl -X POST "http://localhost:8000/chunk" -H "accept: application/json" -H "Content-Type: multipart/form-data" -F "file=@path/to/your/file.pdf"

Running the application

On your machine

Create conda environment:

conda env create --file=environment.yml

Activate conda environment:

conda activate chunky

Run the FastAPI application:

uvicorn app:app

Using Docker

Build the Docker image:

docker build -t chunky:latest .

Run the Docker container:

docker run -d -p 8000:8000 chunky:latest

Design choices

Text cleaning

Based on the articles on Medium and Spotintelligence, I've implemented some basic text cleaning strategies like white space removal, non-printable characters removal, and multiple spaces removal. I've avoided more advanced techniques to maintain text consistency for the chunking.

Choosing the chunking strategy

Mainly the articles of unstructured.io and Pinecone, the Reddit comments, and first cited paper infunced my decisions about choosing the chunking strategy and parameters.

The fixed-size, character-based chunking would have been too simplistic for the use case, since it might split text in the middle of sentences or important semantic units, leading to chunks that are not meaningful or useful for downstream tasks.

I had to skip most of the semantic chunking methods because of the limitation about using LLM's or any Transformer based models. Even though I've found and tested semantic chunking methods/libraries (e.g. NLTK/textsplit/PySBD), they didn't seem to split text as uniformly as recursive chunking.

Recursive chunking seemed to perform the best, since it allows for more control over the chunk sizes and ensures that the chunks are semantically meaningful: by recursively splitting the text based on predefined rules (e.g., sentence boundaries, paragraph breaks), we can achieve a balance between chunk size and semantic coherence. This is crucial for text embedding models.

The plots about the distribution of chunk lengths in the Juypter Notebook supports this, therefore I sticked with recursive chunking.

Chunk size and overlap

The research materials almost uniformly stated that for a vector embedding use-case a chunk size of 500-1000 and an overlap of 10% of the chunk size could work the most optimal - of course depending on the type of the documents and the embedding model. Therefore, I choose 1024 (which is a power of 2) as the chunk size, and the overlap of 100 as a general solution.

Sources

Blog posts, articles

Forum posts

Code examples

Research papers

Improving Retrieval for RAG based Question Answering Models on Financial Documents
Spurthi Setty, Harsh Thakkar, Alyssa Lee, Eden Chung, Natan Vidra
arXiv preprint, 2024.
arXiv:2404.07221
Question-Based Retrieval using Atomic Units for Enterprise RAG
Vatsal Raina and Mark Gales
arXiv preprint, 2024.
arXiv:2405.12363

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.vscode		.vscode
test_documents		test_documents
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
chunky.py		chunky.py
environment.yml		environment.yml
playground.ipynb		playground.ipynb
test_app.py		test_app.py
test_chunky.py		test_chunky.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

chunky

Overview

Endpoints

Parameters

Responses

Example cURL

Running the application

On your machine

Using Docker

Design choices

Text cleaning

Choosing the chunking strategy

Chunk size and overlap

Sources

Blog posts, articles

Forum posts

Code examples

Research papers

About

Uh oh!

Releases

Languages

http code	content-type	response
`200`	`application/json`	`{"chunks": ["chunk1", "chunk2", ...]}`
`400`	`application/json`	`{"detail": "Invalid file type. Please upload a PDF file."}`
`500`	`application/json`	`{"detail": "An error occurred: {error_message}"}`

Uh oh!

Uh oh!

kristofmaar/chunky

Folders and files

Latest commit

History

Repository files navigation

chunky

Overview

Endpoints

Parameters

Responses

Example cURL

Running the application

On your machine

Using Docker

Design choices

Text cleaning

Choosing the chunking strategy

Chunk size and overlap

Sources

Blog posts, articles

Forum posts

Code examples

Research papers

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Languages