Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@Askir
Copy link
Contributor

@Askir Askir commented May 16, 2025

No description provided.

@Askir Askir requested a review from a team as a code owner May 16, 2025 12:53
@Askir Askir force-pushed the jascha/document-chunk-overlap branch from 047bdff to dfd3610 Compare May 16, 2025 13:05

Both chunking implementations use langchain's Text splitters under the hood: [[1]](https://python.langchain.com/docs/how_to/character_text_splitter/), [[2]](https://python.langchain.com/docs/how_to/recursive_text_splitter/).


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about adding an example here?

A JSON configuration object that you can use in [ai.create_vectorizer](#create-vectorizers).

### Understanding Overlap
The recursive character text splitter creates overlaps only when splitting within words (using the empty string separator "").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this statement correct? In the following example, the "" splitter is not used, but there is still overlapping text:

text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", " ", ""],chunk_size=20,chunk_overlap=10)
text_splitter.create_documents(["Lorem Ipsum is simply dummy text of the printing and typesetting industry."])

Produces:

['Lorem Ipsum is', 'Ipsum is simply', 'is simply dummy', 'dummy text of the', 'of the printing and', 'and typesetting', 'industry.']

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code actually contains the "" splitter. But I tried without it and you're correct that it still produces overlap. I can't quite figure out how this works. Some chunks seems to have overlap and some don't. Maybe linking to langchain is enough.

@Askir Askir force-pushed the jascha/document-chunk-overlap branch from 9e23cce to 6b94ce8 Compare June 17, 2025 15:54
Update docs/vectorizer/api-reference.md

Co-authored-by: Sergio Moya <[email protected]>
Signed-off-by: Jascha Beste <[email protected]>
@Askir Askir force-pushed the jascha/document-chunk-overlap branch from 6b94ce8 to f1d967c Compare June 17, 2025 15:55
@alejandrodnm alejandrodnm force-pushed the main branch 2 times, most recently from 35e80a5 to f15011e Compare October 14, 2025 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants