-
Notifications
You must be signed in to change notification settings - Fork 283
docs: document how chunk overlap works #751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
047bdff to
dfd3610
Compare
|
|
||
| Both chunking implementations use langchain's Text splitters under the hood: [[1]](https://python.langchain.com/docs/how_to/character_text_splitter/), [[2]](https://python.langchain.com/docs/how_to/recursive_text_splitter/). | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT about adding an example here?
docs/vectorizer/api-reference.md
Outdated
| A JSON configuration object that you can use in [ai.create_vectorizer](#create-vectorizers). | ||
|
|
||
| ### Understanding Overlap | ||
| The recursive character text splitter creates overlaps only when splitting within words (using the empty string separator ""). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this statement correct? In the following example, the "" splitter is not used, but there is still overlapping text:
text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", " ", ""],chunk_size=20,chunk_overlap=10)
text_splitter.create_documents(["Lorem Ipsum is simply dummy text of the printing and typesetting industry."])Produces:
['Lorem Ipsum is', 'Ipsum is simply', 'is simply dummy', 'dummy text of the', 'of the printing and', 'and typesetting', 'industry.']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code actually contains the "" splitter. But I tried without it and you're correct that it still produces overlap. I can't quite figure out how this works. Some chunks seems to have overlap and some don't. Maybe linking to langchain is enough.
9e23cce to
6b94ce8
Compare
Update docs/vectorizer/api-reference.md Co-authored-by: Sergio Moya <[email protected]> Signed-off-by: Jascha Beste <[email protected]>
6b94ce8 to
f1d967c
Compare
35e80a5 to
f15011e
Compare
No description provided.