Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 54 additions & 46 deletions 07-TextSplitter/04-SemanticChunker.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,19 +18,20 @@
"\n",
"## Overview\n",
"\n",
"This tutorial covers a Text Splitter that splits text based on semantic similarity.\n",
"This tutorial dives into a Text Splitter that uses semantic similarity to split text.\n",
"\n",
"The **Semantic Chunker** is a sophisticated tool within **LangChain** that brings an intelligent approach to document chunking. Rather than simply dividing text at fixed intervals, it analyzes the semantic meaning of content to create more meaningful divisions. \n",
"LangChain's `SemanticChunker` is a powerful tool that takes document chunking to a whole new level. Unlike traiditional methods that split text at fixed intervals, the `SemanticChunker` analyzes the meaning of the content to create more logical divisions.\n",
"\n",
"This process relies on **OpenAI's embedding model** , which evaluates how similar different pieces of text are to each other. The tool offers flexible splitting options, including percentile-based, standard deviation, and interquartile range methods. \n",
"This approach relies on **OpenAI's embedding model** , calculating how similar different pieces of text are by converting them into numerical representations. The tool offers various splitting options to suit your needs. You can choose from methods based on percentiles, standard deviation, or interquartile range.\n",
"\n",
"What sets it apart from traditional text splitters is its ability to maintain context by identifying natural break points in the text, ultimately leading to better performance when working with large language models. \n",
"What sets the `SemanticChunker` apart is its ability to preserve context by identifying natural breaks. This ultimately leads to better performance when working with large language models. \n",
"\n",
"By understanding the actual meaning of the content, it creates more coherent and useful chunks that preserve the original document's context and flow.\n",
"Since the `SemanticChunker` understands the actual content, it generates chunks that are more useful and maintain the flow and context of the original document.\n",
"\n",
" [Greg Kamradt's Notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)\n",
"See [Greg Kamradt's notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)\n",
"\n",
"The method divides the text into sentence units, then groups them into three sentences, and merges similar sentences in the embedding space.\n",
"\n",
"The method breaks down the text into individual sentences first. Then, it groups sementically similar sentences into chunks (e.g., 3 sentences), and finally merges similar sentences in the embedding space.\n",
"\n",
"### Table of Contents\n",
"\n",
Expand All @@ -42,8 +43,8 @@
"\n",
"### References\n",
"\n",
"- [Greg Kamradt's Notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)\n",
"\n",
"- [Greg Kamradt's notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)\n",
"- [Greg Kamradt's video](https://youtu.be/8OJC21T2SL4?si=PzUtNGYJ_KULq3-w&t=2580)\n",
"\n",
"----"
]
Expand Down Expand Up @@ -78,7 +79,7 @@
"outputs": [],
"source": [
"%%capture --no-stderr\n",
"!pip install langchain-opentutorial"
"%pip install langchain-opentutorial"
]
},
{
Expand Down Expand Up @@ -134,9 +135,9 @@
"id": "54038451",
"metadata": {},
"source": [
"You can alternatively set `OPENAI_API_KEY` in `.env` file and load it.\n",
"Alternatively, you can set and load `OPENAI_API_KEY` from a `.env` file.\n",
"\n",
"[Note] This is not necessary if you've already set `OPENAI_API_KEY` in previous steps."
"**[Note]** This is only necessary if you haven't already set `OPENAI_API_KEY` in previous steps."
]
},
{
Expand All @@ -153,6 +154,14 @@
"load_dotenv(override=True)"
]
},
{
"cell_type": "markdown",
"id": "b3926396",
"metadata": {},
"source": [
"Load the sample text and output its content."
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -176,17 +185,17 @@
"source": [
"## Creating a SemanticChunker\n",
"\n",
"`SemanticChunker` is one of LangChain's experimental features, which serves to divide text into semantically similar chunks.\n",
"The `SemanticChunker` is an experimental LangChain feature, that splits text into semantically similar chunks.\n",
"\n",
"This allows you to process and analyze text data more effectively."
"This approach allows for more effective processing and analysis of text data."
]
},
{
"cell_type": "markdown",
"id": "ab33ae70",
"id": "339f032d",
"metadata": {},
"source": [
"Use `SemanticChunker` to divide the text into semantically related chunks.\n"
"Use the `SemanticChunker` to divide the text into semantically related chunks."
]
},
{
Expand Down Expand Up @@ -216,7 +225,7 @@
"id": "b0c9b20b",
"metadata": {},
"source": [
"- Use `text_splitter` to divide the `file` text into document units."
"Use the `text_splitter` with your loaded file (`file`) to split the text into smallar, more manageable unit documents. This process is often referred to as chunking."
]
},
{
Expand All @@ -234,7 +243,7 @@
"id": "14a777bc",
"metadata": {},
"source": [
"Check the divided chunks."
"After splitting, you can examine the resulting chunks to see how the text has been divided."
]
},
{
Expand All @@ -253,7 +262,7 @@
"id": "8f03b26b",
"metadata": {},
"source": [
"You can convert chunks to documents using the `create_documents()` function.\n"
"The `create_documents()` function allows you to convert the individual chunks (`[file]`) into proper document objects (`docs`).\n"
]
},
{
Expand All @@ -276,18 +285,25 @@
"metadata": {},
"source": [
"## Breakpoints\n",
"This chunker works by determining when to \"split\" sentences. \n",
"\n",
"This is done by examining the embedding differences between two sentences.\n",
"\n",
"If the difference exceeds a certain threshold, the sentences are split.\n",
"This chunking process works by indentifying natural breaks between sentences.\n",
"\n",
"- Reference video: https://youtu.be/8OJC21T2SL4?si=PzUtNGYJ_KULq3-w&t=2580\n",
"Here's how it decides where to split the text:\n",
"1. It calculates the difference between these embeddings for each pair of sentences.\n",
"2. When the difference between two sentences exceeds a certain threshold (breakpoint), the `text_splitter` identifies this as a natural break and splits the text at that point.\n",
"\n",
"### Percentile\n",
"The basic splitting method is based on `Percentile`.\n",
"Check out [Greg Kamradt's video](https://youtu.be/8OJC21T2SL4?si=PzUtNGYJ_KULq3-w&t=2580) for more details.\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "f295b680",
"metadata": {},
"source": [
"### Percentile-Based Splitting\n",
"\n",
"In this method, all differences between sentences are calculated, then splitting is done based on the specified percentile.\n"
"This method sorts all embedding differences between sentences. Then, it splits the text at a specific percentile (e.g. 70th percentile)."
]
},
{
Expand All @@ -311,7 +327,7 @@
"id": "59aa8318",
"metadata": {},
"source": [
"Check the split results.\n"
"Examine the resulting document list (`docs`).\n"
]
},
{
Expand All @@ -335,7 +351,7 @@
"id": "07e83f74",
"metadata": {},
"source": [
"Print the length of `docs`."
"Use the `len(docs)` function to get the number of chunks created."
]
},
{
Expand All @@ -353,11 +369,11 @@
"id": "21c1c9e8",
"metadata": {},
"source": [
"### Standard Deviation\n",
"### Standard Deviation Splitting\n",
"\n",
"In this method, splitting occurs when there is a difference greater than the specified `breakpoint_threshold_amount` standard deviation.\n",
"This method sets a threshold based on a specified number of standard deviations (`breakpoint_threshold_amount`).\n",
"\n",
"- Set the `breakpoint_threshold_type` parameter to \"standard_deviation\" to specify chunk splitting criteria based on standard deviation."
"To use standard deviation for your breakpoints, set the `breakpoint_threshold_type` parameter to `\"standard_deviation\"` when initializing the `text_splitter`."
]
},
{
Expand All @@ -381,7 +397,7 @@
"id": "690db96c",
"metadata": {},
"source": [
"Check the split results."
"After splitting, check the `docs` list and print its length (`len(docs)`) to see how many chunks were created."
]
},
{
Expand Down Expand Up @@ -411,14 +427,6 @@
" print(\"===\" * 20)"
]
},
{
"cell_type": "markdown",
"id": "095170af",
"metadata": {},
"source": [
"Print the length of `docs`."
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -434,17 +442,17 @@
"id": "c5b03d9b",
"metadata": {},
"source": [
"### Interquartile\n",
"### Interquartile Range Splitting\n",
"\n",
"This method uses interquartile range to split chunks."
"This method utilizes the interquartile range (IQR) of the embedding differences to consider breaks, leading to a text split."
]
},
{
"cell_type": "markdown",
"id": "fb408177",
"metadata": {},
"source": [
"- Set the `breakpoint_threshold_type` parameter to \"interquartile\" to specify chunk splitting criteria based on interquartile range.\n"
"Set the `breakpoint_threshold_type` parameter to `\"interquartile\"` when initializing the `text_splitter` to use the IQR for splitting."
]
},
{
Expand Down Expand Up @@ -487,7 +495,7 @@
"id": "9d186bb7",
"metadata": {},
"source": [
"Print the length of `docs`.\n"
"Finally, print the length of `docs` list (`len(docs)`) to view the number of cunks created.\n"
]
},
{
Expand Down Expand Up @@ -522,4 +530,4 @@
},
"nbformat": 4,
"nbformat_minor": 5
}
}
Loading