LangChain-OpenTutorial · teddylee777 · Jan 20, 2025 · Jan 19, 2025
diff --git a/07-TextSplitter/03-TokenTextSplitter.ipynb → ...xtSplitter/03-TextSplittingMethods_.ipynb b/07-TextSplitter/03-TokenTextSplitter.ipynb → ...xtSplitter/03-TextSplittingMethods_.ipynb
@@ -4,7 +4,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "# TokenTextSplitter\n",
+        " # Text Splitting Methods in NLP\n",
         "\n",
         "- Author: [Ilgyun Jeong](https://github.com/johnny9210)\n",
         "- Peer Review : [JoonHo Kim](https://github.com/jhboyo), [Sunyoung Park (architectyou)](https://github.com/Architectyou)\n",
@@ -13,10 +13,29 @@
         "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/03-TokenTextSplitter.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/03-TokenTextSplitter.ipynb)\n",
         "\n",
         "## Overview\n",
+        "Text splitting is a crucial preprocessing step in Natural Language Processing (NLP). This tutorial covers various text splitting methods and tools, exploring their advantages, disadvantages, and appropriate use cases.\n",
         "\n",
-        "Language models operate within token limits, making it crucial to manage text within these constraints. \n",
+        "Main approaches to text splitting:\n",
         "\n",
-        "TokenTextSplitter serves as an effective tool for segmenting text into manageable chunks based on token count, ensuring compliance with these limitations.\n",
+        "1. **Token-based Splitting**\n",
+        "   - Tiktoken: OpenAI's high-performance BPE tokenizer\n",
+        "   - Hugging Face tokenizers: Tokenizers for various pre-trained models\n",
+        "   \n",
+        "2. **Sentence-based Splitting**\n",
+        "   - SentenceTransformers: Splits text while maintaining semantic coherence\n",
+        "   - NLTK: Natural language processing based sentence and word splitting\n",
+        "   - spaCy: Text splitting utilizing advanced language processing capabilities\n",
+        "\n",
+        "3. **Language-specific Tools**\n",
+        "   - KoNLPy: Specialized splitting tool for Korean text processing\n",
+        "\n",
+        "Each tool has its unique characteristics and advantages:\n",
+        "- Tiktoken offers fast processing speed and compatibility with OpenAI models\n",
+        "- SentenceTransformers provides meaning-based sentence splitting\n",
+        "- NLTK and spaCy implement linguistic rule-based splitting\n",
+        "- KoNLPy specializes in Korean morphological analysis and splitting\n",
+        "\n",
+        "Through this tutorial, you will understand the characteristics of each tool and learn to choose the most suitable text splitting method for your project.\n",
         "\n",
         "### Table of Contents\n",
         "\n",
@@ -32,6 +51,7 @@
         "\n",
         "### References\n",
         "\n",
+        "- [LangChain: How to split text by tokens](https://python.langchain.com/docs/how_to/split_by_token/)\n",
         "- [Langchain TokenTextSplitter](https://python.langchain.com/api_reference/text_splitters/base/langchain_text_splitters.base.TokenTextSplitter.html)\n",
         "----"
       ]
@@ -231,7 +251,7 @@
         "    # Set the chunk size to 300.\n",
         "    chunk_size=300,\n",
         "    # Ensure there is no overlap between chunks.\n",
-        "    chunk_overlap=0,\n",
+        "    chunk_overlap=50,\n",
         ")\n",
         "# Split the file text into chunks.\n",
         "texts = text_splitter.split_text(file)"
@@ -355,7 +375,7 @@
         "\n",
         "text_splitter = TokenTextSplitter(\n",
         "    chunk_size=200,  # Set the chunk size to 10.\n",
-        "    chunk_overlap=0,  # Set the overlap between chunks to 0.\n",
+        "    chunk_overlap=50,  # Set the overlap between chunks to 0.\n",
         ")\n",
         "\n",
         "# Split the state_of_the_union text into chunks.\n",
@@ -552,8 +572,8 @@
       "source": [
         "from langchain_text_splitters import SentenceTransformersTokenTextSplitter\n",
         "\n",
-        "# Create a sentence splitter and set the overlap between chunks to 0.\n",
-        "splitter = SentenceTransformersTokenTextSplitter(chunk_size=200, chunk_overlap=0)"
+        "# Create a sentence splitter and set the overlap between chunks to 50.\n",
+        "splitter = SentenceTransformersTokenTextSplitter(chunk_size=200, chunk_overlap=50)"
       ]
     },
     {
@@ -690,15 +710,16 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 26,
+      "execution_count": 1,
       "metadata": {},
       "outputs": [
         {
           "name": "stderr",
           "output_type": "stream",
           "text": [
-            "[nltk_data] Downloading package punkt_tab to /Users/teddy/nltk_data...\n",
-            "[nltk_data]   Package punkt_tab is already up-to-date!\n"
+            "[nltk_data] Downloading package punkt_tab to\n",
+            "[nltk_data]     /Users/ilgyun/nltk_data...\n",
+            "[nltk_data]   Unzipping tokenizers/punkt_tab.zip.\n"
           ]
         },
         {
@@ -707,7 +728,7 @@
               "True"
             ]
           },
-          "execution_count": 26,
+          "execution_count": 1,
           "metadata": {},
           "output_type": "execute_result"
         }
@@ -727,7 +748,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 27,
+      "execution_count": 3,
       "metadata": {},
       "outputs": [
         {
@@ -736,19 +757,21 @@
           "text": [
             "Semantic Search\n",
             "\n",
-            "Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\n",
-            "Example: Vectors of word embeddings can be stored in a database for quick access.\n",
-            "Related keywords: embedding, database, vectorization, vectorization\n",
+            "정의: 의미론적 검색은 사용자의 질의를 단순한 키워드 매칭을 넘어서 그 의미를 파악하여 관련된 결과를 반환하는 검색 방식입니다.\n",
+            "예시: 사용자가 \"태양계 행성\"이라고 검색하면, \"목성\", \"화성\" 등과 같이 관련된 행성에 대한 정보를 반환합니다.\n",
+            "연관키워드: 자연어 처리, 검색 알고리즘, 데이터 마이닝\n",
             "\n",
             "Embedding\n",
             "\n",
-            "Definition: Embed\n"
+            "정의: 임베딩은 단어나 문장 같은 텍스트 데이터를 저차원의 연속적인 벡터로 변환하는 과정입니다. 이를 통해 컴퓨터가 텍스트를 이해하고 처리할 수 있게 합니다.\n",
+            "예시: \"사과\"라는 단어를 [0.65, -0.23, 0.17]과 같은 벡터로 표현합니다.\n",
+            "연관키워드: 자연어 처\n"
           ]
         }
       ],
       "source": [
         "# Open the data/appendix-keywords.txt file and create a file object named f.\n",
-        "with open(\"./data/appendix-keywords.txt\") as f:\n",
+        "with open(\"./data/appendix-keywords_kr.txt\") as f:\n",
         "    file = (\n",
         "        f.read()\n",
         "    )  # Read the contents of the file and store them in the file variable.\n",
@@ -767,15 +790,15 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 28,
+      "execution_count": 4,
       "metadata": {},
       "outputs": [],
       "source": [
         "from langchain_text_splitters import NLTKTextSplitter\n",
         "\n",
         "text_splitter = NLTKTextSplitter(\n",
         "    chunk_size=200,  # Set the chunk size to 200.\n",
-        "    chunk_overlap=0,  # Set the overlap between chunks to 0.\n",
+        "    chunk_overlap=50,  # Set the overlap between chunks to 50.\n",
         ")"
       ]
     },
@@ -788,43 +811,18 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 29,
+      "execution_count": 5,
       "metadata": {},
       "outputs": [
-        {
-          "name": "stderr",
-          "output_type": "stream",
-          "text": [
-            "Created a chunk of size 215, which is longer than the specified 200\n",
-            "Created a chunk of size 240, which is longer than the specified 200\n",
-            "Created a chunk of size 225, which is longer than the specified 200\n",
-            "Created a chunk of size 211, which is longer than the specified 200\n",
-            "Created a chunk of size 231, which is longer than the specified 200\n",
-            "Created a chunk of size 222, which is longer than the specified 200\n",
-            "Created a chunk of size 203, which is longer than the specified 200\n",
-            "Created a chunk of size 280, which is longer than the specified 200\n",
-            "Created a chunk of size 230, which is longer than the specified 200\n",
-            "Created a chunk of size 213, which is longer than the specified 200\n",
-            "Created a chunk of size 219, which is longer than the specified 200\n",
-            "Created a chunk of size 213, which is longer than the specified 200\n",
-            "Created a chunk of size 214, which is longer than the specified 200\n",
-            "Created a chunk of size 203, which is longer than the specified 200\n",
-            "Created a chunk of size 211, which is longer than the specified 200\n",
-            "Created a chunk of size 224, which is longer than the specified 200\n",
-            "Created a chunk of size 218, which is longer than the specified 200\n",
-            "Created a chunk of size 230, which is longer than the specified 200\n",
-            "Created a chunk of size 219, which is longer than the specified 200\n"
-          ]
-        },
         {
           "name": "stdout",
           "output_type": "stream",
           "text": [
             "Semantic Search\n",
             "\n",
-            "Definition: A vector store is a system that stores data converted to vector format.\n",
+            "정의: 의미론적 검색은 사용자의 질의를 단순한 키워드 매칭을 넘어서 그 의미를 파악하여 관련된 결과를 반환하는 검색 방식입니다.\n",
             "\n",
-            "It is used for search, classification, and other data analysis tasks.\n"
+            "예시: 사용자가 \"태양계 행성\"이라고 검색하면, \"목성\", \"화성\" 등과 같이 관련된 행성에 대한 정보를 반환합니다.\n"
           ]
         }
       ],
@@ -1093,9 +1091,9 @@
       "name": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
-      "version": "3.11.10"
+      "version": "3.10.15"
     }
   },
   "nbformat": 4,
   "nbformat_minor": 2
-}
+}