From 0ec8c192ce3038fb65a4645c390f0f6e9726cb56 Mon Sep 17 00:00:00 2001 From: johnny9210 Date: Mon, 20 Jan 2025 00:20:08 +0900 Subject: [PATCH] =?UTF-8?q?=EA=B2=80=EC=88=98=20=ED=94=BC=EB=93=9C?= =?UTF-8?q?=EB=B0=B1=20=EB=B0=98=EC=98=81?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...r.ipynb => 03-TextSplittingMethods_.ipynb} | 96 +++++++++---------- 1 file changed, 47 insertions(+), 49 deletions(-) rename 07-TextSplitter/{03-TokenTextSplitter.ipynb => 03-TextSplittingMethods_.ipynb} (93%) diff --git a/07-TextSplitter/03-TokenTextSplitter.ipynb b/07-TextSplitter/03-TextSplittingMethods_.ipynb similarity index 93% rename from 07-TextSplitter/03-TokenTextSplitter.ipynb rename to 07-TextSplitter/03-TextSplittingMethods_.ipynb index e502bba66..40dae81a2 100644 --- a/07-TextSplitter/03-TokenTextSplitter.ipynb +++ b/07-TextSplitter/03-TextSplittingMethods_.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# TokenTextSplitter\n", + " # Text Splitting Methods in NLP\n", "\n", "- Author: [Ilgyun Jeong](https://github.com/johnny9210)\n", "- Peer Review : [JoonHo Kim](https://github.com/jhboyo), [Sunyoung Park (architectyou)](https://github.com/Architectyou)\n", @@ -13,10 +13,29 @@ "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/03-TokenTextSplitter.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/03-TokenTextSplitter.ipynb)\n", "\n", "## Overview\n", + "Text splitting is a crucial preprocessing step in Natural Language Processing (NLP). This tutorial covers various text splitting methods and tools, exploring their advantages, disadvantages, and appropriate use cases.\n", "\n", - "Language models operate within token limits, making it crucial to manage text within these constraints. \n", + "Main approaches to text splitting:\n", "\n", - "TokenTextSplitter serves as an effective tool for segmenting text into manageable chunks based on token count, ensuring compliance with these limitations.\n", + "1. **Token-based Splitting**\n", + " - Tiktoken: OpenAI's high-performance BPE tokenizer\n", + " - Hugging Face tokenizers: Tokenizers for various pre-trained models\n", + " \n", + "2. **Sentence-based Splitting**\n", + " - SentenceTransformers: Splits text while maintaining semantic coherence\n", + " - NLTK: Natural language processing based sentence and word splitting\n", + " - spaCy: Text splitting utilizing advanced language processing capabilities\n", + "\n", + "3. **Language-specific Tools**\n", + " - KoNLPy: Specialized splitting tool for Korean text processing\n", + "\n", + "Each tool has its unique characteristics and advantages:\n", + "- Tiktoken offers fast processing speed and compatibility with OpenAI models\n", + "- SentenceTransformers provides meaning-based sentence splitting\n", + "- NLTK and spaCy implement linguistic rule-based splitting\n", + "- KoNLPy specializes in Korean morphological analysis and splitting\n", + "\n", + "Through this tutorial, you will understand the characteristics of each tool and learn to choose the most suitable text splitting method for your project.\n", "\n", "### Table of Contents\n", "\n", @@ -32,6 +51,7 @@ "\n", "### References\n", "\n", + "- [LangChain: How to split text by tokens](https://python.langchain.com/docs/how_to/split_by_token/)\n", "- [Langchain TokenTextSplitter](https://python.langchain.com/api_reference/text_splitters/base/langchain_text_splitters.base.TokenTextSplitter.html)\n", "----" ] @@ -231,7 +251,7 @@ " # Set the chunk size to 300.\n", " chunk_size=300,\n", " # Ensure there is no overlap between chunks.\n", - " chunk_overlap=0,\n", + " chunk_overlap=50,\n", ")\n", "# Split the file text into chunks.\n", "texts = text_splitter.split_text(file)" @@ -355,7 +375,7 @@ "\n", "text_splitter = TokenTextSplitter(\n", " chunk_size=200, # Set the chunk size to 10.\n", - " chunk_overlap=0, # Set the overlap between chunks to 0.\n", + " chunk_overlap=50, # Set the overlap between chunks to 0.\n", ")\n", "\n", "# Split the state_of_the_union text into chunks.\n", @@ -552,8 +572,8 @@ "source": [ "from langchain_text_splitters import SentenceTransformersTokenTextSplitter\n", "\n", - "# Create a sentence splitter and set the overlap between chunks to 0.\n", - "splitter = SentenceTransformersTokenTextSplitter(chunk_size=200, chunk_overlap=0)" + "# Create a sentence splitter and set the overlap between chunks to 50.\n", + "splitter = SentenceTransformersTokenTextSplitter(chunk_size=200, chunk_overlap=50)" ] }, { @@ -690,15 +710,16 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "[nltk_data] Downloading package punkt_tab to /Users/teddy/nltk_data...\n", - "[nltk_data] Package punkt_tab is already up-to-date!\n" + "[nltk_data] Downloading package punkt_tab to\n", + "[nltk_data] /Users/ilgyun/nltk_data...\n", + "[nltk_data] Unzipping tokenizers/punkt_tab.zip.\n" ] }, { @@ -707,7 +728,7 @@ "True" ] }, - "execution_count": 26, + "execution_count": 1, "metadata": {}, "output_type": "execute_result" } @@ -727,7 +748,7 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 3, "metadata": {}, "outputs": [ { @@ -736,19 +757,21 @@ "text": [ "Semantic Search\n", "\n", - "Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\n", - "Example: Vectors of word embeddings can be stored in a database for quick access.\n", - "Related keywords: embedding, database, vectorization, vectorization\n", + "정의: 의미론적 검색은 사용자의 질의를 단순한 키워드 매칭을 넘어서 그 의미를 파악하여 관련된 결과를 반환하는 검색 방식입니다.\n", + "예시: 사용자가 \"태양계 행성\"이라고 검색하면, \"목성\", \"화성\" 등과 같이 관련된 행성에 대한 정보를 반환합니다.\n", + "연관키워드: 자연어 처리, 검색 알고리즘, 데이터 마이닝\n", "\n", "Embedding\n", "\n", - "Definition: Embed\n" + "정의: 임베딩은 단어나 문장 같은 텍스트 데이터를 저차원의 연속적인 벡터로 변환하는 과정입니다. 이를 통해 컴퓨터가 텍스트를 이해하고 처리할 수 있게 합니다.\n", + "예시: \"사과\"라는 단어를 [0.65, -0.23, 0.17]과 같은 벡터로 표현합니다.\n", + "연관키워드: 자연어 처\n" ] } ], "source": [ "# Open the data/appendix-keywords.txt file and create a file object named f.\n", - "with open(\"./data/appendix-keywords.txt\") as f:\n", + "with open(\"./data/appendix-keywords_kr.txt\") as f:\n", " file = (\n", " f.read()\n", " ) # Read the contents of the file and store them in the file variable.\n", @@ -767,7 +790,7 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ @@ -775,7 +798,7 @@ "\n", "text_splitter = NLTKTextSplitter(\n", " chunk_size=200, # Set the chunk size to 200.\n", - " chunk_overlap=0, # Set the overlap between chunks to 0.\n", + " chunk_overlap=50, # Set the overlap between chunks to 50.\n", ")" ] }, @@ -788,43 +811,18 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 5, "metadata": {}, "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Created a chunk of size 215, which is longer than the specified 200\n", - "Created a chunk of size 240, which is longer than the specified 200\n", - "Created a chunk of size 225, which is longer than the specified 200\n", - "Created a chunk of size 211, which is longer than the specified 200\n", - "Created a chunk of size 231, which is longer than the specified 200\n", - "Created a chunk of size 222, which is longer than the specified 200\n", - "Created a chunk of size 203, which is longer than the specified 200\n", - "Created a chunk of size 280, which is longer than the specified 200\n", - "Created a chunk of size 230, which is longer than the specified 200\n", - "Created a chunk of size 213, which is longer than the specified 200\n", - "Created a chunk of size 219, which is longer than the specified 200\n", - "Created a chunk of size 213, which is longer than the specified 200\n", - "Created a chunk of size 214, which is longer than the specified 200\n", - "Created a chunk of size 203, which is longer than the specified 200\n", - "Created a chunk of size 211, which is longer than the specified 200\n", - "Created a chunk of size 224, which is longer than the specified 200\n", - "Created a chunk of size 218, which is longer than the specified 200\n", - "Created a chunk of size 230, which is longer than the specified 200\n", - "Created a chunk of size 219, which is longer than the specified 200\n" - ] - }, { "name": "stdout", "output_type": "stream", "text": [ "Semantic Search\n", "\n", - "Definition: A vector store is a system that stores data converted to vector format.\n", + "정의: 의미론적 검색은 사용자의 질의를 단순한 키워드 매칭을 넘어서 그 의미를 파악하여 관련된 결과를 반환하는 검색 방식입니다.\n", "\n", - "It is used for search, classification, and other data analysis tasks.\n" + "예시: 사용자가 \"태양계 행성\"이라고 검색하면, \"목성\", \"화성\" 등과 같이 관련된 행성에 대한 정보를 반환합니다.\n" ] } ], @@ -1093,9 +1091,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.10" + "version": "3.10.15" } }, "nbformat": 4, "nbformat_minor": 2 -} \ No newline at end of file +}