Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# TokenTextSplitter\n",
" # Text Splitting Methods in NLP\n",
"\n",
"- Author: [Ilgyun Jeong](https://github.com/johnny9210)\n",
"- Peer Review : [JoonHo Kim](https://github.com/jhboyo), [Sunyoung Park (architectyou)](https://github.com/Architectyou)\n",
Expand All @@ -13,10 +13,29 @@
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/03-TokenTextSplitter.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/03-TokenTextSplitter.ipynb)\n",
"\n",
"## Overview\n",
"Text splitting is a crucial preprocessing step in Natural Language Processing (NLP). This tutorial covers various text splitting methods and tools, exploring their advantages, disadvantages, and appropriate use cases.\n",
"\n",
"Language models operate within token limits, making it crucial to manage text within these constraints. \n",
"Main approaches to text splitting:\n",
"\n",
"TokenTextSplitter serves as an effective tool for segmenting text into manageable chunks based on token count, ensuring compliance with these limitations.\n",
"1. **Token-based Splitting**\n",
" - Tiktoken: OpenAI's high-performance BPE tokenizer\n",
" - Hugging Face tokenizers: Tokenizers for various pre-trained models\n",
" \n",
"2. **Sentence-based Splitting**\n",
" - SentenceTransformers: Splits text while maintaining semantic coherence\n",
" - NLTK: Natural language processing based sentence and word splitting\n",
" - spaCy: Text splitting utilizing advanced language processing capabilities\n",
"\n",
"3. **Language-specific Tools**\n",
" - KoNLPy: Specialized splitting tool for Korean text processing\n",
"\n",
"Each tool has its unique characteristics and advantages:\n",
"- Tiktoken offers fast processing speed and compatibility with OpenAI models\n",
"- SentenceTransformers provides meaning-based sentence splitting\n",
"- NLTK and spaCy implement linguistic rule-based splitting\n",
"- KoNLPy specializes in Korean morphological analysis and splitting\n",
"\n",
"Through this tutorial, you will understand the characteristics of each tool and learn to choose the most suitable text splitting method for your project.\n",
"\n",
"### Table of Contents\n",
"\n",
Expand All @@ -32,6 +51,7 @@
"\n",
"### References\n",
"\n",
"- [LangChain: How to split text by tokens](https://python.langchain.com/docs/how_to/split_by_token/)\n",
"- [Langchain TokenTextSplitter](https://python.langchain.com/api_reference/text_splitters/base/langchain_text_splitters.base.TokenTextSplitter.html)\n",
"----"
]
Expand Down Expand Up @@ -231,7 +251,7 @@
" # Set the chunk size to 300.\n",
" chunk_size=300,\n",
" # Ensure there is no overlap between chunks.\n",
" chunk_overlap=0,\n",
" chunk_overlap=50,\n",
")\n",
"# Split the file text into chunks.\n",
"texts = text_splitter.split_text(file)"
Expand Down Expand Up @@ -355,7 +375,7 @@
"\n",
"text_splitter = TokenTextSplitter(\n",
" chunk_size=200, # Set the chunk size to 10.\n",
" chunk_overlap=0, # Set the overlap between chunks to 0.\n",
" chunk_overlap=50, # Set the overlap between chunks to 0.\n",
")\n",
"\n",
"# Split the state_of_the_union text into chunks.\n",
Expand Down Expand Up @@ -552,8 +572,8 @@
"source": [
"from langchain_text_splitters import SentenceTransformersTokenTextSplitter\n",
"\n",
"# Create a sentence splitter and set the overlap between chunks to 0.\n",
"splitter = SentenceTransformersTokenTextSplitter(chunk_size=200, chunk_overlap=0)"
"# Create a sentence splitter and set the overlap between chunks to 50.\n",
"splitter = SentenceTransformersTokenTextSplitter(chunk_size=200, chunk_overlap=50)"
]
},
{
Expand Down Expand Up @@ -690,15 +710,16 @@
},
{
"cell_type": "code",
"execution_count": 26,
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package punkt_tab to /Users/teddy/nltk_data...\n",
"[nltk_data] Package punkt_tab is already up-to-date!\n"
"[nltk_data] Downloading package punkt_tab to\n",
"[nltk_data] /Users/ilgyun/nltk_data...\n",
"[nltk_data] Unzipping tokenizers/punkt_tab.zip.\n"
]
},
{
Expand All @@ -707,7 +728,7 @@
"True"
]
},
"execution_count": 26,
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -727,7 +748,7 @@
},
{
"cell_type": "code",
"execution_count": 27,
"execution_count": 3,
"metadata": {},
"outputs": [
{
Expand All @@ -736,19 +757,21 @@
"text": [
"Semantic Search\n",
"\n",
"Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\n",
"Example: Vectors of word embeddings can be stored in a database for quick access.\n",
"Related keywords: embedding, database, vectorization, vectorization\n",
"정의: 의미론적 검색은 사용자의 질의를 단순한 키워드 매칭을 넘어서 그 의미를 파악하여 관련된 결과를 반환하는 검색 방식입니다.\n",
"예시: 사용자가 \"태양계 행성\"이라고 검색하면, \"목성\", \"화성\" 등과 같이 관련된 행성에 대한 정보를 반환합니다.\n",
"연관키워드: 자연어 처리, 검색 알고리즘, 데이터 마이닝\n",
"\n",
"Embedding\n",
"\n",
"Definition: Embed\n"
"정의: 임베딩은 단어나 문장 같은 텍스트 데이터를 저차원의 연속적인 벡터로 변환하는 과정입니다. 이를 통해 컴퓨터가 텍스트를 이해하고 처리할 수 있게 합니다.\n",
"예시: \"사과\"라는 단어를 [0.65, -0.23, 0.17]과 같은 벡터로 표현합니다.\n",
"연관키워드: 자연어 처\n"
]
}
],
"source": [
"# Open the data/appendix-keywords.txt file and create a file object named f.\n",
"with open(\"./data/appendix-keywords.txt\") as f:\n",
"with open(\"./data/appendix-keywords_kr.txt\") as f:\n",
" file = (\n",
" f.read()\n",
" ) # Read the contents of the file and store them in the file variable.\n",
Expand All @@ -767,15 +790,15 @@
},
{
"cell_type": "code",
"execution_count": 28,
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from langchain_text_splitters import NLTKTextSplitter\n",
"\n",
"text_splitter = NLTKTextSplitter(\n",
" chunk_size=200, # Set the chunk size to 200.\n",
" chunk_overlap=0, # Set the overlap between chunks to 0.\n",
" chunk_overlap=50, # Set the overlap between chunks to 50.\n",
")"
]
},
Expand All @@ -788,43 +811,18 @@
},
{
"cell_type": "code",
"execution_count": 29,
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Created a chunk of size 215, which is longer than the specified 200\n",
"Created a chunk of size 240, which is longer than the specified 200\n",
"Created a chunk of size 225, which is longer than the specified 200\n",
"Created a chunk of size 211, which is longer than the specified 200\n",
"Created a chunk of size 231, which is longer than the specified 200\n",
"Created a chunk of size 222, which is longer than the specified 200\n",
"Created a chunk of size 203, which is longer than the specified 200\n",
"Created a chunk of size 280, which is longer than the specified 200\n",
"Created a chunk of size 230, which is longer than the specified 200\n",
"Created a chunk of size 213, which is longer than the specified 200\n",
"Created a chunk of size 219, which is longer than the specified 200\n",
"Created a chunk of size 213, which is longer than the specified 200\n",
"Created a chunk of size 214, which is longer than the specified 200\n",
"Created a chunk of size 203, which is longer than the specified 200\n",
"Created a chunk of size 211, which is longer than the specified 200\n",
"Created a chunk of size 224, which is longer than the specified 200\n",
"Created a chunk of size 218, which is longer than the specified 200\n",
"Created a chunk of size 230, which is longer than the specified 200\n",
"Created a chunk of size 219, which is longer than the specified 200\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Semantic Search\n",
"\n",
"Definition: A vector store is a system that stores data converted to vector format.\n",
"정의: 의미론적 검색은 사용자의 질의를 단순한 키워드 매칭을 넘어서 그 의미를 파악하여 관련된 결과를 반환하는 검색 방식입니다.\n",
"\n",
"It is used for search, classification, and other data analysis tasks.\n"
"예시: 사용자가 \"태양계 행성\"이라고 검색하면, \"목성\", \"화성\" 등과 같이 관련된 행성에 대한 정보를 반환합니다.\n"
]
}
],
Expand Down Expand Up @@ -1093,9 +1091,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.10"
"version": "3.10.15"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
}
Loading