From 9583e3a4b65f61cbc54a5019f9bcd65fd6b8f0ed Mon Sep 17 00:00:00 2001 From: greencode Date: Mon, 6 Jan 2025 20:37:08 +0900 Subject: [PATCH] [N-2] 07-Text Splitter / 05-CodeSplitter MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit [N-2] 07-Text Splitter / 05-CodeSplitter ISSUE 07-05-CodeSplitter #36 에 대해서 수정 하였습니다. --- 07-TextSplitter/05-CodeSplitter.ipynb | 318 +++++++++++++------------- 1 file changed, 158 insertions(+), 160 deletions(-) diff --git a/07-TextSplitter/05-CodeSplitter.ipynb b/07-TextSplitter/05-CodeSplitter.ipynb index 38ab8737c..ef3b461ca 100644 --- a/07-TextSplitter/05-CodeSplitter.ipynb +++ b/07-TextSplitter/05-CodeSplitter.ipynb @@ -6,19 +6,20 @@ "source": [ "# Split code with Langchain\n", "\n", - "- Author: [greencode](https://github.com/greencode-99)\n", + "- Author: [Jongcheol Kim](https://github.com/greencode-99)\n", "- Design: \n", + "- Peer Review:\n", "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n", "\n", "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)\n", "\n", "## Overview\n", "\n", - "`RecursiveCharacterTextSplitter` includes pre-built lists of separators that are useful for splitting text in a specific programming language.\n", + "`RecursiveCharacterTextSplitter` includes pre-built separator lists optimized for splitting text in different programming languages.\n", "\n", - "You can split code written in various programming languages using `CodeTextSplitter`.\n", + "The `CodeTextSplitter` provides even more specialized functionality for splitting code.\n", "\n", - "To do this, import the `Language` enum and specify the corresponding programming language.\n", + "To use it, import the `Language` enum(enumeration) and specify the desired programming language.\n", "\n", "\n", "### Table of Contents\n", @@ -27,8 +28,8 @@ "- [Environment Setup](#environment-setup)\n", "- [Code Spliter Examples](#code-splitter-examples)\n", " - [Python](#python)\n", - " - [JS](#js)\n", - " - [TS](#ts)\n", + " - [JavaScript](#javascript)\n", + " - [TypeScript](#typescript)\n", " - [Markdown](#markdown)\n", " - [LaTeX](#latex)\n", " - [HTML](#html)\n", @@ -52,13 +53,13 @@ "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n", "\n", "**[Note]**\n", - "- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n", + "- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.\n", "- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details." ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -68,7 +69,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -86,7 +87,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 3, "metadata": {}, "outputs": [ { @@ -114,7 +115,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 4, "metadata": {}, "outputs": [ { @@ -123,7 +124,7 @@ "True" ] }, - "execution_count": 5, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } @@ -140,7 +141,7 @@ "source": [ "## Code Splitter Examples\n", "\n", - "Here is an example of splitting text using `RecursiveCharacterTextSplitter`.\n", + "Here is an example of splitting text using the `RecursiveCharacterTextSplitter`.\n", "\n", "- Import the `Language` and `RecursiveCharacterTextSplitter` classes from the `langchain_text_splitters` module.\n", "- `RecursiveCharacterTextSplitter` is a text splitter that recursively splits text at the character level." @@ -148,7 +149,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ @@ -164,14 +165,14 @@ "source": [ "Supported languages are stored in the langchain_text_splitters.Language enum. \n", "\n", - "API Reference: [Language](https://python.langchain.com/docs/api_reference/text_splitters/Language) | [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/api_reference/text_splitters/RecursiveCharacterTextSplitter)\n", + "API Reference: [Language](https://python.langchain.com/api_reference/text_splitters/base/langchain_text_splitters.base.Language.html#language) | [RecursiveCharacterTextSplitter](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html#recursivecharactertextsplitter)\n", "\n", - "Below is the full list of supported languages." + "See below for the full list of supported languages." ] }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -201,10 +202,11 @@ " 'lua',\n", " 'perl',\n", " 'haskell',\n", - " 'elixir']" + " 'elixir',\n", + " 'powershell']" ] }, - "execution_count": 10, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } @@ -218,14 +220,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "You can use the `get_separators_for_language` method of the `RecursiveCharacterTextSplitter` class to check the separators used for a specific language.\n", + "You can use the `get_separators_for_language` method of the `RecursiveCharacterTextSplitter` class to see the separators used for a given language.\n", "\n", - "- In the example, the `Language.PYTHON` enum value is passed as an argument to check the separators used for the Python language." + "- For example, passing `Language.PYTHON` retrieves the separators used for Python:" ] }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 7, "metadata": {}, "outputs": [ { @@ -234,7 +236,7 @@ "['\\nclass ', '\\ndef ', '\\n\\tdef ', '\\n\\n', '\\n', ' ', '']" ] }, - "execution_count": 11, + "execution_count": 7, "metadata": {}, "output_type": "execute_result" } @@ -250,25 +252,25 @@ "source": [ "### Python\n", "\n", - "Use `RecursiveCharacterTextSplitter` to split Python code into document units.\n", - "- Specify `Language.PYTHON` as the `language` parameter to use the Python language.\n", - "- Set `chunk_size` to 50 to limit the maximum size of each document.\n", - "- Set `chunk_overlap` to 0 to disallow overlap between documents." + "Here's how to split Python code into smaller chunks using the `RecursiveCharacterTextSplitter`.\n", + "- First, specify `Language.PYTHON` for the `language` parameter. It tells the splitter you're working with Python code.\n", + "- Then, set `chunk_size` to 50. This limits the size of each resulting chunk to a maximum of 50 characters.\n", + "- Finally, set `chunk_overlap` to 0. It prevents any of the chunks from overlapping." ] }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "[Document(page_content='def hello_world():\\n print(\"Hello, World!\")'),\n", - " Document(page_content='hello_world()')]" + "[Document(metadata={}, page_content='def hello_world():\\n print(\"Hello, World!\")'),\n", + " Document(metadata={}, page_content='hello_world()')]" ] }, - "execution_count": 13, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" } @@ -292,7 +294,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 9, "metadata": {}, "outputs": [ { @@ -318,27 +320,27 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### JS\n", + "### JavaScript\n", "\n", - "Here is an example of using the JS text splitter.\n", - "- Specify `Language.JS` as the `language` parameter to use the JavaScript language.\n", - "- Set `chunk_size` to 60 to limit the maximum size of each document.\n", - "- Set `chunk_overlap` to 0 to disallow overlap between documents.\n" + "Here's how to split JavaScript code into smaller chunks using the `RecursiveCharacterTextSplitter`.\n", + "- First, specify `Language.JS` for the `language` parameter. It tells the splitter you're working with JavaScript code.\n", + "- Then, set `chunk_size` to 60. This limits the size of each resulting chunk to a maximum of 60 characters.\n", + "- Finally, set `chunk_overlap` to 0. It prevents any of the chunks from overlapping.\n" ] }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "[Document(page_content='function helloWorld() {\\n console.log(\"Hello, World!\");\\n}'),\n", - " Document(page_content='helloWorld();')]" + "[Document(metadata={}, page_content='function helloWorld() {\\n console.log(\"Hello, World!\");\\n}'),\n", + " Document(metadata={}, page_content='helloWorld();')]" ] }, - "execution_count": 12, + "execution_count": 10, "metadata": {}, "output_type": "execute_result" } @@ -365,28 +367,28 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### TS \n", + "### TypeScript\n", "\n", - "Here is an example of using the TS text splitter.\n", - "- Specify `Language.TS` as the `language` parameter to use the TypeScript language.\n", - "- Set `chunk_size` to 60 to limit the maximum size of each document.\n", - "- Set `chunk_overlap` to 0 to disallow overlap between documents.\n" + "Here's how to split TypeScript code into smaller chunks using the `RecursiveCharacterTextSplitter`.\n", + "- First, specify `Language.TS` for the `language` parameter. It tells the splitter you're working with TypeScript code.\n", + "- Then, set `chunk_size` to 60. This limits the size of each resulting chunk to a maximum of 60 characters.\n", + "- Finally, set `chunk_overlap` to 0. It prevents any of the chunks from overlapping.\n" ] }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "[Document(page_content='function helloWorld(): void {'),\n", - " Document(page_content='console.log(\"Hello, World!\");\\n}'),\n", - " Document(page_content='helloWorld();')]" + "[Document(metadata={}, page_content='function helloWorld(): void {'),\n", + " Document(metadata={}, page_content='console.log(\"Hello, World!\");\\n}'),\n", + " Document(metadata={}, page_content='helloWorld();')]" ] }, - "execution_count": 15, + "execution_count": 11, "metadata": {}, "output_type": "execute_result" } @@ -415,31 +417,31 @@ "source": [ "### Markdown\n", "\n", - "Here is an example of using the Markdown text splitter.\n", + "Here's how to split Markdown text into smaller chunks using the `RecursiveCharacterTextSplitter`.\n", "\n", - "- Specify `Language.MARKDOWN` as the `language` parameter to use the Markdown language.\n", - "- Set `chunk_size` to 60 to limit the maximum size of each document.\n", - "- Set `chunk_overlap` to 0 to disallow overlap between documents." + "- First, Specify `Language.MARKDOWN` for the `language` parameter. It tells the splitter you're working with Markdown text.\n", + "- Then, set `chunk_size` to 60. This limits the size of each resulting chunk to a maximum of 60 characters.\n", + "- Finally, set `chunk_overlap` to 0. It prevents any of the chunks from overlapping." ] }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "[Document(page_content='# 🦜️🔗 LangChain'),\n", - " Document(page_content='⚡ Building applications with LLMs through composability ⚡'),\n", - " Document(page_content='## What is LangChain?'),\n", - " Document(page_content=\"# Hopefully this code block isn't split\"),\n", - " Document(page_content='LangChain is a framework for...'),\n", - " Document(page_content='As an open-source project in a rapidly developing field, we'),\n", - " Document(page_content='are extremely open to contributions.')]" + "[Document(metadata={}, page_content='# 🦜️🔗 LangChain'),\n", + " Document(metadata={}, page_content='⚡ Building applications with LLMs through composability ⚡'),\n", + " Document(metadata={}, page_content='## What is LangChain?'),\n", + " Document(metadata={}, page_content=\"# Hopefully this code block isn't split\"),\n", + " Document(metadata={}, page_content='LangChain is a framework for...'),\n", + " Document(metadata={}, page_content='As an open-source project in a rapidly developing field, we'),\n", + " Document(metadata={}, page_content='are extremely open to contributions.')]" ] }, - "execution_count": 14, + "execution_count": 12, "metadata": {}, "output_type": "execute_result" } @@ -476,43 +478,43 @@ "\n", "LaTeX is a markup language for document creation, widely used for representing mathematical symbols and formulas.\n", "\n", - "Here is an example of LaTeX text.\n", - "- Specify `Language.LATEX` as the `language` parameter to use the LaTeX language.\n", - "- Set `chunk_size` to 60 to limit the maximum size of each document.\n", - "- Set `chunk_overlap` to 0 to disallow overlap between documents.\n" + "Here's how to split LaTeX text into smaller chunks using the `RecursiveCharacterTextSplitter`.\n", + "- First, specify `Language.LATEX` for the `language` parameter. It tells the splitter you're working with LaTeX text.\n", + "- Then, set `chunk_size` to 60. This limits the size of each resulting chunk to a maximum of 60 characters.\n", + "- Finally, set `chunk_overlap` to 0. It prevents any of the chunks from overlapping." ] }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "[Document(page_content='\\\\documentclass{article}\\n\\n\\x08egin{document}\\n\\n\\\\maketitle'),\n", - " Document(page_content='\\\\section{Introduction}\\nLarge language models (LLMs) are a'),\n", - " Document(page_content='type of machine learning model that can be trained on vast'),\n", - " Document(page_content='amounts of text data to generate human-like language. In'),\n", - " Document(page_content='recent years, LLMs have made significant advances in a'),\n", - " Document(page_content='variety of natural language processing tasks, including'),\n", - " Document(page_content='language translation, text generation, and sentiment'),\n", - " Document(page_content='analysis.'),\n", - " Document(page_content='\\\\subsection{History of LLMs}\\nThe earliest LLMs were'),\n", - " Document(page_content='developed in the 1980s and 1990s, but they were limited by'),\n", - " Document(page_content='the amount of data that could be processed and the'),\n", - " Document(page_content='computational power available at the time. In the past'),\n", - " Document(page_content='decade, however, advances in hardware and software have'),\n", - " Document(page_content='made it possible to train LLMs on massive datasets, leading'),\n", - " Document(page_content='to significant improvements in performance.'),\n", - " Document(page_content='\\\\subsection{Applications of LLMs}\\nLLMs have many'),\n", - " Document(page_content='applications in industry, including chatbots, content'),\n", - " Document(page_content='creation, and virtual assistants. They can also be used in'),\n", - " Document(page_content='academia for research in linguistics, psychology, and'),\n", - " Document(page_content='computational linguistics.\\n\\n\\\\end{document}')]" + "[Document(metadata={}, page_content='\\\\documentclass{article}\\n\\n\\x08egin{document}\\n\\n\\\\maketitle'),\n", + " Document(metadata={}, page_content='\\\\section{Introduction}\\nLarge language models (LLMs) are a'),\n", + " Document(metadata={}, page_content='type of machine learning model that can be trained on vast'),\n", + " Document(metadata={}, page_content='amounts of text data to generate human-like language. In'),\n", + " Document(metadata={}, page_content='recent years, LLMs have made significant advances in a'),\n", + " Document(metadata={}, page_content='variety of natural language processing tasks, including'),\n", + " Document(metadata={}, page_content='language translation, text generation, and sentiment'),\n", + " Document(metadata={}, page_content='analysis.'),\n", + " Document(metadata={}, page_content='\\\\subsection{History of LLMs}\\nThe earliest LLMs were'),\n", + " Document(metadata={}, page_content='developed in the 1980s and 1990s, but they were limited by'),\n", + " Document(metadata={}, page_content='the amount of data that could be processed and the'),\n", + " Document(metadata={}, page_content='computational power available at the time. In the past'),\n", + " Document(metadata={}, page_content='decade, however, advances in hardware and software have'),\n", + " Document(metadata={}, page_content='made it possible to train LLMs on massive datasets, leading'),\n", + " Document(metadata={}, page_content='to significant improvements in performance.'),\n", + " Document(metadata={}, page_content='\\\\subsection{Applications of LLMs}\\nLLMs have many'),\n", + " Document(metadata={}, page_content='applications in industry, including chatbots, content'),\n", + " Document(metadata={}, page_content='creation, and virtual assistants. They can also be used in'),\n", + " Document(metadata={}, page_content='academia for research in linguistics, psychology, and'),\n", + " Document(metadata={}, page_content='computational linguistics.\\n\\n\\\\end{document}')]" ] }, - "execution_count": 16, + "execution_count": 13, "metadata": {}, "output_type": "execute_result" } @@ -553,36 +555,36 @@ "source": [ "### HTML\n", "\n", - "Here is an example of using the HTML text splitter.\n", - "- Specify `Language.HTML` as the `language` parameter to use the HTML language.\n", - "- Set `chunk_size` to 60 to limit the maximum size of each document.\n", - "- Set `chunk_overlap` to 0 to disallow overlap between documents.\n" + "Here's how to split HTML text into smaller chunks using the `RecursiveCharacterTextSplitter`.\n", + "- First, specify `Language.HTML` for the `language` parameter. It tells the splitter you're working with HTML.\n", + "- Then, set `chunk_size` to 60. This limits the size of each resulting chunk to a maximum of 60 characters.\n", + "- Finally, set `chunk_overlap` to 0. It prevents any of the chunks from overlapping.\n" ] }, { "cell_type": "code", - "execution_count": 51, + "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "[Document(page_content='\\n'),\n", - " Document(page_content='\\n Codestin Search App'),\n", - " Document(page_content='\\n