diff --git a/07-TextSplitter/05-CodeSplitter.ipynb b/07-TextSplitter/05-CodeSplitter.ipynb index 2f282adef..c30827584 100644 --- a/07-TextSplitter/05-CodeSplitter.ipynb +++ b/07-TextSplitter/05-CodeSplitter.ipynb @@ -1,880 +1,877 @@ { "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Split code with Langchain\n", - "\n", - "- Author: [greencode](https://github.com/greencode-99)\n", - "- Design: \n", - "- Peer Review : [Teddy Lee](https://github.com/teddylee777), [heewung song](https://github.com/kofsitho87), [Teddy Lee](https://github.com/teddylee777)\n", - "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n", - "\n", - "[](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/05-CodeSplitter.ipynb) [](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/05-CodeSplitter.ipynb)\n", - "\n", - "## Overview\n", - "\n", - "`RecursiveCharacterTextSplitter` includes pre-built lists of separators that are useful for splitting text in a specific programming language.\n", - "\n", - "You can split code written in various programming languages using `CodeTextSplitter`.\n", - "\n", - "To do this, import the `Language` enum and specify the corresponding programming language.\n", - "\n", - "\n", - "### Table of Contents\n", - "\n", - "- [Overview](#Overview)\n", - "- [Environment Setup](#environment-setup)\n", - "- [Code Spliter Examples](#code-splitter-examples)\n", - " - [Python](#python)\n", - " - [JS](#js)\n", - " - [TS](#ts)\n", - " - [Markdown](#markdown)\n", - " - [LaTeX](#latex)\n", - " - [HTML](#html)\n", - " - [Solidity](#solidity)\n", - " - [C#](#c)\n", - " - [PHP](#php)\n", - " - [Kotlin](#kotlin)\n", - "\n", - "\n", - "### References\n", - "- [How to split code](https://python.langchain.com/docs/how_to/code_splitter/)\n", - "----" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Environment Setup\n", - "\n", - "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n", - "\n", - "**[Note]**\n", - "- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n", - "- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "%%capture --no-stderr\n", - "%pip install langchain-opentutorial" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "# Install required packages\n", - "from langchain_opentutorial import package\n", - "\n", - "package.install(\n", - " [\n", - " \"langchain_text_splitters\",\n", - " ],\n", - " verbose=False,\n", - " upgrade=False,\n", - ")" - ] - }, - { - "cell_type": "code", + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Split code with Langchain\n", + "\n", + "- Author: [Jongcheol Kim](https://github.com/greencode-99)\n", + "- Design: \n", + "- Peer Review: [kofsitho87](https://github.com/kofsitho87), [teddylee777](https://github.com/teddylee777)\n", + "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n", + "\n", + "[](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)\n", + "\n", + "## Overview\n", + "\n", + "`RecursiveCharacterTextSplitter` includes pre-built separator lists optimized for splitting text in different programming languages.\n", + "\n", + "The `CodeTextSplitter` provides even more specialized functionality for splitting code.\n", + "\n", + "To use it, import the `Language` enum(enumeration) and specify the desired programming language.\n", + "\n", + "\n", + "### Table of Contents\n", + "\n", + "- [Overview](#Overview)\n", + "- [Environment Setup](#environment-setup)\n", + "- [Code Spliter Examples](#code-splitter-examples)\n", + " - [Python](#python)\n", + " - [JavaScript](#javascript)\n", + " - [TypeScript](#typescript)\n", + " - [Markdown](#markdown)\n", + " - [LaTeX](#latex)\n", + " - [HTML](#html)\n", + " - [Solidity](#solidity)\n", + " - [C#](#c)\n", + " - [PHP](#php)\n", + " - [Kotlin](#kotlin)\n", + "\n", + "\n", + "### References\n", + "- [How to split code](https://python.langchain.com/docs/how_to/code_splitter/)\n", + "----" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Environment Setup\n", + "\n", + "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n", + "\n", + "**[Note]**\n", + "- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.\n", + "- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "%%capture --no-stderr\n", + "%pip install langchain-opentutorial" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# Install required packages\n", + "from langchain_opentutorial import package\n", + "\n", + "package.install(\n", + " [\n", + " \"langchain_text_splitters\",\n", + " ],\n", + " verbose=False,\n", + " upgrade=False,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Environment variables have been set successfully.\n" + ] + } + ], + "source": [ + "# Set environment variables\n", + "from langchain_opentutorial import set_env\n", + "\n", + "set_env(\n", + " {\n", + " \"OPENAI_API_KEY\": \"\",\n", + " \"LANGCHAIN_API_KEY\": \"\",\n", + " \"LANGCHAIN_TRACING_V2\": \"true\",\n", + " \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n", + " \"LANGCHAIN_PROJECT\": \"Code-Splitter\",\n", + " }\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, "execution_count": 4, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Environment variables have been set successfully.\n" - ] - } - ], - "source": [ - "# Set environment variables\n", - "from langchain_opentutorial import set_env\n", - "\n", - "set_env(\n", - " {\n", - " \"OPENAI_API_KEY\": \"\",\n", - " \"LANGCHAIN_API_KEY\": \"\",\n", - " \"LANGCHAIN_TRACING_V2\": \"true\",\n", - " \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n", - " \"LANGCHAIN_PROJECT\": \"Code-Splitter\",\n", - " }\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from dotenv import load_dotenv\n", - "\n", - "load_dotenv()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Code Splitter Examples\n", - "\n", - "Here is an example of splitting text using `RecursiveCharacterTextSplitter`.\n", - "\n", - "- Import the `Language` and `RecursiveCharacterTextSplitter` classes from the `langchain_text_splitters` module.\n", - "- `RecursiveCharacterTextSplitter` is a text splitter that recursively splits text at the character level." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [], - "source": [ - "from langchain_text_splitters import (\n", - " Language,\n", - " RecursiveCharacterTextSplitter,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Supported languages are stored in the langchain_text_splitters.Language enum. \n", - "\n", - "API Reference: [Language](https://python.langchain.com/docs/api_reference/text_splitters/Language) | [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/api_reference/text_splitters/RecursiveCharacterTextSplitter)\n", - "\n", - "Below is the full list of supported languages." - ] - }, - { - "cell_type": "code", + "output_type": "execute_result" + } + ], + "source": [ + "from dotenv import load_dotenv\n", + "\n", + "load_dotenv()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Code Splitter Examples\n", + "\n", + "Here is an example of splitting text using the `RecursiveCharacterTextSplitter`.\n", + "\n", + "- Import the `Language` and `RecursiveCharacterTextSplitter` classes from the `langchain_text_splitters` module.\n", + "- `RecursiveCharacterTextSplitter` is a text splitter that recursively splits text at the character level." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_text_splitters import (\n", + " Language,\n", + " RecursiveCharacterTextSplitter,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Supported languages are stored in the langchain_text_splitters.Language enum. \n", + "\n", + "API Reference: [Language](https://python.langchain.com/api_reference/text_splitters/base/langchain_text_splitters.base.Language.html#language) | [RecursiveCharacterTextSplitter](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html#recursivecharactertextsplitter)\n", + "\n", + "See below for the full list of supported languages." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['cpp',\n", + " 'go',\n", + " 'java',\n", + " 'kotlin',\n", + " 'js',\n", + " 'ts',\n", + " 'php',\n", + " 'proto',\n", + " 'python',\n", + " 'rst',\n", + " 'ruby',\n", + " 'rust',\n", + " 'scala',\n", + " 'swift',\n", + " 'markdown',\n", + " 'latex',\n", + " 'html',\n", + " 'sol',\n", + " 'csharp',\n", + " 'cobol',\n", + " 'c',\n", + " 'lua',\n", + " 'perl',\n", + " 'haskell',\n", + " 'elixir',\n", + " 'powershell']" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Get the full list of supported languages.\n", + "[e.value for e in Language]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can use the `get_separators_for_language` method of the `RecursiveCharacterTextSplitter` class to see the separators used for a given language.\n", + "\n", + "- For example, passing `Language.PYTHON` retrieves the separators used for Python:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['\\nclass ', '\\ndef ', '\\n\\tdef ', '\\n\\n', '\\n', ' ', '']" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# You can check the separators used for the given language.\n", + "RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Python\n", + "\n", + "Here's how to split Python code into smaller chunks using the `RecursiveCharacterTextSplitter`.\n", + "- First, specify `Language.PYTHON` for the `language` parameter. It tells the splitter you're working with Python code.\n", + "- Then, set `chunk_size` to 50. This limits the size of each resulting chunk to a maximum of 50 characters.\n", + "- Finally, set `chunk_overlap` to 0. It prevents any of the chunks from overlapping." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={}, page_content='def hello_world():\\n print(\"Hello, World!\")'),\n", + " Document(metadata={}, page_content='hello_world()')]" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "PYTHON_CODE = \"\"\"\n", + "def hello_world():\n", + " print(\"Hello, World!\")\n", + "\n", + "hello_world()\n", + "\"\"\"\n", + "\n", + "python_splitter = RecursiveCharacterTextSplitter.from_language(\n", + " language=Language.PYTHON, chunk_size=50, chunk_overlap=0\n", + ")\n", + "\n", + "# Create `Document`. The created `Document` is returned as a list.\n", + "python_docs = python_splitter.create_documents([PYTHON_CODE])\n", + "python_docs" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "def hello_world():\n", + " print(\"Hello, World!\")\n", + "==================\n", + "hello_world()\n", + "==================\n" + ] + } + ], + "source": [ + "# This section iterates through the list of documents created by the RecursiveCharacterTextSplitter\n", + "# and prints each document's content followed by a separator line for readability.\n", + "for doc in python_docs:\n", + " print(doc.page_content, end=\"\\n==================\\n\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### JavaScript\n", + "\n", + "Here's how to split JavaScript code into smaller chunks using the `RecursiveCharacterTextSplitter`.\n", + "- First, specify `Language.JS` for the `language` parameter. It tells the splitter you're working with JavaScript code.\n", + "- Then, set `chunk_size` to 60. This limits the size of each resulting chunk to a maximum of 60 characters.\n", + "- Finally, set `chunk_overlap` to 0. It prevents any of the chunks from overlapping.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={}, page_content='function helloWorld() {\\n console.log(\"Hello, World!\");\\n}'),\n", + " Document(metadata={}, page_content='helloWorld();')]" + ] + }, "execution_count": 10, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['cpp',\n", - " 'go',\n", - " 'java',\n", - " 'kotlin',\n", - " 'js',\n", - " 'ts',\n", - " 'php',\n", - " 'proto',\n", - " 'python',\n", - " 'rst',\n", - " 'ruby',\n", - " 'rust',\n", - " 'scala',\n", - " 'swift',\n", - " 'markdown',\n", - " 'latex',\n", - " 'html',\n", - " 'sol',\n", - " 'csharp',\n", - " 'cobol',\n", - " 'c',\n", - " 'lua',\n", - " 'perl',\n", - " 'haskell',\n", - " 'elixir']" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Get the full list of supported languages.\n", - "[e.value for e in Language]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can use the `get_separators_for_language` method of the `RecursiveCharacterTextSplitter` class to check the separators used for a specific language.\n", - "\n", - "- In the example, the `Language.PYTHON` enum value is passed as an argument to check the separators used for the Python language." - ] - }, - { - "cell_type": "code", + "output_type": "execute_result" + } + ], + "source": [ + "JS_CODE = \"\"\"\n", + "function helloWorld() {\n", + " console.log(\"Hello, World!\");\n", + "}\n", + "\n", + "helloWorld();\n", + "\"\"\"\n", + "\n", + "js_splitter = RecursiveCharacterTextSplitter.from_language(\n", + " language=Language.JS, chunk_size=60, chunk_overlap=0\n", + ")\n", + "\n", + "# Create `Document`. The created `Document` is returned as a list.\n", + "js_docs = js_splitter.create_documents([JS_CODE])\n", + "js_docs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### TypeScript\n", + "\n", + "Here's how to split TypeScript code into smaller chunks using the `RecursiveCharacterTextSplitter`.\n", + "- First, specify `Language.TS` for the `language` parameter. It tells the splitter you're working with TypeScript code.\n", + "- Then, set `chunk_size` to 60. This limits the size of each resulting chunk to a maximum of 60 characters.\n", + "- Finally, set `chunk_overlap` to 0. It prevents any of the chunks from overlapping.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={}, page_content='function helloWorld(): void {'),\n", + " Document(metadata={}, page_content='console.log(\"Hello, World!\");\\n}'),\n", + " Document(metadata={}, page_content='helloWorld();')]" + ] + }, "execution_count": 11, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['\\nclass ', '\\ndef ', '\\n\\tdef ', '\\n\\n', '\\n', ' ', '']" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# You can check the separators used for the given language.\n", - "RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)" - ] - }, - { - "cell_type": "markdown", + "output_type": "execute_result" + } + ], + "source": [ + "TS_CODE = \"\"\"\n", + "function helloWorld(): void {\n", + " console.log(\"Hello, World!\");\n", + "}\n", + "\n", + "helloWorld();\n", + "\"\"\"\n", + "\n", + "ts_splitter = RecursiveCharacterTextSplitter.from_language(\n", + " language=Language.TS, chunk_size=60, chunk_overlap=0\n", + ")\n", + "\n", + "\n", + "ts_docs = ts_splitter.create_documents([TS_CODE])\n", + "ts_docs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Markdown\n", + "\n", + "Here's how to split Markdown text into smaller chunks using the `RecursiveCharacterTextSplitter`.\n", + "\n", + "- First, Specify `Language.MARKDOWN` for the `language` parameter. It tells the splitter you're working with Markdown text.\n", + "- Then, set `chunk_size` to 60. This limits the size of each resulting chunk to a maximum of 60 characters.\n", + "- Finally, set `chunk_overlap` to 0. It prevents any of the chunks from overlapping." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={}, page_content='# 🦜️🔗 LangChain'),\n", + " Document(metadata={}, page_content='⚡ Building applications with LLMs through composability ⚡'),\n", + " Document(metadata={}, page_content='## What is LangChain?'),\n", + " Document(metadata={}, page_content=\"# Hopefully this code block isn't split\"),\n", + " Document(metadata={}, page_content='LangChain is a framework for...'),\n", + " Document(metadata={}, page_content='As an open-source project in a rapidly developing field, we'),\n", + " Document(metadata={}, page_content='are extremely open to contributions.')]" + ] + }, + "execution_count": 12, "metadata": {}, - "source": [ - "### Python\n", - "\n", - "Use `RecursiveCharacterTextSplitter` to split Python code into document units.\n", - "- Specify `Language.PYTHON` as the `language` parameter to use the Python language.\n", - "- Set `chunk_size` to 50 to limit the maximum size of each document.\n", - "- Set `chunk_overlap` to 0 to disallow overlap between documents." - ] - }, - { - "cell_type": "code", + "output_type": "execute_result" + } + ], + "source": [ + "markdown_text = \"\"\"\n", + "# 🦜️🔗 LangChain\n", + "\n", + "⚡ Building applications with LLMs through composability ⚡\n", + "\n", + "## What is LangChain?\n", + "\n", + "# Hopefully this code block isn't split\n", + "LangChain is a framework for...\n", + "\n", + "As an open-source project in a rapidly developing field, we are extremely open to contributions.\n", + "\"\"\"\n", + "\n", + "md_splitter = RecursiveCharacterTextSplitter.from_language(\n", + " language=Language.MARKDOWN,\n", + " chunk_size=60,\n", + " chunk_overlap=0,\n", + ")\n", + "\n", + "md_docs = md_splitter.create_documents([markdown_text])\n", + "md_docs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### LaTeX\n", + "\n", + "LaTeX is a markup language for document creation, widely used for representing mathematical symbols and formulas.\n", + "\n", + "Here's how to split LaTeX text into smaller chunks using the `RecursiveCharacterTextSplitter`.\n", + "- First, specify `Language.LATEX` for the `language` parameter. It tells the splitter you're working with LaTeX text.\n", + "- Then, set `chunk_size` to 60. This limits the size of each resulting chunk to a maximum of 60 characters.\n", + "- Finally, set `chunk_overlap` to 0. It prevents any of the chunks from overlapping." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={}, page_content='\\\\documentclass{article}\\n\\n\\x08egin{document}\\n\\n\\\\maketitle'),\n", + " Document(metadata={}, page_content='\\\\section{Introduction}\\nLarge language models (LLMs) are a'),\n", + " Document(metadata={}, page_content='type of machine learning model that can be trained on vast'),\n", + " Document(metadata={}, page_content='amounts of text data to generate human-like language. In'),\n", + " Document(metadata={}, page_content='recent years, LLMs have made significant advances in a'),\n", + " Document(metadata={}, page_content='variety of natural language processing tasks, including'),\n", + " Document(metadata={}, page_content='language translation, text generation, and sentiment'),\n", + " Document(metadata={}, page_content='analysis.'),\n", + " Document(metadata={}, page_content='\\\\subsection{History of LLMs}\\nThe earliest LLMs were'),\n", + " Document(metadata={}, page_content='developed in the 1980s and 1990s, but they were limited by'),\n", + " Document(metadata={}, page_content='the amount of data that could be processed and the'),\n", + " Document(metadata={}, page_content='computational power available at the time. In the past'),\n", + " Document(metadata={}, page_content='decade, however, advances in hardware and software have'),\n", + " Document(metadata={}, page_content='made it possible to train LLMs on massive datasets, leading'),\n", + " Document(metadata={}, page_content='to significant improvements in performance.'),\n", + " Document(metadata={}, page_content='\\\\subsection{Applications of LLMs}\\nLLMs have many'),\n", + " Document(metadata={}, page_content='applications in industry, including chatbots, content'),\n", + " Document(metadata={}, page_content='creation, and virtual assistants. They can also be used in'),\n", + " Document(metadata={}, page_content='academia for research in linguistics, psychology, and'),\n", + " Document(metadata={}, page_content='computational linguistics.\\n\\n\\\\end{document}')]" + ] + }, "execution_count": 13, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[Document(page_content='def hello_world():\\n print(\"Hello, World!\")'),\n", - " Document(page_content='hello_world()')]" - ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "PYTHON_CODE = \"\"\"\n", - "def hello_world():\n", - " print(\"Hello, World!\")\n", - "\n", - "hello_world()\n", - "\"\"\"\n", - "\n", - "python_splitter = RecursiveCharacterTextSplitter.from_language(\n", - " language=Language.PYTHON, chunk_size=50, chunk_overlap=0\n", - ")\n", - "\n", - "# Create `Document`. The created `Document` is returned as a list.\n", - "python_docs = python_splitter.create_documents([PYTHON_CODE])\n", - "python_docs" - ] - }, - { - "cell_type": "code", + "output_type": "execute_result" + } + ], + "source": [ + "latex_text = \"\"\"\n", + "\\documentclass{article}\n", + "\n", + "\\begin{document}\n", + "\n", + "\\maketitle\n", + "\n", + "\\section{Introduction}\n", + "Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.\n", + "\n", + "\\subsection{History of LLMs}\n", + "The earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.\n", + "\n", + "\\subsection{Applications of LLMs}\n", + "LLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.\n", + "\n", + "\\end{document}\n", + "\"\"\"\n", + "\n", + "latex_splitter = RecursiveCharacterTextSplitter.from_language(\n", + " language=Language.LATEX,\n", + " chunk_size=60,\n", + " chunk_overlap=0,\n", + ")\n", + "\n", + "latex_docs = latex_splitter.create_documents([latex_text])\n", + "latex_docs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### HTML\n", + "\n", + "Here's how to split HTML text into smaller chunks using the `RecursiveCharacterTextSplitter`.\n", + "- First, specify `Language.HTML` for the `language` parameter. It tells the splitter you're working with HTML.\n", + "- Then, set `chunk_size` to 60. This limits the size of each resulting chunk to a maximum of 60 characters.\n", + "- Finally, set `chunk_overlap` to 0. It prevents any of the chunks from overlapping.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={}, page_content='\\n'),\n", + " Document(metadata={}, page_content='
\\n⚡ Building applications with LLMs through composability ⚡'),\n", + " Document(metadata={}, page_content='
\\n⚡ Building applications with LLMs through composability ⚡
\n", + "⚡ Building applications with LLMs through composability ⚡'),\n", - " Document(page_content='
\\n⚡ Building applications with LLMs through composability ⚡
\n", - "