diff --git a/07-TextSplitter/05-CodeSplitter.ipynb b/07-TextSplitter/05-CodeSplitter.ipynb index 2f282adef..c30827584 100644 --- a/07-TextSplitter/05-CodeSplitter.ipynb +++ b/07-TextSplitter/05-CodeSplitter.ipynb @@ -1,880 +1,877 @@ { "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Split code with Langchain\n", - "\n", - "- Author: [greencode](https://github.com/greencode-99)\n", - "- Design: \n", - "- Peer Review : [Teddy Lee](https://github.com/teddylee777), [heewung song](https://github.com/kofsitho87), [Teddy Lee](https://github.com/teddylee777)\n", - "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n", - "\n", - "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/05-CodeSplitter.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/05-CodeSplitter.ipynb)\n", - "\n", - "## Overview\n", - "\n", - "`RecursiveCharacterTextSplitter` includes pre-built lists of separators that are useful for splitting text in a specific programming language.\n", - "\n", - "You can split code written in various programming languages using `CodeTextSplitter`.\n", - "\n", - "To do this, import the `Language` enum and specify the corresponding programming language.\n", - "\n", - "\n", - "### Table of Contents\n", - "\n", - "- [Overview](#Overview)\n", - "- [Environment Setup](#environment-setup)\n", - "- [Code Spliter Examples](#code-splitter-examples)\n", - " - [Python](#python)\n", - " - [JS](#js)\n", - " - [TS](#ts)\n", - " - [Markdown](#markdown)\n", - " - [LaTeX](#latex)\n", - " - [HTML](#html)\n", - " - [Solidity](#solidity)\n", - " - [C#](#c)\n", - " - [PHP](#php)\n", - " - [Kotlin](#kotlin)\n", - "\n", - "\n", - "### References\n", - "- [How to split code](https://python.langchain.com/docs/how_to/code_splitter/)\n", - "----" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Environment Setup\n", - "\n", - "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n", - "\n", - "**[Note]**\n", - "- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n", - "- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "%%capture --no-stderr\n", - "%pip install langchain-opentutorial" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "# Install required packages\n", - "from langchain_opentutorial import package\n", - "\n", - "package.install(\n", - " [\n", - " \"langchain_text_splitters\",\n", - " ],\n", - " verbose=False,\n", - " upgrade=False,\n", - ")" - ] - }, - { - "cell_type": "code", + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Split code with Langchain\n", + "\n", + "- Author: [Jongcheol Kim](https://github.com/greencode-99)\n", + "- Design: \n", + "- Peer Review: [kofsitho87](https://github.com/kofsitho87), [teddylee777](https://github.com/teddylee777)\n", + "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n", + "\n", + "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)\n", + "\n", + "## Overview\n", + "\n", + "`RecursiveCharacterTextSplitter` includes pre-built separator lists optimized for splitting text in different programming languages.\n", + "\n", + "The `CodeTextSplitter` provides even more specialized functionality for splitting code.\n", + "\n", + "To use it, import the `Language` enum(enumeration) and specify the desired programming language.\n", + "\n", + "\n", + "### Table of Contents\n", + "\n", + "- [Overview](#Overview)\n", + "- [Environment Setup](#environment-setup)\n", + "- [Code Spliter Examples](#code-splitter-examples)\n", + " - [Python](#python)\n", + " - [JavaScript](#javascript)\n", + " - [TypeScript](#typescript)\n", + " - [Markdown](#markdown)\n", + " - [LaTeX](#latex)\n", + " - [HTML](#html)\n", + " - [Solidity](#solidity)\n", + " - [C#](#c)\n", + " - [PHP](#php)\n", + " - [Kotlin](#kotlin)\n", + "\n", + "\n", + "### References\n", + "- [How to split code](https://python.langchain.com/docs/how_to/code_splitter/)\n", + "----" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Environment Setup\n", + "\n", + "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n", + "\n", + "**[Note]**\n", + "- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.\n", + "- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "%%capture --no-stderr\n", + "%pip install langchain-opentutorial" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# Install required packages\n", + "from langchain_opentutorial import package\n", + "\n", + "package.install(\n", + " [\n", + " \"langchain_text_splitters\",\n", + " ],\n", + " verbose=False,\n", + " upgrade=False,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Environment variables have been set successfully.\n" + ] + } + ], + "source": [ + "# Set environment variables\n", + "from langchain_opentutorial import set_env\n", + "\n", + "set_env(\n", + " {\n", + " \"OPENAI_API_KEY\": \"\",\n", + " \"LANGCHAIN_API_KEY\": \"\",\n", + " \"LANGCHAIN_TRACING_V2\": \"true\",\n", + " \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n", + " \"LANGCHAIN_PROJECT\": \"Code-Splitter\",\n", + " }\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, "execution_count": 4, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Environment variables have been set successfully.\n" - ] - } - ], - "source": [ - "# Set environment variables\n", - "from langchain_opentutorial import set_env\n", - "\n", - "set_env(\n", - " {\n", - " \"OPENAI_API_KEY\": \"\",\n", - " \"LANGCHAIN_API_KEY\": \"\",\n", - " \"LANGCHAIN_TRACING_V2\": \"true\",\n", - " \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n", - " \"LANGCHAIN_PROJECT\": \"Code-Splitter\",\n", - " }\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from dotenv import load_dotenv\n", - "\n", - "load_dotenv()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Code Splitter Examples\n", - "\n", - "Here is an example of splitting text using `RecursiveCharacterTextSplitter`.\n", - "\n", - "- Import the `Language` and `RecursiveCharacterTextSplitter` classes from the `langchain_text_splitters` module.\n", - "- `RecursiveCharacterTextSplitter` is a text splitter that recursively splits text at the character level." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [], - "source": [ - "from langchain_text_splitters import (\n", - " Language,\n", - " RecursiveCharacterTextSplitter,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Supported languages are stored in the langchain_text_splitters.Language enum. \n", - "\n", - "API Reference: [Language](https://python.langchain.com/docs/api_reference/text_splitters/Language) | [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/api_reference/text_splitters/RecursiveCharacterTextSplitter)\n", - "\n", - "Below is the full list of supported languages." - ] - }, - { - "cell_type": "code", + "output_type": "execute_result" + } + ], + "source": [ + "from dotenv import load_dotenv\n", + "\n", + "load_dotenv()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Code Splitter Examples\n", + "\n", + "Here is an example of splitting text using the `RecursiveCharacterTextSplitter`.\n", + "\n", + "- Import the `Language` and `RecursiveCharacterTextSplitter` classes from the `langchain_text_splitters` module.\n", + "- `RecursiveCharacterTextSplitter` is a text splitter that recursively splits text at the character level." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_text_splitters import (\n", + " Language,\n", + " RecursiveCharacterTextSplitter,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Supported languages are stored in the langchain_text_splitters.Language enum. \n", + "\n", + "API Reference: [Language](https://python.langchain.com/api_reference/text_splitters/base/langchain_text_splitters.base.Language.html#language) | [RecursiveCharacterTextSplitter](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html#recursivecharactertextsplitter)\n", + "\n", + "See below for the full list of supported languages." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['cpp',\n", + " 'go',\n", + " 'java',\n", + " 'kotlin',\n", + " 'js',\n", + " 'ts',\n", + " 'php',\n", + " 'proto',\n", + " 'python',\n", + " 'rst',\n", + " 'ruby',\n", + " 'rust',\n", + " 'scala',\n", + " 'swift',\n", + " 'markdown',\n", + " 'latex',\n", + " 'html',\n", + " 'sol',\n", + " 'csharp',\n", + " 'cobol',\n", + " 'c',\n", + " 'lua',\n", + " 'perl',\n", + " 'haskell',\n", + " 'elixir',\n", + " 'powershell']" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Get the full list of supported languages.\n", + "[e.value for e in Language]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can use the `get_separators_for_language` method of the `RecursiveCharacterTextSplitter` class to see the separators used for a given language.\n", + "\n", + "- For example, passing `Language.PYTHON` retrieves the separators used for Python:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['\\nclass ', '\\ndef ', '\\n\\tdef ', '\\n\\n', '\\n', ' ', '']" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# You can check the separators used for the given language.\n", + "RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Python\n", + "\n", + "Here's how to split Python code into smaller chunks using the `RecursiveCharacterTextSplitter`.\n", + "- First, specify `Language.PYTHON` for the `language` parameter. It tells the splitter you're working with Python code.\n", + "- Then, set `chunk_size` to 50. This limits the size of each resulting chunk to a maximum of 50 characters.\n", + "- Finally, set `chunk_overlap` to 0. It prevents any of the chunks from overlapping." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={}, page_content='def hello_world():\\n print(\"Hello, World!\")'),\n", + " Document(metadata={}, page_content='hello_world()')]" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "PYTHON_CODE = \"\"\"\n", + "def hello_world():\n", + " print(\"Hello, World!\")\n", + "\n", + "hello_world()\n", + "\"\"\"\n", + "\n", + "python_splitter = RecursiveCharacterTextSplitter.from_language(\n", + " language=Language.PYTHON, chunk_size=50, chunk_overlap=0\n", + ")\n", + "\n", + "# Create `Document`. The created `Document` is returned as a list.\n", + "python_docs = python_splitter.create_documents([PYTHON_CODE])\n", + "python_docs" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "def hello_world():\n", + " print(\"Hello, World!\")\n", + "==================\n", + "hello_world()\n", + "==================\n" + ] + } + ], + "source": [ + "# This section iterates through the list of documents created by the RecursiveCharacterTextSplitter\n", + "# and prints each document's content followed by a separator line for readability.\n", + "for doc in python_docs:\n", + " print(doc.page_content, end=\"\\n==================\\n\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### JavaScript\n", + "\n", + "Here's how to split JavaScript code into smaller chunks using the `RecursiveCharacterTextSplitter`.\n", + "- First, specify `Language.JS` for the `language` parameter. It tells the splitter you're working with JavaScript code.\n", + "- Then, set `chunk_size` to 60. This limits the size of each resulting chunk to a maximum of 60 characters.\n", + "- Finally, set `chunk_overlap` to 0. It prevents any of the chunks from overlapping.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={}, page_content='function helloWorld() {\\n console.log(\"Hello, World!\");\\n}'),\n", + " Document(metadata={}, page_content='helloWorld();')]" + ] + }, "execution_count": 10, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['cpp',\n", - " 'go',\n", - " 'java',\n", - " 'kotlin',\n", - " 'js',\n", - " 'ts',\n", - " 'php',\n", - " 'proto',\n", - " 'python',\n", - " 'rst',\n", - " 'ruby',\n", - " 'rust',\n", - " 'scala',\n", - " 'swift',\n", - " 'markdown',\n", - " 'latex',\n", - " 'html',\n", - " 'sol',\n", - " 'csharp',\n", - " 'cobol',\n", - " 'c',\n", - " 'lua',\n", - " 'perl',\n", - " 'haskell',\n", - " 'elixir']" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Get the full list of supported languages.\n", - "[e.value for e in Language]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can use the `get_separators_for_language` method of the `RecursiveCharacterTextSplitter` class to check the separators used for a specific language.\n", - "\n", - "- In the example, the `Language.PYTHON` enum value is passed as an argument to check the separators used for the Python language." - ] - }, - { - "cell_type": "code", + "output_type": "execute_result" + } + ], + "source": [ + "JS_CODE = \"\"\"\n", + "function helloWorld() {\n", + " console.log(\"Hello, World!\");\n", + "}\n", + "\n", + "helloWorld();\n", + "\"\"\"\n", + "\n", + "js_splitter = RecursiveCharacterTextSplitter.from_language(\n", + " language=Language.JS, chunk_size=60, chunk_overlap=0\n", + ")\n", + "\n", + "# Create `Document`. The created `Document` is returned as a list.\n", + "js_docs = js_splitter.create_documents([JS_CODE])\n", + "js_docs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### TypeScript\n", + "\n", + "Here's how to split TypeScript code into smaller chunks using the `RecursiveCharacterTextSplitter`.\n", + "- First, specify `Language.TS` for the `language` parameter. It tells the splitter you're working with TypeScript code.\n", + "- Then, set `chunk_size` to 60. This limits the size of each resulting chunk to a maximum of 60 characters.\n", + "- Finally, set `chunk_overlap` to 0. It prevents any of the chunks from overlapping.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={}, page_content='function helloWorld(): void {'),\n", + " Document(metadata={}, page_content='console.log(\"Hello, World!\");\\n}'),\n", + " Document(metadata={}, page_content='helloWorld();')]" + ] + }, "execution_count": 11, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['\\nclass ', '\\ndef ', '\\n\\tdef ', '\\n\\n', '\\n', ' ', '']" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# You can check the separators used for the given language.\n", - "RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)" - ] - }, - { - "cell_type": "markdown", + "output_type": "execute_result" + } + ], + "source": [ + "TS_CODE = \"\"\"\n", + "function helloWorld(): void {\n", + " console.log(\"Hello, World!\");\n", + "}\n", + "\n", + "helloWorld();\n", + "\"\"\"\n", + "\n", + "ts_splitter = RecursiveCharacterTextSplitter.from_language(\n", + " language=Language.TS, chunk_size=60, chunk_overlap=0\n", + ")\n", + "\n", + "\n", + "ts_docs = ts_splitter.create_documents([TS_CODE])\n", + "ts_docs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Markdown\n", + "\n", + "Here's how to split Markdown text into smaller chunks using the `RecursiveCharacterTextSplitter`.\n", + "\n", + "- First, Specify `Language.MARKDOWN` for the `language` parameter. It tells the splitter you're working with Markdown text.\n", + "- Then, set `chunk_size` to 60. This limits the size of each resulting chunk to a maximum of 60 characters.\n", + "- Finally, set `chunk_overlap` to 0. It prevents any of the chunks from overlapping." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={}, page_content='# 🦜️🔗 LangChain'),\n", + " Document(metadata={}, page_content='⚡ Building applications with LLMs through composability ⚡'),\n", + " Document(metadata={}, page_content='## What is LangChain?'),\n", + " Document(metadata={}, page_content=\"# Hopefully this code block isn't split\"),\n", + " Document(metadata={}, page_content='LangChain is a framework for...'),\n", + " Document(metadata={}, page_content='As an open-source project in a rapidly developing field, we'),\n", + " Document(metadata={}, page_content='are extremely open to contributions.')]" + ] + }, + "execution_count": 12, "metadata": {}, - "source": [ - "### Python\n", - "\n", - "Use `RecursiveCharacterTextSplitter` to split Python code into document units.\n", - "- Specify `Language.PYTHON` as the `language` parameter to use the Python language.\n", - "- Set `chunk_size` to 50 to limit the maximum size of each document.\n", - "- Set `chunk_overlap` to 0 to disallow overlap between documents." - ] - }, - { - "cell_type": "code", + "output_type": "execute_result" + } + ], + "source": [ + "markdown_text = \"\"\"\n", + "# 🦜️🔗 LangChain\n", + "\n", + "⚡ Building applications with LLMs through composability ⚡\n", + "\n", + "## What is LangChain?\n", + "\n", + "# Hopefully this code block isn't split\n", + "LangChain is a framework for...\n", + "\n", + "As an open-source project in a rapidly developing field, we are extremely open to contributions.\n", + "\"\"\"\n", + "\n", + "md_splitter = RecursiveCharacterTextSplitter.from_language(\n", + " language=Language.MARKDOWN,\n", + " chunk_size=60,\n", + " chunk_overlap=0,\n", + ")\n", + "\n", + "md_docs = md_splitter.create_documents([markdown_text])\n", + "md_docs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### LaTeX\n", + "\n", + "LaTeX is a markup language for document creation, widely used for representing mathematical symbols and formulas.\n", + "\n", + "Here's how to split LaTeX text into smaller chunks using the `RecursiveCharacterTextSplitter`.\n", + "- First, specify `Language.LATEX` for the `language` parameter. It tells the splitter you're working with LaTeX text.\n", + "- Then, set `chunk_size` to 60. This limits the size of each resulting chunk to a maximum of 60 characters.\n", + "- Finally, set `chunk_overlap` to 0. It prevents any of the chunks from overlapping." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={}, page_content='\\\\documentclass{article}\\n\\n\\x08egin{document}\\n\\n\\\\maketitle'),\n", + " Document(metadata={}, page_content='\\\\section{Introduction}\\nLarge language models (LLMs) are a'),\n", + " Document(metadata={}, page_content='type of machine learning model that can be trained on vast'),\n", + " Document(metadata={}, page_content='amounts of text data to generate human-like language. In'),\n", + " Document(metadata={}, page_content='recent years, LLMs have made significant advances in a'),\n", + " Document(metadata={}, page_content='variety of natural language processing tasks, including'),\n", + " Document(metadata={}, page_content='language translation, text generation, and sentiment'),\n", + " Document(metadata={}, page_content='analysis.'),\n", + " Document(metadata={}, page_content='\\\\subsection{History of LLMs}\\nThe earliest LLMs were'),\n", + " Document(metadata={}, page_content='developed in the 1980s and 1990s, but they were limited by'),\n", + " Document(metadata={}, page_content='the amount of data that could be processed and the'),\n", + " Document(metadata={}, page_content='computational power available at the time. In the past'),\n", + " Document(metadata={}, page_content='decade, however, advances in hardware and software have'),\n", + " Document(metadata={}, page_content='made it possible to train LLMs on massive datasets, leading'),\n", + " Document(metadata={}, page_content='to significant improvements in performance.'),\n", + " Document(metadata={}, page_content='\\\\subsection{Applications of LLMs}\\nLLMs have many'),\n", + " Document(metadata={}, page_content='applications in industry, including chatbots, content'),\n", + " Document(metadata={}, page_content='creation, and virtual assistants. They can also be used in'),\n", + " Document(metadata={}, page_content='academia for research in linguistics, psychology, and'),\n", + " Document(metadata={}, page_content='computational linguistics.\\n\\n\\\\end{document}')]" + ] + }, "execution_count": 13, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[Document(page_content='def hello_world():\\n print(\"Hello, World!\")'),\n", - " Document(page_content='hello_world()')]" - ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "PYTHON_CODE = \"\"\"\n", - "def hello_world():\n", - " print(\"Hello, World!\")\n", - "\n", - "hello_world()\n", - "\"\"\"\n", - "\n", - "python_splitter = RecursiveCharacterTextSplitter.from_language(\n", - " language=Language.PYTHON, chunk_size=50, chunk_overlap=0\n", - ")\n", - "\n", - "# Create `Document`. The created `Document` is returned as a list.\n", - "python_docs = python_splitter.create_documents([PYTHON_CODE])\n", - "python_docs" - ] - }, - { - "cell_type": "code", + "output_type": "execute_result" + } + ], + "source": [ + "latex_text = \"\"\"\n", + "\\documentclass{article}\n", + "\n", + "\\begin{document}\n", + "\n", + "\\maketitle\n", + "\n", + "\\section{Introduction}\n", + "Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.\n", + "\n", + "\\subsection{History of LLMs}\n", + "The earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.\n", + "\n", + "\\subsection{Applications of LLMs}\n", + "LLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.\n", + "\n", + "\\end{document}\n", + "\"\"\"\n", + "\n", + "latex_splitter = RecursiveCharacterTextSplitter.from_language(\n", + " language=Language.LATEX,\n", + " chunk_size=60,\n", + " chunk_overlap=0,\n", + ")\n", + "\n", + "latex_docs = latex_splitter.create_documents([latex_text])\n", + "latex_docs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### HTML\n", + "\n", + "Here's how to split HTML text into smaller chunks using the `RecursiveCharacterTextSplitter`.\n", + "- First, specify `Language.HTML` for the `language` parameter. It tells the splitter you're working with HTML.\n", + "- Then, set `chunk_size` to 60. This limits the size of each resulting chunk to a maximum of 60 characters.\n", + "- Finally, set `chunk_overlap` to 0. It prevents any of the chunks from overlapping.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={}, page_content='\\n'),\n", + " Document(metadata={}, page_content='\\n Codestin Search App'),\n", + " Document(metadata={}, page_content='\\n \n", + "\n", + " \n", + " Codestin Search App\n", + " \n", + " \n", + " \n", + "
\n", + "

🦜️🔗 LangChain

\n", + "

⚡ Building applications with LLMs through composability ⚡

\n", + "
\n", + "
\n", + " As an open-source project in a rapidly developing field, we are extremely open to contributions.\n", + "
\n", + " \n", + "\n", + "\"\"\"\n", + "\n", + "html_splitter = RecursiveCharacterTextSplitter.from_language(\n", + " language=Language.HTML, chunk_size=60, chunk_overlap=0\n", + ")\n", + "\n", + "html_docs = html_splitter.create_documents([html_text])\n", + "html_docs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Solidity\n", + "\n", + "Here's how to split Solidity code (sotred as a string in the `SOL_CODE` variable) into smaller chunks by creating a `RecursiveCharacterTextSplitter` instance called `sol_splitter` to handle the splitting.\n", + "- First, specify `Language.SOL` for the `language` parameter. It tells the splitter you're working with Solidity code.\n", + "- Then, set `chunk_size` to 128. This limits the size of each resulting chunk to a maximum of 128 characters.\n", + "- Finally, set `chunk_overlap` to 0. It prevents any of the chunks from overlapping.\n", + "- The `sol_splitter.create_documents()` method splits the Solidity code(`SOL_CODE`) into chunks and stores them in the `sol_docs` variable.\n", + "- Print or display the output(`sol_docs`) to verify the split.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={}, page_content='pragma solidity ^0.8.20;'),\n", + " Document(metadata={}, page_content='contract HelloWorld { \\n function add(uint a, uint b) pure public returns(uint) {\\n return a + b;\\n }\\n}')]" + ] + }, "execution_count": 15, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[Document(page_content='function helloWorld(): void {'),\n", - " Document(page_content='console.log(\"Hello, World!\");\\n}'),\n", - " Document(page_content='helloWorld();')]" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "TS_CODE = \"\"\"\n", - "function helloWorld(): void {\n", - " console.log(\"Hello, World!\");\n", - "}\n", - "\n", - "helloWorld();\n", - "\"\"\"\n", - "\n", - "ts_splitter = RecursiveCharacterTextSplitter.from_language(\n", - " language=Language.TS, chunk_size=60, chunk_overlap=0\n", - ")\n", - "\n", - "\n", - "ts_docs = ts_splitter.create_documents([TS_CODE])\n", - "ts_docs" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Markdown\n", - "\n", - "Here is an example of using the Markdown text splitter.\n", - "\n", - "- Specify `Language.MARKDOWN` as the `language` parameter to use the Markdown language.\n", - "- Set `chunk_size` to 60 to limit the maximum size of each document.\n", - "- Set `chunk_overlap` to 0 to disallow overlap between documents." - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[Document(page_content='# 🦜️🔗 LangChain'),\n", - " Document(page_content='⚡ Building applications with LLMs through composability ⚡'),\n", - " Document(page_content='## What is LangChain?'),\n", - " Document(page_content=\"# Hopefully this code block isn't split\"),\n", - " Document(page_content='LangChain is a framework for...'),\n", - " Document(page_content='As an open-source project in a rapidly developing field, we'),\n", - " Document(page_content='are extremely open to contributions.')]" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "markdown_text = \"\"\"\n", - "# 🦜️🔗 LangChain\n", - "\n", - "⚡ Building applications with LLMs through composability ⚡\n", - "\n", - "## What is LangChain?\n", - "\n", - "# Hopefully this code block isn't split\n", - "LangChain is a framework for...\n", - "\n", - "As an open-source project in a rapidly developing field, we are extremely open to contributions.\n", - "\"\"\"\n", - "\n", - "md_splitter = RecursiveCharacterTextSplitter.from_language(\n", - " language=Language.MARKDOWN,\n", - " chunk_size=60,\n", - " chunk_overlap=0,\n", - ")\n", - "\n", - "md_docs = md_splitter.create_documents([markdown_text])\n", - "md_docs" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### LaTeX\n", - "\n", - "LaTeX is a markup language for document creation, widely used for representing mathematical symbols and formulas.\n", - "\n", - "Here is an example of LaTeX text.\n", - "- Specify `Language.LATEX` as the `language` parameter to use the LaTeX language.\n", - "- Set `chunk_size` to 60 to limit the maximum size of each document.\n", - "- Set `chunk_overlap` to 0 to disallow overlap between documents.\n" - ] - }, - { - "cell_type": "code", + "output_type": "execute_result" + } + ], + "source": [ + "SOL_CODE = \"\"\"\n", + "pragma solidity ^0.8.20; \n", + "contract HelloWorld { \n", + " function add(uint a, uint b) pure public returns(uint) {\n", + " return a + b;\n", + " }\n", + "}\n", + "\"\"\"\n", + "\n", + "sol_splitter = RecursiveCharacterTextSplitter.from_language(\n", + " language=Language.SOL, chunk_size=128, chunk_overlap=0\n", + ")\n", + "\n", + "sol_docs = sol_splitter.create_documents([SOL_CODE])\n", + "sol_docs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### C#\n", + "\n", + "Here's how to split C# code into smaller chunks using the `RecursiveCharacterTextSplitter`.\n", + "- First, specify `Language.CSHARP` for the `language` parameter. It tells the splitter you're working with C# code.\n", + "- Then, set `chunk_size` to 128. This limits the size of each resulting chunk to a maximum of 128 characters.\n", + "- Finally, set `chunk_overlap` to 0. It prevents any of the chunks from overlapping." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={}, page_content='using System;'),\n", + " Document(metadata={}, page_content='class Program\\n{\\n static void Main()\\n {\\n Console.WriteLine(\"Enter a number (1-5):\");'),\n", + " Document(metadata={}, page_content='int input = Convert.ToInt32(Console.ReadLine());\\n for (int i = 1; i <= input; i++)\\n {'),\n", + " Document(metadata={}, page_content='if (i % 2 == 0)\\n {\\n Console.WriteLine($\"{i} is even.\");\\n }\\n else'),\n", + " Document(metadata={}, page_content='{\\n Console.WriteLine($\"{i} is odd.\");\\n }\\n }\\n Console.WriteLine(\"Goodbye!\");'),\n", + " Document(metadata={}, page_content='}\\n}')]" + ] + }, "execution_count": 16, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[Document(page_content='\\\\documentclass{article}\\n\\n\\x08egin{document}\\n\\n\\\\maketitle'),\n", - " Document(page_content='\\\\section{Introduction}\\nLarge language models (LLMs) are a'),\n", - " Document(page_content='type of machine learning model that can be trained on vast'),\n", - " Document(page_content='amounts of text data to generate human-like language. In'),\n", - " Document(page_content='recent years, LLMs have made significant advances in a'),\n", - " Document(page_content='variety of natural language processing tasks, including'),\n", - " Document(page_content='language translation, text generation, and sentiment'),\n", - " Document(page_content='analysis.'),\n", - " Document(page_content='\\\\subsection{History of LLMs}\\nThe earliest LLMs were'),\n", - " Document(page_content='developed in the 1980s and 1990s, but they were limited by'),\n", - " Document(page_content='the amount of data that could be processed and the'),\n", - " Document(page_content='computational power available at the time. In the past'),\n", - " Document(page_content='decade, however, advances in hardware and software have'),\n", - " Document(page_content='made it possible to train LLMs on massive datasets, leading'),\n", - " Document(page_content='to significant improvements in performance.'),\n", - " Document(page_content='\\\\subsection{Applications of LLMs}\\nLLMs have many'),\n", - " Document(page_content='applications in industry, including chatbots, content'),\n", - " Document(page_content='creation, and virtual assistants. They can also be used in'),\n", - " Document(page_content='academia for research in linguistics, psychology, and'),\n", - " Document(page_content='computational linguistics.\\n\\n\\\\end{document}')]" - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "latex_text = \"\"\"\n", - "\\documentclass{article}\n", - "\n", - "\\begin{document}\n", - "\n", - "\\maketitle\n", - "\n", - "\\section{Introduction}\n", - "Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.\n", - "\n", - "\\subsection{History of LLMs}\n", - "The earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.\n", - "\n", - "\\subsection{Applications of LLMs}\n", - "LLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.\n", - "\n", - "\\end{document}\n", - "\"\"\"\n", - "\n", - "latex_splitter = RecursiveCharacterTextSplitter.from_language(\n", - " language=Language.LATEX,\n", - " chunk_size=60,\n", - " chunk_overlap=0,\n", - ")\n", - "\n", - "latex_docs = latex_splitter.create_documents([latex_text])\n", - "latex_docs" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### HTML\n", - "\n", - "Here is an example of using the HTML text splitter.\n", - "- Specify `Language.HTML` as the `language` parameter to use the HTML language.\n", - "- Set `chunk_size` to 60 to limit the maximum size of each document.\n", - "- Set `chunk_overlap` to 0 to disallow overlap between documents.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 51, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[Document(page_content='\\n'),\n", - " Document(page_content='\\n Codestin Search App'),\n", - " Document(page_content='\\n \n", - "\n", - " \n", - " Codestin Search App\n", - " \n", - " \n", - " \n", - "
\n", - "

🦜️🔗 LangChain

\n", - "

⚡ Building applications with LLMs through composability ⚡

\n", - "
\n", - "
\n", - " As an open-source project in a rapidly developing field, we are extremely open to contributions.\n", - "
\n", - " \n", - "\n", - "\"\"\"\n", - "\n", - "html_splitter = RecursiveCharacterTextSplitter.from_language(\n", - " language=Language.HTML, chunk_size=60, chunk_overlap=0\n", - ")\n", - "\n", - "html_docs = html_splitter.create_documents([html_text])\n", - "html_docs" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Solidity\n", - "\n", - "Here is an example of using the Solidity text splitter:\n", - "\n", - "- The Solidity code is stored in the `SOL_CODE` variable as a string.\n", - "- The `RecursiveCharacterTextSplitter` is used to create `sol_splitter`, which splits the Solidity code into chunks.\n", - " - The `language` parameter is set to `Language.SOL` to specify the Solidity language.\n", - " - The `chunk_size` is set to 128 to specify the maximum size of each chunk.\n", - " - The `chunk_overlap` is set to 0 to prevent overlap between chunks.\n", - " \n", - "- The `sol_splitter.create_documents()` method is used to split `SOL_CODE` into chunks and store the split chunks in the `sol_docs` variable.\n", - "- The `sol_docs` are output to verify the split Solidity code chunks.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 52, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[Document(page_content='pragma solidity ^0.8.20;'),\n", - " Document(page_content='contract HelloWorld { \\n function add(uint a, uint b) pure public returns(uint) {\\n return a + b;\\n }\\n}')]" - ] - }, - "execution_count": 52, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "SOL_CODE = \"\"\"\n", - "pragma solidity ^0.8.20; \n", - "contract HelloWorld { \n", - " function add(uint a, uint b) pure public returns(uint) {\n", - " return a + b;\n", - " }\n", - "}\n", - "\"\"\"\n", - "\n", - "sol_splitter = RecursiveCharacterTextSplitter.from_language(\n", - " language=Language.SOL, chunk_size=128, chunk_overlap=0\n", - ")\n", - "\n", - "sol_docs = sol_splitter.create_documents([SOL_CODE])\n", - "sol_docs" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### C#\n", - "\n", - "Here is an example of using the C# text splitter.\n", - "- Specify `Language.CSHARP` as the `language` parameter to use the C# language.\n", - "- Set `chunk_size` to 128 to limit the maximum size of each document.\n", - "- Set `chunk_overlap` to 0 to disallow overlap between documents." - ] - }, - { - "cell_type": "code", - "execution_count": 54, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[Document(page_content='using System;'),\n", - " Document(page_content='class Program\\n{\\n static void Main()\\n {\\n Console.WriteLine(\"Enter a number (1-5):\");'),\n", - " Document(page_content='int input = Convert.ToInt32(Console.ReadLine());\\n for (int i = 1; i <= input; i++)\\n {'),\n", - " Document(page_content='if (i % 2 == 0)\\n {\\n Console.WriteLine($\"{i} is even.\");\\n }\\n else'),\n", - " Document(page_content='{\\n Console.WriteLine($\"{i} is odd.\");\\n }\\n }\\n Console.WriteLine(\"Goodbye!\");'),\n", - " Document(page_content='}\\n}')]" - ] - }, - "execution_count": 54, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "C_CODE = \"\"\"\n", - "using System;\n", - "class Program\n", - "{\n", - " static void Main()\n", - " {\n", - " Console.WriteLine(\"Enter a number (1-5):\");\n", - " int input = Convert.ToInt32(Console.ReadLine());\n", - " for (int i = 1; i <= input; i++)\n", - " {\n", - " if (i % 2 == 0)\n", - " {\n", - " Console.WriteLine($\"{i} is even.\");\n", - " }\n", - " else\n", - " {\n", - " Console.WriteLine($\"{i} is odd.\");\n", - " }\n", - " }\n", - " Console.WriteLine(\"Goodbye!\");\n", - " }\n", - "}\n", - "\"\"\"\n", - "\n", - "c_splitter = RecursiveCharacterTextSplitter.from_language(\n", - " language=Language.CSHARP, chunk_size=128, chunk_overlap=0\n", - ")\n", - "\n", - "c_docs = c_splitter.create_documents([C_CODE])\n", - "c_docs" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### PHP\n", - "\n", - "Here is an example of using the PHP text splitter.\n", - "- Specify `Language.PHP` as the `language` parameter to use the PHP language.\n", - "- Set `chunk_size` to 50 to limit the maximum size of each document.\n", - "- Set `chunk_overlap` to 0 to disallow overlap between documents." - ] - }, - { - "cell_type": "code", - "execution_count": 56, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[Document(page_content=''),\n", - " Document(page_content='println(\"Name: ${file.name} | Last Write Time: ${file.lastModified()}\")\\n }\\n}')]" - ] - }, - "execution_count": 65, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "KOTLIN_CODE = \"\"\"\n", - "fun main() {\n", - " val directoryPath = System.getProperty(\"user.dir\")\n", - " val files = File(directoryPath).listFiles()?.filter { !it.isDirectory }?.sortedBy { it.lastModified() } ?: emptyArray()\n", - "\n", - " files.forEach { file ->\n", - " println(\"Name: ${file.name} | Last Write Time: ${file.lastModified()}\")\n", - " }\n", - "}\n", - "\"\"\"\n", - "\n", - "kotlin_splitter = RecursiveCharacterTextSplitter.from_language(\n", - " language=Language.KOTLIN, chunk_size=100, chunk_overlap=0\n", - ")\n", - "\n", - "kotlin_docs = kotlin_splitter.create_documents([KOTLIN_CODE])\n", - "kotlin_docs" - ] - } + "output_type": "execute_result" + } + ], + "source": [ + "C_CODE = \"\"\"\n", + "using System;\n", + "class Program\n", + "{\n", + " static void Main()\n", + " {\n", + " Console.WriteLine(\"Enter a number (1-5):\");\n", + " int input = Convert.ToInt32(Console.ReadLine());\n", + " for (int i = 1; i <= input; i++)\n", + " {\n", + " if (i % 2 == 0)\n", + " {\n", + " Console.WriteLine($\"{i} is even.\");\n", + " }\n", + " else\n", + " {\n", + " Console.WriteLine($\"{i} is odd.\");\n", + " }\n", + " }\n", + " Console.WriteLine(\"Goodbye!\");\n", + " }\n", + "}\n", + "\"\"\"\n", + "\n", + "c_splitter = RecursiveCharacterTextSplitter.from_language(\n", + " language=Language.CSHARP, chunk_size=128, chunk_overlap=0\n", + ")\n", + "\n", + "c_docs = c_splitter.create_documents([C_CODE])\n", + "c_docs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### PHP\n", + "\n", + "Here's how to split PHP code into smaller chunks using the `RecursiveCharacterTextSplitter`.\n", + "- First, specify `Language.PHP` for the `language` parameter. It tells the splitter you're working with PHP code.\n", + "- Then, set `chunk_size` to 50. This limits the size of each resulting chunk to a maximum of 50 characters.\n", + "- Finally, set `chunk_overlap` to 0. It prevents any of the chunks from overlapping." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={}, page_content=''),\n", + " Document(metadata={}, page_content='println(\"Name: ${file.name} | Last Write Time: ${file.lastModified()}\")\\n }\\n}')]" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "KOTLIN_CODE = \"\"\"\n", + "fun main() {\n", + " val directoryPath = System.getProperty(\"user.dir\")\n", + " val files = File(directoryPath).listFiles()?.filter { !it.isDirectory }?.sortedBy { it.lastModified() } ?: emptyArray()\n", + "\n", + " files.forEach { file ->\n", + " println(\"Name: ${file.name} | Last Write Time: ${file.lastModified()}\")\n", + " }\n", + "}\n", + "\"\"\"\n", + "\n", + "kotlin_splitter = RecursiveCharacterTextSplitter.from_language(\n", + " language=Language.KOTLIN, chunk_size=100, chunk_overlap=0\n", + ")\n", + "\n", + "kotlin_docs = kotlin_splitter.create_documents([KOTLIN_CODE])\n", + "kotlin_docs" + ] + } ], "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.9" - } + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.11" + } }, "nbformat": 4, "nbformat_minor": 2 -} \ No newline at end of file + } \ No newline at end of file