LangChain-OpenTutorial · BAEM1N · Jan 2, 2025 · Dec 31, 2024
diff --git a/06-DocumentLoader/08-TXT-Loader.ipynb b/06-DocumentLoader/08-TXT-Loader.ipynb
@@ -0,0 +1,320 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# TXT Loader\n",
+    "\n",
+    "- Author: [seofield](https://github.com/seofield)\n",
+    "- Design:\n",
+    "- Peer Review: [suhyun0115](https://github.com/suhyun0115) [HarryKane11](https://github.com/HarryKane11)\n",
+    "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
+    "\n",
+    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "This tutorial focuses on using LangChain’s TextLoader to efficiently load and process individual text files. \n",
+    "\n",
+    "You’ll learn how to extract metadata and content, making it easier to prepare text data.\n",
+    "\n",
+    "\n",
+    "### Table of Contents\n",
+    "\n",
+    "- [Overview](#overview)\n",
+    "- [Environement Setup](#environment-setup)\n",
+    "- [TXT Loader](#txt-loader)\n",
+    "- [Automatic Encoding Detection with TextLoader](#automatic-encoding-detection-with-textloader)\n",
+    "\n",
+    "----"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Environment Setup\n",
+    "\n",
+    "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n",
+    "\n",
+    "**[Note]**\n",
+    "- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n",
+    "- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture --no-stderr\n",
+    "!pip install langchain-opentutorial"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install required packages\n",
+    "from langchain_opentutorial import package\n",
+    "\n",
+    "package.install(\n",
+    "    [\n",
+    "        \"langchain\",\n",
+    "        \"langchain_community\",\n",
+    "        \"chardet\"\n",
+    "    ],\n",
+    "    verbose=False,\n",
+    "    upgrade=False,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## TXT Loader\n",
+    "\n",
+    "Let’s explore how to load files with the `.txt` extension using a loader."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of documents: 1\n",
+      "\n",
+      "[Metadata]\n",
+      "\n",
+      "{'source': 'data/appendix-keywords.txt'}\n",
+      "\n",
+      "========= [Preview - First 500 Characters] =========\n",
+      "\n",
+      "Semantic Search\n",
+      "\n",
+      "Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.\n",
+      "Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”\n",
+      "Related Keywords: Natural Language Processing, Search Algorithms, Data Mining\n",
+      "\n",
+      "Embedding\n",
+      "\n",
+      "Definition: Embedding is the process of converting textual data, such as words\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain_community.document_loaders import TextLoader\n",
+    "\n",
+    "# Create a text loader\n",
+    "loader = TextLoader(\"data/appendix-keywords.txt\", encoding=\"utf-8\")\n",
+    "\n",
+    "# Load the document\n",
+    "docs = loader.load()\n",
+    "print(f\"Number of documents: {len(docs)}\\n\")\n",
+    "print(\"[Metadata]\\n\")\n",
+    "print(docs[0].metadata)\n",
+    "print(\"\\n========= [Preview - First 500 Characters] =========\\n\")\n",
+    "print(docs[0].page_content[:500])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Automatic Encoding Detection with TextLoader\n",
+    "\n",
+    "In this example, we explore several strategies for using the TextLoader class to efficiently load large batches of files from a directory with varying encodings.\n",
+    "\n",
+    "To illustrate the problem, we’ll first attempt to load multiple text files with arbitrary encodings.\n",
+    "\n",
+    "- `silent_errors`: By passing the silent_errors parameter to the DirectoryLoader, you can skip files that cannot be loaded and continue the loading process without interruptions.\n",
+    "- `autodetect_encoding`: Additionally, you can enable automatic encoding detection by passing the autodetect_encoding parameter to the loader class, allowing it to detect file encodings before failing.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_community.document_loaders import DirectoryLoader\n",
+    "\n",
+    "path = \"data/\"\n",
+    "\n",
+    "text_loader_kwargs = {\"autodetect_encoding\": True}\n",
+    "\n",
+    "loader = DirectoryLoader(\n",
+    "    path,\n",
+    "    glob=\"**/*.txt\",\n",
+    "    loader_cls=TextLoader,\n",
+    "    silent_errors=True,\n",
+    "    loader_kwargs=text_loader_kwargs,\n",
+    ")\n",
+    "docs = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `data/appendix-keywords.txt` file and its derivative files with similar names all have different encoding formats.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['data/appendix-keywords-CP949.txt',\n",
+       " 'data/appendix-keywords-EUCKR.txt',\n",
+       " 'data/appendix-keywords.txt',\n",
+       " 'data/appendix-keywords-utf8.txt']"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "doc_sources = [doc.metadata[\"source\"] for doc in docs]\n",
+    "doc_sources"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[Metadata]\n",
+      "\n",
+      "{'source': 'data/appendix-keywords-CP949.txt'}\n",
+      "\n",
+      "========= [Preview - First 500 Characters] =========\n",
+      "\n",
+      "Semantic Search\n",
+      "\n",
+      "Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user¡¯s query to return relevant results.\n",
+      "Example: If a user searches for ¡°planets in the solar system,¡± the system might return information about related planets such as ¡°Jupiter¡± or ¡°Mars.¡±\n",
+      "Related Keywords: Natural Language Processing, Search Algorithms, Data Mining\n",
+      "\n",
+      "Embedding\n",
+      "\n",
+      "Definition: Embedding is the process of converting textual data, such a\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"[Metadata]\\n\")\n",
+    "print(docs[0].metadata)\n",
+    "print(\"\\n========= [Preview - First 500 Characters] =========\\n\")\n",
+    "print(docs[0].page_content[:500])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[Metadata]\n",
+      "\n",
+      "{'source': 'data/appendix-keywords-EUCKR.txt'}\n",
+      "\n",
+      "========= [Preview - First 500 Characters] =========\n",
+      "\n",
+      "Semantic Search\n",
+      "\n",
+      "Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user¡¯s query to return relevant results.\n",
+      "Example: If a user searches for ¡°planets in the solar system,¡± the system might return information about related planets such as ¡°Jupiter¡± or ¡°Mars.¡±\n",
+      "Related Keywords: Natural Language Processing, Search Algorithms, Data Mining\n",
+      "\n",
+      "Embedding\n",
+      "\n",
+      "Definition: Embedding is the process of converting textual data, such a\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"[Metadata]\\n\")\n",
+    "print(docs[1].metadata)\n",
+    "print(\"\\n========= [Preview - First 500 Characters] =========\\n\")\n",
+    "print(docs[1].page_content[:500])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[Metadata]\n",
+      "\n",
+      "{'source': 'data/appendix-keywords-utf8.txt'}\n",
+      "\n",
+      "========= [Preview - First 500 Characters] =========\n",
+      "\n",
+      "Semantic Search\n",
+      "\n",
+      "Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.\n",
+      "Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”\n",
+      "Related Keywords: Natural Language Processing, Search Algorithms, Data Mining\n",
+      "\n",
+      "Embedding\n",
+      "\n",
+      "Definition: Embedding is the process of converting textual data, such as words\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"[Metadata]\\n\")\n",
+    "print(docs[3].metadata)\n",
+    "print(\"\\n========= [Preview - First 500 Characters] =========\\n\")\n",
+    "print(docs[3].page_content[:500])"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "langchain-opentutorial-99wpaVyw-py3.11",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}