Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
320 changes: 320 additions & 0 deletions 06-DocumentLoader/08-TXT-Loader.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,320 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# TXT Loader\n",
"\n",
"- Author: [seofield](https://github.com/seofield)\n",
"- Design:\n",
"- Peer Review: [suhyun0115](https://github.com/suhyun0115) [HarryKane11](https://github.com/HarryKane11)\n",
"- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
"\n",
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)\n",
"\n",
"## Overview\n",
"\n",
"This tutorial focuses on using LangChain’s TextLoader to efficiently load and process individual text files. \n",
"\n",
"You’ll learn how to extract metadata and content, making it easier to prepare text data.\n",
"\n",
"\n",
"### Table of Contents\n",
"\n",
"- [Overview](#overview)\n",
"- [Environement Setup](#environment-setup)\n",
"- [TXT Loader](#txt-loader)\n",
"- [Automatic Encoding Detection with TextLoader](#automatic-encoding-detection-with-textloader)\n",
"\n",
"----"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Environment Setup\n",
"\n",
"Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n",
"\n",
"**[Note]**\n",
"- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n",
"- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%%capture --no-stderr\n",
"!pip install langchain-opentutorial"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Install required packages\n",
"from langchain_opentutorial import package\n",
"\n",
"package.install(\n",
" [\n",
" \"langchain\",\n",
" \"langchain_community\",\n",
" \"chardet\"\n",
" ],\n",
" verbose=False,\n",
" upgrade=False,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TXT Loader\n",
"\n",
"Let’s explore how to load files with the `.txt` extension using a loader."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of documents: 1\n",
"\n",
"[Metadata]\n",
"\n",
"{'source': 'data/appendix-keywords.txt'}\n",
"\n",
"========= [Preview - First 500 Characters] =========\n",
"\n",
"Semantic Search\n",
"\n",
"Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.\n",
"Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”\n",
"Related Keywords: Natural Language Processing, Search Algorithms, Data Mining\n",
"\n",
"Embedding\n",
"\n",
"Definition: Embedding is the process of converting textual data, such as words\n"
]
}
],
"source": [
"from langchain_community.document_loaders import TextLoader\n",
"\n",
"# Create a text loader\n",
"loader = TextLoader(\"data/appendix-keywords.txt\", encoding=\"utf-8\")\n",
"\n",
"# Load the document\n",
"docs = loader.load()\n",
"print(f\"Number of documents: {len(docs)}\\n\")\n",
"print(\"[Metadata]\\n\")\n",
"print(docs[0].metadata)\n",
"print(\"\\n========= [Preview - First 500 Characters] =========\\n\")\n",
"print(docs[0].page_content[:500])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Automatic Encoding Detection with TextLoader\n",
"\n",
"In this example, we explore several strategies for using the TextLoader class to efficiently load large batches of files from a directory with varying encodings.\n",
"\n",
"To illustrate the problem, we’ll first attempt to load multiple text files with arbitrary encodings.\n",
"\n",
"- `silent_errors`: By passing the silent_errors parameter to the DirectoryLoader, you can skip files that cannot be loaded and continue the loading process without interruptions.\n",
"- `autodetect_encoding`: Additionally, you can enable automatic encoding detection by passing the autodetect_encoding parameter to the loader class, allowing it to detect file encodings before failing.\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import DirectoryLoader\n",
"\n",
"path = \"data/\"\n",
"\n",
"text_loader_kwargs = {\"autodetect_encoding\": True}\n",
"\n",
"loader = DirectoryLoader(\n",
" path,\n",
" glob=\"**/*.txt\",\n",
" loader_cls=TextLoader,\n",
" silent_errors=True,\n",
" loader_kwargs=text_loader_kwargs,\n",
")\n",
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `data/appendix-keywords.txt` file and its derivative files with similar names all have different encoding formats.\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['data/appendix-keywords-CP949.txt',\n",
" 'data/appendix-keywords-EUCKR.txt',\n",
" 'data/appendix-keywords.txt',\n",
" 'data/appendix-keywords-utf8.txt']"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"doc_sources = [doc.metadata[\"source\"] for doc in docs]\n",
"doc_sources"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Metadata]\n",
"\n",
"{'source': 'data/appendix-keywords-CP949.txt'}\n",
"\n",
"========= [Preview - First 500 Characters] =========\n",
"\n",
"Semantic Search\n",
"\n",
"Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user¡¯s query to return relevant results.\n",
"Example: If a user searches for ¡°planets in the solar system,¡± the system might return information about related planets such as ¡°Jupiter¡± or ¡°Mars.¡±\n",
"Related Keywords: Natural Language Processing, Search Algorithms, Data Mining\n",
"\n",
"Embedding\n",
"\n",
"Definition: Embedding is the process of converting textual data, such a\n"
]
}
],
"source": [
"print(\"[Metadata]\\n\")\n",
"print(docs[0].metadata)\n",
"print(\"\\n========= [Preview - First 500 Characters] =========\\n\")\n",
"print(docs[0].page_content[:500])"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Metadata]\n",
"\n",
"{'source': 'data/appendix-keywords-EUCKR.txt'}\n",
"\n",
"========= [Preview - First 500 Characters] =========\n",
"\n",
"Semantic Search\n",
"\n",
"Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user¡¯s query to return relevant results.\n",
"Example: If a user searches for ¡°planets in the solar system,¡± the system might return information about related planets such as ¡°Jupiter¡± or ¡°Mars.¡±\n",
"Related Keywords: Natural Language Processing, Search Algorithms, Data Mining\n",
"\n",
"Embedding\n",
"\n",
"Definition: Embedding is the process of converting textual data, such a\n"
]
}
],
"source": [
"print(\"[Metadata]\\n\")\n",
"print(docs[1].metadata)\n",
"print(\"\\n========= [Preview - First 500 Characters] =========\\n\")\n",
"print(docs[1].page_content[:500])"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Metadata]\n",
"\n",
"{'source': 'data/appendix-keywords-utf8.txt'}\n",
"\n",
"========= [Preview - First 500 Characters] =========\n",
"\n",
"Semantic Search\n",
"\n",
"Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.\n",
"Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”\n",
"Related Keywords: Natural Language Processing, Search Algorithms, Data Mining\n",
"\n",
"Embedding\n",
"\n",
"Definition: Embedding is the process of converting textual data, such as words\n"
]
}
],
"source": [
"print(\"[Metadata]\\n\")\n",
"print(docs[3].metadata)\n",
"print(\"\\n========= [Preview - First 500 Characters] =========\\n\")\n",
"print(docs[3].page_content[:500])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "langchain-opentutorial-99wpaVyw-py3.11",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading
Loading