From 81e4d2ffd82992248687f1be3d8ecb33c1e277f7 Mon Sep 17 00:00:00 2001 From: Gwangwon Jung Date: Sun, 6 Apr 2025 02:11:58 +0900 Subject: [PATCH 1/5] [N-2] 09-Vector Store / 99-Mater-Template - vectorstore interface notebook template file - unified content structure --- 09-VectorStore/99-Master-Template.ipynb | 693 ++++++++++++++++++++++++ 1 file changed, 693 insertions(+) create mode 100644 09-VectorStore/99-Master-Template.ipynb diff --git a/09-VectorStore/99-Master-Template.ipynb b/09-VectorStore/99-Master-Template.ipynb new file mode 100644 index 000000000..24003d355 --- /dev/null +++ b/09-VectorStore/99-Master-Template.ipynb @@ -0,0 +1,693 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "25733da0", + "metadata": {}, + "source": [ + "# {VectorStore Name}\n", + "\n", + "- Author: [Author Name](#Author's-Profile-Link)\n", + "- Design: [Designer](#Designer's-Profile-Link)\n", + "- Peer Review: [Reviewer Name](#Reviewer-Profile-Link)\n", + "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n", + "\n", + "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/your-notebook-file-name) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/your-notebook-file-name)\n", + "\n", + "## Overview\n", + "\n", + "This tutorial covers how to use **{Vector Store Name}** with **LangChain** .\n", + "\n", + "{A short introduction to vectordb}\n", + "\n", + "This tutorial walks you through using **CRUD** operations with the **{VectorDB}** **storing** , **updating** , **deleting** documents, and performing **similarity-based retrieval** .\n", + "\n", + "### Table of Contents\n", + "\n", + "- [Overview](#overview)\n", + "- [Environment Setup](#environment-setup)\n", + "- [What is {vectordb}?](#what-is-{vectordb}?)\n", + "- [Data](#data)\n", + " - [Introduce Data](#introduce-data)\n", + " - [Preprocessing Data](#preprocessing-data)\n", + "- [Initial Setting {vectordb}](#initial-setting-{vectordb})\n", + " - [Load Embedding Model](#load-embedding-model)\n", + " - [Load {vectordb} Client](#load-{vectordb}-client)\n", + "- [Document Manager](#document-manager)\n", + " - [Create Instance](#create-instance)\n", + " - [Upsert Document](#upsert-document)\n", + " - [Upsert Parallel Document](#upsert-parallel-document)\n", + " - [Similarity Search](#similarity-search)\n", + " - [Delete Document](#delete-document)\n", + "\n", + "\n", + "### References\n", + "----" + ] + }, + { + "cell_type": "markdown", + "id": "c1fac085", + "metadata": {}, + "source": [ + "## Environment Setup\n", + "\n", + "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n", + "\n", + "**[Note]**\n", + "- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n", + "- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "98da7994", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "%%capture --no-stderr\n", + "%pip install langchain-opentutorial" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "800c732b", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "# Install required packages\n", + "from langchain_opentutorial import package\n", + "\n", + "package.install(\n", + " [\n", + " \"langsmith\",\n", + " \"langchain-core\",\n", + " \"python-dotenv\",\n", + " ],\n", + " verbose=False,\n", + " upgrade=False,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5b36bafa", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "# Set environment variables\n", + "from langchain_opentutorial import set_env\n", + "\n", + "set_env(\n", + " {\n", + " \"OPENAI_API_KEY\": \"\",\n", + " \"LANGCHAIN_API_KEY\": \"\",\n", + " \"LANGCHAIN_TRACING_V2\": \"true\",\n", + " \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n", + " \"LANGCHAIN_PROJECT\": \"{Project Name}\",\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "8011a0c7", + "metadata": {}, + "source": [ + "You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.\n", + "\n", + "[Note] This is not necessary if you've already set the required API keys in previous steps." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "70d7e764", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "from dotenv import load_dotenv\n", + "\n", + "load_dotenv(override=True)" + ] + }, + { + "cell_type": "markdown", + "id": "6890920d", + "metadata": {}, + "source": [ + "Please write down what you need to set up the Vectorstore here." + ] + }, + { + "cell_type": "markdown", + "id": "6f3b5bd2", + "metadata": {}, + "source": [ + "## Data\n", + "\n", + "This part walks you through the **data preparation process** .\n", + "\n", + "This section includes the following components:\n", + "\n", + "- Introduce Data\n", + "\n", + "- Preprocessing Data\n" + ] + }, + { + "cell_type": "markdown", + "id": "508ae7f7", + "metadata": {}, + "source": [ + "### Introduce Data\n", + "\n", + "In this tutorial, we will use the fairy tale **πŸ“— The Little Prince** in PDF format as our data.\n", + "\n", + "This material complies with the **Apache 2.0 license** .\n", + "\n", + "The data is used in a text (.txt) format converted from the original PDF.\n", + "\n", + "You can view the data at the link below.\n", + "- [Data Link](https://huggingface.co/datasets/sohyunwriter/the_little_prince)" + ] + }, + { + "cell_type": "markdown", + "id": "004ea4f4", + "metadata": {}, + "source": [ + "### Preprocessing Data\n", + "\n", + "In this tutorial section, we will preprocess the text data from The Little Prince and convert it into a list of `LangChain Document` objects with metadata. \n", + "\n", + "Each document chunk will include a `title` field in the metadata, extracted from the first line of each section." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8e4cac64", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "from langchain.schema import Document\n", + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "import re\n", + "from typing import List\n", + "\n", + "def preprocessing_data(content:str)->List[Document]:\n", + " # 1. Split the text by double newlines to separate sections\n", + " blocks = content.split(\"\\n\\n\")\n", + "\n", + " # 2. Initialize the text splitter\n", + " text_splitter = RecursiveCharacterTextSplitter(\n", + " chunk_size=500, # Maximum number of characters per chunk\n", + " chunk_overlap=50, # Overlap between chunks to preserve context\n", + " separators=[\"\\n\\n\", \"\\n\", \" \"] # Order of priority for splitting\n", + " )\n", + "\n", + " documents = []\n", + "\n", + " # 3. Loop through each section\n", + " for block in blocks:\n", + " lines = block.strip().splitlines()\n", + " if not lines:\n", + " continue\n", + "\n", + " # Extract title from the first line using square brackets [ ]\n", + " first_line = lines[0]\n", + " title_match = re.search(r\"\\[(.*?)\\]\", first_line)\n", + " title = title_match.group(1).strip() if title_match else None\n", + "\n", + " # Remove the title line from content\n", + " body = \"\\n\".join(lines[1:]).strip()\n", + " if not body:\n", + " continue\n", + "\n", + " # 4. Chunk the section using the text splitter\n", + " chunks = text_splitter.split_text(body)\n", + "\n", + " # 5. Create a LangChain Document for each chunk with the same title metadata\n", + " for chunk in chunks:\n", + " documents.append(Document(page_content=chunk, metadata={\"title\": title}))\n", + "\n", + " print(f\"Generated {len(documents)} chunked documents.\")\n", + "\n", + " return documents" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1d091a51", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "# Load the entire text file\n", + "with open(\"the_little_prince.txt\", \"r\", encoding=\"utf-8\") as f:\n", + " content = f.read()\n", + "\n", + "# Preprocessing Data\n", + "\n", + "docs = preprocessing_data(content=content)" + ] + }, + { + "cell_type": "markdown", + "id": "1977d4ff", + "metadata": {}, + "source": [ + "## Initial Setting {vectordb}\n", + "\n", + "This part walks you through the initial setup of **{vectordb}** .\n", + "\n", + "This section includes the following components:\n", + "\n", + "- Load Embedding Model\n", + "\n", + "- Load {vectordb} Client" + ] + }, + { + "cell_type": "markdown", + "id": "7eee56b2", + "metadata": {}, + "source": [ + "### Load Embedding Model\n", + "\n", + "In the **Load Embedding Model** section, you'll learn how to load an embedding model.\n", + "\n", + "This tutorial uses **OpenAI** 's **API-Key** for loading the model.\n", + "\n", + "*πŸ’‘ If you prefer to use another embedding model, see the instructions below.*\n", + "- [Embedding Models](https://python.langchain.com/docs/integrations/text_embedding/)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5bd5c3c9", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "import os\n", + "from langchain_openai import OpenAIEmbeddings\n", + "\n", + "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")" + ] + }, + { + "cell_type": "markdown", + "id": "40f65795", + "metadata": {}, + "source": [ + "### Load {vectordb} Client\n", + "\n", + "In the **Load {vectordb} Client** section, we cover how to load the **database client object** using the **Python SDK** for **{vectordb}** .\n", + "- [Python SDK Docs]()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eed0ebad", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "# Create Database Client Object Function\n", + "\n", + "def get_db_client():\n", + " \"\"\"\n", + " Initializes and returns a VectorStore client instance.\n", + "\n", + " This function loads configuration (e.g., API key, host) from environment\n", + " variables or default values and creates a client object to interact\n", + " with the {vectordb} Python SDK.\n", + "\n", + " Returns:\n", + " client:ClientType - An instance of the {vectordb} client.\n", + "\n", + " Raises:\n", + " ValueError: If required configuration is missing.\n", + " \"\"\"\n", + " return client" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2b5f4116", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "# Get DB Client Object\n", + "\n", + "client = get_db_client()" + ] + }, + { + "cell_type": "markdown", + "id": "3a5a97a0", + "metadata": {}, + "source": [ + "## Document Manager\n", + "\n", + "To support the **Langchain-Opentutorial** , we implemented a custom set of **CRUD** functionalities for VectorDBs. \n", + "\n", + "The following operations are included:\n", + "\n", + "- `upsert` : Update existing documents or insert if they don’t exist\n", + "\n", + "- `upsert_parallel` : Perform upserts in parallel for large-scale data\n", + "\n", + "- `similarity_search` : Search for similar documents based on embeddings\n", + "\n", + "- `delete` : Remove documents based on filter conditions\n", + "\n", + "Each of these features is implemented as class methods specific to each VectorDB.\n", + "\n", + "In this tutorial, you can easily utilize these methods to interact with your VectorDB.\n", + "\n", + "*We plan to continuously expand the functionality by adding more common operations in the future.*" + ] + }, + { + "cell_type": "markdown", + "id": "65a40601", + "metadata": {}, + "source": [ + "### Create Instance\n", + "\n", + "First, we create an instance of the **{vectordb}** helper class to use its CRUD functionalities.\n", + "\n", + "This class is initialized with the **{vectordb} Python SDK client instance** and the **embedding model instance** , both of which were defined in the previous section." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dccab807", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "# crud_manager = (client=client, embedding=embedding)" + ] + }, + { + "cell_type": "markdown", + "id": "c1c0c67f", + "metadata": {}, + "source": [ + "Now you can use the following **CRUD** operations with the `crud_manager` instance.\n", + "\n", + "These instance allow you to easily manage documents in your **{vectordb}** ." + ] + }, + { + "cell_type": "markdown", + "id": "7c6c53c5", + "metadata": {}, + "source": [ + "### Upsert Document\n", + "\n", + "**Update** existing documents or **insert** if they don’t exist\n", + "\n", + "**βœ… Args**\n", + "\n", + "- `texts` : Iterable[str] – List of text contents to be inserted/updated.\n", + "\n", + "- `metadatas` : Optional[List[Dict]] – List of metadata dictionaries for each text (optional).\n", + "\n", + "- `ids` : Optional[List[str]] – Custom IDs for the documents. If not provided, IDs will be auto-generated.\n", + "\n", + "- `**kwargs` : Extra arguments for the underlying vector store.\n", + "\n", + "**πŸ”„ Return**\n", + "\n", + "- None" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f3a6c32b", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "from uuid import uuid4\n", + "\n", + "args = {\n", + " \"texts\": [doc.page_content for doc in docs[:2]],\n", + " \"metadatas\": [doc.metadata[\"title\"] for doc in docs[:2]],\n", + " \"ids\": [str(uuid4()) for _ in docs[:2]]\n", + " # if you want args, add params.\n", + "}\n", + "\n", + "# crud_manager.upsert(**args)" + ] + }, + { + "cell_type": "markdown", + "id": "278fe1ed", + "metadata": {}, + "source": [ + "### Upsert Parallel Document\n", + "\n", + "Perform **upserts** in **parallel** for large-scale data\n", + "\n", + "**βœ… Args**\n", + "\n", + "- `texts` : Iterable[str] – List of text contents to be inserted/updated.\n", + "\n", + "- `metadatas` : Optional[List[Dict]] – List of metadata dictionaries for each text (optional).\n", + "\n", + "- `ids` : Optional[List[str]] – Custom IDs for the documents. If not provided, IDs will be auto-generated.\n", + "\n", + "- `batch_size` : int – Number of documents per batch (default: 32).\n", + "\n", + "- `workers` : int – Number of parallel workers (default: 10).\n", + "\n", + "- `**kwargs` : Extra arguments for the underlying vector store.\n", + "\n", + "**πŸ”„ Return**\n", + "\n", + "- None" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a89dd8e0", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "from uuid import uuid4\n", + "\n", + "args = {\n", + " \"texts\": [doc.page_content for doc in docs],\n", + " \"metadatas\": [doc.metadata[\"title\"] for doc in docs],\n", + " \"ids\": [str(uuid4()) for _ in docs]\n", + " # if you want args, add params.\n", + "}\n", + "\n", + "# crud_manager.upsert_parallel(**args)" + ] + }, + { + "cell_type": "markdown", + "id": "6beea197", + "metadata": {}, + "source": [ + "### Similarity Search\n", + "\n", + "Search for **similar documents** based on **embeddings** .\n", + "\n", + "This method uses **\"cosine similarity\"** .\n", + "\n", + "\n", + "**βœ… Args**\n", + "\n", + "- `query` : str – The text query for similarity search.\n", + "\n", + "- `k` : int – Number of top results to return (default: 10).\n", + "\n", + "`**kwargs` : Additional search options (e.g., filters).\n", + "\n", + "**πŸ”„ Return**\n", + "\n", + "- `results` : List[Document] – A list of LangChain Document objects ranked by similarity." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5859782b", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "# Search by Query\n", + "\n", + "# results = crud_manager.search(query=\"\",k=3)\n", + "# for idx,doc in enumerate(results):\n", + "# print(f\"Rank {idx} | Title : {doc.metadata['title']}\")\n", + "# print(f\"Contents : {doc.page_content}\")\n", + "# print()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2577dd4a", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "# Filter Search\n", + "\n", + "# results = crud_manager.search(query=\"\",k=3,={\"title\":\"Chapter 4\"})\n", + "# for idx,doc in enumerate(results):\n", + "# print(f\"Rank {idx} | Title : {doc.metadata['title']}\")\n", + "# print(f\"Contents : {doc.page_content}\")\n", + "# print()" + ] + }, + { + "cell_type": "markdown", + "id": "9ad0ed0c", + "metadata": {}, + "source": [ + "### Delete Document\n", + "\n", + "Remove documents based on filter conditions\n", + "\n", + "**βœ… Args**\n", + "\n", + "- `ids` : Optional[List[str]] – List of document IDs to delete. If None, deletion is based on filter.\n", + "\n", + "- `filters` : Optional[Dict] – Dictionary specifying filter conditions (e.g., metadata match).\n", + "\n", + "- `**kwargs` : Any additional parameters.\n", + "\n", + "**πŸ”„ Return**\n", + "\n", + "- None" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0e3a2c33", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "# Delete by ids\n", + "\n", + "# ids = [] # The 'ids' value you want to delete\n", + "# crud_manager.delete(ids=ids)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "60bcb4cf", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "# Delete by ids with filters\n", + "\n", + "# ids = [] # The `ids` value corresponding to chapter 6\n", + "# crud_manager.delete(ids=ids,filters={\"title\":\"chapter 6\"}) " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "30d42d2e", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "# Delete All\n", + "\n", + "# crud_manager.delete()" + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From fb93ff8c2291ddfe98e80a9ec8d73d0f9b4fa21e Mon Sep 17 00:00:00 2001 From: Gwangwon Jung Date: Sun, 6 Apr 2025 02:24:26 +0900 Subject: [PATCH 2/5] [N-2] 09-Vector Store / 99-Mater-Template - Add a query to the `search` usage code. --- 09-VectorStore/99-Master-Template.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/09-VectorStore/99-Master-Template.ipynb b/09-VectorStore/99-Master-Template.ipynb index 24003d355..b12e264a2 100644 --- a/09-VectorStore/99-Master-Template.ipynb +++ b/09-VectorStore/99-Master-Template.ipynb @@ -583,7 +583,7 @@ "source": [ "# Search by Query\n", "\n", - "# results = crud_manager.search(query=\"\",k=3)\n", + "# results = crud_manager.search(query=\"What is essential is invisible to the eye.\",k=3)\n", "# for idx,doc in enumerate(results):\n", "# print(f\"Rank {idx} | Title : {doc.metadata['title']}\")\n", "# print(f\"Contents : {doc.page_content}\")\n", @@ -603,7 +603,7 @@ "source": [ "# Filter Search\n", "\n", - "# results = crud_manager.search(query=\"\",k=3,={\"title\":\"Chapter 4\"})\n", + "# results = crud_manager.search(query=\"Which asteroid did the little prince come from?\",k=3,={\"title\":\"Chapter 4\"})\n", "# for idx,doc in enumerate(results):\n", "# print(f\"Rank {idx} | Title : {doc.metadata['title']}\")\n", "# print(f\"Contents : {doc.page_content}\")\n", From e117dfd655f6ad1533f344c15cb1611dbdba686a Mon Sep 17 00:00:00 2001 From: Gwangwon Jung Date: Tue, 29 Apr 2025 22:57:50 +0900 Subject: [PATCH 3/5] [N-2] 09-Vector Store / 99-Master-Template - Remove `###` content in `Table of content`. - ` -> ``` changed backtick --- 09-VectorStore/99-Master-Template.ipynb | 65 +++++++++++-------------- 1 file changed, 28 insertions(+), 37 deletions(-) diff --git a/09-VectorStore/99-Master-Template.ipynb b/09-VectorStore/99-Master-Template.ipynb index b12e264a2..25fb37698 100644 --- a/09-VectorStore/99-Master-Template.ipynb +++ b/09-VectorStore/99-Master-Template.ipynb @@ -28,17 +28,8 @@ "- [Environment Setup](#environment-setup)\n", "- [What is {vectordb}?](#what-is-{vectordb}?)\n", "- [Data](#data)\n", - " - [Introduce Data](#introduce-data)\n", - " - [Preprocessing Data](#preprocessing-data)\n", "- [Initial Setting {vectordb}](#initial-setting-{vectordb})\n", - " - [Load Embedding Model](#load-embedding-model)\n", - " - [Load {vectordb} Client](#load-{vectordb}-client)\n", "- [Document Manager](#document-manager)\n", - " - [Create Instance](#create-instance)\n", - " - [Upsert Document](#upsert-document)\n", - " - [Upsert Parallel Document](#upsert-parallel-document)\n", - " - [Similarity Search](#similarity-search)\n", - " - [Delete Document](#delete-document)\n", "\n", "\n", "### References\n", @@ -55,8 +46,8 @@ "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n", "\n", "**[Note]**\n", - "- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n", - "- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details." + "- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n", + "- You can checkout the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details." ] }, { @@ -129,7 +120,7 @@ "id": "8011a0c7", "metadata": {}, "source": [ - "You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.\n", + "You can alternatively set API keys such as ```OPENAI_API_KEY``` in a ```.env``` file and load them.\n", "\n", "[Note] This is not necessary if you've already set the required API keys in previous steps." ] @@ -198,9 +189,9 @@ "source": [ "### Preprocessing Data\n", "\n", - "In this tutorial section, we will preprocess the text data from The Little Prince and convert it into a list of `LangChain Document` objects with metadata. \n", + "In this tutorial section, we will preprocess the text data from The Little Prince and convert it into a list of ```LangChain Document``` objects with metadata. \n", "\n", - "Each document chunk will include a `title` field in the metadata, extracted from the first line of each section." + "Each document chunk will include a ```title``` field in the metadata, extracted from the first line of each section." ] }, { @@ -305,7 +296,7 @@ "\n", "In the **Load Embedding Model** section, you'll learn how to load an embedding model.\n", "\n", - "This tutorial uses **OpenAI** 's **API-Key** for loading the model.\n", + "This tutorial uses **OpenAI's** **API-Key** for loading the model.\n", "\n", "*πŸ’‘ If you prefer to use another embedding model, see the instructions below.*\n", "- [Embedding Models](https://python.langchain.com/docs/integrations/text_embedding/)" @@ -396,13 +387,13 @@ "\n", "The following operations are included:\n", "\n", - "- `upsert` : Update existing documents or insert if they don’t exist\n", + "- ```upsert``` : Update existing documents or insert if they don’t exist\n", "\n", - "- `upsert_parallel` : Perform upserts in parallel for large-scale data\n", + "- ```upsert_parallel``` : Perform upserts in parallel for large-scale data\n", "\n", - "- `similarity_search` : Search for similar documents based on embeddings\n", + "- ```similarity_search``` : Search for similar documents based on embeddings\n", "\n", - "- `delete` : Remove documents based on filter conditions\n", + "- ```delete``` : Remove documents based on filter conditions\n", "\n", "Each of these features is implemented as class methods specific to each VectorDB.\n", "\n", @@ -442,7 +433,7 @@ "id": "c1c0c67f", "metadata": {}, "source": [ - "Now you can use the following **CRUD** operations with the `crud_manager` instance.\n", + "Now you can use the following **CRUD** operations with the ```crud_manager``` instance.\n", "\n", "These instance allow you to easily manage documents in your **{vectordb}** ." ] @@ -458,13 +449,13 @@ "\n", "**βœ… Args**\n", "\n", - "- `texts` : Iterable[str] – List of text contents to be inserted/updated.\n", + "- ```texts``` : Iterable[str] – List of text contents to be inserted/updated.\n", "\n", - "- `metadatas` : Optional[List[Dict]] – List of metadata dictionaries for each text (optional).\n", + "- ```metadatas``` : Optional[List[Dict]] – List of metadata dictionaries for each text (optional).\n", "\n", - "- `ids` : Optional[List[str]] – Custom IDs for the documents. If not provided, IDs will be auto-generated.\n", + "- ```ids``` : Optional[List[str]] – Custom IDs for the documents. If not provided, IDs will be auto-generated.\n", "\n", - "- `**kwargs` : Extra arguments for the underlying vector store.\n", + "- ```**kwargs``` : Extra arguments for the underlying vector store.\n", "\n", "**πŸ”„ Return**\n", "\n", @@ -505,17 +496,17 @@ "\n", "**βœ… Args**\n", "\n", - "- `texts` : Iterable[str] – List of text contents to be inserted/updated.\n", + "- ```texts``` : Iterable[str] – List of text contents to be inserted/updated.\n", "\n", - "- `metadatas` : Optional[List[Dict]] – List of metadata dictionaries for each text (optional).\n", + "- ```metadatas``` : Optional[List[Dict]] – List of metadata dictionaries for each text (optional).\n", "\n", - "- `ids` : Optional[List[str]] – Custom IDs for the documents. If not provided, IDs will be auto-generated.\n", + "- ```ids``` : Optional[List[str]] – Custom IDs for the documents. If not provided, IDs will be auto-generated.\n", "\n", - "- `batch_size` : int – Number of documents per batch (default: 32).\n", + "- ```batch_size``` : int – Number of documents per batch (default: 32).\n", "\n", - "- `workers` : int – Number of parallel workers (default: 10).\n", + "- ```workers``` : int – Number of parallel workers (default: 10).\n", "\n", - "- `**kwargs` : Extra arguments for the underlying vector store.\n", + "- ```**kwargs``` : Extra arguments for the underlying vector store.\n", "\n", "**πŸ”„ Return**\n", "\n", @@ -559,15 +550,15 @@ "\n", "**βœ… Args**\n", "\n", - "- `query` : str – The text query for similarity search.\n", + "- ```query``` : str – The text query for similarity search.\n", "\n", - "- `k` : int – Number of top results to return (default: 10).\n", + "- ```k``` : int – Number of top results to return (default: 10).\n", "\n", - "`**kwargs` : Additional search options (e.g., filters).\n", + "```**kwargs``` : Additional search options (e.g., filters).\n", "\n", "**πŸ”„ Return**\n", "\n", - "- `results` : List[Document] – A list of LangChain Document objects ranked by similarity." + "- ```results``` : List[Document] – A list of LangChain Document objects ranked by similarity." ] }, { @@ -621,11 +612,11 @@ "\n", "**βœ… Args**\n", "\n", - "- `ids` : Optional[List[str]] – List of document IDs to delete. If None, deletion is based on filter.\n", + "- ```ids``` : Optional[List[str]] – List of document IDs to delete. If None, deletion is based on filter.\n", "\n", - "- `filters` : Optional[Dict] – Dictionary specifying filter conditions (e.g., metadata match).\n", + "- ```filters``` : Optional[Dict] – Dictionary specifying filter conditions (e.g., metadata match).\n", "\n", - "- `**kwargs` : Any additional parameters.\n", + "- ```**kwargs``` : Any additional parameters.\n", "\n", "**πŸ”„ Return**\n", "\n", From ea4014e7f4e6054fc996cede0ce0957a2ac72b5b Mon Sep 17 00:00:00 2001 From: Gwangwon Jung Date: Mon, 5 May 2025 10:53:37 +0900 Subject: [PATCH 4/5] [N-2] 09-Vector Store / 99-Master-Template - add `as_retriever` : LightCustomRetriever for tutorial - Fixed Master-Template - vectordbinterface.py --- 09-VectorStore/99-Master-Template.ipynb | 60 +++++++++++++++++++++-- 09-VectorStore/utils/vectordbinterface.py | 41 +++++++++++++++- 2 files changed, 97 insertions(+), 4 deletions(-) diff --git a/09-VectorStore/99-Master-Template.ipynb b/09-VectorStore/99-Master-Template.ipynb index 25fb37698..bd54df77e 100644 --- a/09-VectorStore/99-Master-Template.ipynb +++ b/09-VectorStore/99-Master-Template.ipynb @@ -232,7 +232,7 @@ " # Extract title from the first line using square brackets [ ]\n", " first_line = lines[0]\n", " title_match = re.search(r\"\\[(.*?)\\]\", first_line)\n", - " title = title_match.group(1).strip() if title_match else None\n", + " title = title_match.group(1).strip() if title_match else \"\"\n", "\n", " # Remove the title line from content\n", " body = \"\\n\".join(lines[1:]).strip()\n", @@ -477,7 +477,7 @@ "\n", "args = {\n", " \"texts\": [doc.page_content for doc in docs[:2]],\n", - " \"metadatas\": [doc.metadata[\"title\"] for doc in docs[:2]],\n", + " \"metadatas\": [doc.metadata for doc in docs[:2]],\n", " \"ids\": [str(uuid4()) for _ in docs[:2]]\n", " # if you want args, add params.\n", "}\n", @@ -528,7 +528,7 @@ "\n", "args = {\n", " \"texts\": [doc.page_content for doc in docs],\n", - " \"metadatas\": [doc.metadata[\"title\"] for doc in docs],\n", + " \"metadatas\": [doc.metadata for doc in docs],\n", " \"ids\": [str(uuid4()) for _ in docs]\n", " # if you want args, add params.\n", "}\n", @@ -601,6 +601,60 @@ "# print()" ] }, + { + "cell_type": "markdown", + "id": "f140c0e2", + "metadata": {}, + "source": [ + "### As_retrever\n", + "\n", + "The ```as_retriever()``` method creates a LangChain-compatible retriever wrapper.\n", + "\n", + "This function allows a ```DocumentManager``` class to return a retriever object by wrapping the internal ```search()``` method, while staying lightweight and independent from full LangChain ```VectorStore``` dependencies.\n", + "\n", + "The retriever obtained through this function can be used the same as the existing LangChain retriever and is **compatible with LangChain Pipeline(e.g. RetrievalQA,ConversationalRetrievalChain,Tool,...)**.\n", + "\n", + "**βœ… Args**\n", + "\n", + "- ```search_fn``` : Callable - The function used to retrieve relevant documents. Typically this is ```self.search``` from a ```DocumentManager``` instance.\n", + "\n", + "- ```search_kwargs``` : Optional[Dict] - A dictionary of keyword arguments passed to ```search_fn```, such as ```k``` for top-K results or metadata filters.\n", + "\n", + "**πŸ”„ Return**\n", + "\n", + "- ```LightCustomRetriever``` :BaseRetriever - A lightweight LangChain-compatible retriever that internally uses the given ```search_fn``` and ```search_kwargs```." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "86de7842", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "# ret = crud_manager.as_retriever(\n", + "# search_fn=crud_manager.search, search_kwargs= # e.g. {\"k\": 1, \"where\": {\"title\": \"\"}}\n", + "# )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7142d29c", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "# ret.invoke(\"Which asteroid did the little prince come from?\")" + ] + }, { "cell_type": "markdown", "id": "9ad0ed0c", diff --git a/09-VectorStore/utils/vectordbinterface.py b/09-VectorStore/utils/vectordbinterface.py index bc304bde3..23c330cdc 100644 --- a/09-VectorStore/utils/vectordbinterface.py +++ b/09-VectorStore/utils/vectordbinterface.py @@ -67,7 +67,18 @@ def delete( New Interface for VectorDB CRUD """ -from typing import Optional, List, Iterable, Any, Dict +from typing import Optional, List, Iterable, Any, Dict, Callable +from langchain_core.retrievers import BaseRetriever +from langchain_core.documents import Document +from pydantic import Field + + +class LightCustomRetriever(BaseRetriever): + search_fn: Callable + search_kwargs: Dict = Field(default_factory=dict) + + def get_relevant_documents(self, query: str) -> List[Document]: + return self.search_fn(query, **self.search_kwargs) class DocumentManager(ABC): @@ -128,3 +139,31 @@ def delete( """ pass + + def as_retriever( + self, search_fn: Callable, search_kwargs: Dict = {} + ) -> LightCustomRetriever: + """ + Create a LangChain-compatible retriever using a custom search function. + + This method wraps a provided search function and its keyword arguments + into a `LightCustomRetriever` object that conforms to LangChain's `BaseRetriever` interface. + Useful for integrating lightweight, SDK-based CRUD search implementations with LangChain chains. + + Args: + search_fn (Callable): + The function that performs the similarity search and returns a list of `Document` objects. + Typically this is the `search()` method of the DocumentManager. + search_kwargs (Dict, optional): + Additional keyword arguments to pass into the `search_fn`. + Example: {'k': 5} to retrieve top 5 similar documents. + + Returns: + LightCustomRetriever: + A retriever instance that can be used with LangChain chains like `RetrievalQA` + or `ConversationalRetrievalChain`. + """ + retriever = LightCustomRetriever( + search_fn=search_fn, search_kwargs=search_kwargs + ) + return retriever From ddb35c5fb71bee28a91809112d9f473e32c15095 Mon Sep 17 00:00:00 2001 From: Gwangwon Jung Date: Tue, 6 May 2025 21:52:40 +0900 Subject: [PATCH 5/5] [N-2] 09-Vector Store / 99-Master-Template - HotFix - Add `What is {vectordb}?` - Changed `with open("the_little_prince.txt"~` -> `with open("./data/the_little_prince.txt"~` --- 09-VectorStore/99-Master-Template.ipynb | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/09-VectorStore/99-Master-Template.ipynb b/09-VectorStore/99-Master-Template.ipynb index bd54df77e..1fe051821 100644 --- a/09-VectorStore/99-Master-Template.ipynb +++ b/09-VectorStore/99-Master-Template.ipynb @@ -146,6 +146,8 @@ "id": "6890920d", "metadata": {}, "source": [ + "## What is {vectordb}?\n", + "\n", "Please write down what you need to set up the Vectorstore here." ] }, @@ -263,7 +265,7 @@ "outputs": [], "source": [ "# Load the entire text file\n", - "with open(\"the_little_prince.txt\", \"r\", encoding=\"utf-8\") as f:\n", + "with open(\"./data/the_little_prince.txt\", \"r\", encoding=\"utf-8\") as f:\n", " content = f.read()\n", "\n", "# Preprocessing Data\n",