From 2bd415cbd30a69a1e0772df2b70a502b2f4b936a Mon Sep 17 00:00:00 2001 From: Jongho Lee Date: Sun, 4 May 2025 01:24:15 +0900 Subject: [PATCH 1/5] pgvector revised --- 09-VectorStore/08-PGVector.ipynb | 990 +++++++++++---------- 09-VectorStore/utils/pgvector_interface.py | 180 ++-- 2 files changed, 572 insertions(+), 598 deletions(-) diff --git a/09-VectorStore/08-PGVector.ipynb b/09-VectorStore/08-PGVector.ipynb index 3608d1790..07c465595 100644 --- a/09-VectorStore/08-PGVector.ipynb +++ b/09-VectorStore/08-PGVector.ipynb @@ -2,6 +2,7 @@ "cells": [ { "cell_type": "markdown", + "id": "25733da0", "metadata": {}, "source": [ "# PGVector\n", @@ -13,20 +14,23 @@ "\n", "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/07-PGVector.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/07-PGVector.ipynb)\n", "\n", - "## Overview \n", + "## Overview\n", + "\n", + "This tutorial covers how to use ```PGVector``` with **LangChain** .\n", "\n", "[```PGVector```](https://github.com/pgvector/pgvector) is an open-source extension for PostgreSQL that allows you to store and search vector data alongside your regular database information.\n", "\n", - "This notebook shows how to use functionality related to ```PGVector```, implementing LangChain vectorstore abstraction using postgres as the backend and utilizing the pgvector extension.\n", + "This tutorial walks you through using **CRUD** operations with the ```PGVector``` **storing** , **updating** , **deleting** documents, and performing **similarity-based retrieval** .\n", "\n", "### Table of Contents\n", "\n", "- [Overview](#overview)\n", "- [Environment Setup](#environment-setup)\n", - "- [What is PGVector?](#what-is-pgvector)\n", - "- [Initialization](#initialization)\n", - "- [Manage vector store](#manage-vector-store)\n", - "- [Similarity search](#similarity-search)\n", + "- [What is PGVector?](#what-is-pgvector?)\n", + "- [Data](#data)\n", + "- [Initial Setting PGVector](#initial-setting-PGVector)\n", + "- [Document Manager](#document-manager)\n", + "\n", "\n", "### References\n", "\n", @@ -40,6 +44,7 @@ }, { "cell_type": "markdown", + "id": "c1fac085", "metadata": {}, "source": [ "## Environment Setup\n", @@ -47,13 +52,14 @@ "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n", "\n", "**[Note]**\n", - "- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n", - "- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details." + "- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n", + "- You can checkout the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details." ] }, { "cell_type": "code", "execution_count": 1, + "id": "98da7994", "metadata": {}, "outputs": [], "source": [ @@ -63,7 +69,8 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, + "id": "800c732b", "metadata": {}, "outputs": [], "source": [ @@ -88,6 +95,7 @@ { "cell_type": "code", "execution_count": 3, + "id": "5b36bafa", "metadata": {}, "outputs": [ { @@ -113,9 +121,20 @@ ")" ] }, + { + "cell_type": "markdown", + "id": "8011a0c7", + "metadata": {}, + "source": [ + "You can alternatively set API keys such as ```OPENAI_API_KEY``` in a ```.env``` file and load them.\n", + "\n", + "[Note] This is not necessary if you've already set the required API keys in previous steps." + ] + }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 4, + "id": "70d7e764", "metadata": {}, "outputs": [ { @@ -124,7 +143,7 @@ "True" ] }, - "execution_count": 5, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } @@ -137,39 +156,154 @@ }, { "cell_type": "markdown", + "id": "6890920d", + "metadata": {}, + "source": [ + "Please write down what you need to set up the Vectorstore here." + ] + }, + { + "cell_type": "markdown", + "id": "6f3b5bd2", + "metadata": {}, + "source": [ + "## Data\n", + "\n", + "This part walks you through the **data preparation process** .\n", + "\n", + "This section includes the following components:\n", + "\n", + "- Introduce Data\n", + "\n", + "- Preprocessing Data\n" + ] + }, + { + "cell_type": "markdown", + "id": "508ae7f7", + "metadata": {}, + "source": [ + "### Introduce Data\n", + "\n", + "In this tutorial, we will use the fairy tale **πŸ“— The Little Prince** in PDF format as our data.\n", + "\n", + "This material complies with the **Apache 2.0 license** .\n", + "\n", + "The data is used in a text (.txt) format converted from the original PDF.\n", + "\n", + "You can view the data at the link below.\n", + "- [Data Link](https://huggingface.co/datasets/sohyunwriter/the_little_prince)" + ] + }, + { + "cell_type": "markdown", + "id": "004ea4f4", + "metadata": {}, + "source": [ + "### Preprocessing Data\n", + "\n", + "In this tutorial section, we will preprocess the text data from The Little Prince and convert it into a list of ```LangChain Document``` objects with metadata. \n", + "\n", + "Each document chunk will include a ```title``` field in the metadata, extracted from the first line of each section." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "8e4cac64", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.schema import Document\n", + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "import re\n", + "from typing import List\n", + "\n", + "\n", + "def preprocessing_data(content: str) -> List[Document]:\n", + " # 1. Split the text by double newlines to separate sections\n", + " blocks = content.split(\"\\n\\n\")\n", + "\n", + " # 2. Initialize the text splitter\n", + " text_splitter = RecursiveCharacterTextSplitter(\n", + " chunk_size=500, # Maximum number of characters per chunk\n", + " chunk_overlap=50, # Overlap between chunks to preserve context\n", + " separators=[\"\\n\\n\", \"\\n\", \" \"], # Order of priority for splitting\n", + " )\n", + "\n", + " documents = []\n", + "\n", + " # 3. Loop through each section\n", + " for block in blocks:\n", + " lines = block.strip().splitlines()\n", + " if not lines:\n", + " continue\n", + "\n", + " # Extract title from the first line using square brackets [ ]\n", + " first_line = lines[0]\n", + " title_match = re.search(r\"\\[(.*?)\\]\", first_line)\n", + " title = title_match.group(1).strip() if title_match else None\n", + "\n", + " # Remove the title line from content\n", + " body = \"\\n\".join(lines[1:]).strip()\n", + " if not body:\n", + " continue\n", + "\n", + " # 4. Chunk the section using the text splitter\n", + " chunks = text_splitter.split_text(body)\n", + "\n", + " # 5. Create a LangChain Document for each chunk with the same title metadata\n", + " for chunk in chunks:\n", + " documents.append(Document(page_content=chunk, metadata={\"title\": title}))\n", + "\n", + " print(f\"Generated {len(documents)} chunked documents.\")\n", + "\n", + " return documents" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "1d091a51", "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Generated 262 chunked documents.\n" + ] + } + ], "source": [ - "## What is PGVector?\n", + "# Load the entire text file\n", + "with open(\"./data/the_little_prince.txt\", \"r\", encoding=\"utf-8\") as f:\n", + " content = f.read()\n", "\n", - "`PGVector` is a ```PostgreSQL``` extension that enables vector similarity search directly within your ```PostgreSQL``` database, making it ideal for AI applications, semantic search, and recommendation systems.\n", + "# Preprocessing Data\n", "\n", - "This is particularly valuable for who already use ```PostgreSQL``` who want to add vector search capabilities without managing separate infrastructure or learning new query languages.\n", + "docs = preprocessing_data(content=content)" + ] + }, + { + "cell_type": "markdown", + "id": "1977d4ff", + "metadata": {}, + "source": [ + "## Initial Setting PGVector\n", "\n", - "**Features** :\n", - "1. Native ```PostgreSQL``` integration with standard SQL queries\n", - "2. Multiple similarity search methods including L2, Inner Product, Cosine\n", - "3. Several indexing options including HNSW and IVFFlat\n", - "4. Support for up to 2,000 dimensions per vector\n", - "5. ACID compliance inherited from ```PostgreSQL```\n", + "This part walks you through the initial setup of ```PGVector```\n", "\n", - "**Advantages** :\n", + "This section includes the following components:\n", "\n", - "1. Free and open-source\n", - "2. Easy integration with existing ```PostgreSQL``` databases\n", - "3. Full SQL functionality and transactional support\n", - "4. No additional infrastructure needed\n", - "5. Supports hybrid searches combining vector and traditional SQL queries\n", + "- Load Embedding Model\n", "\n", - "**Disadvantages** :\n", - "1. Performance limitations with very large datasets (billions of vectors)\n", - "2. Limited to single-node deployment\n", - "3. Memory-intensive for large vector dimensions\n", - "4. Requires manual optimization for best performance\n", - "5. Less specialized features compared to dedicated vector databases" + "- Load ```PGVector``` Client" ] }, { "cell_type": "markdown", + "id": "835e5c9e", "metadata": {}, "source": [ "### Set up PGVector\n", @@ -186,7 +320,7 @@ "\n", "For more detailed instructions, please refer to [the official documentation](https://github.com/pgvector/pgvector) \n", "\n", - "** [ NOTE ] **\n", + "**[ NOTE ]**\n", "* If you want to maintain the stored data even after container being deleted, you must mount volume like below:\n", "```bash\n", "docker run --name pgvector-container -v {/mount/path}:/var/lib/postgresql/data -e POSTGRES_USER=langchain -e POSTGRES_PASSWORD=langchain -e POSTGRES_DB=langchain -p 6024:5432 -d pgvector/pgvector:pg16\n", @@ -195,32 +329,133 @@ }, { "cell_type": "markdown", + "id": "7eee56b2", "metadata": {}, "source": [ - "## Initialization\n", + "### Load Embedding Model\n", + "\n", + "In the **Load Embedding Model** section, you'll learn how to load an embedding model.\n", + "\n", + "This tutorial uses **OpenAI's** **API-Key** for loading the model.\n", + "\n", + "*πŸ’‘ If you prefer to use another embedding model, see the instructions below.*\n", + "- [Embedding Models](https://python.langchain.com/docs/integrations/text_embedding/)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "5bd5c3c9", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from langchain_openai import OpenAIEmbeddings\n", "\n", - "If you are successfully running the pgvector container, you can use ```pgVectorIndexManager``` from ```pgvector_interface``` in utils directory to handle collections.\n", + "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")" + ] + }, + { + "cell_type": "markdown", + "id": "40f65795", + "metadata": {}, + "source": [ + "### Load PGVector Client\n", "\n", - "To initialize ```pgVectorIndexManager``` you can pass full connection string or pass each parameter separately." + "In the **Load ```PGVector``` Client** section, we cover how to load the **database client object** using the **Python SDK** for ```PGVector``` .\n", + "- [PGVector Python SDK Docs](https://github.com/pgvector/pgvector)" ] }, { "cell_type": "code", "execution_count": 8, + "id": "eed0ebad", "metadata": {}, "outputs": [], "source": [ - "from utils.pgvector_interface import pgVectorIndexManager\n", + "from sqlalchemy import create_engine\n", + "\n", + "# Create Database Client Object Function\n", + "\n", + "\n", + "def get_db_client(conn_str):\n", + " \"\"\"\n", + "\n", "\n", - "# Setup connection infomation\n", - "conn_str = \"postgresql+psycopg://langchain:langchain@localhost:6024/langchain\"\n", + " Initializes and returns a VectorStore client instance.\n", + "\n", + "\n", + "\n", + " This function loads configuration (e.g., API key, host) from environment\n", + "\n", + "\n", + " variables or default values and creates a client object to interact\n", + "\n", + "\n", + " with the {vectordb} Python SDK.\n", + "\n", + "\n", + "\n", + " Returns:\n", + "\n", + "\n", + " client:ClientType - An instance of the {vectordb} client.\n", + "\n", + "\n", + "\n", + " Raises:\n", + "\n", + "\n", + " ValueError: If required configuration is missing.\n", + "\n", + "\n", + " \"\"\"\n", + " try:\n", + " client = create_engine(url=conn_str, **({}))\n", + " except Exception as e:\n", + " raise e\n", + " else:\n", + " return client" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "2b5f4116", + "metadata": {}, + "outputs": [], + "source": [ + "# Get DB Client Object\n", + "conn_str = \"postgresql+psycopg://langchain:langchain@localhost:6088/langchain\"\n", + "client = get_db_client(conn_str)" + ] + }, + { + "cell_type": "markdown", + "id": "2e8f4075", + "metadata": {}, + "source": [ + "If you are successfully running the ```PGVector``` container and get client objecct, you can use ```pgVectorIndexManager``` from ```pgvector_interface``` in utils directory to handle collections.\n", + "\n", + "You can also initialize ```pgVectorIndexManager``` by passing full connection string or each parameter separately instead of passing client." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "ba8f2308", + "metadata": {}, + "outputs": [], + "source": [ + "from utils.pgvector_interface import pgVectorIndexManager\n", "\n", "# Initialize pgVectorIndexManaer\n", - "index_manager = pgVectorIndexManager(connection=conn_str)" + "index_manager = pgVectorIndexManager(client=client)" ] }, { "cell_type": "markdown", + "id": "734dc3da", "metadata": {}, "source": [ "When you initialize ```pgVectorIndexManager```, the procedure will automatically create two tables\n", @@ -247,6 +482,7 @@ }, { "cell_type": "markdown", + "id": "f83b661d", "metadata": {}, "source": [ "## Create collection\n", @@ -263,13 +499,13 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 11, + "id": "4742c2ff", "metadata": {}, "outputs": [], "source": [ "import getpass\n", "import os\n", - "\n", "if not os.environ.get(\"OPENAI_API_KEY\"):\n", " os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"Enter API key for OpenAI: \")\n", "\n", @@ -280,115 +516,46 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 12, + "id": "d92c6846", "metadata": {}, "outputs": [], "source": [ "# create new collection\n", - "col_manager = index_manager.create_index(\n", - " collection_name=\"langchain_opentutorial\", embedding=embeddings\n", + "_ = index_manager.create_index(\n", + " collection_name=\"tutorial_collection\", embedding=embeddings\n", ")" ] }, { "cell_type": "markdown", + "id": "3a5a97a0", "metadata": {}, "source": [ - "### List collections\n", + "## Document Manager\n", "\n", - "As we have created a new collection, we will call the ```list_indexes``` method to check if the collection is created." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "['langchain_opentutorial']\n" - ] - } - ], - "source": [ - "# check collections\n", - "indexes = index_manager.list_indexes()\n", - "print(indexes)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Delete collections\n", + "To support the **Langchain-Opentutorial** , we implemented a custom set of **CRUD** functionalities for VectorDBs. \n", "\n", - "We can also delete collection by calling the ```delete_index``` method by pass the name of the collection to delete.\n", + "The following operations are included:\n", "\n", - "We delete **langchain_opentutorial** collection, and then create it again." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[]\n" - ] - } - ], - "source": [ - "# delete collection\n", - "index_manager.delete_index(\"langchain_opentutorial\")\n", + "- ```upsert``` : Update existing documents or insert if they don’t exist\n", "\n", - "# check collections\n", - "indexes = index_manager.list_indexes()\n", - "print(indexes)\n", + "- ```upsert_parallel``` : Perform upserts in parallel for large-scale data\n", "\n", - "# Create again\n", - "col_manager_tmp1 = index_manager.create_index(\n", - " collection_name=\"langchain_opentutorial\", embedding=embeddings\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Get collection\n", - "As we said, when you create a new collection by calling the ```create_index``` method, this will automatically return ```pgVectorDocumentManager``` instance.\n", + "- ```similarity_search``` : Search for similar documents based on embeddings\n", "\n", - "But if you want to re-use already created collection, you can call the ```get_index``` method with name of the collection and embedding model you used to create the collection to get manager." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Get collection\n", - "col_manager_tmp2 = index_manager.get_index(\n", - " embedding=embeddings, collection_name=\"langchain_opentutorial\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Manage vector store\n", + "- ```delete``` : Remove documents based on filter conditions\n", "\n", - "Once you have created your vector store, we can interact with it by adding and deleting different items." + "Each of these features is implemented as class methods specific to each VectorDB.\n", + "\n", + "In this tutorial, you can easily utilize these methods to interact with your VectorDB.\n", + "\n", + "*We plan to continuously expand the functionality by adding more common operations in the future.*" ] }, { "cell_type": "markdown", + "id": "4e89549b-fbb4-4a9d-b01d-1898d129b1e2", "metadata": {}, "source": [ "### Filtering\n", @@ -410,7 +577,7 @@ "| \\$and | Logical (and) |\n", "| \\$or | Logical (or) |\n", "\n", - "Filter can be used with ```scroll```, ```delete```, and ```search``` methods.\n", + "Filter can be used with ```delete```, and ```search``` methods.\n", "\n", "To apply filter, we create a dictionary and pass it to ```filter``` parameter like the following\n", "```python\n", @@ -420,513 +587,382 @@ }, { "cell_type": "markdown", + "id": "65a40601", "metadata": {}, "source": [ - "### Connect to index\n", - "To add, delete, search items, we need to initialize an object which connected to the index we operate on.\n", + "### Create Instance\n", "\n", - "We will connect to **langchain_opentutorial** . Recall that we used basic ```OpenAIEmbedding``` as a embedding function, and thus we need to pass it when we initialize ```index_manager``` object.\n", + "First, we create an instance of the **{vectordb}** helper class to use its CRUD functionalities.\n", "\n", - "Remember that we also can get ```pgVectorDocumentManager``` object when we create an index with ```pgVectorIndexManager``` object or ```pgVectorIndexManager.get_index``` method, but this time we call it directly to get an ```pgVectorDocumentManager``` object." + "This class is initialized with the **{vectordb} Python SDK client instance** and the **embedding model instance** , both of which were defined in the previous section." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 13, + "id": "dccab807", "metadata": {}, "outputs": [], "source": [ - "from utils.pgvector_interface import pgVectorDocumentManager\n", + "from utils.pgvector_interface import pgVectorCRUDManager\n", "\n", - "# Get document manager\n", - "col_manager = pgVectorDocumentManager(\n", - " embedding=embeddings,\n", - " connection_info=conn_str,\n", - " collection_name=\"langchain_opentutorial\",\n", + "crud_manager = pgVectorCRUDManager(\n", + " client=client, embedding=embedding, collection_name=\"tutorial_collection\"\n", ")" ] }, { "cell_type": "markdown", + "id": "c1c0c67f", "metadata": {}, "source": [ - "### Data Preprocessing\n", + "Now you can use the following **CRUD** operations with the ```crud_manager``` instance.\n", "\n", - "Below is the preprocessing process for general documents.\n", - "\n", - "- Need to extract **metadata** from documents\n", - "- Filter documents by minimum length.\n", - " \n", - "- Determine whether to use ```basename``` or not. Default is ```False```.\n", - " - ```basename``` denotes the last value of the filepath.\n", - " - For example, **document.pdf** will be the ```basename``` for the filepath **./data/document.pdf** ." + "These instance allow you to easily manage documents in your ```PGVector```." ] }, { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [], - "source": [ - "# This is a long document we can split up.\n", - "data_path = \"./data/the_little_prince.txt\"\n", - "with open(data_path, encoding=\"utf8\") as f:\n", - " raw_text = f.read()" - ] - }, - { - "cell_type": "code", - "execution_count": 13, + "cell_type": "markdown", + "id": "7c6c53c5", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "page_content='The Little Prince\n", - "Written By Antoine de Saiot-Exupery (1900γ€œ1944)'\n" - ] - } - ], "source": [ - "from langchain_text_splitters import RecursiveCharacterTextSplitter\n", - "from uuid import uuid4\n", + "### Upsert Document\n", "\n", - "# define text splitter\n", - "text_splitter = RecursiveCharacterTextSplitter(\n", - " # Set a really small chunk size, just to show.\n", - " chunk_size=100,\n", - " chunk_overlap=20,\n", - " length_function=len,\n", - " is_separator_regex=False,\n", - ")\n", + "**Update** existing documents or **insert** if they don’t exist\n", + "\n", + "**βœ… Args**\n", + "\n", + "- ```texts``` : Iterable[str] – List of text contents to be inserted/updated.\n", + "\n", + "- ```metadatas``` : Optional[List[Dict]] – List of metadata dictionaries for each text (optional).\n", + "\n", + "- ```ids``` : Optional[List[str]] – Custom IDs for the documents. If not provided, IDs will be auto-generated.\n", "\n", - "# split raw text by splitter.\n", - "split_docs = text_splitter.create_documents([raw_text])\n", + "- ```**kwargs``` : Extra arguments for the underlying vector store.\n", "\n", - "# print one of documents to check its structure\n", - "print(split_docs[0])" + "**πŸ”„ Return**\n", + "\n", + "- ```ids``` : IDs of the upserted documents." ] }, { "cell_type": "code", "execution_count": 14, + "id": "f3a6c32b", "metadata": {}, "outputs": [], "source": [ - "# define document preprocessor\n", - "def preprocess_documents(\n", - " split_docs, metadata_keys, min_length, use_basename=False, **kwargs\n", - "):\n", - " metadata = kwargs\n", + "from uuid import uuid4\n", "\n", - " if use_basename:\n", - " assert metadata.get(\"source\", None) is not None, \"source must be provided\"\n", - " metadata[\"source\"] = metadata[\"source\"].split(\"/\")[-1]\n", + "ids = [str(uuid4()) for _ in docs]\n", "\n", - " result_docs = []\n", - " for idx, doc in enumerate(split_docs):\n", - " if len(doc.page_content) < min_length:\n", - " continue\n", - " for k in metadata_keys:\n", - " doc.metadata.update({k: metadata.get(k, \"\")})\n", - " doc.metadata.update({\"page\": idx + 1, \"id\": str(uuid4())})\n", - " result_docs.append(doc)\n", "\n", - " return result_docs" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": { - "tags": [] - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "page_content='The Little Prince\n", - "Written By Antoine de Saiot-Exupery (1900γ€œ1944)' metadata={'source': 'the_little_prince.txt', 'page': 1, 'author': 'Saiot-Exupery', 'id': 'cc23e228-2540-4e5c-8eb3-be6df7a3bf77'}\n" - ] - } - ], - "source": [ - "# preprocess raw documents\n", - "processed_docs = preprocess_documents(\n", - " split_docs=split_docs,\n", - " metadata_keys=[\"source\", \"page\", \"author\"],\n", - " min_length=5,\n", - " use_basename=True,\n", - " source=data_path,\n", - " author=\"Saiot-Exupery\",\n", - ")\n", + "args = {\n", + " \"texts\": [doc.page_content for doc in docs[:2]],\n", + " \"metadatas\": [doc.metadata for doc in docs[:2]],\n", + " \"ids\": ids[:2],\n", + "}\n", + "\n", "\n", - "# print one of preprocessed document to chekc its structure\n", - "print(processed_docs[0])" + "upsert_result = crud_manager.upsert(**args)" ] }, { "cell_type": "markdown", + "id": "278fe1ed", "metadata": {}, "source": [ - "### Add items to vector store\n", + "### Upsert Parallel Document\n", "\n", - "We can add items to our vector store by using the ```upsert``` or ```upsert_parallel``` method.\n", + "Perform **upserts** in **parallel** for large-scale data\n", "\n", - "If you pass ids along with documents, then ids will be used, but if you do not pass ids, it will be created based `page_content` using md5 hash function.\n", + "**βœ… Args**\n", "\n", - "Basically, ```upsert``` and ```upsert_parallel``` methods do upsert not insert, based on **id** of the item.\n", + "- ```texts``` : Iterable[str] – List of text contents to be inserted/updated.\n", "\n", - "So if you provided id and want to update data, you must provide the same id that you provided at first upsertion.\n", + "- ```metadatas``` : Optional[List[Dict]] – List of metadata dictionaries for each text (optional).\n", "\n", - "We will upsert data to collection, **langchain_opentutorial** , with ```upsert``` method for the first half, and with ```upsert_parallel``` for the second half." - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of documents: 1359\n" - ] - } - ], - "source": [ - "# Gather uuids, texts, metadatas\n", - "uuids = [doc.metadata[\"id\"] for doc in processed_docs]\n", - "texts = [doc.page_content for doc in processed_docs]\n", - "metadatas = [doc.metadata for doc in processed_docs]\n", + "- ```ids``` : Optional[List[str]] – Custom IDs for the documents. If not provided, IDs will be auto-generated.\n", "\n", - "# Get total number of documents\n", - "total_number = len(processed_docs)\n", - "print(\"Number of documents:\", total_number)" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 1.57 s, sys: 140 ms, total: 1.71 s\n", - "Wall time: 5.46 s\n" - ] - } - ], - "source": [ - "%%time\n", - "# upsert documents\n", - "upsert_result = col_manager.upsert(\n", - " \n", - " texts=texts[:total_number//2], metadatas=metadatas[:total_number//2], ids=uuids[:total_number//2]\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 1.79 s, sys: 82.9 ms, total: 1.88 s\n", - "Wall time: 4.96 s\n" - ] - } - ], - "source": [ - "%%time\n", - "# upsert documents parallel\n", - "upsert_parallel_result = col_manager.upsert_parallel(\n", - " texts = texts[total_number//2 :],\n", - " metadatas = metadatas[total_number//2:],\n", - " ids = uuids[total_number//2:],\n", - " batch_size=32,\n", - " max_workers=8\n", - ")" + "- ```batch_size``` : int – Number of documents per batch (default: 32).\n", + "\n", + "- ```workers``` : int – Number of parallel workers (default: 10).\n", + "\n", + "- ```**kwargs``` : Extra arguments for the underlying vector store.\n", + "\n", + "**πŸ”„ Return**\n", + "\n", + "- ```ids``` : IDs of the upserted documents." ] }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 15, + "id": "a89dd8e0", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "1359\n", - "Manual Ids == Output Ids: True\n" - ] - } - ], + "outputs": [], "source": [ - "result = upsert_result + upsert_parallel_result\n", - "\n", - "# check number of ids upserted\n", - "print(len(result))\n", + "args = {\n", + " \"texts\": [doc.page_content for doc in docs],\n", + " \"metadatas\": [doc.metadata for doc in docs],\n", + " \"ids\": ids,\n", + " \"batch_size\": 32,\n", + " \"max_workers\": 8,\n", + "}\n", "\n", - "# check manual ids are the same as output ids\n", - "print(\"Manual Ids == Output Ids:\", sorted(result) == sorted(uuids))" + "upsert_parallel_result = crud_manager.upsert_parallel(**args)" ] }, { "cell_type": "markdown", + "id": "6beea197", "metadata": {}, "source": [ - "**[ NOTE ]**\n", + "### Similarity Search\n", "\n", - "As we have only one table, **langchain_pg_embedding** to store data, we have only one column **cmetadata** to store metadata for each document.\n", + "Search for **similar documents** based on **embeddings** .\n", "\n", - "The **cmetadata** column is jsonb type, and thus if you want to update the metadata, you should provide not only the new metadata key-value you want to update, but with all the metadata already stored." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Scroll items from vector store\n", - "As we have added some items to our first vector store, named **langchain_opentutorial** , we can scroll items from the vector store.\n", + "This method uses **\"cosine similarity\"** .\n", "\n", - "This can be done by calling ```scroll``` method.\n", "\n", - "When we scroll items from the vector store we can pass ```ids``` or ```filter``` to get items that we want, or just call ```scroll``` to get ```k```(*default 10*) items.\n", + "**βœ… Args**\n", "\n", - "We can get embedded vector values of each items by set ```include_embedding``` True." - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of items scrolled: 10\n", - "{'content': 'The Little Prince\\nWritten By Antoine de Saiot-Exupery (1900γ€œ1944)', 'metadata': {'id': 'cc23e228-2540-4e5c-8eb3-be6df7a3bf77', 'page': 1, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None}\n" - ] - } - ], - "source": [ - "# Do scroll without ids or filter\n", - "scroll_result = col_manager.scroll()\n", + "- ```query``` : str – The text query for similarity search.\n", + "\n", + "- ```k``` : int – Number of top results to return (default: 10).\n", "\n", - "# print the number of items scrolled and first item that returned.\n", - "print(f\"Number of items scrolled: {len(scroll_result)}\")\n", - "print(scroll_result[0])" + "```**kwargs``` : Additional search options (e.g., filters).\n", + "\n", + "**πŸ”„ Return**\n", + "\n", + "- ```results``` : List[Document] – A list of LangChain Document objects ranked by similarity." ] }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 16, + "id": "5859782b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Number of items scrolled: 3\n", - "{'content': 'The Little Prince\\nWritten By Antoine de Saiot-Exupery (1900γ€œ1944)', 'metadata': {'id': 'cc23e228-2540-4e5c-8eb3-be6df7a3bf77', 'page': 1, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None}\n", - "{'content': '[ Antoine de Saiot-Exupery ]', 'metadata': {'id': 'd4bf8981-2af4-4288-8aaf-6586381973c4', 'page': 2, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None}\n", - "{'content': 'Over the past century, the thrill of flying has inspired some to perform remarkable feats of', 'metadata': {'id': '31dc52cf-530b-449c-a3db-ec64d9e1a10c', 'page': 3, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None}\n" + "Rank 1\n", + "Contents : And he went back to meet the fox. \n", + "\"Goodbye,\" he said. \n", + "\"Goodbye,\" said the fox. \"And now here is my secret, a very simple secret: It is only with the heart that one can see rightly; what is essential is invisible to the eye.\" \n", + "\"What is essential is invisible to the eye,\" the little prince repeated, so that he would be sure to remember.\n", + "\"It is the time you have wasted for your rose that makes your rose so important.\"\n", + "Metadata: {'title': 'Chapter 21'}\n", + "Similarity Score: 0.5095177281477812\n", + "\n", + "Rank 2\n", + "Contents : \"Yes,\" I said to the little prince. \"The house, the stars, the desert-- what gives them their beauty is something that is invisible!\" \n", + "\"I am glad,\" he said, \"that you agree with my fox.\"\n", + "Metadata: {'title': 'Chapter 24'}\n", + "Similarity Score: 0.4950920951146853\n", + "\n", + "Rank 3\n", + "Contents : \"The men where you live,\" said the little prince, \"raise five thousand roses in the same garden-- and they do not find in it what they are looking for.\" \n", + "\"They do not find it,\" I replied. \n", + "\"And yet what they are looking for could be found in one single rose, or in a little water.\" \n", + "\"Yes, that is true,\" I said. \n", + "And the little prince added: \n", + "\"But the eyes are blind. One must look with the heart...\"\n", + "Metadata: {'title': 'Chapter 25'}\n", + "Similarity Score: 0.4223722219467283\n", + "\n" ] } ], "source": [ - "# Do scroll with filter\n", - "scroll_result = col_manager.scroll(filter={\"page\": {\"$in\": [1, 2, 3]}})\n", + "# Search by Query\n", "\n", - "# print the number of items scrolled and all items that returned.\n", - "print(f\"Number of items scrolled: {len(scroll_result)}\")\n", - "for r in scroll_result:\n", - " print(r)" + "results = crud_manager.search(query=\"What is essential is invisible to the eye.\", k=3)\n", + "for idx, result in enumerate(results):\n", + " print(f\"Rank {idx+1}\")\n", + " print(f\"Contents : {result['content']}\")\n", + " print(f\"Metadata: {result['metadata']}\")\n", + " print(f\"Similarity Score: {result['score']}\")\n", + " print()" ] }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 17, + "id": "2577dd4a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Number of items scrolled: 3\n", - "{'content': 'The Little Prince\\nWritten By Antoine de Saiot-Exupery (1900γ€œ1944)', 'metadata': {'id': 'cc23e228-2540-4e5c-8eb3-be6df7a3bf77', 'page': 1, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None}\n", - "{'content': '[ Antoine de Saiot-Exupery ]', 'metadata': {'id': 'd4bf8981-2af4-4288-8aaf-6586381973c4', 'page': 2, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None}\n", - "{'content': 'Over the past century, the thrill of flying has inspired some to perform remarkable feats of', 'metadata': {'id': '31dc52cf-530b-449c-a3db-ec64d9e1a10c', 'page': 3, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None}\n" + "Rank 1\n", + "Contents : \"The men where you live,\" said the little prince, \"raise five thousand roses in the same garden-- and they do not find in it what they are looking for.\" \n", + "\"They do not find it,\" I replied. \n", + "\"And yet what they are looking for could be found in one single rose, or in a little water.\" \n", + "\"Yes, that is true,\" I said. \n", + "And the little prince added: \n", + "\"But the eyes are blind. One must look with the heart...\"\n", + "Metadata: {'title': 'Chapter 25'}\n", + "Similarity Score: 0.4223722219467283\n", + "\n", + "Rank 2\n", + "Contents : \"The men where you live,\" said the little prince, \"raise five thousand roses in the same garden-- and they do not find in it what they are looking for.\" \n", + "\"They do not find it,\" I replied. \n", + "\"And yet what they are looking for could be found in one single rose, or in a little water.\" \n", + "\"Yes, that is true,\" I said. \n", + "And the little prince added: \n", + "\"But the eyes are blind. One must look with the heart...\"\n", + "Metadata: {'title': 'Chapter 25'}\n", + "Similarity Score: 0.4223722219467283\n", + "\n", + "Rank 3\n", + "Contents : \"The men where you live,\" said the little prince, \"raise five thousand roses in the same garden-- and they do not find in it what they are looking for.\" \n", + "\"They do not find it,\" I replied. \n", + "\"And yet what they are looking for could be found in one single rose, or in a little water.\" \n", + "\"Yes, that is true,\" I said. \n", + "And the little prince added: \n", + "\"But the eyes are blind. One must look with the heart...\"\n", + "Metadata: {'title': 'Chapter 25'}\n", + "Similarity Score: 0.4223722219467283\n", + "\n" ] } ], "source": [ - "# Do scroll with ids\n", - "scroll_result = col_manager.scroll(ids=uuids[:3])\n", + "# Filter Search\n", "\n", - "# print the number of items scrolled and all items that returned.\n", - "print(f\"Number of items scrolled: {len(scroll_result)}\")\n", - "for r in scroll_result:\n", - " print(r)" + "results = crud_manager.search(\n", + " query=\"Which asteroid did the little prince come from?\",\n", + " k=3,\n", + " filter={\"title\": \"Chapter 4\"},\n", + ")\n", + "for idx, doc in enumerate(results):\n", + " print(f\"Rank {idx+1}\")\n", + " print(f\"Contents : {result['content']}\")\n", + " print(f\"Metadata: {result['metadata']}\")\n", + " print(f\"Similarity Score: {result['score']}\")\n", + " print()" ] }, { "cell_type": "markdown", + "id": "9ad0ed0c", "metadata": {}, "source": [ - "### Delete items from vector store\n", + "### Delete Document\n", "\n", - "We can delete items by filter or ids with ```delete``` method.\n", + "Remove documents based on filter conditions\n", "\n", + "**βœ… Args**\n", "\n", - "For example, we will delete **the first page**, that is ```page``` 1, of the little prince, and try to scroll it." - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Delete done successfully\n", - "[]\n" - ] - } - ], - "source": [ - "# delete an item\n", - "col_manager.delete(filter={\"page\": {\"$eq\": 1}})\n", + "- ```ids``` : Optional[List[str]] – List of document IDs to delete. If None, deletion is based on filter.\n", "\n", - "# check if it remains in DB.\n", - "print(col_manager.scroll(filter={\"page\": {\"$eq\": 1}}))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now we delete 5 items using ```ids```." + "- ```filters``` : Optional[Dict] – Dictionary specifying filter conditions (e.g., metadata match).\n", + "\n", + "- ```**kwargs``` : Any additional parameters.\n", + "\n", + "**πŸ”„ Return**\n", + "\n", + "- Boolean" ] }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 18, + "id": "0e3a2c33", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Delete done successfully\n", - "[]\n" + "Delete done successfully\n" ] + }, + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" } ], "source": [ - "# delete item by ids\n", - "ids = uuids[1:6]\n", - "\n", - "# call delete_node method\n", - "col_manager.delete(ids=ids)\n", - "\n", - "# check if it remains in DB.\n", - "print(col_manager.scroll(ids=ids))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Similarity search\n", - "\n", - "As a vector store, ```pgVector``` support similarity search with various distance metric, **l2** , **inner** (max inner product), **cosine** .\n", - "\n", - "By default, distance strategy is set to **cosine.** \n", - "\n", - "Similarity search can be done by calling the ```search``` method.\n", + "# Delete by ids\n", "\n", - "You can set the number of retrieved documents by passing ```k```(*default to 4*)." + "crud_manager.delete(ids=ids[:10])" ] }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 19, + "id": "60bcb4cf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "{'content': '\"My friend the fox--\" the little prince said to me.', 'metadata': {'id': 'b02aaaa0-9352-403a-8924-cfff4973b926', 'page': 1087, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None, 'score': 0.631413271508214}\n", - "{'content': '\"No,\" said the little prince. \"I am looking for friends. What does that mean-- β€˜tameβ€˜?\"', 'metadata': {'id': '48adae15-36ba-4384-8762-0ef3f0ac33a3', 'page': 958, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None, 'score': 0.6050397117589812}\n", - "{'content': 'the little prince returns to his planet', 'metadata': {'id': '4ed37f54-5619-4fc9-912b-4a37fb5a5625', 'page': 1202, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None, 'score': 0.5846221199406966}\n", - "{'content': 'midst of the Sahara where he meets a tiny prince from another world traveling the universe in order', 'metadata': {'id': '28b44d4b-cf4e-4cb9-983b-7fb3ec735609', 'page': 25, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None, 'score': 0.5682375512406654}\n", - "{'content': '[ Chapter 2 ]\\n- the narrator crashes in the desert and makes the acquaintance of the little prince', 'metadata': {'id': '2a4e0184-bc2c-4558-8eaa-63a1a13da3a0', 'page': 85, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None, 'score': 0.555493427632688}\n" + "Delete done successfully\n" ] + }, + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" } ], "source": [ - "results = col_manager.search(query=\"Does the little prince have a friend?\", k=5)\n", - "for doc in results:\n", - " print(doc)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Similarity search with filters\n", + "# Delete by filters\n", "\n", - "You can also do similarity search with filter as we have done in ```scroll``` or ```delete```." + "crud_manager.delete(filters={\"title\": {\"$eq\": \"chapter 4\"}})" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 20, + "id": "30d42d2e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "{'content': 'inhabited region. And yet my little man seemed neither to be straying uncertainly among the sands,', 'metadata': {'id': '1be69712-f0f4-4728-b6f2-d4cf12cddfdb', 'page': 107, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None, 'score': 0.23158187113240447}\n", - "{'content': 'Nothing about him gave any suggestion of a child lost in the middle of the desert, a thousand miles', 'metadata': {'id': 'df4ece8c-dcb6-400e-9d8e-0eb5820a5c4e', 'page': 109, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None, 'score': 0.18018012822748797}\n", - "{'content': 'among the sands, nor to be fainting from fatigue or hunger or thirst or fear. Nothing about him', 'metadata': {'id': '71b4297c-3b76-43cb-be6a-afca5f59388d', 'page': 108, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None, 'score': 0.17715921622781305}\n", - "{'content': 'less charming than its model.', 'metadata': {'id': '507267bc-7076-42f7-ad7c-ed1f835663f2', 'page': 100, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None, 'score': 0.16131896837723747}\n", - "{'content': 'a thousand miles from any human habitation. When at last I was able to speak, I said to him:', 'metadata': {'id': '524af6ff-1370-4c20-ad94-1b37e45fe0c5', 'page': 110, 'author': 'Saiot-Exupery', 'source': 'the_little_prince.txt'}, 'embedding': None, 'score': 0.15769872390077566}\n" + "Delete done successfully\n" ] + }, + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" } ], "source": [ - "# search with filter\n", - "result_with_filter = col_manager.search(\n", - " \"Does the little prince have a friend?\",\n", - " filter={\"page\": {\"$between\": [100, 110]}},\n", - " k=5,\n", - ")\n", + "# Delete All\n", "\n", - "for doc in result_with_filter:\n", - " print(doc)" + "crud_manager.delete()" ] } ], "metadata": { "kernelspec": { - "display_name": "testbed", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -944,5 +980,5 @@ } }, "nbformat": 4, - "nbformat_minor": 4 + "nbformat_minor": 5 } diff --git a/09-VectorStore/utils/pgvector_interface.py b/09-VectorStore/utils/pgvector_interface.py index ed6002d9f..470f9aa27 100644 --- a/09-VectorStore/utils/pgvector_interface.py +++ b/09-VectorStore/utils/pgvector_interface.py @@ -188,6 +188,7 @@ class EmbeddingStore(Base): class pgVectorIndexManager: def __init__( self, + client=None, connection=None, host=None, port=None, @@ -198,28 +199,34 @@ def __init__( dbname=None, db=None, ): - if connection is not None: - self.connection_str = connection - + if client is not None: + self.client = client + self.connection_str = None + self._engine = client else: - assert host is not None, "host is missing" - assert port is not None, "port is missing" - assert ( - username is not None or user is not None - ), "username(or user) is missing" - assert ( - password is not None or passwd is not None - ), "password(or passwd) is missing" - assert dbname is not None or db is not None, "dbname(or db) is missing" - - self.host = host - self.port = port - self.userName = username if username is not None else user - self.passWord = password if password is not None else passwd - self.dbName = dbname if dbname is not None else db - self.connection_str = f"postgresql+psycopg://{self.userName}:{self.passWord}@{self.host}:{self.port}/{self.dbName}" - - self._engine = create_engine(url=self.connection_str, **({})) + self.client = None + if connection is not None: + self.connection_str = connection + + else: + assert host is not None, "host is missing" + assert port is not None, "port is missing" + assert ( + username is not None or user is not None + ), "username(or user) is missing" + assert ( + password is not None or passwd is not None + ), "password(or passwd) is missing" + assert dbname is not None or db is not None, "dbname(or db) is missing" + + self.host = host + self.port = port + self.userName = username if username is not None else user + self.passWord = password if password is not None else passwd + self.dbName = dbname if dbname is not None else db + self.connection_str = f"postgresql+psycopg://{self.userName}:{self.passWord}@{self.host}:{self.port}/{self.dbName}" + + self._engine = create_engine(url=self.connection_str, **({})) self.session_maker: scoped_session self.session_maker = scoped_session(sessionmaker(bind=self._engine)) self.collection_metadata = None @@ -334,29 +341,44 @@ def create_index(self, collection_name, embedding=None, dimension=None): ) return False else: - return pgVectorDocumentManager( - embedding=embedding, - connection_info=self.connection_str, - collection_name=collection_name, - ) + if self.client is not None: + return pgVectorCRUDManager( + embedding=embedding, + client=self.client, + collection_name=collection_name, + ) + else: + return pgVectorCRUDManager( + embedding=embedding, + connection_info=self.connection_str, + collection_name=collection_name, + ) def get_index(self, embedding, collection_name): - return pgVectorDocumentManager( + return pgVectorCRUDManager( embedding=embedding, connection_info=self.connection_str, collection_name=collection_name, ) -class pgVectorDocumentManager(DocumentManager): +class pgVectorCRUDManager(DocumentManager): def __init__( - self, embedding, connection_info=None, collection_name=None, distance="cosine" + self, + embedding, + client=None, + connection_info=None, + collection_name=None, + distance="cosine", ): - if isinstance(connection_info, str): - self.connection_info = connection_info - elif isinstance(connection_info, dict): - self.connection_info = self._make_conn_string(connection_info) - self._engine = create_engine(url=self.connection_info, **({})) + if client is not None: + self._engine = client + else: + if isinstance(connection_info, str): + self.connection_info = connection_info + elif isinstance(connection_info, dict): + self.connection_info = self._make_conn_string(connection_info) + self._engine = create_engine(url=self.connection_info, **({})) self.session_maker: scoped_session self.session_maker = scoped_session(sessionmaker(bind=self._engine)) self.collection_metadata = None @@ -791,13 +813,15 @@ def delete(self, ids=None, filter=None, **kwargs): stmt = stmt.where(self.EmbeddingStore.id.in_(ids)) session.execute(stmt) - elif filter: + elif filter is not None: filter_by = [self.EmbeddingStore.collection_id == collection.uuid] filter_clauses = self._create_filter_clause(filter) if filter_clauses is not None: filter_by.append(filter_clauses) stmt = stmt.where(filter_clauses) session.execute(stmt) + else: + session.execute(stmt) session.commit() except Exception as e: msg = f"Delete failed due to {type(e)} {str(e)}" @@ -815,10 +839,6 @@ def _get_retriever_tags(self) -> list[str]: tags.append(self.embeddings.__class__.__name__) return tags - def as_retriever(self, **kwargs): - tags = kwargs.pop("tags", None) or [] + self._get_retriever_tags() - return pgVectorRetriever(vectorstore=self, tags=tags, **kwargs) - def scroll(self, ids=None, filter=None, k=10, **kwargs): with self._make_sync_session() as session: # type: ignore[arg-type] collection = self.CollectionStore.get_by_name( @@ -857,85 +877,3 @@ def scroll(self, ids=None, filter=None, k=10, **kwargs): ] return docs - - -class pgVectorRetriever(BaseRetriever): - vectorstore: pgVectorDocumentManager - search_type: str = "similarity" - search_kwargs: dict = Field(default_factory=dict) - allowed_search_types: ClassVar[Collection[str]] = ( - "similarity", - "similarity_score_threshold", - "mmr", - ) - - model_config = ConfigDict( - arbitrary_types_allowed=True, - ) - - @model_validator(mode="before") - @classmethod - def validate_search_type(cls, values: dict) -> Any: - search_type = values.get("search_type", "similarity") - if search_type not in cls.allowed_search_types: - msg = ( - f"search_type of {search_type} not allowed. Valid values are: " - f"{cls.allowed_search_types}" - ) - raise ValueError(msg) - if search_type == "similarity_score_threshold": - score_threshold = values.get("search_kwargs", {}).get("score_threshold") - if (score_threshold is None) or (not isinstance(score_threshold, float)): - msg = ( - "`score_threshold` is not specified with a float value(0~1) " - "in `search_kwargs`." - ) - raise ValueError(msg) - return values - - def _get_ls_params(self, **kwargs: Any) -> LangSmithRetrieverParams: - """Get standard params for tracing.""" - - _kwargs = self.search_kwargs | kwargs - - ls_params = super()._get_ls_params(**_kwargs) - ls_params["ls_vector_store_provider"] = self.vectorstore.__class__.__name__ - - if self.vectorstore.embeddings: - ls_params["ls_embedding_provider"] = ( - self.vectorstore.embeddings.__class__.__name__ - ) - elif hasattr(self.vectorstore, "embedding") and isinstance( - self.vectorstore.embedding, Embeddings - ): - ls_params["ls_embedding_provider"] = ( - self.vectorstore.embedding.__class__.__name__ - ) - - return ls_params - - def _get_relevant_documents( - self, query: str, *, run_manager, **kwargs: Any - ) -> list[Document]: - _kwargs = self.search_kwargs | kwargs - print(f"_kwargs: {_kwargs}") - if self.search_type == "similarity": - docs = self.vectorstore.search(query, **_kwargs) - else: - msg = f"search_type of {self.search_type} not allowed." - raise ValueError(msg) - return docs - - def add_documents(self, documents: list[Document], **kwargs: Any) -> list[str]: - """Add documents to the vectorstore. - - Args: - documents: Documents to add to the vectorstore. - **kwargs: Other keyword arguments that subclasses might use. - - Returns: - List of IDs of the added texts. - """ - texts = [doc.page_content for doc in documents] - metadatas = [doc.metadata for doc in documents] - return self.vectorstore.upsert(texts, metadatas, **kwargs) From 5a1803ba5e43b613d882beea9f5da56e08672d1d Mon Sep 17 00:00:00 2001 From: XaviereKU Date: Sun, 4 May 2025 16:47:28 +0900 Subject: [PATCH 2/5] Revision --- 09-VectorStore/08-PGVector.ipynb | 55 ++++++++++++++++---------------- 1 file changed, 28 insertions(+), 27 deletions(-) diff --git a/09-VectorStore/08-PGVector.ipynb b/09-VectorStore/08-PGVector.ipynb index 07c465595..247fa85c2 100644 --- a/09-VectorStore/08-PGVector.ipynb +++ b/09-VectorStore/08-PGVector.ipynb @@ -159,7 +159,34 @@ "id": "6890920d", "metadata": {}, "source": [ - "Please write down what you need to set up the Vectorstore here." + "### Set up PGVector\n", + "\n", + "If you are using Windows and have installed postgresql for Windows, you are required to install **vector** extension for postgresql. The following may help [Install pgvector on Windows](https://dev.to/mehmetakar/install-pgvector-on-windows-6gl).\n", + "\n", + "But in this tutorial, we will use ```Docker``` container. If you are using Mac or Windows, check [Docker Desktop for Mac](https://docs.docker.com/desktop/setup/install/mac-install/) or [Docker Desktop for Windows](https://docs.docker.com/desktop/setup/install/windows-install).\n", + "\n", + "If you are using ```Docker``` desktop, you can easily set up `PGVector` by running the following command that spins up a ```Docker``` container:\n", + "\n", + "```bash\n", + "docker run --name pgvector-container -e POSTGRES_USER=langchain -e POSTGRES_PASSWORD=langchain -e POSTGRES_DB=langchain -p 6024:5432 -d pgvector/pgvector:pg16\n", + "```\n", + "\n", + "For more detailed instructions, please refer to [the official documentation](https://github.com/pgvector/pgvector) \n", + "\n", + "**[ NOTE ]**\n", + "* If you want to maintain the stored data even after container being deleted, you must mount volume like below:\n", + "```bash\n", + "docker run --name pgvector-container -v {/mount/path}:/var/lib/postgresql/data -e POSTGRES_USER=langchain -e POSTGRES_PASSWORD=langchain -e POSTGRES_DB=langchain -p 6024:5432 -d pgvector/pgvector:pg16\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "id": "8afc0863", + "metadata": {}, + "source": [ + "## What is PGVector?\n", + "\n" ] }, { @@ -301,32 +328,6 @@ "- Load ```PGVector``` Client" ] }, - { - "cell_type": "markdown", - "id": "835e5c9e", - "metadata": {}, - "source": [ - "### Set up PGVector\n", - "\n", - "If you are using Windows and have installed postgresql for Windows, you are required to install **vector** extension for postgresql. The following may help [Install pgvector on Windows](https://dev.to/mehmetakar/install-pgvector-on-windows-6gl).\n", - "\n", - "But in this tutorial, we will use ```Docker``` container. If you are using Mac or Windows, check [Docker Desktop for Mac](https://docs.docker.com/desktop/setup/install/mac-install/) or [Docker Desktop for Windows](https://docs.docker.com/desktop/setup/install/windows-install).\n", - "\n", - "If you are using ```Docker``` desktop, you can easily set up `PGVector` by running the following command that spins up a ```Docker``` container:\n", - "\n", - "```bash\n", - "docker run --name pgvector-container -e POSTGRES_USER=langchain -e POSTGRES_PASSWORD=langchain -e POSTGRES_DB=langchain -p 6024:5432 -d pgvector/pgvector:pg16\n", - "```\n", - "\n", - "For more detailed instructions, please refer to [the official documentation](https://github.com/pgvector/pgvector) \n", - "\n", - "**[ NOTE ]**\n", - "* If you want to maintain the stored data even after container being deleted, you must mount volume like below:\n", - "```bash\n", - "docker run --name pgvector-container -v {/mount/path}:/var/lib/postgresql/data -e POSTGRES_USER=langchain -e POSTGRES_PASSWORD=langchain -e POSTGRES_DB=langchain -p 6024:5432 -d pgvector/pgvector:pg16\n", - "```\n" - ] - }, { "cell_type": "markdown", "id": "7eee56b2", From 74066d145f8b6765d02b024fc87b7f7cd3e6773b Mon Sep 17 00:00:00 2001 From: Jongho Lee Date: Sun, 4 May 2025 21:05:17 +0900 Subject: [PATCH 3/5] Update --- 09-VectorStore/08-PGVector.ipynb | 36 +++++++++++++++++++++++++++----- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/09-VectorStore/08-PGVector.ipynb b/09-VectorStore/08-PGVector.ipynb index 247fa85c2..8e6299ae7 100644 --- a/09-VectorStore/08-PGVector.ipynb +++ b/09-VectorStore/08-PGVector.ipynb @@ -26,7 +26,7 @@ "\n", "- [Overview](#overview)\n", "- [Environment Setup](#environment-setup)\n", - "- [What is PGVector?](#what-is-pgvector?)\n", + "- [What is PGVector?](#what-is-pgvector)\n", "- [Data](#data)\n", "- [Initial Setting PGVector](#initial-setting-PGVector)\n", "- [Document Manager](#document-manager)\n", @@ -186,7 +186,32 @@ "metadata": {}, "source": [ "## What is PGVector?\n", - "\n" + "\n", + "`PGVector` is a ```PostgreSQL``` extension that enables vector similarity search directly within your ```PostgreSQL``` database, making it ideal for AI applications, semantic search, and recommendation systems.\n", + "\n", + "This is particularly valuable for who already use ```PostgreSQL``` who want to add vector search capabilities without managing separate infrastructure or learning new query languages.\n", + "\n", + "**Features** :\n", + "1. Native ```PostgreSQL``` integration with standard SQL queries\n", + "2. Multiple similarity search methods including L2, Inner Product, Cosine\n", + "3. Several indexing options including HNSW and IVFFlat\n", + "4. Support for up to 2,000 dimensions per vector\n", + "5. ACID compliance inherited from ```PostgreSQL```\n", + "\n", + "**Advantages** :\n", + "\n", + "1. Free and open-source\n", + "2. Easy integration with existing ```PostgreSQL``` databases\n", + "3. Full SQL functionality and transactional support\n", + "4. No additional infrastructure needed\n", + "5. Supports hybrid searches combining vector and traditional SQL queries\n", + "\n", + "**Disadvantages** :\n", + "1. Performance limitations with very large datasets (billions of vectors)\n", + "2. Limited to single-node deployment\n", + "3. Memory-intensive for large vector dimensions\n", + "4. Requires manual optimization for best performance\n", + "5. Less specialized features compared to dedicated vector databases" ] }, { @@ -486,7 +511,7 @@ "id": "f83b661d", "metadata": {}, "source": [ - "## Create collection\n", + "### Create collection\n", "Now we can create collection with ```index_manager```.\n", "\n", "To create collection, you need to pass **embedding** model and **collection_name** when calling the ```create_index``` method.\n", @@ -500,13 +525,14 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": null, "id": "4742c2ff", "metadata": {}, "outputs": [], "source": [ "import getpass\n", "import os\n", + "\n", "if not os.environ.get(\"OPENAI_API_KEY\"):\n", " os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"Enter API key for OpenAI: \")\n", "\n", @@ -963,7 +989,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "testbed", "language": "python", "name": "python3" }, From f3a8561a5595548cb6f7ab9c8ad017b93463a77a Mon Sep 17 00:00:00 2001 From: Jongho Lee Date: Sun, 4 May 2025 21:09:41 +0900 Subject: [PATCH 4/5] update --- 09-VectorStore/08-PGVector.ipynb | 22 ++-------------------- 1 file changed, 2 insertions(+), 20 deletions(-) diff --git a/09-VectorStore/08-PGVector.ipynb b/09-VectorStore/08-PGVector.ipynb index 8e6299ae7..d7039edf2 100644 --- a/09-VectorStore/08-PGVector.ipynb +++ b/09-VectorStore/08-PGVector.ipynb @@ -394,7 +394,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "id": "eed0ebad", "metadata": {}, "outputs": [], @@ -406,36 +406,18 @@ "\n", "def get_db_client(conn_str):\n", " \"\"\"\n", - "\n", - "\n", " Initializes and returns a VectorStore client instance.\n", - "\n", - "\n", - "\n", " This function loads configuration (e.g., API key, host) from environment\n", - "\n", - "\n", " variables or default values and creates a client object to interact\n", - "\n", - "\n", " with the {vectordb} Python SDK.\n", "\n", - "\n", - "\n", " Returns:\n", - "\n", - "\n", " client:ClientType - An instance of the {vectordb} client.\n", "\n", - "\n", - "\n", " Raises:\n", - "\n", - "\n", " ValueError: If required configuration is missing.\n", - "\n", - "\n", " \"\"\"\n", + "\n", " try:\n", " client = create_engine(url=conn_str, **({}))\n", " except Exception as e:\n", From c497b02d39f77429cad278822909793da4664697 Mon Sep 17 00:00:00 2001 From: Jongho Lee Date: Sun, 4 May 2025 22:16:44 +0900 Subject: [PATCH 5/5] Match port --- 09-VectorStore/08-PGVector.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/09-VectorStore/08-PGVector.ipynb b/09-VectorStore/08-PGVector.ipynb index d7039edf2..fd0103df6 100644 --- a/09-VectorStore/08-PGVector.ipynb +++ b/09-VectorStore/08-PGVector.ipynb @@ -168,7 +168,7 @@ "If you are using ```Docker``` desktop, you can easily set up `PGVector` by running the following command that spins up a ```Docker``` container:\n", "\n", "```bash\n", - "docker run --name pgvector-container -e POSTGRES_USER=langchain -e POSTGRES_PASSWORD=langchain -e POSTGRES_DB=langchain -p 6024:5432 -d pgvector/pgvector:pg16\n", + "docker run --name pgvector-container -e POSTGRES_USER=langchain -e POSTGRES_PASSWORD=langchain -e POSTGRES_DB=langchain -p 6088:5432 -d pgvector/pgvector:pg16\n", "```\n", "\n", "For more detailed instructions, please refer to [the official documentation](https://github.com/pgvector/pgvector) \n", @@ -176,7 +176,7 @@ "**[ NOTE ]**\n", "* If you want to maintain the stored data even after container being deleted, you must mount volume like below:\n", "```bash\n", - "docker run --name pgvector-container -v {/mount/path}:/var/lib/postgresql/data -e POSTGRES_USER=langchain -e POSTGRES_PASSWORD=langchain -e POSTGRES_DB=langchain -p 6024:5432 -d pgvector/pgvector:pg16\n", + "docker run --name pgvector-container -v {/mount/path}:/var/lib/postgresql/data -e POSTGRES_USER=langchain -e POSTGRES_PASSWORD=langchain -e POSTGRES_DB=langchain -p 6088:5432 -d pgvector/pgvector:pg16\n", "```\n" ] },