From 43b4a841927ee608fb04bb1e536d0d1a17819b74 Mon Sep 17 00:00:00 2001 From: Gwangwon Jung Date: Tue, 6 May 2025 17:43:49 +0900 Subject: [PATCH 1/4] [N-2] 09-VectorStore / 05-Qdrant - Init new contents --- 09-VectorStore/05-Qdrant.ipynb | 1347 +++++------------ .../qdrant/legacy/(legacy)05-Qdrant.ipynb | 1285 ++++++++++++++++ .../utils/{ => qdrant/legacy}/qdrant.py | 0 3 files changed, 1685 insertions(+), 947 deletions(-) create mode 100644 09-VectorStore/utils/qdrant/legacy/(legacy)05-Qdrant.ipynb rename 09-VectorStore/utils/{ => qdrant/legacy}/qdrant.py (100%) diff --git a/09-VectorStore/05-Qdrant.ipynb b/09-VectorStore/05-Qdrant.ipynb index a196a7a17..bd54df77e 100644 --- a/09-VectorStore/05-Qdrant.ipynb +++ b/09-VectorStore/05-Qdrant.ipynb @@ -2,76 +2,63 @@ "cells": [ { "cell_type": "markdown", + "id": "25733da0", "metadata": {}, "source": [ - "# Qdrant\n", + "# {VectorStore Name}\n", "\n", - "- Author: [HyeonJong Moon](https://github.com/hj0302)\n", - "- Design: \n", - "- Peer Review: \n", + "- Author: [Author Name](#Author's-Profile-Link)\n", + "- Design: [Designer](#Designer's-Profile-Link)\n", + "- Peer Review: [Reviewer Name](#Reviewer-Profile-Link)\n", "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n", "\n", - "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)\n", - "\n", + "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/your-notebook-file-name) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/your-notebook-file-name)\n", "\n", "## Overview\n", "\n", - "This notebook demonstrates how to utilize the features related to the `Qdrant` vector database.\n", - "\n", - "[`Qdrant`](https://python.langchain.com/docs/integrations/vectorstores/qdrant/) is an open-source vector similarity search engine designed to store, search, and manage high-dimensional vectors with additional payloads. It offers a production-ready service with a user-friendly API, suitable for applications such as semantic search, recommendation systems, and more.\n", + "This tutorial covers how to use **{Vector Store Name}** with **LangChain** .\n", "\n", - "**Qdrant's architecture** is optimized for efficient vector similarity searches, employing advanced indexing techniques like **Hierarchical Navigable Small World (HNSW)** graphs to enable fast and scalable retrieval of relevant data.\n", + "{A short introduction to vectordb}\n", "\n", + "This tutorial walks you through using **CRUD** operations with the **{VectorDB}** **storing** , **updating** , **deleting** documents, and performing **similarity-based retrieval** .\n", "\n", "### Table of Contents\n", "\n", "- [Overview](#overview)\n", "- [Environment Setup](#environment-setup)\n", - "- [Credentials](#credentials)\n", - "- [Installation](#installation)\n", - "- [Initialization](#initialization)\n", - "- [Manage Vector Store](#manage-vector-store)\n", - " - [Create a Collection](#create-a-collection)\n", - " - [List Collections](#list-collections)\n", - " - [Delete a Collection](#delete-a-collection)\n", - " - [Add Items to the Vector Store](#add-items-to-the-vector-store)\n", - " - [Delete Items from the Vector Store](#delete-items-from-the-vector-store)\n", - " - [Upsert Items to Vector Store (Parallel)](#upsert-items-to-vector-store-parallel)\n", - "- [Query Vector Store](#query-vector-store)\n", - " - [Query Directly](#query-directly)\n", - " - [Similarity Search with Score](#similarity-search-with-score)\n", - " - [Query by Turning into Retriever](#query-by-turning-into-retriever)\n", - " - [Search with Filtering](#search-with-filtering)\n", - " - [Delete with Filtering](#delete-with-filtering)\n", - " - [Filtering and Updating Records](#filtering-and-updating-records)\n", + "- [What is {vectordb}?](#what-is-{vectordb}?)\n", + "- [Data](#data)\n", + "- [Initial Setting {vectordb}](#initial-setting-{vectordb})\n", + "- [Document Manager](#document-manager)\n", "\n", - "### References\n", "\n", - "- [LangChain Qdrant Reference](https://python.langchain.com/docs/integrations/vectorstores/qdrant/)\n", - "- [Qdrant Official Reference](https://qdrant.tech/documentation/frameworks/langchain/)\n", - "- [Qdrant Install Reference](https://qdrant.tech/documentation/guides/installation/)\n", - "- [Qdrant Cloud Reference](https://cloud.qdrant.io)\n", - "- [Qdrant Cloud Quickstart Reference](https://qdrant.tech/documentation/quickstart-cloud/)\n", + "### References\n", "----" ] }, { "cell_type": "markdown", + "id": "c1fac085", "metadata": {}, "source": [ "## Environment Setup\n", "\n", - "Set up the environment. You may refer to Environment Setup for more details.\n", + "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n", "\n", - "[Note]\n", - "- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.\n", - "- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details." + "**[Note]**\n", + "- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n", + "- You can checkout the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details." ] }, { "cell_type": "code", - "execution_count": 1, - "metadata": {}, + "execution_count": null, + "id": "98da7994", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, "outputs": [], "source": [ "%%capture --no-stderr\n", @@ -80,19 +67,14 @@ }, { "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.0.1\u001b[0m\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" - ] + "execution_count": null, + "id": "800c732b", + "metadata": { + "vscode": { + "languageId": "plaintext" } - ], + }, + "outputs": [], "source": [ "# Install required packages\n", "from langchain_opentutorial import package\n", @@ -100,11 +82,8 @@ "package.install(\n", " [\n", " \"langsmith\",\n", - " \"langchain_openai\",\n", - " \"langchain_qdrant\",\n", - " \"qdrant_client\",\n", - " \"langchain_core\",\n", - " \"fastembed\",\n", + " \"langchain-core\",\n", + " \"python-dotenv\",\n", " ],\n", " verbose=False,\n", " upgrade=False,\n", @@ -113,59 +92,49 @@ }, { "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Environment variables have been set successfully.\n" - ] + "execution_count": null, + "id": "5b36bafa", + "metadata": { + "vscode": { + "languageId": "plaintext" } - ], + }, + "outputs": [], "source": [ "# Set environment variables\n", "from langchain_opentutorial import set_env\n", "\n", "set_env(\n", " {\n", - " \"OPEN_API_KEY\": \"\",\n", - " \"QDRANT_API_KEY\": \"\",\n", - " \"QDRANT_URL\": \"\",\n", + " \"OPENAI_API_KEY\": \"\",\n", " \"LANGCHAIN_API_KEY\": \"\",\n", " \"LANGCHAIN_TRACING_V2\": \"true\",\n", " \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n", - " \"LANGCHAIN_PROJECT\": \"Qdrant\",\n", + " \"LANGCHAIN_PROJECT\": \"{Project Name}\",\n", " }\n", ")" ] }, { "cell_type": "markdown", + "id": "8011a0c7", "metadata": {}, "source": [ - "You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.\n", + "You can alternatively set API keys such as ```OPENAI_API_KEY``` in a ```.env``` file and load them.\n", "\n", - "**[Note]** If you are using a `.env` file, proceed as follows." + "[Note] This is not necessary if you've already set the required API keys in previous steps." ] }, { "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" + "execution_count": null, + "id": "70d7e764", + "metadata": { + "vscode": { + "languageId": "plaintext" } - ], + }, + "outputs": [], "source": [ "from dotenv import load_dotenv\n", "\n", @@ -174,1112 +143,596 @@ }, { "cell_type": "markdown", + "id": "6890920d", "metadata": {}, "source": [ - "## **Credentials**\n", - "\n", - "Create a new account or sign in to your existing one, and generate an API key for use in this notebook.\n", - "\n", - "1. **Log in to Qdrant Cloud** : Go to the [Qdrant Cloud](https://cloud.qdrant.io) website and log in using your email, Google account, or GitHub account.\n", - "\n", - "2. **Create a Cluster** : After logging in, navigate to the **\"Clusters\"** section and click the **\"Create\"** button. Choose your desired configurations and region, then click **\"Create\"** to start building your cluster. Once the cluster is created, an API key will be generated for you.\n", - "\n", - "3. **Retrieve and Store Your API Key** : When your cluster is created, you will receive an API key. Ensure you save this key in a secure location, as you will need it later. If you lose it, you will have to generate a new one.\n", - "\n", - "4. **Manage API Keys** : To create additional API keys or manage existing ones, go to the **\"Access Management\"** section in the Qdrant Cloud dashboard and select *\"Qdrant Cloud API Keys\"* Here, you can create new keys or delete existing ones.\n", - "\n", - "```\n", - "QDRANT_API_KEY=\"YOUR_QDRANT_API_KEY\"\n", - "```" + "Please write down what you need to set up the Vectorstore here." ] }, { "cell_type": "markdown", + "id": "6f3b5bd2", "metadata": {}, "source": [ - "## **Installation**\n", + "## Data\n", "\n", - "There are several main options for initializing and using the **Qdrant** vector store:\n", + "This part walks you through the **data preparation process** .\n", "\n", - "- **Local Mode** : This mode doesn't require a separate server.\n", - " - **In-memory storage** (data is not persisted)\n", - " - **On-disk storage** (data is saved to your local machine)\n", - "- **Docker Deployments** : You can run **Qdrant** using **Docker**.\n", - "- **Qdrant Cloud** : Use **Qdrant** as a managed cloud service.\n", + "This section includes the following components:\n", "\n", - "For detailed instructions, see the [installation instructions](https://qdrant.tech/documentation/guides/installation/)." + "- Introduce Data\n", + "\n", + "- Preprocessing Data\n" ] }, { "cell_type": "markdown", + "id": "508ae7f7", "metadata": {}, "source": [ - "### In-Memory\n", + "### Introduce Data\n", "\n", - "For simple tests or quick experiments, you might choose to store data directly in memory. This means the data is automatically removed when your client terminates, typically at the end of your script or notebook session." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Collection 'demo_collection' does not exist or force recreate is enabled. Creating new collection...\n", - "Collection 'demo_collection' created successfully with configuration: {'vectors_config': VectorParams(size=3072, distance=, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None, multivector_config=None)}\n" - ] - } - ], - "source": [ - "from utils.qdrant import QdrantDocumentManager\n", - "from langchain_openai import OpenAIEmbeddings\n", + "In this tutorial, we will use the fairy tale **📗 The Little Prince** in PDF format as our data.\n", "\n", - "# Define the collection name for storing documents\n", - "collection_name = \"demo_collection\"\n", + "This material complies with the **Apache 2.0 license** .\n", "\n", - "# Initialize the embedding model with a specific OpenAI model\n", - "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", + "The data is used in a text (.txt) format converted from the original PDF.\n", "\n", - "# Create an instance of QdrantDocumentManager with in-memory storage\n", - "db = QdrantDocumentManager(\n", - " location=\":memory:\", # Use in-memory database for temporary storage\n", - " collection_name=collection_name,\n", - " embedding=embedding,\n", - ")" + "You can view the data at the link below.\n", + "- [Data Link](https://huggingface.co/datasets/sohyunwriter/the_little_prince)" ] }, { "cell_type": "markdown", + "id": "004ea4f4", "metadata": {}, "source": [ - "### On-Disk Storage\n", + "### Preprocessing Data\n", "\n", - "With **on-disk storage**, you can store your vectors directly on your hard drive without requiring a **Qdrant server**. This ensures that your data persists even when you restart the program." + "In this tutorial section, we will preprocess the text data from The Little Prince and convert it into a list of ```LangChain Document``` objects with metadata. \n", + "\n", + "Each document chunk will include a ```title``` field in the metadata, extracted from the first line of each section." ] }, { "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Collection 'demo_collection' does not exist or force recreate is enabled. Creating new collection...\n", - "Collection 'demo_collection' created successfully with configuration: {'vectors_config': VectorParams(size=3072, distance=, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None, multivector_config=None)}\n" - ] + "execution_count": null, + "id": "8e4cac64", + "metadata": { + "vscode": { + "languageId": "plaintext" } - ], + }, + "outputs": [], "source": [ - "from utils.qdrant import QdrantDocumentManager\n", - "from langchain_openai import OpenAIEmbeddings\n", + "from langchain.schema import Document\n", + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "import re\n", + "from typing import List\n", "\n", - "# Define the path for Qdrant storage\n", - "qdrant_path = \"./qdrant_memory\"\n", + "def preprocessing_data(content:str)->List[Document]:\n", + " # 1. Split the text by double newlines to separate sections\n", + " blocks = content.split(\"\\n\\n\")\n", "\n", - "# Define the collection name for storing documents\n", - "collection_name = \"demo_collection\"\n", + " # 2. Initialize the text splitter\n", + " text_splitter = RecursiveCharacterTextSplitter(\n", + " chunk_size=500, # Maximum number of characters per chunk\n", + " chunk_overlap=50, # Overlap between chunks to preserve context\n", + " separators=[\"\\n\\n\", \"\\n\", \" \"] # Order of priority for splitting\n", + " )\n", "\n", - "# Initialize the embedding model with a specific OpenAI model\n", - "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", + " documents = []\n", "\n", - "# Create an instance of QdrantDocumentManager with specified storage path\n", - "db = QdrantDocumentManager(\n", - " path=qdrant_path, # Specify the path for Qdrant storage\n", - " collection_name=collection_name,\n", - " embedding=embedding,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Docker Deployments\n", + " # 3. Loop through each section\n", + " for block in blocks:\n", + " lines = block.strip().splitlines()\n", + " if not lines:\n", + " continue\n", "\n", - "You can deploy `Qdrant` in a **production environment** using [`Docker`](https://qdrant.tech/documentation/guides/installation/#docker) and [`Docker Compose`](https://qdrant.tech/documentation/guides/installation/#docker-compose). Refer to the `Docker` and `Docker Compose` setup instructions in the development section for detailed information." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [], - "source": [ - "from utils.qdrant import QdrantDocumentManager\n", - "from langchain_openai import OpenAIEmbeddings\n", + " # Extract title from the first line using square brackets [ ]\n", + " first_line = lines[0]\n", + " title_match = re.search(r\"\\[(.*?)\\]\", first_line)\n", + " title = title_match.group(1).strip() if title_match else \"\"\n", "\n", - "# Define the URL for Qdrant server\n", - "url = \"http://localhost:6333\"\n", + " # Remove the title line from content\n", + " body = \"\\n\".join(lines[1:]).strip()\n", + " if not body:\n", + " continue\n", "\n", - "# Define the collection name for storing documents\n", - "collection_name = \"demo_collection\"\n", + " # 4. Chunk the section using the text splitter\n", + " chunks = text_splitter.split_text(body)\n", "\n", - "# Initialize the embedding model with a specific OpenAI model\n", - "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", + " # 5. Create a LangChain Document for each chunk with the same title metadata\n", + " for chunk in chunks:\n", + " documents.append(Document(page_content=chunk, metadata={\"title\": title}))\n", "\n", - "# Create an instance of QdrantDocumentManager with specified storage path\n", - "db = QdrantDocumentManager(\n", - " url=url, # Specify the path for Qdrant storage\n", - " collection_name=collection_name,\n", - " embedding=embedding,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Qdrant Cloud\n", + " print(f\"Generated {len(documents)} chunked documents.\")\n", "\n", - "For a **production environment**, you can use [**Qdrant Cloud**](https://cloud.qdrant.io/). It offers fully managed `Qdrant` databases with features such as **horizontal and vertical scaling**, **one-click setup and upgrades**, **monitoring**, **logging**, **backups**, and **disaster recovery**. For more information, refer to the [**Qdrant Cloud documentation**](https://qdrant.tech/documentation/cloud/).\n" + " return documents" ] }, { "cell_type": "code", - "execution_count": 8, - "metadata": {}, + "execution_count": null, + "id": "1d091a51", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, "outputs": [], "source": [ - "import getpass\n", - "import os\n", + "# Load the entire text file\n", + "with open(\"the_little_prince.txt\", \"r\", encoding=\"utf-8\") as f:\n", + " content = f.read()\n", "\n", - "# Fetch the Qdrant server URL from environment variables or prompt for input\n", - "if not os.getenv(\"QDRANT_URL\"):\n", - " os.environ[\"QDRANT_URL\"] = getpass.getpass(\"Enter your Qdrant Cloud URL key: \")\n", - "QDRANT_URL = os.environ.get(\"QDRANT_URL\")\n", + "# Preprocessing Data\n", "\n", - "# Fetch the Qdrant API key from environment variables or prompt for input\n", - "if not os.getenv(\"QDRANT_API_KEY\"):\n", - " os.environ[\"QDRANT_API_KEY\"] = getpass.getpass(\"Enter your Qdrant API key: \")\n", - "QDRANT_API_KEY = os.environ.get(\"QDRANT_API_KEY\")" + "docs = preprocessing_data(content=content)" ] }, { - "cell_type": "code", - "execution_count": 9, + "cell_type": "markdown", + "id": "1977d4ff", "metadata": {}, - "outputs": [], "source": [ - "from utils.qdrant import QdrantDocumentManager\n", - "from langchain_openai import OpenAIEmbeddings\n", + "## Initial Setting {vectordb}\n", "\n", - "# Define the collection name for storing documents\n", - "collection_name = \"demo_collection\"\n", + "This part walks you through the initial setup of **{vectordb}** .\n", "\n", - "# Initialize the embedding model with a specific OpenAI model\n", - "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", + "This section includes the following components:\n", "\n", - "# Create an instance of QdrantDocumentManager with specified storage path\n", - "db = QdrantDocumentManager(\n", - " url=QDRANT_URL,\n", - " api_key=QDRANT_API_KEY,\n", - " collection_name=collection_name,\n", - " embedding=embedding,\n", - ")" + "- Load Embedding Model\n", + "\n", + "- Load {vectordb} Client" ] }, { "cell_type": "markdown", + "id": "7eee56b2", "metadata": {}, "source": [ - "## Initialization\n", + "### Load Embedding Model\n", "\n", - "Once you've established your **vector store**, you'll likely need to manage the **collections** within it. Here are some common operations you can perform:\n", + "In the **Load Embedding Model** section, you'll learn how to load an embedding model.\n", "\n", - "- **Create a collection**\n", - "- **List collections**\n", - "- **Delete a collection**" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To proceed with the tutorial, we will use **Qdrant Cloud** for the next steps. This approach ensures that your data is securely stored in the cloud, allowing for seamless access, comprehensive testing, and experimentation across different environments." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Create a Collection\n", + "This tutorial uses **OpenAI's** **API-Key** for loading the model.\n", "\n", - "The `QdrantDocumentManager` class allows you to create a new **collection** in `Qdrant`. It can automatically create a collection if it doesn't exist or if you want to **recreate** it. You can specify configurations for **dense** and **sparse vectors** to meet different search needs. Use the `_ensure_collection_exists` method for **automatic creation** or call `create_collection` directly when needed." + "*💡 If you prefer to use another embedding model, see the instructions below.*\n", + "- [Embedding Models](https://python.langchain.com/docs/integrations/text_embedding/)" ] }, { "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Collection 'test_collection' does not exist or force recreate is enabled. Creating new collection...\n", - "Collection 'test_collection' created successfully with configuration: {'vectors_config': VectorParams(size=3072, distance=, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None, multivector_config=None)}\n" - ] + "execution_count": null, + "id": "5bd5c3c9", + "metadata": { + "vscode": { + "languageId": "plaintext" } - ], + }, + "outputs": [], "source": [ - "from utils.qdrant import QdrantDocumentManager\n", + "import os\n", "from langchain_openai import OpenAIEmbeddings\n", - "from qdrant_client.http.models import Distance\n", - "\n", - "# Define the collection name for storing documents\n", - "collection_name = \"test_collection\"\n", "\n", - "# Initialize the embedding model with a specific OpenAI model\n", - "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", - "\n", - "# Create an instance of QdrantDocumentManager with specified storage path\n", - "db = QdrantDocumentManager(\n", - " url=QDRANT_URL,\n", - " api_key=QDRANT_API_KEY,\n", - " collection_name=collection_name,\n", - " embedding=embedding,\n", - " metric=Distance.COSINE,\n", - ")" + "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")" ] }, { "cell_type": "markdown", + "id": "40f65795", "metadata": {}, "source": [ - "### List Collections\n", + "### Load {vectordb} Client\n", "\n", - "The `QdrantDocumentManager` class lets you list all **collections** in your `Qdrant` instance using the `get_collections` method. This retrieves and displays the **names** of all existing collections.\n" + "In the **Load {vectordb} Client** section, we cover how to load the **database client object** using the **Python SDK** for **{vectordb}** .\n", + "- [Python SDK Docs]()" ] }, { "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Collection Name: test_collection\n", - "Collection Name: sparse_collection\n", - "Collection Name: dense_collection\n", - "Collection Name: insta_image_search_test\n", - "Collection Name: insta_image_search\n", - "Collection Name: demo_collection\n" - ] + "execution_count": null, + "id": "eed0ebad", + "metadata": { + "vscode": { + "languageId": "plaintext" } - ], + }, + "outputs": [], "source": [ - "# Retrieve the list of collections from the Qdrant client\n", - "collections = db.client.get_collections()\n", + "# Create Database Client Object Function\n", "\n", - "# Iterate over each collection and print its details\n", - "for collection in collections.collections:\n", - " print(f\"Collection Name: {collection.name}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Delete a Collection\n", + "def get_db_client():\n", + " \"\"\"\n", + " Initializes and returns a VectorStore client instance.\n", "\n", - "The `QdrantDocumentManager` class allows you to delete a **collection** using the `delete_collection` method. This method removes the specified collection from your `Qdrant` instance." + " This function loads configuration (e.g., API key, host) from environment\n", + " variables or default values and creates a client object to interact\n", + " with the {vectordb} Python SDK.\n", + "\n", + " Returns:\n", + " client:ClientType - An instance of the {vectordb} client.\n", + "\n", + " Raises:\n", + " ValueError: If required configuration is missing.\n", + " \"\"\"\n", + " return client" ] }, { "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Collection 'test_collection' has been deleted.\n" - ] + "execution_count": null, + "id": "2b5f4116", + "metadata": { + "vscode": { + "languageId": "plaintext" } - ], + }, + "outputs": [], "source": [ - "# Define collection name\n", - "collection_name = \"test_collection\"\n", + "# Get DB Client Object\n", "\n", - "# Delete the collection\n", - "if db.client.delete_collection(collection_name=collection_name):\n", - " print(f\"Collection '{collection_name}' has been deleted.\")" + "client = get_db_client()" ] }, { "cell_type": "markdown", + "id": "3a5a97a0", "metadata": {}, "source": [ - "## Manage VectorStore\n", + "## Document Manager\n", "\n", - "After you've created your **vector store**, you can interact with it by **adding** or **deleting** items. Here are some common operations:" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Add Items to the Vector Store\n", + "To support the **Langchain-Opentutorial** , we implemented a custom set of **CRUD** functionalities for VectorDBs. \n", "\n", - "The `QdrantDocumentManager` class lets you add items to your **vector store** using the `upsert` method. This method **updates** existing documents with new data if their IDs already exist." - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [], - "source": [ - "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", - "from langchain.document_loaders import TextLoader\n", - "from uuid import uuid4\n", + "The following operations are included:\n", "\n", - "# Load the text file\n", - "loader = TextLoader(\"./data/the_little_prince.txt\")\n", - "documents = loader.load()\n", + "- ```upsert``` : Update existing documents or insert if they don’t exist\n", "\n", - "# Initialize the text splitter\n", - "text_splitter = RecursiveCharacterTextSplitter(\n", - " chunk_size=600, chunk_overlap=100, length_function=len\n", - ")\n", + "- ```upsert_parallel``` : Perform upserts in parallel for large-scale data\n", "\n", - "split_docs = text_splitter.split_documents(documents)\n", - "\n", - "# Generate unique IDs for documents\n", - "uuids = [str(uuid4()) for _ in split_docs[:30]]\n", - "page_contents = [doc.page_content for doc in split_docs[:30]]\n", - "metadatas = [doc.metadata for doc in split_docs[:30]]" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['22417c4f-bf11-4e92-978a-6c436dec39ca',\n", - " '28f56a01-34af-46ae-aeb4-ea6e0fcacb62',\n", - " 'c6d06501-9595-4272-80b5-f0747cb145fc',\n", - " 'b4b901bf-6e83-4658-b5e9-a1d5a80c767d',\n", - " '21b1b98d-0707-4128-a0bd-78c94db6cbf3',\n", - " 'c49b5d7c-c330-4d59-9097-25c3c52510b9',\n", - " '36ddc677-4fa9-47ee-b2e0-284bdb9062a1',\n", - " '32fde659-84c6-4679-b4df-d4b1d11e645f',\n", - " 'caf0b611-4a38-4a94-84a9-c3a98ac0b2a1',\n", - " '0e655834-9a6c-48a8-8a3b-5d5e2b1d6c2c',\n", - " '493aaa5c-b89d-429b-a425-57f20f3564ed',\n", - " '6f7f0755-d226-4aec-a714-a53d7a705e51',\n", - " '8b68a39b-f990-4ce1-9fbd-675f5103d3ff',\n", - " '73ef217b-9114-48a4-a447-0deb916b3d5a',\n", - " '63b99932-4e84-4cb2-a5ef-1d83fdbc4e6a',\n", - " '45fd3628-ca2f-439d-97ba-cc34da564f36',\n", - " '876f59dd-a9ae-4af7-84e8-5d8fe78cf7d3',\n", - " '5aa82f42-534f-447f-94b5-9ed4f3571091',\n", - " 'eb69cc2a-8899-4d9e-ad8f-adebea281ff0',\n", - " '1defc340-16b4-4ee0-94de-0dabc23e5d07',\n", - " '368d5f90-75d2-406c-8dd2-c7d8736b6944',\n", - " '842812f6-ee9f-43ae-8f6d-53015a5e57af',\n", - " '61031399-09ed-4c88-bc93-1018b942df71',\n", - " 'a6ac25f2-2dd5-445f-95dd-6a4d9fc4081c',\n", - " '08215031-2393-4d0c-82a2-53a6a90d169f',\n", - " 'f41de48c-1e7d-4036-a75e-a10ac579081d',\n", - " 'a2d6b6d1-5bbc-4f17-9b95-c917021614f0',\n", - " '3603a2e7-6021-46c9-8f4c-d53056849c1a',\n", - " 'e1fb95a1-7c1c-4aed-a628-b39e0907b744',\n", - " '2a42fbb6-9450-4d86-a5f8-65f333c10d4c']" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from utils.qdrant import QdrantDocumentManager\n", - "from langchain_openai import OpenAIEmbeddings\n", + "- ```similarity_search``` : Search for similar documents based on embeddings\n", "\n", - "# Define the collection name for storing documents\n", - "collection_name = \"demo_collection\"\n", + "- ```delete``` : Remove documents based on filter conditions\n", "\n", - "# Initialize the embedding model with a specific OpenAI model\n", - "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", + "Each of these features is implemented as class methods specific to each VectorDB.\n", "\n", - "# Create an instance of QdrantDocumentManager with specified storage path\n", - "db = QdrantDocumentManager(\n", - " url=QDRANT_URL,\n", - " api_key=QDRANT_API_KEY,\n", - " collection_name=collection_name,\n", - " embedding=embedding,\n", - ")\n", + "In this tutorial, you can easily utilize these methods to interact with your VectorDB.\n", "\n", - "db.upsert(texts=page_contents, metadatas=metadatas, ids=uuids)" + "*We plan to continuously expand the functionality by adding more common operations in the future.*" ] }, { "cell_type": "markdown", + "id": "65a40601", "metadata": {}, "source": [ - "### Delete Items from the Vector Store\n", - "\n", - "The `QdrantDocumentManager` class allows you to delete items from your **vector store** using the `delete` method. You can specify items to delete by providing **IDs** or **filters**.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [], - "source": [ - "delete_ids = [uuids[0]]\n", + "### Create Instance\n", "\n", - "db.delete(ids=delete_ids)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Upsert Items to Vector Store (Parallel)\n", + "First, we create an instance of the **{vectordb}** helper class to use its CRUD functionalities.\n", "\n", - "The `QdrantDocumentManager` class supports **parallel upserts** using the `upsert_parallel` method. This efficiently **adds** or **updates** multiple items with unique **IDs**, **data**, and **metadata**." + "This class is initialized with the **{vectordb} Python SDK client instance** and the **embedding model instance** , both of which were defined in the previous section." ] }, { "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['286d99ae-019b-41ed-962a-c1a26bf41c4a',\n", - " 'e17ce584-3576-45bb-8d82-36cfdd4c89d1',\n", - " 'aed142fa-a13a-421f-9e60-ab1af13a8b15',\n", - " '14337336-edb2-4ea1-880c-2f4613f1f999',\n", - " '91d47b16-4a1f-4f1f-ba07-78f9b2db06d8',\n", - " '6b58d2d9-1a4b-4e03-97fd-d584d502b606',\n", - " 'e7b6f4b5-27e0-4787-a74c-b8d17a7038ea',\n", - " '01579e1a-9935-443d-a7a5-b9ffdd1e07f9',\n", - " '4d516f16-09cf-4b7e-8d65-455eced738e7',\n", - " '7fd284a3-5f10-407f-a8fe-44a923263748',\n", - " '55fae9b6-046a-4f09-9cf0-08568efde43c',\n", - " 'b4386ade-1590-41fa-94e7-cc34d4f4c9da',\n", - " 'd27d8f98-349a-4c45-9f82-31e983edfa8c',\n", - " '20537c5d-80d1-4d72-8507-73fd21e3f11a',\n", - " 'ae418ede-69f6-4703-9d9d-2e31d59441b2',\n", - " '975d663d-f825-446d-9824-7997058ca24a',\n", - " 'c8086e33-6345-4403-a98c-a4cd46375cd1',\n", - " 'ec887b4f-eecf-4325-8117-293e6fd8dfd6',\n", - " 'c5fa1381-e30d-47d8-aad3-d46cc8520953',\n", - " '1b20e891-e44f-4640-ab24-03d692627265',\n", - " '0d37a3dd-329f-4901-a828-71a704f7a35e',\n", - " '170420dc-b02c-42f3-a36d-c56973784fb7',\n", - " 'f11893c3-20c5-43e4-9c0f-905d91c7a668',\n", - " '37327ff1-7f17-43b0-89ca-65ab69c14df6',\n", - " '92a4e2ec-7418-4241-a1e3-3bf2668a9fd6',\n", - " 'ea018faa-293f-4329-b8ae-92dc3fcdd909',\n", - " '09c78d94-0b4c-41cc-b530-7504f3d62dc4',\n", - " '907ad8d0-427d-4f29-b801-aea90a6a86aa',\n", - " '86508b0c-4ff7-422f-b13e-1443e47ef5d3',\n", - " 'b12e4c37-50a1-4257-80ae-de372a4a77ce']" - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" + "execution_count": null, + "id": "dccab807", + "metadata": { + "vscode": { + "languageId": "plaintext" } - ], + }, + "outputs": [], "source": [ - "# Generate unique IDs for documents\n", - "uuids = [str(uuid4()) for _ in split_docs[30:60]]\n", - "page_contents = [doc.page_content for doc in split_docs[30:60]]\n", - "metadatas = [doc.metadata for doc in split_docs[30:60]]\n", - "\n", - "db.upsert_parallel(\n", - " texts=page_contents,\n", - " metadatas=metadatas,\n", - " ids=uuids,\n", - " batch_size=32,\n", - " workers=10,\n", - ")" + "# crud_manager = (client=client, embedding=embedding)" ] }, { "cell_type": "markdown", + "id": "c1c0c67f", "metadata": {}, "source": [ - "## Query VectorStore\n", + "Now you can use the following **CRUD** operations with the ```crud_manager``` instance.\n", "\n", - "Once your **vector store** has been created and the relevant **documents** have been added, you will most likely wish to **query** it during the running of your `chain` or `agent`." + "These instance allow you to easily manage documents in your **{vectordb}** ." ] }, { "cell_type": "markdown", + "id": "7c6c53c5", "metadata": {}, "source": [ - "### Query Directly\n", + "### Upsert Document\n", + "\n", + "**Update** existing documents or **insert** if they don’t exist\n", + "\n", + "**✅ Args**\n", + "\n", + "- ```texts``` : Iterable[str] – List of text contents to be inserted/updated.\n", + "\n", + "- ```metadatas``` : Optional[List[Dict]] – List of metadata dictionaries for each text (optional).\n", "\n", - "The `QdrantDocumentManager` class allows direct **querying** using the `search` method. It performs **similarity searches** by converting queries into **vector embeddings** to find similar **documents**.\n" + "- ```ids``` : Optional[List[str]] – Custom IDs for the documents. If not provided, IDs will be auto-generated.\n", + "\n", + "- ```**kwargs``` : Extra arguments for the underlying vector store.\n", + "\n", + "**🔄 Return**\n", + "\n", + "- None" ] }, { "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "* for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", - " [{'source': './data/the_little_prince.txt'}]\n", - "\n", - "\n", - "* for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", - " [{'source': './data/the_little_prince.txt'}]\n", - "\n", - "\n", - "* for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", - " [{'source': './data/the_little_prince.txt'}]\n", - "\n", - "\n" - ] + "execution_count": null, + "id": "f3a6c32b", + "metadata": { + "vscode": { + "languageId": "plaintext" } - ], + }, + "outputs": [], "source": [ - "query = \"What is the significance of the rose in The Little Prince?\"\n", + "from uuid import uuid4\n", "\n", - "response = db.search(\n", - " query=query,\n", - " k=3,\n", - ")\n", + "args = {\n", + " \"texts\": [doc.page_content for doc in docs[:2]],\n", + " \"metadatas\": [doc.metadata for doc in docs[:2]],\n", + " \"ids\": [str(uuid4()) for _ in docs[:2]]\n", + " # if you want args, add params.\n", + "}\n", "\n", - "for res in response:\n", - " payload = res[\"payload\"]\n", - " print(f\"* {payload['page_content'][:200]}\\n [{payload['metadata']}]\\n\\n\")" + "# crud_manager.upsert(**args)" ] }, { "cell_type": "markdown", + "id": "278fe1ed", "metadata": {}, "source": [ - "### Similarity Search with Score\n", + "### Upsert Parallel Document\n", + "\n", + "Perform **upserts** in **parallel** for large-scale data\n", "\n", - "The `QdrantDocumentManager` class enables **similarity searches** with **scores** using the `search` method. This provides a **relevance score** for each **document** found.\n" + "**✅ Args**\n", + "\n", + "- ```texts``` : Iterable[str] – List of text contents to be inserted/updated.\n", + "\n", + "- ```metadatas``` : Optional[List[Dict]] – List of metadata dictionaries for each text (optional).\n", + "\n", + "- ```ids``` : Optional[List[str]] – Custom IDs for the documents. If not provided, IDs will be auto-generated.\n", + "\n", + "- ```batch_size``` : int – Number of documents per batch (default: 32).\n", + "\n", + "- ```workers``` : int – Number of parallel workers (default: 10).\n", + "\n", + "- ```**kwargs``` : Extra arguments for the underlying vector store.\n", + "\n", + "**🔄 Return**\n", + "\n", + "- None" ] }, { "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "* [SIM=0.527] for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", - " [{'source': './data/the_little_prince.txt'}]\n", - "\n", - "\n", - "* [SIM=0.527] for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", - " [{'source': './data/the_little_prince.txt'}]\n", - "\n", - "\n", - "* [SIM=0.527] for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", - " [{'source': './data/the_little_prince.txt'}]\n", - "\n", - "\n" - ] + "execution_count": null, + "id": "a89dd8e0", + "metadata": { + "vscode": { + "languageId": "plaintext" } - ], + }, + "outputs": [], "source": [ - "# Define the query to search in the database\n", - "query = \"What is the significance of the rose in The Little Prince?\"\n", - "\n", - "# Perform the search with the specified query and number of results\n", - "response = db.search(query=query, k=3)\n", - "\n", - "for res in response:\n", - " payload = res[\"payload\"]\n", - " score = res[\"score\"]\n", - " print(\n", - " f\"* [SIM={score:.3f}] {payload['page_content'][:200]}\\n [{payload['metadata']}]\\n\\n\"\n", - " )" + "from uuid import uuid4\n", + "\n", + "args = {\n", + " \"texts\": [doc.page_content for doc in docs],\n", + " \"metadatas\": [doc.metadata for doc in docs],\n", + " \"ids\": [str(uuid4()) for _ in docs]\n", + " # if you want args, add params.\n", + "}\n", + "\n", + "# crud_manager.upsert_parallel(**args)" ] }, { "cell_type": "markdown", + "id": "6beea197", "metadata": {}, "source": [ - "### Query by Turning into Retriever\n", + "### Similarity Search\n", "\n", - "The `QdrantDocumentManager` class can transform the **vector store** into a `retriever`. This allows for easier **integration** into **workflows** or **chains**.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "* for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", - " [{'source': './data/the_little_prince.txt', '_id': 'c49b5d7c-c330-4d59-9097-25c3c52510b9', '_collection_name': 'demo_collection'}]\n", - "\n", - "\n", - "* for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", - " [{'source': './data/the_little_prince.txt', '_id': '9567e6cf-2f89-4c3b-8a41-7167770fbcd3', '_collection_name': 'demo_collection'}]\n", - "\n", - "\n", - "* for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", - " [{'source': './data/the_little_prince.txt', '_id': 'e2a0d06a-9ccd-4e9e-8d4a-4e1292b6ccef', '_collection_name': 'demo_collection'}]\n", - "\n", - "\n" - ] - } - ], - "source": [ - "from langchain_qdrant import QdrantVectorStore\n", + "Search for **similar documents** based on **embeddings** .\n", "\n", - "# Initialize QdrantVectorStore with the client, collection name, and embedding\n", - "vector_store = QdrantVectorStore(\n", - " client=db.client, collection_name=db.collection_name, embedding=db.embedding\n", - ")\n", + "This method uses **\"cosine similarity\"** .\n", "\n", - "query = \"What is the significance of the rose in The Little Prince?\"\n", "\n", - "# Transform the vector store into a retriever with specific search parameters\n", - "retriever = vector_store.as_retriever(\n", - " search_type=\"similarity_score_threshold\",\n", - " search_kwargs={\"k\": 3, \"score_threshold\": 0.3},\n", - ")\n", + "**✅ Args**\n", "\n", - "results = retriever.invoke(query)\n", + "- ```query``` : str – The text query for similarity search.\n", "\n", - "for res in results:\n", - " print(f\"* {res.page_content[:200]}\\n [{res.metadata}]\\n\\n\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Search with Filtering\n", + "- ```k``` : int – Number of top results to return (default: 10).\n", + "\n", + "```**kwargs``` : Additional search options (e.g., filters).\n", "\n", - "The `QdrantDocumentManager` class allows **searching with filters** to retrieve records based on specific **metadata values**. This is done using the `scroll` method with a defined **filter query**." + "**🔄 Return**\n", + "\n", + "- ```results``` : List[Document] – A list of LangChain Document objects ranked by similarity." ] }, { "cell_type": "code", - "execution_count": 20, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[Record(id='09c78d94-0b4c-41cc-b530-7504f3d62dc4', payload={'page_content': '[ Chapter 7 ]\\n- the narrator learns about the secret of the little prince‘s life \\nOn the fifth day-- again, as always, it was thanks to the sheep-- the secret of the little prince‘s life was revealed to me. Abruptly, without anything to lead up to it, and as if the question had been born of long and silent meditation on his problem, he demanded: \\n\"A sheep-- if it eats little bushes, does it eat flowers, too?\"\\n\"A sheep,\" I answered, \"eats anything it finds in its reach.\"\\n\"Even flowers that have thorns?\"\\n\"Yes, even flowers that have thorns.\" \\n\"Then the thorns-- what use are they?\"', 'metadata': {'source': './data/the_little_prince.txt'}}, vector=None, shard_key=None, order_value=None),\n", - " Record(id='0e655834-9a6c-48a8-8a3b-5d5e2b1d6c2c', payload={'page_content': '[ Chapter 1 ]\\n- we are introduced to the narrator, a pilot, and his ideas about grown-ups\\nOnce when I was six years old I saw a magnificent picture in a book, called True Stories from Nature, about the primeval forest. It was a picture of a boa constrictor in the act of swallowing an animal. Here is a copy of the drawing. \\n(picture)\\nIn the book it said: \"Boa constrictors swallow their prey whole, without chewing it. After that they are not able to move, and they sleep through the six months that they need for digestion.\"', 'metadata': {'source': './data/the_little_prince.txt'}}, vector=None, shard_key=None, order_value=None),\n", - " Record(id='286d99ae-019b-41ed-962a-c1a26bf41c4a', payload={'page_content': '[ Chapter 4 ]\\n- the narrator speculates as to which asteroid from which the little prince came\\u3000\\u3000\\nI had thus learned a second fact of great importance: this was that the planet the little prince came from was scarcely any larger than a house!', 'metadata': {'source': './data/the_little_prince.txt'}}, vector=None, shard_key=None, order_value=None),\n", - " Record(id='45fd3628-ca2f-439d-97ba-cc34da564f36', payload={'page_content': '[ Chapter 2 ]\\n- the narrator crashes in the desert and makes the acquaintance of the little prince\\nSo I lived my life alone, without anyone that I could really talk to, until I had an accident with my plane in the Desert of Sahara, six years ago. Something was broken in my engine. And as I had with me neither a mechanic nor any passengers, I set myself to attempt the difficult repairs all alone. It was a question of life or death for me: I had scarcely enough drinking water to last a week.', 'metadata': {'source': './data/the_little_prince.txt'}}, vector=None, shard_key=None, order_value=None),\n", - " Record(id='d27d8f98-349a-4c45-9f82-31e983edfa8c', payload={'page_content': '[ Chapter 5 ]\\n- we are warned as to the dangers of the baobabs\\nAs each day passed I would learn, in our talk, something about the little prince‘s planet, his departure from it, his journey. The information would come very slowly, as it might chance to fall from his thoughts. It was in this way that I heard, on the third day, about the catastrophe of the baobabs.\\nThis time, once more, I had the sheep to thank for it. For the little prince asked me abruptly-- as if seized by a grave doubt-- \"It is true, isn‘t it, that sheep eat little bushes?\" \\n\"Yes, that is true.\" \\n\"Ah! I am glad!\"', 'metadata': {'source': './data/the_little_prince.txt'}}, vector=None, shard_key=None, order_value=None),\n", - " Record(id='f11893c3-20c5-43e4-9c0f-905d91c7a668', payload={'page_content': '[ Chapter 6 ]\\n- the little prince and the narrator talk about sunsets\\nOh, little prince! Bit by bit I came to understand the secrets of your sad little life... For a long time you had found your only entertainment in the quiet pleasure of looking at the sunset. I learned that new detail on the morning of the fourth day, when you said to me: \\n\"I am very fond of sunsets. Come, let us go look at a sunset now.\" \\n\"But we must wait,\" I said. \\n\"Wait? For what?\" \\n\"For the sunset. We must wait until it is time.\"', 'metadata': {'source': './data/the_little_prince.txt'}}, vector=None, shard_key=None, order_value=None),\n", - " Record(id='f41de48c-1e7d-4036-a75e-a10ac579081d', payload={'page_content': '[ Chapter 3 ]\\n- the narrator learns more about from where the little prince came\\nIt took me a long time to learn where he came from. The little prince, who asked me so many questions, never seemed to hear the ones I asked him. It was from words dropped by chance that, little by little, everything was revealed to me. \\nThe first time he saw my airplane, for instance (I shall not draw my airplane; that would be much too complicated for me), he asked me: \\n\"What is that object?\"\\n\"That is not an object. It flies. It is an airplane. It is my airplane.\"', 'metadata': {'source': './data/the_little_prince.txt'}}, vector=None, shard_key=None, order_value=None)]" - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" + "execution_count": null, + "id": "5859782b", + "metadata": { + "vscode": { + "languageId": "plaintext" } - ], - "source": [ - "from qdrant_client import models\n", - "\n", - "# Define a filter query to match documents containing the text \"Chapter\" in the page content\n", - "filter_query = models.Filter(\n", - " must=[\n", - " models.FieldCondition(\n", - " key=\"page_content\",\n", - " match=models.MatchText(text=\"Chapter\"),\n", - " ),\n", - " ]\n", - ")\n", - "\n", - "# Retrieve records from the collection that match the filter query\n", - "db.scroll(\n", - " scroll_filter=filter_query,\n", - " k=10,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, + }, + "outputs": [], "source": [ - "### Delete with Filtering\n", + "# Search by Query\n", "\n", - "The `QdrantDocumentManager` class allows you to **delete records** using **filters** based on specific **metadata values**. This is achieved with the `delete` method and a **filter query**." + "# results = crud_manager.search(query=\"What is essential is invisible to the eye.\",k=3)\n", + "# for idx,doc in enumerate(results):\n", + "# print(f\"Rank {idx} | Title : {doc.metadata['title']}\")\n", + "# print(f\"Contents : {doc.page_content}\")\n", + "# print()" ] }, { "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "UpdateResult(operation_id=31, status=)" - ] - }, - "execution_count": 21, - "metadata": {}, - "output_type": "execute_result" + "execution_count": null, + "id": "2577dd4a", + "metadata": { + "vscode": { + "languageId": "plaintext" } - ], + }, + "outputs": [], "source": [ - "from qdrant_client.http.models import Filter, FieldCondition, MatchText\n", - "\n", - "# Define a filter query to match documents containing the text \"Chapter\" in the page content\n", - "filter_query = models.Filter(\n", - " must=[\n", - " models.FieldCondition(\n", - " key=\"page_content\",\n", - " match=models.MatchText(text=\"Chapter\"),\n", - " ),\n", - " ]\n", - ")\n", - "\n", - "# Delete records from the collection that match the filter query\n", - "db.client.delete(collection_name=db.collection_name, points_selector=filter_query)" + "# Filter Search\n", + "\n", + "# results = crud_manager.search(query=\"Which asteroid did the little prince come from?\",k=3,={\"title\":\"Chapter 4\"})\n", + "# for idx,doc in enumerate(results):\n", + "# print(f\"Rank {idx} | Title : {doc.metadata['title']}\")\n", + "# print(f\"Contents : {doc.page_content}\")\n", + "# print()" ] }, { "cell_type": "markdown", + "id": "f140c0e2", "metadata": {}, "source": [ - "### Filtering and Updating Records\n", + "### As_retrever\n", + "\n", + "The ```as_retriever()``` method creates a LangChain-compatible retriever wrapper.\n", + "\n", + "This function allows a ```DocumentManager``` class to return a retriever object by wrapping the internal ```search()``` method, while staying lightweight and independent from full LangChain ```VectorStore``` dependencies.\n", + "\n", + "The retriever obtained through this function can be used the same as the existing LangChain retriever and is **compatible with LangChain Pipeline(e.g. RetrievalQA,ConversationalRetrievalChain,Tool,...)**.\n", + "\n", + "**✅ Args**\n", "\n", - "The `QdrantDocumentManager` class supports **filtering and updating records** based on specific **metadata values**. This is done by **retrieving records** with **filters** and **updating** them as needed.\n" + "- ```search_fn``` : Callable - The function used to retrieve relevant documents. Typically this is ```self.search``` from a ```DocumentManager``` instance.\n", + "\n", + "- ```search_kwargs``` : Optional[Dict] - A dictionary of keyword arguments passed to ```search_fn```, such as ```k``` for top-K results or metadata filters.\n", + "\n", + "**🔄 Return**\n", + "\n", + "- ```LightCustomRetriever``` :BaseRetriever - A lightweight LangChain-compatible retriever that internally uses the given ```search_fn``` and ```search_kwargs```." ] }, { "cell_type": "code", - "execution_count": 22, - "metadata": {}, + "execution_count": null, + "id": "86de7842", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, "outputs": [], "source": [ - "from qdrant_client import models\n", - "\n", - "# Define a filter query to match documents with a specific metadata source\n", - "filter_query = models.Filter(\n", - " must=[\n", - " models.FieldCondition(\n", - " key=\"metadata.source\",\n", - " match=models.MatchValue(value=\"./data/the_little_prince.txt\"),\n", - " ),\n", - " ]\n", - ")\n", - "\n", - "# Retrieve records matching the filter query, including their vectors\n", - "response = db.scroll(scroll_filter=filter_query, k=10, with_vectors=True)\n", - "new_source = \"the_little_prince.txt\"\n", - "\n", - "# Update the point IDs and set new metadata for the records\n", - "for point in response: # response[0] returns a list of points\n", - " payload = point.payload\n", - "\n", - " # Check if metadata exists in the payload\n", - " if \"metadata\" in payload:\n", - " payload[\"metadata\"][\"source\"] = new_source\n", - " else:\n", - " payload[\"metadata\"] = {\n", - " \"source\": new_source\n", - " } # Add new metadata if it doesn't exist\n", - "\n", - " # Update the point with new metadata\n", - " db.client.upsert(\n", - " collection_name=db.collection_name,\n", - " points=[\n", - " models.PointStruct(\n", - " id=point.id,\n", - " payload=payload,\n", - " vector=point.vector,\n", - " )\n", - " ],\n", - " )" + "# ret = crud_manager.as_retriever(\n", + "# search_fn=crud_manager.search, search_kwargs= # e.g. {\"k\": 1, \"where\": {\"title\": \"\"}}\n", + "# )" ] }, { - "cell_type": "markdown", - "metadata": {}, + "cell_type": "code", + "execution_count": null, + "id": "7142d29c", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], "source": [ - "### Similarity Search Options\n", - "\n", - "When using `QdrantVectorStore`, you have three options for performing **similarity searches**. You can select the desired search mode using the `retrieval_mode` parameter when you set up the class. The available modes are:\n", - "\n", - "- **Dense Vector Search** (Default)\n", - "- **Sparse Vector Search**\n", - "- **Hybrid Search**" + "# ret.invoke(\"Which asteroid did the little prince come from?\")" ] }, { "cell_type": "markdown", + "id": "9ad0ed0c", "metadata": {}, "source": [ - "### Dense Vector Search\n", + "### Delete Document\n", "\n", - "To perform a search using only **dense vectors**:\n", + "Remove documents based on filter conditions\n", "\n", - "- The `retrieval_mode` parameter must be set to `RetrievalMode.DENSE`. This is also the **default setting**.\n", - "- You need to provide a [dense embeddings](https://python.langchain.com/docs/integrations/text_embedding/) value through the `embedding` parameter.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "* for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", - " [{'source': './data/the_little_prince.txt', '_id': '3cc041d5-2700-498f-8114-85f3c96e26b9', '_collection_name': 'dense_collection'}]\n", - "\n", - "\n", - "* for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", - " [{'source': './data/the_little_prince.txt', '_id': '24d766ea-3383-40e5-bd0e-051d51de88a3', '_collection_name': 'dense_collection'}]\n", - "\n", - "\n", - "* Indeed, as I learned, there were on the planet where the little prince lived-- as on all planets-- good plants and bad plants. In consequence, there were good seeds from good plants, and bad seeds fro\n", - " [{'source': './data/the_little_prince.txt', '_id': 'd25ba992-e54d-4e8a-9572-438c78d0288b', '_collection_name': 'dense_collection'}]\n", - "\n", - "\n" - ] - } - ], - "source": [ - "from langchain_qdrant import RetrievalMode\n", - "from langchain_openai import OpenAIEmbeddings\n", + "**✅ Args**\n", "\n", - "query = \"What is the significance of the rose in The Little Prince?\"\n", - "\n", - "# Initialize the embedding model with a specific OpenAI model\n", - "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", - "\n", - "# Initialize QdrantVectorStore with documents, embeddings, and configuration\n", - "vector_store = QdrantVectorStore.from_documents(\n", - " documents=split_docs[:50],\n", - " embedding=embedding,\n", - " url=QDRANT_URL,\n", - " api_key=QDRANT_API_KEY,\n", - " collection_name=\"dense_collection\",\n", - " retrieval_mode=RetrievalMode.DENSE,\n", - " batch_size=10,\n", - ")\n", - "\n", - "# Perform similarity search in the vector store\n", - "results = vector_store.similarity_search(\n", - " query=query,\n", - " k=3,\n", - ")\n", - "\n", - "for res in results:\n", - " print(f\"* {res.page_content[:200]}\\n [{res.metadata}]\\n\\n\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Sparse Vector Search\n", + "- ```ids``` : Optional[List[str]] – List of document IDs to delete. If None, deletion is based on filter.\n", "\n", - "To search with only **sparse vectors**:\n", + "- ```filters``` : Optional[Dict] – Dictionary specifying filter conditions (e.g., metadata match).\n", "\n", - "- The `retrieval_mode` parameter should be set to `RetrievalMode.SPARSE`.\n", - "- An implementation of the [SparseEmbeddings](https://github.com/langchain-ai/langchain/blob/master/libs/partners/qdrant/langchain_qdrant/sparse_embeddings.py) interface using any **sparse embeddings provider** has to be provided as a value to the `sparse_embedding` parameter.\n", - "- The `langchain-qdrant` package provides a **FastEmbed** based implementation out of the box.\n", + "- ```**kwargs``` : Any additional parameters.\n", "\n", - "To use it, install the [FastEmbed](https://github.com/qdrant/fastembed) package:\n", + "**🔄 Return**\n", "\n", - "```bash\n", - "pip install fastembed\n", - "```" + "- None" ] }, { "cell_type": "code", - "execution_count": 24, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "* [ Chapter 20 ]\n", - "- the little prince discovers a garden of roses\n", - "But it happened that after walking for a long time through sand, and rocks, and snow, the little prince at last came upon a road. And all\n", - " [{'source': './data/the_little_prince.txt', '_id': '30d70339-4233-427b-b839-208c7618ae82', '_collection_name': 'sparse_collection'}]\n", - "\n", - "\n", - "* [ Chapter 20 ]\n", - "- the little prince discovers a garden of roses\n", - "But it happened that after walking for a long time through sand, and rocks, and snow, the little prince at last came upon a road. And all\n", - " [{'source': './data/the_little_prince.txt', '_id': '45ad1b0e-45cd-46f0-b6cd-d8e2b19ea8fa', '_collection_name': 'sparse_collection'}]\n", - "\n", - "\n", - "* And he went back to meet the fox. \n", - "\"Goodbye,\" he said. \n", - "\"Goodbye,\" said the fox. \"And now here is my secret, a very simple secret: It is only with the heart that one can see rightly; what is essential\n", - " [{'source': './data/the_little_prince.txt', '_id': 'ab098119-c45f-4e33-b105-a6c6e01a918b', '_collection_name': 'sparse_collection'}]\n", - "\n", - "\n" - ] + "execution_count": null, + "id": "0e3a2c33", + "metadata": { + "vscode": { + "languageId": "plaintext" } - ], + }, + "outputs": [], "source": [ - "from langchain_qdrant import FastEmbedSparse, RetrievalMode\n", - "from langchain_qdrant import RetrievalMode\n", - "from langchain_openai import OpenAIEmbeddings\n", + "# Delete by ids\n", "\n", - "query = \"What is the significance of the rose in The Little Prince?\"\n", - "\n", - "# Initialize the embedding model with a specific OpenAI model\n", - "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", - "# Initialize sparse embeddings using FastEmbedSparse\n", - "sparse_embeddings = FastEmbedSparse(model_name=\"Qdrant/bm25\")\n", - "\n", - "# Initialize QdrantVectorStore with documents, embeddings, and configuration\n", - "vector_store = QdrantVectorStore.from_documents(\n", - " documents=split_docs,\n", - " embedding=embedding,\n", - " sparse_embedding=sparse_embeddings,\n", - " url=QDRANT_URL,\n", - " api_key=QDRANT_API_KEY,\n", - " collection_name=\"sparse_collection\",\n", - " retrieval_mode=RetrievalMode.SPARSE,\n", - " batch_size=10,\n", - ")\n", - "\n", - "# Perform similarity search in the vector store\n", - "results = vector_store.similarity_search(\n", - " query=query,\n", - " k=3,\n", - ")\n", - "\n", - "for res in results:\n", - " print(f\"* {res.page_content[:200]}\\n [{res.metadata}]\\n\\n\")" + "# ids = [] # The 'ids' value you want to delete\n", + "# crud_manager.delete(ids=ids)" ] }, { - "cell_type": "markdown", - "metadata": {}, + "cell_type": "code", + "execution_count": null, + "id": "60bcb4cf", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], "source": [ - "### Hybrid Vector Search\n", - "\n", - "To perform a **hybrid search** using **dense** and **sparse vectors** with **score fusion**:\n", + "# Delete by ids with filters\n", "\n", - "- The `retrieval_mode` parameter should be set to `RetrievalMode.HYBRID`.\n", - "- A [`dense embeddings`](https://python.langchain.com/docs/integrations/text_embedding/) value should be provided to the `embedding` parameter.\n", - "- An implementation of the [`SparseEmbeddings`](https://github.com/langchain-ai/langchain/blob/master/libs/partners/qdrant/langchain_qdrant/sparse_embeddings.py) interface using any **sparse embeddings provider** has to be provided as a value to the `sparse_embedding` parameter.\n", - "\n", - "**Note**: If you've added documents with the `HYBRID` mode, you can switch to any **retrieval mode** when searching, since both the **dense** and **sparse vectors** are available in the **collection**." + "# ids = [] # The `ids` value corresponding to chapter 6\n", + "# crud_manager.delete(ids=ids,filters={\"title\":\"chapter 6\"}) " ] }, { "cell_type": "code", - "execution_count": 25, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "* \"Go and look again at the roses. You will understand now that yours is unique in all the world. Then come back to say goodbye to me, and I will make you a present of a secret.\" \n", - "The little prince went\n", - " [{'source': './data/the_little_prince.txt', '_id': '447a916c-d8a9-46f2-b035-d0ac4c7ea901', '_collection_name': 'hybrid_collection'}]\n", - "\n", - "\n", - "* [ Chapter 20 ]\n", - "- the little prince discovers a garden of roses\n", - "But it happened that after walking for a long time through sand, and rocks, and snow, the little prince at last came upon a road. And all\n", - " [{'source': './data/the_little_prince.txt', '_id': '894a9222-ef0c-4e28-b736-8a334cbdc83b', '_collection_name': 'hybrid_collection'}]\n", - "\n", - "\n", - "* [ Chapter 8 ]\n", - "- the rose arrives at the little prince‘s planet\n", - " [{'source': './data/the_little_prince.txt', '_id': 'a3729fa0-b734-4316-ad18-83ea16263a2f', '_collection_name': 'hybrid_collection'}]\n", - "\n", - "\n" - ] + "execution_count": null, + "id": "30d42d2e", + "metadata": { + "vscode": { + "languageId": "plaintext" } - ], + }, + "outputs": [], "source": [ - "from langchain_qdrant import FastEmbedSparse, RetrievalMode\n", - "from langchain_qdrant import RetrievalMode\n", - "from langchain_openai import OpenAIEmbeddings\n", + "# Delete All\n", "\n", - "query = \"What is the significance of the rose in The Little Prince?\"\n", - "\n", - "# Initialize the embedding model with a specific OpenAI model\n", - "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", - "# Initialize sparse embeddings using FastEmbedSparse\n", - "sparse_embeddings = FastEmbedSparse(model_name=\"Qdrant/bm25\")\n", - "\n", - "# Initialize QdrantVectorStore with documents, embeddings, and configuration\n", - "vector_store = QdrantVectorStore.from_documents(\n", - " documents=split_docs,\n", - " embedding=embedding,\n", - " sparse_embedding=sparse_embeddings,\n", - " url=QDRANT_URL,\n", - " api_key=QDRANT_API_KEY,\n", - " collection_name=\"hybrid_collection\",\n", - " retrieval_mode=RetrievalMode.HYBRID,\n", - " batch_size=10,\n", - ")\n", - "\n", - "# Perform similarity search in the vector store\n", - "results = vector_store.similarity_search(\n", - " query=query,\n", - " k=3,\n", - ")\n", - "\n", - "for res in results:\n", - " print(f\"* {res.page_content[:200]}\\n [{res.metadata}]\\n\\n\")" + "# crud_manager.delete()" ] } ], "metadata": { - "kernelspec": { - "display_name": "langchain-opentutorial-6aJyhYW2-py3.11", - "language": "python", - "name": "python3" - }, "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.11" + "name": "python" } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 5 } diff --git a/09-VectorStore/utils/qdrant/legacy/(legacy)05-Qdrant.ipynb b/09-VectorStore/utils/qdrant/legacy/(legacy)05-Qdrant.ipynb new file mode 100644 index 000000000..a196a7a17 --- /dev/null +++ b/09-VectorStore/utils/qdrant/legacy/(legacy)05-Qdrant.ipynb @@ -0,0 +1,1285 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Qdrant\n", + "\n", + "- Author: [HyeonJong Moon](https://github.com/hj0302)\n", + "- Design: \n", + "- Peer Review: \n", + "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n", + "\n", + "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)\n", + "\n", + "\n", + "## Overview\n", + "\n", + "This notebook demonstrates how to utilize the features related to the `Qdrant` vector database.\n", + "\n", + "[`Qdrant`](https://python.langchain.com/docs/integrations/vectorstores/qdrant/) is an open-source vector similarity search engine designed to store, search, and manage high-dimensional vectors with additional payloads. It offers a production-ready service with a user-friendly API, suitable for applications such as semantic search, recommendation systems, and more.\n", + "\n", + "**Qdrant's architecture** is optimized for efficient vector similarity searches, employing advanced indexing techniques like **Hierarchical Navigable Small World (HNSW)** graphs to enable fast and scalable retrieval of relevant data.\n", + "\n", + "\n", + "### Table of Contents\n", + "\n", + "- [Overview](#overview)\n", + "- [Environment Setup](#environment-setup)\n", + "- [Credentials](#credentials)\n", + "- [Installation](#installation)\n", + "- [Initialization](#initialization)\n", + "- [Manage Vector Store](#manage-vector-store)\n", + " - [Create a Collection](#create-a-collection)\n", + " - [List Collections](#list-collections)\n", + " - [Delete a Collection](#delete-a-collection)\n", + " - [Add Items to the Vector Store](#add-items-to-the-vector-store)\n", + " - [Delete Items from the Vector Store](#delete-items-from-the-vector-store)\n", + " - [Upsert Items to Vector Store (Parallel)](#upsert-items-to-vector-store-parallel)\n", + "- [Query Vector Store](#query-vector-store)\n", + " - [Query Directly](#query-directly)\n", + " - [Similarity Search with Score](#similarity-search-with-score)\n", + " - [Query by Turning into Retriever](#query-by-turning-into-retriever)\n", + " - [Search with Filtering](#search-with-filtering)\n", + " - [Delete with Filtering](#delete-with-filtering)\n", + " - [Filtering and Updating Records](#filtering-and-updating-records)\n", + "\n", + "### References\n", + "\n", + "- [LangChain Qdrant Reference](https://python.langchain.com/docs/integrations/vectorstores/qdrant/)\n", + "- [Qdrant Official Reference](https://qdrant.tech/documentation/frameworks/langchain/)\n", + "- [Qdrant Install Reference](https://qdrant.tech/documentation/guides/installation/)\n", + "- [Qdrant Cloud Reference](https://cloud.qdrant.io)\n", + "- [Qdrant Cloud Quickstart Reference](https://qdrant.tech/documentation/quickstart-cloud/)\n", + "----" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Environment Setup\n", + "\n", + "Set up the environment. You may refer to Environment Setup for more details.\n", + "\n", + "[Note]\n", + "- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.\n", + "- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "%%capture --no-stderr\n", + "%pip install langchain-opentutorial" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.0.1\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" + ] + } + ], + "source": [ + "# Install required packages\n", + "from langchain_opentutorial import package\n", + "\n", + "package.install(\n", + " [\n", + " \"langsmith\",\n", + " \"langchain_openai\",\n", + " \"langchain_qdrant\",\n", + " \"qdrant_client\",\n", + " \"langchain_core\",\n", + " \"fastembed\",\n", + " ],\n", + " verbose=False,\n", + " upgrade=False,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Environment variables have been set successfully.\n" + ] + } + ], + "source": [ + "# Set environment variables\n", + "from langchain_opentutorial import set_env\n", + "\n", + "set_env(\n", + " {\n", + " \"OPEN_API_KEY\": \"\",\n", + " \"QDRANT_API_KEY\": \"\",\n", + " \"QDRANT_URL\": \"\",\n", + " \"LANGCHAIN_API_KEY\": \"\",\n", + " \"LANGCHAIN_TRACING_V2\": \"true\",\n", + " \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n", + " \"LANGCHAIN_PROJECT\": \"Qdrant\",\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.\n", + "\n", + "**[Note]** If you are using a `.env` file, proceed as follows." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from dotenv import load_dotenv\n", + "\n", + "load_dotenv(override=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## **Credentials**\n", + "\n", + "Create a new account or sign in to your existing one, and generate an API key for use in this notebook.\n", + "\n", + "1. **Log in to Qdrant Cloud** : Go to the [Qdrant Cloud](https://cloud.qdrant.io) website and log in using your email, Google account, or GitHub account.\n", + "\n", + "2. **Create a Cluster** : After logging in, navigate to the **\"Clusters\"** section and click the **\"Create\"** button. Choose your desired configurations and region, then click **\"Create\"** to start building your cluster. Once the cluster is created, an API key will be generated for you.\n", + "\n", + "3. **Retrieve and Store Your API Key** : When your cluster is created, you will receive an API key. Ensure you save this key in a secure location, as you will need it later. If you lose it, you will have to generate a new one.\n", + "\n", + "4. **Manage API Keys** : To create additional API keys or manage existing ones, go to the **\"Access Management\"** section in the Qdrant Cloud dashboard and select *\"Qdrant Cloud API Keys\"* Here, you can create new keys or delete existing ones.\n", + "\n", + "```\n", + "QDRANT_API_KEY=\"YOUR_QDRANT_API_KEY\"\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## **Installation**\n", + "\n", + "There are several main options for initializing and using the **Qdrant** vector store:\n", + "\n", + "- **Local Mode** : This mode doesn't require a separate server.\n", + " - **In-memory storage** (data is not persisted)\n", + " - **On-disk storage** (data is saved to your local machine)\n", + "- **Docker Deployments** : You can run **Qdrant** using **Docker**.\n", + "- **Qdrant Cloud** : Use **Qdrant** as a managed cloud service.\n", + "\n", + "For detailed instructions, see the [installation instructions](https://qdrant.tech/documentation/guides/installation/)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### In-Memory\n", + "\n", + "For simple tests or quick experiments, you might choose to store data directly in memory. This means the data is automatically removed when your client terminates, typically at the end of your script or notebook session." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collection 'demo_collection' does not exist or force recreate is enabled. Creating new collection...\n", + "Collection 'demo_collection' created successfully with configuration: {'vectors_config': VectorParams(size=3072, distance=, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None, multivector_config=None)}\n" + ] + } + ], + "source": [ + "from utils.qdrant import QdrantDocumentManager\n", + "from langchain_openai import OpenAIEmbeddings\n", + "\n", + "# Define the collection name for storing documents\n", + "collection_name = \"demo_collection\"\n", + "\n", + "# Initialize the embedding model with a specific OpenAI model\n", + "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", + "\n", + "# Create an instance of QdrantDocumentManager with in-memory storage\n", + "db = QdrantDocumentManager(\n", + " location=\":memory:\", # Use in-memory database for temporary storage\n", + " collection_name=collection_name,\n", + " embedding=embedding,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### On-Disk Storage\n", + "\n", + "With **on-disk storage**, you can store your vectors directly on your hard drive without requiring a **Qdrant server**. This ensures that your data persists even when you restart the program." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collection 'demo_collection' does not exist or force recreate is enabled. Creating new collection...\n", + "Collection 'demo_collection' created successfully with configuration: {'vectors_config': VectorParams(size=3072, distance=, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None, multivector_config=None)}\n" + ] + } + ], + "source": [ + "from utils.qdrant import QdrantDocumentManager\n", + "from langchain_openai import OpenAIEmbeddings\n", + "\n", + "# Define the path for Qdrant storage\n", + "qdrant_path = \"./qdrant_memory\"\n", + "\n", + "# Define the collection name for storing documents\n", + "collection_name = \"demo_collection\"\n", + "\n", + "# Initialize the embedding model with a specific OpenAI model\n", + "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", + "\n", + "# Create an instance of QdrantDocumentManager with specified storage path\n", + "db = QdrantDocumentManager(\n", + " path=qdrant_path, # Specify the path for Qdrant storage\n", + " collection_name=collection_name,\n", + " embedding=embedding,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Docker Deployments\n", + "\n", + "You can deploy `Qdrant` in a **production environment** using [`Docker`](https://qdrant.tech/documentation/guides/installation/#docker) and [`Docker Compose`](https://qdrant.tech/documentation/guides/installation/#docker-compose). Refer to the `Docker` and `Docker Compose` setup instructions in the development section for detailed information." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "from utils.qdrant import QdrantDocumentManager\n", + "from langchain_openai import OpenAIEmbeddings\n", + "\n", + "# Define the URL for Qdrant server\n", + "url = \"http://localhost:6333\"\n", + "\n", + "# Define the collection name for storing documents\n", + "collection_name = \"demo_collection\"\n", + "\n", + "# Initialize the embedding model with a specific OpenAI model\n", + "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", + "\n", + "# Create an instance of QdrantDocumentManager with specified storage path\n", + "db = QdrantDocumentManager(\n", + " url=url, # Specify the path for Qdrant storage\n", + " collection_name=collection_name,\n", + " embedding=embedding,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Qdrant Cloud\n", + "\n", + "For a **production environment**, you can use [**Qdrant Cloud**](https://cloud.qdrant.io/). It offers fully managed `Qdrant` databases with features such as **horizontal and vertical scaling**, **one-click setup and upgrades**, **monitoring**, **logging**, **backups**, and **disaster recovery**. For more information, refer to the [**Qdrant Cloud documentation**](https://qdrant.tech/documentation/cloud/).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "import getpass\n", + "import os\n", + "\n", + "# Fetch the Qdrant server URL from environment variables or prompt for input\n", + "if not os.getenv(\"QDRANT_URL\"):\n", + " os.environ[\"QDRANT_URL\"] = getpass.getpass(\"Enter your Qdrant Cloud URL key: \")\n", + "QDRANT_URL = os.environ.get(\"QDRANT_URL\")\n", + "\n", + "# Fetch the Qdrant API key from environment variables or prompt for input\n", + "if not os.getenv(\"QDRANT_API_KEY\"):\n", + " os.environ[\"QDRANT_API_KEY\"] = getpass.getpass(\"Enter your Qdrant API key: \")\n", + "QDRANT_API_KEY = os.environ.get(\"QDRANT_API_KEY\")" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "from utils.qdrant import QdrantDocumentManager\n", + "from langchain_openai import OpenAIEmbeddings\n", + "\n", + "# Define the collection name for storing documents\n", + "collection_name = \"demo_collection\"\n", + "\n", + "# Initialize the embedding model with a specific OpenAI model\n", + "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", + "\n", + "# Create an instance of QdrantDocumentManager with specified storage path\n", + "db = QdrantDocumentManager(\n", + " url=QDRANT_URL,\n", + " api_key=QDRANT_API_KEY,\n", + " collection_name=collection_name,\n", + " embedding=embedding,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Initialization\n", + "\n", + "Once you've established your **vector store**, you'll likely need to manage the **collections** within it. Here are some common operations you can perform:\n", + "\n", + "- **Create a collection**\n", + "- **List collections**\n", + "- **Delete a collection**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To proceed with the tutorial, we will use **Qdrant Cloud** for the next steps. This approach ensures that your data is securely stored in the cloud, allowing for seamless access, comprehensive testing, and experimentation across different environments." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create a Collection\n", + "\n", + "The `QdrantDocumentManager` class allows you to create a new **collection** in `Qdrant`. It can automatically create a collection if it doesn't exist or if you want to **recreate** it. You can specify configurations for **dense** and **sparse vectors** to meet different search needs. Use the `_ensure_collection_exists` method for **automatic creation** or call `create_collection` directly when needed." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collection 'test_collection' does not exist or force recreate is enabled. Creating new collection...\n", + "Collection 'test_collection' created successfully with configuration: {'vectors_config': VectorParams(size=3072, distance=, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None, multivector_config=None)}\n" + ] + } + ], + "source": [ + "from utils.qdrant import QdrantDocumentManager\n", + "from langchain_openai import OpenAIEmbeddings\n", + "from qdrant_client.http.models import Distance\n", + "\n", + "# Define the collection name for storing documents\n", + "collection_name = \"test_collection\"\n", + "\n", + "# Initialize the embedding model with a specific OpenAI model\n", + "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", + "\n", + "# Create an instance of QdrantDocumentManager with specified storage path\n", + "db = QdrantDocumentManager(\n", + " url=QDRANT_URL,\n", + " api_key=QDRANT_API_KEY,\n", + " collection_name=collection_name,\n", + " embedding=embedding,\n", + " metric=Distance.COSINE,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### List Collections\n", + "\n", + "The `QdrantDocumentManager` class lets you list all **collections** in your `Qdrant` instance using the `get_collections` method. This retrieves and displays the **names** of all existing collections.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collection Name: test_collection\n", + "Collection Name: sparse_collection\n", + "Collection Name: dense_collection\n", + "Collection Name: insta_image_search_test\n", + "Collection Name: insta_image_search\n", + "Collection Name: demo_collection\n" + ] + } + ], + "source": [ + "# Retrieve the list of collections from the Qdrant client\n", + "collections = db.client.get_collections()\n", + "\n", + "# Iterate over each collection and print its details\n", + "for collection in collections.collections:\n", + " print(f\"Collection Name: {collection.name}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Delete a Collection\n", + "\n", + "The `QdrantDocumentManager` class allows you to delete a **collection** using the `delete_collection` method. This method removes the specified collection from your `Qdrant` instance." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collection 'test_collection' has been deleted.\n" + ] + } + ], + "source": [ + "# Define collection name\n", + "collection_name = \"test_collection\"\n", + "\n", + "# Delete the collection\n", + "if db.client.delete_collection(collection_name=collection_name):\n", + " print(f\"Collection '{collection_name}' has been deleted.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Manage VectorStore\n", + "\n", + "After you've created your **vector store**, you can interact with it by **adding** or **deleting** items. Here are some common operations:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Add Items to the Vector Store\n", + "\n", + "The `QdrantDocumentManager` class lets you add items to your **vector store** using the `upsert` method. This method **updates** existing documents with new data if their IDs already exist." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "from langchain.document_loaders import TextLoader\n", + "from uuid import uuid4\n", + "\n", + "# Load the text file\n", + "loader = TextLoader(\"./data/the_little_prince.txt\")\n", + "documents = loader.load()\n", + "\n", + "# Initialize the text splitter\n", + "text_splitter = RecursiveCharacterTextSplitter(\n", + " chunk_size=600, chunk_overlap=100, length_function=len\n", + ")\n", + "\n", + "split_docs = text_splitter.split_documents(documents)\n", + "\n", + "# Generate unique IDs for documents\n", + "uuids = [str(uuid4()) for _ in split_docs[:30]]\n", + "page_contents = [doc.page_content for doc in split_docs[:30]]\n", + "metadatas = [doc.metadata for doc in split_docs[:30]]" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['22417c4f-bf11-4e92-978a-6c436dec39ca',\n", + " '28f56a01-34af-46ae-aeb4-ea6e0fcacb62',\n", + " 'c6d06501-9595-4272-80b5-f0747cb145fc',\n", + " 'b4b901bf-6e83-4658-b5e9-a1d5a80c767d',\n", + " '21b1b98d-0707-4128-a0bd-78c94db6cbf3',\n", + " 'c49b5d7c-c330-4d59-9097-25c3c52510b9',\n", + " '36ddc677-4fa9-47ee-b2e0-284bdb9062a1',\n", + " '32fde659-84c6-4679-b4df-d4b1d11e645f',\n", + " 'caf0b611-4a38-4a94-84a9-c3a98ac0b2a1',\n", + " '0e655834-9a6c-48a8-8a3b-5d5e2b1d6c2c',\n", + " '493aaa5c-b89d-429b-a425-57f20f3564ed',\n", + " '6f7f0755-d226-4aec-a714-a53d7a705e51',\n", + " '8b68a39b-f990-4ce1-9fbd-675f5103d3ff',\n", + " '73ef217b-9114-48a4-a447-0deb916b3d5a',\n", + " '63b99932-4e84-4cb2-a5ef-1d83fdbc4e6a',\n", + " '45fd3628-ca2f-439d-97ba-cc34da564f36',\n", + " '876f59dd-a9ae-4af7-84e8-5d8fe78cf7d3',\n", + " '5aa82f42-534f-447f-94b5-9ed4f3571091',\n", + " 'eb69cc2a-8899-4d9e-ad8f-adebea281ff0',\n", + " '1defc340-16b4-4ee0-94de-0dabc23e5d07',\n", + " '368d5f90-75d2-406c-8dd2-c7d8736b6944',\n", + " '842812f6-ee9f-43ae-8f6d-53015a5e57af',\n", + " '61031399-09ed-4c88-bc93-1018b942df71',\n", + " 'a6ac25f2-2dd5-445f-95dd-6a4d9fc4081c',\n", + " '08215031-2393-4d0c-82a2-53a6a90d169f',\n", + " 'f41de48c-1e7d-4036-a75e-a10ac579081d',\n", + " 'a2d6b6d1-5bbc-4f17-9b95-c917021614f0',\n", + " '3603a2e7-6021-46c9-8f4c-d53056849c1a',\n", + " 'e1fb95a1-7c1c-4aed-a628-b39e0907b744',\n", + " '2a42fbb6-9450-4d86-a5f8-65f333c10d4c']" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from utils.qdrant import QdrantDocumentManager\n", + "from langchain_openai import OpenAIEmbeddings\n", + "\n", + "# Define the collection name for storing documents\n", + "collection_name = \"demo_collection\"\n", + "\n", + "# Initialize the embedding model with a specific OpenAI model\n", + "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", + "\n", + "# Create an instance of QdrantDocumentManager with specified storage path\n", + "db = QdrantDocumentManager(\n", + " url=QDRANT_URL,\n", + " api_key=QDRANT_API_KEY,\n", + " collection_name=collection_name,\n", + " embedding=embedding,\n", + ")\n", + "\n", + "db.upsert(texts=page_contents, metadatas=metadatas, ids=uuids)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Delete Items from the Vector Store\n", + "\n", + "The `QdrantDocumentManager` class allows you to delete items from your **vector store** using the `delete` method. You can specify items to delete by providing **IDs** or **filters**.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "delete_ids = [uuids[0]]\n", + "\n", + "db.delete(ids=delete_ids)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Upsert Items to Vector Store (Parallel)\n", + "\n", + "The `QdrantDocumentManager` class supports **parallel upserts** using the `upsert_parallel` method. This efficiently **adds** or **updates** multiple items with unique **IDs**, **data**, and **metadata**." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['286d99ae-019b-41ed-962a-c1a26bf41c4a',\n", + " 'e17ce584-3576-45bb-8d82-36cfdd4c89d1',\n", + " 'aed142fa-a13a-421f-9e60-ab1af13a8b15',\n", + " '14337336-edb2-4ea1-880c-2f4613f1f999',\n", + " '91d47b16-4a1f-4f1f-ba07-78f9b2db06d8',\n", + " '6b58d2d9-1a4b-4e03-97fd-d584d502b606',\n", + " 'e7b6f4b5-27e0-4787-a74c-b8d17a7038ea',\n", + " '01579e1a-9935-443d-a7a5-b9ffdd1e07f9',\n", + " '4d516f16-09cf-4b7e-8d65-455eced738e7',\n", + " '7fd284a3-5f10-407f-a8fe-44a923263748',\n", + " '55fae9b6-046a-4f09-9cf0-08568efde43c',\n", + " 'b4386ade-1590-41fa-94e7-cc34d4f4c9da',\n", + " 'd27d8f98-349a-4c45-9f82-31e983edfa8c',\n", + " '20537c5d-80d1-4d72-8507-73fd21e3f11a',\n", + " 'ae418ede-69f6-4703-9d9d-2e31d59441b2',\n", + " '975d663d-f825-446d-9824-7997058ca24a',\n", + " 'c8086e33-6345-4403-a98c-a4cd46375cd1',\n", + " 'ec887b4f-eecf-4325-8117-293e6fd8dfd6',\n", + " 'c5fa1381-e30d-47d8-aad3-d46cc8520953',\n", + " '1b20e891-e44f-4640-ab24-03d692627265',\n", + " '0d37a3dd-329f-4901-a828-71a704f7a35e',\n", + " '170420dc-b02c-42f3-a36d-c56973784fb7',\n", + " 'f11893c3-20c5-43e4-9c0f-905d91c7a668',\n", + " '37327ff1-7f17-43b0-89ca-65ab69c14df6',\n", + " '92a4e2ec-7418-4241-a1e3-3bf2668a9fd6',\n", + " 'ea018faa-293f-4329-b8ae-92dc3fcdd909',\n", + " '09c78d94-0b4c-41cc-b530-7504f3d62dc4',\n", + " '907ad8d0-427d-4f29-b801-aea90a6a86aa',\n", + " '86508b0c-4ff7-422f-b13e-1443e47ef5d3',\n", + " 'b12e4c37-50a1-4257-80ae-de372a4a77ce']" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Generate unique IDs for documents\n", + "uuids = [str(uuid4()) for _ in split_docs[30:60]]\n", + "page_contents = [doc.page_content for doc in split_docs[30:60]]\n", + "metadatas = [doc.metadata for doc in split_docs[30:60]]\n", + "\n", + "db.upsert_parallel(\n", + " texts=page_contents,\n", + " metadatas=metadatas,\n", + " ids=uuids,\n", + " batch_size=32,\n", + " workers=10,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Query VectorStore\n", + "\n", + "Once your **vector store** has been created and the relevant **documents** have been added, you will most likely wish to **query** it during the running of your `chain` or `agent`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Query Directly\n", + "\n", + "The `QdrantDocumentManager` class allows direct **querying** using the `search` method. It performs **similarity searches** by converting queries into **vector embeddings** to find similar **documents**.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "* for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", + " [{'source': './data/the_little_prince.txt'}]\n", + "\n", + "\n", + "* for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", + " [{'source': './data/the_little_prince.txt'}]\n", + "\n", + "\n", + "* for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", + " [{'source': './data/the_little_prince.txt'}]\n", + "\n", + "\n" + ] + } + ], + "source": [ + "query = \"What is the significance of the rose in The Little Prince?\"\n", + "\n", + "response = db.search(\n", + " query=query,\n", + " k=3,\n", + ")\n", + "\n", + "for res in response:\n", + " payload = res[\"payload\"]\n", + " print(f\"* {payload['page_content'][:200]}\\n [{payload['metadata']}]\\n\\n\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Similarity Search with Score\n", + "\n", + "The `QdrantDocumentManager` class enables **similarity searches** with **scores** using the `search` method. This provides a **relevance score** for each **document** found.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "* [SIM=0.527] for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", + " [{'source': './data/the_little_prince.txt'}]\n", + "\n", + "\n", + "* [SIM=0.527] for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", + " [{'source': './data/the_little_prince.txt'}]\n", + "\n", + "\n", + "* [SIM=0.527] for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", + " [{'source': './data/the_little_prince.txt'}]\n", + "\n", + "\n" + ] + } + ], + "source": [ + "# Define the query to search in the database\n", + "query = \"What is the significance of the rose in The Little Prince?\"\n", + "\n", + "# Perform the search with the specified query and number of results\n", + "response = db.search(query=query, k=3)\n", + "\n", + "for res in response:\n", + " payload = res[\"payload\"]\n", + " score = res[\"score\"]\n", + " print(\n", + " f\"* [SIM={score:.3f}] {payload['page_content'][:200]}\\n [{payload['metadata']}]\\n\\n\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Query by Turning into Retriever\n", + "\n", + "The `QdrantDocumentManager` class can transform the **vector store** into a `retriever`. This allows for easier **integration** into **workflows** or **chains**.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "* for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", + " [{'source': './data/the_little_prince.txt', '_id': 'c49b5d7c-c330-4d59-9097-25c3c52510b9', '_collection_name': 'demo_collection'}]\n", + "\n", + "\n", + "* for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", + " [{'source': './data/the_little_prince.txt', '_id': '9567e6cf-2f89-4c3b-8a41-7167770fbcd3', '_collection_name': 'demo_collection'}]\n", + "\n", + "\n", + "* for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", + " [{'source': './data/the_little_prince.txt', '_id': 'e2a0d06a-9ccd-4e9e-8d4a-4e1292b6ccef', '_collection_name': 'demo_collection'}]\n", + "\n", + "\n" + ] + } + ], + "source": [ + "from langchain_qdrant import QdrantVectorStore\n", + "\n", + "# Initialize QdrantVectorStore with the client, collection name, and embedding\n", + "vector_store = QdrantVectorStore(\n", + " client=db.client, collection_name=db.collection_name, embedding=db.embedding\n", + ")\n", + "\n", + "query = \"What is the significance of the rose in The Little Prince?\"\n", + "\n", + "# Transform the vector store into a retriever with specific search parameters\n", + "retriever = vector_store.as_retriever(\n", + " search_type=\"similarity_score_threshold\",\n", + " search_kwargs={\"k\": 3, \"score_threshold\": 0.3},\n", + ")\n", + "\n", + "results = retriever.invoke(query)\n", + "\n", + "for res in results:\n", + " print(f\"* {res.page_content[:200]}\\n [{res.metadata}]\\n\\n\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Search with Filtering\n", + "\n", + "The `QdrantDocumentManager` class allows **searching with filters** to retrieve records based on specific **metadata values**. This is done using the `scroll` method with a defined **filter query**." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Record(id='09c78d94-0b4c-41cc-b530-7504f3d62dc4', payload={'page_content': '[ Chapter 7 ]\\n- the narrator learns about the secret of the little prince‘s life \\nOn the fifth day-- again, as always, it was thanks to the sheep-- the secret of the little prince‘s life was revealed to me. Abruptly, without anything to lead up to it, and as if the question had been born of long and silent meditation on his problem, he demanded: \\n\"A sheep-- if it eats little bushes, does it eat flowers, too?\"\\n\"A sheep,\" I answered, \"eats anything it finds in its reach.\"\\n\"Even flowers that have thorns?\"\\n\"Yes, even flowers that have thorns.\" \\n\"Then the thorns-- what use are they?\"', 'metadata': {'source': './data/the_little_prince.txt'}}, vector=None, shard_key=None, order_value=None),\n", + " Record(id='0e655834-9a6c-48a8-8a3b-5d5e2b1d6c2c', payload={'page_content': '[ Chapter 1 ]\\n- we are introduced to the narrator, a pilot, and his ideas about grown-ups\\nOnce when I was six years old I saw a magnificent picture in a book, called True Stories from Nature, about the primeval forest. It was a picture of a boa constrictor in the act of swallowing an animal. Here is a copy of the drawing. \\n(picture)\\nIn the book it said: \"Boa constrictors swallow their prey whole, without chewing it. After that they are not able to move, and they sleep through the six months that they need for digestion.\"', 'metadata': {'source': './data/the_little_prince.txt'}}, vector=None, shard_key=None, order_value=None),\n", + " Record(id='286d99ae-019b-41ed-962a-c1a26bf41c4a', payload={'page_content': '[ Chapter 4 ]\\n- the narrator speculates as to which asteroid from which the little prince came\\u3000\\u3000\\nI had thus learned a second fact of great importance: this was that the planet the little prince came from was scarcely any larger than a house!', 'metadata': {'source': './data/the_little_prince.txt'}}, vector=None, shard_key=None, order_value=None),\n", + " Record(id='45fd3628-ca2f-439d-97ba-cc34da564f36', payload={'page_content': '[ Chapter 2 ]\\n- the narrator crashes in the desert and makes the acquaintance of the little prince\\nSo I lived my life alone, without anyone that I could really talk to, until I had an accident with my plane in the Desert of Sahara, six years ago. Something was broken in my engine. And as I had with me neither a mechanic nor any passengers, I set myself to attempt the difficult repairs all alone. It was a question of life or death for me: I had scarcely enough drinking water to last a week.', 'metadata': {'source': './data/the_little_prince.txt'}}, vector=None, shard_key=None, order_value=None),\n", + " Record(id='d27d8f98-349a-4c45-9f82-31e983edfa8c', payload={'page_content': '[ Chapter 5 ]\\n- we are warned as to the dangers of the baobabs\\nAs each day passed I would learn, in our talk, something about the little prince‘s planet, his departure from it, his journey. The information would come very slowly, as it might chance to fall from his thoughts. It was in this way that I heard, on the third day, about the catastrophe of the baobabs.\\nThis time, once more, I had the sheep to thank for it. For the little prince asked me abruptly-- as if seized by a grave doubt-- \"It is true, isn‘t it, that sheep eat little bushes?\" \\n\"Yes, that is true.\" \\n\"Ah! I am glad!\"', 'metadata': {'source': './data/the_little_prince.txt'}}, vector=None, shard_key=None, order_value=None),\n", + " Record(id='f11893c3-20c5-43e4-9c0f-905d91c7a668', payload={'page_content': '[ Chapter 6 ]\\n- the little prince and the narrator talk about sunsets\\nOh, little prince! Bit by bit I came to understand the secrets of your sad little life... For a long time you had found your only entertainment in the quiet pleasure of looking at the sunset. I learned that new detail on the morning of the fourth day, when you said to me: \\n\"I am very fond of sunsets. Come, let us go look at a sunset now.\" \\n\"But we must wait,\" I said. \\n\"Wait? For what?\" \\n\"For the sunset. We must wait until it is time.\"', 'metadata': {'source': './data/the_little_prince.txt'}}, vector=None, shard_key=None, order_value=None),\n", + " Record(id='f41de48c-1e7d-4036-a75e-a10ac579081d', payload={'page_content': '[ Chapter 3 ]\\n- the narrator learns more about from where the little prince came\\nIt took me a long time to learn where he came from. The little prince, who asked me so many questions, never seemed to hear the ones I asked him. It was from words dropped by chance that, little by little, everything was revealed to me. \\nThe first time he saw my airplane, for instance (I shall not draw my airplane; that would be much too complicated for me), he asked me: \\n\"What is that object?\"\\n\"That is not an object. It flies. It is an airplane. It is my airplane.\"', 'metadata': {'source': './data/the_little_prince.txt'}}, vector=None, shard_key=None, order_value=None)]" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from qdrant_client import models\n", + "\n", + "# Define a filter query to match documents containing the text \"Chapter\" in the page content\n", + "filter_query = models.Filter(\n", + " must=[\n", + " models.FieldCondition(\n", + " key=\"page_content\",\n", + " match=models.MatchText(text=\"Chapter\"),\n", + " ),\n", + " ]\n", + ")\n", + "\n", + "# Retrieve records from the collection that match the filter query\n", + "db.scroll(\n", + " scroll_filter=filter_query,\n", + " k=10,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Delete with Filtering\n", + "\n", + "The `QdrantDocumentManager` class allows you to **delete records** using **filters** based on specific **metadata values**. This is achieved with the `delete` method and a **filter query**." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "UpdateResult(operation_id=31, status=)" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from qdrant_client.http.models import Filter, FieldCondition, MatchText\n", + "\n", + "# Define a filter query to match documents containing the text \"Chapter\" in the page content\n", + "filter_query = models.Filter(\n", + " must=[\n", + " models.FieldCondition(\n", + " key=\"page_content\",\n", + " match=models.MatchText(text=\"Chapter\"),\n", + " ),\n", + " ]\n", + ")\n", + "\n", + "# Delete records from the collection that match the filter query\n", + "db.client.delete(collection_name=db.collection_name, points_selector=filter_query)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Filtering and Updating Records\n", + "\n", + "The `QdrantDocumentManager` class supports **filtering and updating records** based on specific **metadata values**. This is done by **retrieving records** with **filters** and **updating** them as needed.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "from qdrant_client import models\n", + "\n", + "# Define a filter query to match documents with a specific metadata source\n", + "filter_query = models.Filter(\n", + " must=[\n", + " models.FieldCondition(\n", + " key=\"metadata.source\",\n", + " match=models.MatchValue(value=\"./data/the_little_prince.txt\"),\n", + " ),\n", + " ]\n", + ")\n", + "\n", + "# Retrieve records matching the filter query, including their vectors\n", + "response = db.scroll(scroll_filter=filter_query, k=10, with_vectors=True)\n", + "new_source = \"the_little_prince.txt\"\n", + "\n", + "# Update the point IDs and set new metadata for the records\n", + "for point in response: # response[0] returns a list of points\n", + " payload = point.payload\n", + "\n", + " # Check if metadata exists in the payload\n", + " if \"metadata\" in payload:\n", + " payload[\"metadata\"][\"source\"] = new_source\n", + " else:\n", + " payload[\"metadata\"] = {\n", + " \"source\": new_source\n", + " } # Add new metadata if it doesn't exist\n", + "\n", + " # Update the point with new metadata\n", + " db.client.upsert(\n", + " collection_name=db.collection_name,\n", + " points=[\n", + " models.PointStruct(\n", + " id=point.id,\n", + " payload=payload,\n", + " vector=point.vector,\n", + " )\n", + " ],\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Similarity Search Options\n", + "\n", + "When using `QdrantVectorStore`, you have three options for performing **similarity searches**. You can select the desired search mode using the `retrieval_mode` parameter when you set up the class. The available modes are:\n", + "\n", + "- **Dense Vector Search** (Default)\n", + "- **Sparse Vector Search**\n", + "- **Hybrid Search**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Dense Vector Search\n", + "\n", + "To perform a search using only **dense vectors**:\n", + "\n", + "- The `retrieval_mode` parameter must be set to `RetrievalMode.DENSE`. This is also the **default setting**.\n", + "- You need to provide a [dense embeddings](https://python.langchain.com/docs/integrations/text_embedding/) value through the `embedding` parameter.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "* for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", + " [{'source': './data/the_little_prince.txt', '_id': '3cc041d5-2700-498f-8114-85f3c96e26b9', '_collection_name': 'dense_collection'}]\n", + "\n", + "\n", + "* for decades. In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little pri\n", + " [{'source': './data/the_little_prince.txt', '_id': '24d766ea-3383-40e5-bd0e-051d51de88a3', '_collection_name': 'dense_collection'}]\n", + "\n", + "\n", + "* Indeed, as I learned, there were on the planet where the little prince lived-- as on all planets-- good plants and bad plants. In consequence, there were good seeds from good plants, and bad seeds fro\n", + " [{'source': './data/the_little_prince.txt', '_id': 'd25ba992-e54d-4e8a-9572-438c78d0288b', '_collection_name': 'dense_collection'}]\n", + "\n", + "\n" + ] + } + ], + "source": [ + "from langchain_qdrant import RetrievalMode\n", + "from langchain_openai import OpenAIEmbeddings\n", + "\n", + "query = \"What is the significance of the rose in The Little Prince?\"\n", + "\n", + "# Initialize the embedding model with a specific OpenAI model\n", + "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", + "\n", + "# Initialize QdrantVectorStore with documents, embeddings, and configuration\n", + "vector_store = QdrantVectorStore.from_documents(\n", + " documents=split_docs[:50],\n", + " embedding=embedding,\n", + " url=QDRANT_URL,\n", + " api_key=QDRANT_API_KEY,\n", + " collection_name=\"dense_collection\",\n", + " retrieval_mode=RetrievalMode.DENSE,\n", + " batch_size=10,\n", + ")\n", + "\n", + "# Perform similarity search in the vector store\n", + "results = vector_store.similarity_search(\n", + " query=query,\n", + " k=3,\n", + ")\n", + "\n", + "for res in results:\n", + " print(f\"* {res.page_content[:200]}\\n [{res.metadata}]\\n\\n\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Sparse Vector Search\n", + "\n", + "To search with only **sparse vectors**:\n", + "\n", + "- The `retrieval_mode` parameter should be set to `RetrievalMode.SPARSE`.\n", + "- An implementation of the [SparseEmbeddings](https://github.com/langchain-ai/langchain/blob/master/libs/partners/qdrant/langchain_qdrant/sparse_embeddings.py) interface using any **sparse embeddings provider** has to be provided as a value to the `sparse_embedding` parameter.\n", + "- The `langchain-qdrant` package provides a **FastEmbed** based implementation out of the box.\n", + "\n", + "To use it, install the [FastEmbed](https://github.com/qdrant/fastembed) package:\n", + "\n", + "```bash\n", + "pip install fastembed\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "* [ Chapter 20 ]\n", + "- the little prince discovers a garden of roses\n", + "But it happened that after walking for a long time through sand, and rocks, and snow, the little prince at last came upon a road. And all\n", + " [{'source': './data/the_little_prince.txt', '_id': '30d70339-4233-427b-b839-208c7618ae82', '_collection_name': 'sparse_collection'}]\n", + "\n", + "\n", + "* [ Chapter 20 ]\n", + "- the little prince discovers a garden of roses\n", + "But it happened that after walking for a long time through sand, and rocks, and snow, the little prince at last came upon a road. And all\n", + " [{'source': './data/the_little_prince.txt', '_id': '45ad1b0e-45cd-46f0-b6cd-d8e2b19ea8fa', '_collection_name': 'sparse_collection'}]\n", + "\n", + "\n", + "* And he went back to meet the fox. \n", + "\"Goodbye,\" he said. \n", + "\"Goodbye,\" said the fox. \"And now here is my secret, a very simple secret: It is only with the heart that one can see rightly; what is essential\n", + " [{'source': './data/the_little_prince.txt', '_id': 'ab098119-c45f-4e33-b105-a6c6e01a918b', '_collection_name': 'sparse_collection'}]\n", + "\n", + "\n" + ] + } + ], + "source": [ + "from langchain_qdrant import FastEmbedSparse, RetrievalMode\n", + "from langchain_qdrant import RetrievalMode\n", + "from langchain_openai import OpenAIEmbeddings\n", + "\n", + "query = \"What is the significance of the rose in The Little Prince?\"\n", + "\n", + "# Initialize the embedding model with a specific OpenAI model\n", + "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", + "# Initialize sparse embeddings using FastEmbedSparse\n", + "sparse_embeddings = FastEmbedSparse(model_name=\"Qdrant/bm25\")\n", + "\n", + "# Initialize QdrantVectorStore with documents, embeddings, and configuration\n", + "vector_store = QdrantVectorStore.from_documents(\n", + " documents=split_docs,\n", + " embedding=embedding,\n", + " sparse_embedding=sparse_embeddings,\n", + " url=QDRANT_URL,\n", + " api_key=QDRANT_API_KEY,\n", + " collection_name=\"sparse_collection\",\n", + " retrieval_mode=RetrievalMode.SPARSE,\n", + " batch_size=10,\n", + ")\n", + "\n", + "# Perform similarity search in the vector store\n", + "results = vector_store.similarity_search(\n", + " query=query,\n", + " k=3,\n", + ")\n", + "\n", + "for res in results:\n", + " print(f\"* {res.page_content[:200]}\\n [{res.metadata}]\\n\\n\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Hybrid Vector Search\n", + "\n", + "To perform a **hybrid search** using **dense** and **sparse vectors** with **score fusion**:\n", + "\n", + "- The `retrieval_mode` parameter should be set to `RetrievalMode.HYBRID`.\n", + "- A [`dense embeddings`](https://python.langchain.com/docs/integrations/text_embedding/) value should be provided to the `embedding` parameter.\n", + "- An implementation of the [`SparseEmbeddings`](https://github.com/langchain-ai/langchain/blob/master/libs/partners/qdrant/langchain_qdrant/sparse_embeddings.py) interface using any **sparse embeddings provider** has to be provided as a value to the `sparse_embedding` parameter.\n", + "\n", + "**Note**: If you've added documents with the `HYBRID` mode, you can switch to any **retrieval mode** when searching, since both the **dense** and **sparse vectors** are available in the **collection**." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "* \"Go and look again at the roses. You will understand now that yours is unique in all the world. Then come back to say goodbye to me, and I will make you a present of a secret.\" \n", + "The little prince went\n", + " [{'source': './data/the_little_prince.txt', '_id': '447a916c-d8a9-46f2-b035-d0ac4c7ea901', '_collection_name': 'hybrid_collection'}]\n", + "\n", + "\n", + "* [ Chapter 20 ]\n", + "- the little prince discovers a garden of roses\n", + "But it happened that after walking for a long time through sand, and rocks, and snow, the little prince at last came upon a road. And all\n", + " [{'source': './data/the_little_prince.txt', '_id': '894a9222-ef0c-4e28-b736-8a334cbdc83b', '_collection_name': 'hybrid_collection'}]\n", + "\n", + "\n", + "* [ Chapter 8 ]\n", + "- the rose arrives at the little prince‘s planet\n", + " [{'source': './data/the_little_prince.txt', '_id': 'a3729fa0-b734-4316-ad18-83ea16263a2f', '_collection_name': 'hybrid_collection'}]\n", + "\n", + "\n" + ] + } + ], + "source": [ + "from langchain_qdrant import FastEmbedSparse, RetrievalMode\n", + "from langchain_qdrant import RetrievalMode\n", + "from langchain_openai import OpenAIEmbeddings\n", + "\n", + "query = \"What is the significance of the rose in The Little Prince?\"\n", + "\n", + "# Initialize the embedding model with a specific OpenAI model\n", + "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", + "# Initialize sparse embeddings using FastEmbedSparse\n", + "sparse_embeddings = FastEmbedSparse(model_name=\"Qdrant/bm25\")\n", + "\n", + "# Initialize QdrantVectorStore with documents, embeddings, and configuration\n", + "vector_store = QdrantVectorStore.from_documents(\n", + " documents=split_docs,\n", + " embedding=embedding,\n", + " sparse_embedding=sparse_embeddings,\n", + " url=QDRANT_URL,\n", + " api_key=QDRANT_API_KEY,\n", + " collection_name=\"hybrid_collection\",\n", + " retrieval_mode=RetrievalMode.HYBRID,\n", + " batch_size=10,\n", + ")\n", + "\n", + "# Perform similarity search in the vector store\n", + "results = vector_store.similarity_search(\n", + " query=query,\n", + " k=3,\n", + ")\n", + "\n", + "for res in results:\n", + " print(f\"* {res.page_content[:200]}\\n [{res.metadata}]\\n\\n\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "langchain-opentutorial-6aJyhYW2-py3.11", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.11" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/09-VectorStore/utils/qdrant.py b/09-VectorStore/utils/qdrant/legacy/qdrant.py similarity index 100% rename from 09-VectorStore/utils/qdrant.py rename to 09-VectorStore/utils/qdrant/legacy/qdrant.py From 2946e13c5483fc191ccef5322d80f1684819cc16 Mon Sep 17 00:00:00 2001 From: Gwangwon Jung Date: Tue, 6 May 2025 17:51:55 +0900 Subject: [PATCH 2/4] [N-2] 09-VectorStore / 05-Qdrant - Write Overview --- 09-VectorStore/05-Qdrant.ipynb | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/09-VectorStore/05-Qdrant.ipynb b/09-VectorStore/05-Qdrant.ipynb index bd54df77e..e8a3f09cd 100644 --- a/09-VectorStore/05-Qdrant.ipynb +++ b/09-VectorStore/05-Qdrant.ipynb @@ -5,30 +5,30 @@ "id": "25733da0", "metadata": {}, "source": [ - "# {VectorStore Name}\n", + "# Qdrant\n", "\n", - "- Author: [Author Name](#Author's-Profile-Link)\n", + "- Author: [Pupba](https://github.com/pupba)\n", "- Design: [Designer](#Designer's-Profile-Link)\n", "- Peer Review: [Reviewer Name](#Reviewer-Profile-Link)\n", "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n", "\n", - "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/your-notebook-file-name) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/your-notebook-file-name)\n", + "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/05-Qdrant.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/05-Qdrant.ipynb)\n", "\n", "## Overview\n", "\n", - "This tutorial covers how to use **{Vector Store Name}** with **LangChain** .\n", + "This tutorial covers how to use **Qdrant** with **LangChain** .\n", "\n", - "{A short introduction to vectordb}\n", + "**Qdrant** is a high-performance, open-source vector database that stands out with advanced filtering, payload indexing, and native support for hybrid (vector + keyword) search.\n", "\n", - "This tutorial walks you through using **CRUD** operations with the **{VectorDB}** **storing** , **updating** , **deleting** documents, and performing **similarity-based retrieval** .\n", + "This tutorial walks you through using **CRUD** operations with the **Qdrant** **storing** , **updating** , **deleting** documents, and performing **similarity-based retrieval** .\n", "\n", "### Table of Contents\n", "\n", "- [Overview](#overview)\n", "- [Environment Setup](#environment-setup)\n", - "- [What is {vectordb}?](#what-is-{vectordb}?)\n", + "- [What is Qdrant?](#what-is-qdrant?)\n", "- [Data](#data)\n", - "- [Initial Setting {vectordb}](#initial-setting-{vectordb})\n", + "- [Initial Setting Qdrant](#initial-setting-qdrant)\n", "- [Document Manager](#document-manager)\n", "\n", "\n", From d629921b7879e15af4c441ce27059761b66b08f6 Mon Sep 17 00:00:00 2001 From: Gwangwon Jung Date: Tue, 6 May 2025 21:47:52 +0900 Subject: [PATCH 3/4] [N-2] 09-Vector Store / 05-Qdrant - New Qdrant Tutorial - Legacy Notbook and Python Script Move to `/utils/qdrant/legacy/` --- 09-VectorStore/05-Qdrant.ipynb | 429 ++++++++++++++--------- 09-VectorStore/assets/05-qdrant-logo.png | Bin 0 -> 2913 bytes 09-VectorStore/utils/qdrant/crud.py | 361 +++++++++++++++++++ 3 files changed, 630 insertions(+), 160 deletions(-) create mode 100644 09-VectorStore/assets/05-qdrant-logo.png create mode 100644 09-VectorStore/utils/qdrant/crud.py diff --git a/09-VectorStore/05-Qdrant.ipynb b/09-VectorStore/05-Qdrant.ipynb index e8a3f09cd..9c13b9b14 100644 --- a/09-VectorStore/05-Qdrant.ipynb +++ b/09-VectorStore/05-Qdrant.ipynb @@ -52,14 +52,20 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "id": "98da7994", - "metadata": { - "vscode": { - "languageId": "plaintext" + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n", + "[notice] A new release of pip is available: 24.3.1 -> 25.1.1\n", + "[notice] To update, run: python.exe -m pip install --upgrade pip\n" + ] } - }, - "outputs": [], + ], "source": [ "%%capture --no-stderr\n", "%pip install langchain-opentutorial" @@ -67,13 +73,9 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, "id": "800c732b", - "metadata": { - "vscode": { - "languageId": "plaintext" - } - }, + "metadata": {}, "outputs": [], "source": [ "# Install required packages\n", @@ -84,6 +86,7 @@ " \"langsmith\",\n", " \"langchain-core\",\n", " \"python-dotenv\",\n", + " \"qdrant-client\",\n", " ],\n", " verbose=False,\n", " upgrade=False,\n", @@ -92,14 +95,18 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "id": "5b36bafa", - "metadata": { - "vscode": { - "languageId": "plaintext" + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Environment variables have been set successfully.\n" + ] } - }, - "outputs": [], + ], "source": [ "# Set environment variables\n", "from langchain_opentutorial import set_env\n", @@ -108,9 +115,11 @@ " {\n", " \"OPENAI_API_KEY\": \"\",\n", " \"LANGCHAIN_API_KEY\": \"\",\n", - " \"LANGCHAIN_TRACING_V2\": \"true\",\n", + " \"LANGCHAIN_TRACING_V2\": \"false\",\n", " \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n", - " \"LANGCHAIN_PROJECT\": \"{Project Name}\",\n", + " \"LANGCHAIN_PROJECT\": \"Qdrant\",\n", + " \"QDRANT_API_KEY\": \"\",\n", + " \"QDRANT_URL\": \"\",\n", " }\n", ")" ] @@ -127,14 +136,21 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "id": "70d7e764", - "metadata": { - "vscode": { - "languageId": "plaintext" + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" } - }, - "outputs": [], + ], "source": [ "from dotenv import load_dotenv\n", "\n", @@ -146,7 +162,25 @@ "id": "6890920d", "metadata": {}, "source": [ - "Please write down what you need to set up the Vectorstore here." + "## What is Qdrant?\n", + "\n", + "![qdrant-logo](./assets/05-qdrant-logo.png)\n", + "\n", + "Qdrant is an open-source vector database and similarity search engine built in Rust, designed to handle high-dimensional vector data efficiently.\n", + "\n", + "It provides a production-ready service with a user-friendly API for storing, searching, and managing vectors along with additional payload data.\n", + "\n", + "### Key Features\n", + "\n", + "- **High Performance** : Built in Rust for speed and reliability, handling billions of vectors with low latency. \n", + "\n", + "- **Advanced Filtering** : Supports complex filtering with JSON payloads, enabling precise searches based on metadata. \n", + "\n", + "- **Hybrid Search** : Combines vector similarity with keyword-based filtering for enhanced search capabilities. \n", + "\n", + "- **Scalable Deployment** : Offers cloud-native scalability with options for on-premise, cloud, and hybrid deployments. \n", + "\n", + "- **Multi-language Support** : Provides client libraries for Python, JavaScript/TypeScript, Go, and more. " ] }, { @@ -196,13 +230,9 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "id": "8e4cac64", - "metadata": { - "vscode": { - "languageId": "plaintext" - } - }, + "metadata": {}, "outputs": [], "source": [ "from langchain.schema import Document\n", @@ -210,15 +240,16 @@ "import re\n", "from typing import List\n", "\n", - "def preprocessing_data(content:str)->List[Document]:\n", + "\n", + "def preprocessing_data(content: str) -> List[Document]:\n", " # 1. Split the text by double newlines to separate sections\n", " blocks = content.split(\"\\n\\n\")\n", "\n", " # 2. Initialize the text splitter\n", " text_splitter = RecursiveCharacterTextSplitter(\n", - " chunk_size=500, # Maximum number of characters per chunk\n", - " chunk_overlap=50, # Overlap between chunks to preserve context\n", - " separators=[\"\\n\\n\", \"\\n\", \" \"] # Order of priority for splitting\n", + " chunk_size=500, # Maximum number of characters per chunk\n", + " chunk_overlap=50, # Overlap between chunks to preserve context\n", + " separators=[\"\\n\\n\", \"\\n\", \" \"], # Order of priority for splitting\n", " )\n", "\n", " documents = []\n", @@ -253,17 +284,21 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "id": "1d091a51", - "metadata": { - "vscode": { - "languageId": "plaintext" + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Generated 262 chunked documents.\n" + ] } - }, - "outputs": [], + ], "source": [ "# Load the entire text file\n", - "with open(\"the_little_prince.txt\", \"r\", encoding=\"utf-8\") as f:\n", + "with open(\"./data/the_little_prince.txt\", \"r\", encoding=\"utf-8\") as f:\n", " content = f.read()\n", "\n", "# Preprocessing Data\n", @@ -276,15 +311,15 @@ "id": "1977d4ff", "metadata": {}, "source": [ - "## Initial Setting {vectordb}\n", + "## Initial Setting Qdrant\n", "\n", - "This part walks you through the initial setup of **{vectordb}** .\n", + "This part walks you through the initial setup of **Qdrant** .\n", "\n", "This section includes the following components:\n", "\n", "- Load Embedding Model\n", "\n", - "- Load {vectordb} Client" + "- Load Qdrant Client" ] }, { @@ -304,19 +339,15 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 7, "id": "5bd5c3c9", - "metadata": { - "vscode": { - "languageId": "plaintext" - } - }, + "metadata": {}, "outputs": [], "source": [ "import os\n", "from langchain_openai import OpenAIEmbeddings\n", "\n", - "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")" + "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\", dimensions=1536)" ] }, { @@ -324,51 +355,60 @@ "id": "40f65795", "metadata": {}, "source": [ - "### Load {vectordb} Client\n", + "### Load Qdrant Client\n", "\n", - "In the **Load {vectordb} Client** section, we cover how to load the **database client object** using the **Python SDK** for **{vectordb}** .\n", - "- [Python SDK Docs]()" + "In the **Load Qdrant Client** section, we cover how to load the **database client object** using the **Python SDK** for **Qdrant** .\n", + "- [Python SDK Docs](https://python-client.qdrant.tech/)" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 8, "id": "eed0ebad", - "metadata": { - "vscode": { - "languageId": "plaintext" - } - }, + "metadata": {}, "outputs": [], "source": [ "# Create Database Client Object Function\n", "\n", + "from qdrant_client import QdrantClient\n", + "import os\n", + "\n", + "\n", "def get_db_client():\n", " \"\"\"\n", " Initializes and returns a VectorStore client instance.\n", "\n", " This function loads configuration (e.g., API key, host) from environment\n", " variables or default values and creates a client object to interact\n", - " with the {vectordb} Python SDK.\n", + " with the Qdrant Python SDK.\n", "\n", " Returns:\n", - " client:ClientType - An instance of the {vectordb} client.\n", + " client:ClientType - An instance of the Qdrant client.\n", "\n", " Raises:\n", " ValueError: If required configuration is missing.\n", " \"\"\"\n", + "\n", + " # In this tutorial, use Qdrant Cloud.\n", + " # Get your personal Qdrant Cloud URL and API_Key on the official website.\n", + " # https://qdrant.tech/documentation/cloud-intro/\n", + " # If you want to use on-premise, please refer to -> https://qdrant.tech/documentation/quickstart/\n", + "\n", + " host = os.environ.get(\"QDRANT_URL\", None)\n", + " api_key = os.environ.get(\"QDRANT_API_KEY\", None)\n", + "\n", + " client = QdrantClient(\n", + " url=host, api_key=api_key, check_compatibility=False, timeout=30\n", + " )\n", + "\n", " return client" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 9, "id": "2b5f4116", - "metadata": { - "vscode": { - "languageId": "plaintext" - } - }, + "metadata": {}, "outputs": [], "source": [ "# Get DB Client Object\n", @@ -409,23 +449,22 @@ "source": [ "### Create Instance\n", "\n", - "First, we create an instance of the **{vectordb}** helper class to use its CRUD functionalities.\n", + "First, we create an instance of the **Qdrant** helper class to use its CRUD functionalities.\n", "\n", - "This class is initialized with the **{vectordb} Python SDK client instance** and the **embedding model instance** , both of which were defined in the previous section." + "This class is initialized with the **Qdrant Python SDK client instance** and the **embedding model instance** , both of which were defined in the previous section." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 10, "id": "dccab807", - "metadata": { - "vscode": { - "languageId": "plaintext" - } - }, + "metadata": {}, "outputs": [], "source": [ - "# crud_manager = (client=client, embedding=embedding)" + "from utils.qdrant.crud import QdrantDocumentManager\n", + "\n", + "# ❗ Qdrant vectorizes using the embedding function. Transfer the \"Embedding Function\" as a parameter.\n", + "crud_manager = QdrantDocumentManager(client=client, embedding=embedding.embed_documents)" ] }, { @@ -435,7 +474,7 @@ "source": [ "Now you can use the following **CRUD** operations with the ```crud_manager``` instance.\n", "\n", - "These instance allow you to easily manage documents in your **{vectordb}** ." + "These instance allow you to easily manage documents in your **Qdrant** ." ] }, { @@ -464,25 +503,30 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 14, "id": "f3a6c32b", - "metadata": { - "vscode": { - "languageId": "plaintext" + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Operation_id : 116 | Status : completed\n" + ] } - }, - "outputs": [], + ], "source": [ "from uuid import uuid4\n", "\n", "args = {\n", " \"texts\": [doc.page_content for doc in docs[:2]],\n", " \"metadatas\": [doc.metadata for doc in docs[:2]],\n", - " \"ids\": [str(uuid4()) for _ in docs[:2]]\n", + " \"ids\": [str(uuid4()) for _ in docs[:2]],\n", + " \"result_view\": True,\n", " # if you want args, add params.\n", "}\n", "\n", - "# crud_manager.upsert(**args)" + "crud_manager.upsert(**args)" ] }, { @@ -515,13 +559,9 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 13, "id": "a89dd8e0", - "metadata": { - "vscode": { - "languageId": "plaintext" - } - }, + "metadata": {}, "outputs": [], "source": [ "from uuid import uuid4\n", @@ -529,11 +569,11 @@ "args = {\n", " \"texts\": [doc.page_content for doc in docs],\n", " \"metadatas\": [doc.metadata for doc in docs],\n", - " \"ids\": [str(uuid4()) for _ in docs]\n", + " \"ids\": [str(uuid4()) for _ in docs],\n", " # if you want args, add params.\n", "}\n", "\n", - "# crud_manager.upsert_parallel(**args)" + "crud_manager.upsert_parallel(**args)" ] }, { @@ -563,42 +603,74 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 16, "id": "5859782b", - "metadata": { - "vscode": { - "languageId": "plaintext" + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Rank 0 | Title : TO LEON WERTH\n", + "Contents : TO LEON WERTH WHEN HE WAS A LITTLE BOY\n", + "\n", + "Rank 1 | Title : Chapter 21\n", + "Contents : you see the grain-fields down yonder? I do not ea t bread. Wheat is of no use to me. The wheat fields have nothing to say to me. And that is sad. But you have hair that is the colour of gold. Think how wonderful that will be when you have tamed me! The grain, which is also golden, will bring me bac k the thought of you. And I shall love to listen to the wind in the wheat...\"\n", + "\n", + "Rank 2 | Title : Chapter 27\n", + "Contents : Look up at the sky. Ask yourselves: is it yes or no? Has the sheep eaten the flower? And you will see how everything changes... \n", + "And no grown-up will ever understand that this is a matter of so much importance! \n", + "(picture)\n", + "This is, to me, the loveliest and saddest landscape in the world. It is the same as that on the preceding page, but I have drawn it again to impress it on your memory. It is here that the little prince appeared on Earth, and disappeared.\n", + "\n" + ] } - }, - "outputs": [], + ], "source": [ "# Search by Query\n", "\n", - "# results = crud_manager.search(query=\"What is essential is invisible to the eye.\",k=3)\n", - "# for idx,doc in enumerate(results):\n", - "# print(f\"Rank {idx} | Title : {doc.metadata['title']}\")\n", - "# print(f\"Contents : {doc.page_content}\")\n", - "# print()" + "results = crud_manager.search(query=\"What is essential is invisible to the eye.\", k=3)\n", + "for idx, doc in enumerate(results):\n", + " print(f\"Rank {idx} | Title : {doc.metadata['title']}\")\n", + " print(f\"Contents : {doc.page_content}\")\n", + " print()" ] }, { "cell_type": "code", - "execution_count": null, - "id": "2577dd4a", - "metadata": { - "vscode": { - "languageId": "plaintext" + "execution_count": 17, + "id": "529d8483", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Rank 0 | Title : Chapter 4\n", + "Contents : But that did not really surprise me much. I knew very well that in addition to the great planets-- such as the Earth, Jupiter, Mars, Venus-- to which we have given names, there are also hundreds of others, some of which are so small that one has a hard time seeing them through the telescope. When an astronomer discovers one of these he does not give it a name, but only a number. He might call it, for example, \"Asteroid 325.\"\n", + "\n", + "Rank 1 | Title : Chapter 4\n", + "Contents : - the narrator speculates as to which asteroid from which the little prince came  \n", + "I had thus learned a second fact of great importance: this was that the planet the little prince came from was scarcely any larger than a house!\n", + "\n", + "Rank 2 | Title : Chapter 4\n", + "Contents : weigh? How much money does his father make?\" Only from these figures do they think they have learned anything about him.\n", + "\n" + ] } - }, - "outputs": [], + ], "source": [ "# Filter Search\n", "\n", - "# results = crud_manager.search(query=\"Which asteroid did the little prince come from?\",k=3,={\"title\":\"Chapter 4\"})\n", - "# for idx,doc in enumerate(results):\n", - "# print(f\"Rank {idx} | Title : {doc.metadata['title']}\")\n", - "# print(f\"Contents : {doc.page_content}\")\n", - "# print()" + "results = crud_manager.search(\n", + " query=\"Which asteroid did the little prince come from?\",\n", + " k=3,\n", + " filters=[{\"title\": \"Chapter 4\"}],\n", + ")\n", + "for idx, doc in enumerate(results):\n", + " print(f\"Rank {idx} | Title : {doc.metadata['title']}\")\n", + " print(f\"Contents : {doc.page_content}\")\n", + " print()" ] }, { @@ -627,32 +699,40 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 18, "id": "86de7842", - "metadata": { - "vscode": { - "languageId": "plaintext" - } - }, + "metadata": {}, "outputs": [], "source": [ - "# ret = crud_manager.as_retriever(\n", - "# search_fn=crud_manager.search, search_kwargs= # e.g. {\"k\": 1, \"where\": {\"title\": \"\"}}\n", - "# )" + "ret = crud_manager.as_retriever(\n", + " search_fn=crud_manager.search,\n", + " search_kwargs={\n", + " \"k\": 2,\n", + " \"filters\": [{\"title\": \"Chapter 5\"}],\n", + " }, # e.g. {\"k\": 1, \"where\": {\"title\": \"\"}}\n", + ")" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 19, "id": "7142d29c", - "metadata": { - "vscode": { - "languageId": "plaintext" + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={'title': 'Chapter 5', 'score': 0.15439087, 'id': 'c10fec23-cf31-4167-8588-a987d1b901e9'}, page_content='My friends, like myself, have been skirting this danger for a long time, without ever knowing it; and so it is for them that I have worked so hard over this drawing. The lesson which I pass on by this means is worth all the trouble it has cost me. \\n(picture)\\nPerhaps you will ask me, \"Why are there no other drawing in this book as magnificent and impressive as this drawing of the baobabs?\"'),\n", + " Document(metadata={'title': 'Chapter 5', 'score': 0.14290929, 'id': 'd77253e8-45ff-4965-a35c-a25e6d4c0217'}, page_content='\"It is a question of discipline,\" the little prince said to me later on. \"When you‘ve finished your own toilet in the morning, then it is time to attend to the toilet of your planet, just so, with the greatest care. You must see to it that you pull up regularly all the baobabs, at the very first moment when they can be distinguished from the rosebushes which they resemble so closely in their earliest youth. It is very tedious work,\" the little prince added, \"but very easy.\"')]" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" } - }, - "outputs": [], + ], "source": [ - "# ret.invoke(\"Which asteroid did the little prince come from?\")" + "ret.invoke(\"Which asteroid did the little prince come from?\")" ] }, { @@ -679,58 +759,87 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 20, "id": "0e3a2c33", - "metadata": { - "vscode": { - "languageId": "plaintext" + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "3 data delete...\n" + ] } - }, - "outputs": [], + ], "source": [ "# Delete by ids\n", "\n", - "# ids = [] # The 'ids' value you want to delete\n", - "# crud_manager.delete(ids=ids)" + "ids = args[\"ids\"][:3] # The 'ids' value you want to delete\n", + "crud_manager.delete(ids=ids)" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 14, "id": "60bcb4cf", - "metadata": { - "vscode": { - "languageId": "plaintext" + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "4 data delete...\n", + "Delete All Finished\n" + ] } - }, - "outputs": [], + ], "source": [ "# Delete by ids with filters\n", "\n", - "# ids = [] # The `ids` value corresponding to chapter 6\n", - "# crud_manager.delete(ids=ids,filters={\"title\":\"chapter 6\"}) " + "ids = args[\"ids\"] # The `ids` value corresponding to chapter 6\n", + "crud_manager.delete(ids=ids, filters=[{\"title\": \"Chapter 6\"}])" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 15, "id": "30d42d2e", - "metadata": { - "vscode": { - "languageId": "plaintext" + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "256 data delete...\n", + "2 data delete...\n", + "Delete All Finished\n" + ] } - }, - "outputs": [], + ], "source": [ "# Delete All\n", "\n", - "# crud_manager.delete()" + "crud_manager.delete()" ] } ], "metadata": { + "kernelspec": { + "display_name": "langchain-opentutorial-B290FrwJ-py3.11", + "language": "python", + "name": "python3" + }, "language_info": { - "name": "python" + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.11" } }, "nbformat": 4, diff --git a/09-VectorStore/assets/05-qdrant-logo.png b/09-VectorStore/assets/05-qdrant-logo.png new file mode 100644 index 0000000000000000000000000000000000000000..aee4e9bd68326e68c5b259b607add03968770eb6 GIT binary patch literal 2913 zcmai0c{r5q8lR+uB$RA@IgMp(Lo-E4mh3ae7-M7^TSgfpvaiX~*pp$hXD4Rt#MmZT z%0v`K5#x00A>o zLlnnvat7+uaSq#6hlX(6F@Kba0igE1)HhDTeOKQ?9{^}bx3^rSXSkPYhV@FAc)eom{Kqd zHHoOQeGGRY&q*V{0FRzkjZ3dlckRC&-7klnmVv#)K7Os^*kH z#8i8P*50%tYF)d_+4Hlkc~4x8Z!8Uie_q?k3QFp}|EP{}_o>qrw`m}psd=ZZ?7f;g zK1IcTR1PtH89oEF7{xpaP<5IEBIad~@9yWIpziG;)PgLMC2!5T=JGbG%J`~Bi4tZ- z!FE&GVaf2ptNejoRp$eU8ygnl{jjz#yz~(2z6o}ql`tms@l#+7<6b)bNh?#sxskrq z(lR@IJGzekwTaH|$m^TbaBZY7a~d*8AkTvB#~dElDxmA=%RQv_1tY&V#QHs`+pI(U zA6J}m!445AG)`q5eU8H%K-KY|fp8fO0#Kd6VHpiCX*J&_`Vyzl(%0{CS}J=D|3=O- zJeda858wna9;p@1IiLoh5Y_;_xv09s5tp-R>(Yg4bl=1OnO@om{qPVeUAO+QqV-6g zb16J>94q{rIdm2obrk-+?9b01cR2B!+L7}AZ1ukdoV98rhleh6M~d`+V<)r!Nv3iwQ3ROW`AUytjL&Ypx7XLO?)gR)iiiea)xDYgY0cROs6J# zj(KDYXNY6NesX`IT-Y}D%K*$B>LP}O#`>00z>&gh%;RFJ09Hn8eheO3p+Pk|q4QCW zYHc~4)PWo8+veBX-%`7=2uEWVgWULiUTdwa`oG+^y+6PCxLAs~T(Xn&jTa409<%^e zeR??i(pxjrt1YA$6u;~Yh^8V4P}laY9sb9@zp)g-ST%-gn{47*U{1lOrOP(<16yoG z3fpPf5^k5kyFyL=M`t~`G^p}&!UkC>eqL77Ew}AzqGvxB=Kje+|1q*E4C2kSg!}ZF zi1|#GrQ4LC4tanV~W`}E<1Um6u)2?VmlbC|6I$!j3kV=DD!Hs5zqN8`9}cT zss-FA3tAXumlM9%2hD+k;hJP1SosvA+s}4Vz9h|Qmy5>88OS_EpIq89)Ltj4Im{B3+mjiZ&&RdveHX+7wXqT& z201(E<%Rj9K_!r$Khm>=2Sxf)zf$0D>MW2n-*1}(@X zLAmXEamtZqB`VlJg#TH4(jiiS^2NBe2#=q^#V9ard?wJ<{)?!%7&nxK3qwonc5LJJ z+o<9pqHMRY8xTc$mFY#bfs7`#Wh-cG^)Iy`_xMj*b)Wa-s_5jQkpgf@UR`p?fb{GN zf(k{sB&>CYFz|pdE2fBGY@G|EC>wB!zwc;2HD`pccXAF5nxzFXI&7YPf6x+>NimDL zHJ<9AjJd|is_jt5W`?ECK??Gv_7!&NPzlU!K-Drw zaFvu2i8c_#{nPpowLi7|Jj!+CjEwzh(N7fy&l|NB#_VB234!9*encqI6C8OJ=vxqy z)RNa;Njr#Lo+z%3d8dI1>n8WwIvM>MWi(5lT4h4XYn5=ad!^Q&MevD+Oru+LqVO$6 zYuu_w1ujSz2i3N*BbrG$<;)hAN)hYiWr;Q(I_aq}g2|Mncop-hrB~}b+4K5#X0Poy zNSz81m?Cg1WjxTu*?0FK46Z;rAqVuFIw_)0r)FnkpaJrG^NN*b#WLStpfMM{;nz>L zPTF|rX9ZksFoIb>yR%!*SQCmiS$oCQZg?>~yHLbn7SF*;F#;0h2Tub+R=%Tg-RvQx zM_wD+PtdvKPG&!X;p8n?y{hr(-s;rx`hbvStYzq%ZYdspW?wEBray-fp|&B8p7H^l zpIh$|gvATqY|dQ&+^R7$mDL(yuEq~K4K&WNvH#%p_`A<ACmUZUrq;ZaF@RuyUdkD*-L>F~nyY3H1(7SBR zSML_F3gnuXJr~-xY`EUy>z34&&v*Kj_0F50Ar=UCAbK&AgP~8J^ho z4Mw@o3>X`M<9tgbGK~VFc(8*XiX9ar1)Yys|C%@1@D*^M~?c>hM<8X^wEg_%;cC+dOn_!^d>IgJ`(_cHLB&5C59Pcxi_P7nN@^ zZ<#6xH}S-6=HF#{TO3%GAqY)}BlM3$jHpY+$V(T^%cB61O|m(x_-K31?*zci2w_-j H;1csM%Y(en literal 0 HcmV?d00001 diff --git a/09-VectorStore/utils/qdrant/crud.py b/09-VectorStore/utils/qdrant/crud.py new file mode 100644 index 000000000..24ced059e --- /dev/null +++ b/09-VectorStore/utils/qdrant/crud.py @@ -0,0 +1,361 @@ +from utils.vectordbinterface import DocumentManager +from qdrant_client import QdrantClient +from qdrant_client.models import ( + VectorParams, + Distance, + PointStruct, + VectorStruct, + UpdateResult, + Filter, + FieldCondition, + MatchValue, + FilterSelector, + PointIdsList, +) +from typing import List, Dict, Any, Optional, Iterable, Callable +from uuid import uuid4 +from concurrent.futures import ThreadPoolExecutor, as_completed +from langchain_core.documents import Document + + +class QdrantDocumentManager(DocumentManager): + def __init__(self, client: QdrantClient, embedding: Callable, **kwargs) -> None: + self.qdrant_client = client # Qdrant Python SDK + try: + if "collection_name" not in kwargs: + kwargs["collection_name"] = "qdrant_test" + + self.collection_name = kwargs["collection_name"] + + if "vectors_config" not in kwargs: + # https://qdrant.tech/documentation/embeddings/openai/?utm_source=chatgpt.com + kwargs["vectors_config"] = VectorParams( + size=1536, # text-embedding-3-large support 1536 + distance=Distance.COSINE, # General Used Cosine Similarity + ) + # Create Collection + if not self.qdrant_client.collection_exists( + "qdrant_test" + ) and not self.qdrant_client.collection_exists(kwargs["collection_name"]): + if self.qdrant_client.create_collection(**kwargs): + print(f"Success Create {kwargs.get('collection_name')} Collection") + else: + raise Exception("Failed Create Collection") + except Exception as e: + print(e) + # Embedding + self.embedding = embedding + + def __ensure_indexed_fields(self, fields: List[str]): + """ + Ensure that each specified payload field is indexed in the current Qdrant collection. + + This is required for enabling metadata filtering (e.g. via 'filters' argument in search). + If an index already exists for a field, it will be skipped silently. + + Args: + fields (List[str]): List of payload field names to index. + + Raises: + Exception: If index creation fails for reasons other than 'already exists'. + """ + + for field in fields: + try: + self.qdrant_client.create_payload_index( + collection_name=self.collection_name, + field_name=field, + field_schema="keyword", + ) + except Exception as e: + if "already exists" not in str(e): + raise + + def upsert( + self, + texts: Iterable[str], + metadatas: Optional[List[Dict]] = None, + ids: Optional[List[str]] = None, + **kwargs: Any, + ) -> None: + """ + Insert or update vectors in Qdrant using the provided embedding model. + + Each text is embedded into a vector and stored in the Qdrant collection + along with its metadata (payload). If no ID is provided, UUIDs are generated. + + Args: + texts (Iterable[str]): List or iterable of input texts to be embedded and stored. + metadatas (Optional[List[Dict]]): Optional list of metadata dictionaries for each text. + ids (Optional[List[str]]): Optional list of string IDs for the vectors. + **kwargs: Optional arguments such as: + - result_view (bool): If True, prints operation_id and status for each result. + + Returns: + None + """ + + if ids is None: # if the ids are None + ids = [str(uuid4()) for _ in range(len(texts))] + + vectors: VectorStruct = self.embedding(texts) # List[float] in VectorStruct + + metadatas = metadatas or [{}] * len(texts) + + payloads = [ + {"text": text} | metadata for text, metadata in zip(texts, metadatas) + ] + + # Create Index + index_fields = set() + for payload in payloads: + index_fields.update(k for k in payload.keys() if k != "text") + + self.__ensure_indexed_fields(list(index_fields) + ["text"]) + + # make points + points = [ + PointStruct(id=id, vector=vector, payload=payload) + for id, vector, payload in zip(ids, vectors, payloads) + ] + + results: UpdateResult = self.qdrant_client.upsert( + collection_name=self.collection_name, points=points + ) + + if "result_view" in kwargs: + print(f"Operation_id : {results.operation_id} | Status : {results.status}") + + def upsert_parallel( + self, + texts: Iterable[str], + metadatas: Optional[List[Dict]] = None, + ids: Optional[List[str]] = None, + batch_size: int = 32, + workers: int = 10, + **kwargs: Any, + ) -> None: + """ + Perform parallel upsert of vectors into Qdrant using ThreadPoolExecutor. + + This method embeds input texts into vectors and inserts them into the collection + in parallel batches, which improves performance for large datasets. + + Args: + texts (Iterable[str]): List or iterable of texts to be embedded and inserted. + metadatas (Optional[List[Dict]]): Optional list of metadata dictionaries. + ids (Optional[List[str]]): Optional list of string IDs. If None, UUIDs are auto-generated. + batch_size (int): Number of points to upsert in each batch. + workers (int): Number of threads to use for parallel execution. + **kwargs: Optional arguments such as: + - result_view (bool): If True, prints operation_id and status for each batch result. + + Returns: + None + """ + texts = list(texts) + metadatas = metadatas or [{}] * len(texts) + + if ids is None: + ids = [str(uuid4()) for _ in range(len(texts))] + + vectors: VectorStruct = self.embedding(texts) + + payloads = [ + {"text": text} | metadata for text, metadata in zip(texts, metadatas) + ] + + # Create Index + index_fields = set() + for payload in payloads: + index_fields.update(k for k in payload.keys() if k != "text") + + self.__ensure_indexed_fields(list(index_fields) + ["text"]) + + # Prepare all points + all_points = [ + PointStruct(id=id, vector=vector, payload=payload) + for id, vector, payload in zip(ids, vectors, payloads) + ] + + # Batching + def batch_iterable(data, batch_size): + for i in range(0, len(data), batch_size): + yield data[i : i + batch_size] + + # upsert Batch + def upsert_batch(batch: List[PointStruct]) -> UpdateResult: + return self.qdrant_client.upsert( + collection_name=self.collection_name, points=batch + ) + + with ThreadPoolExecutor(max_workers=workers) as executor: + futures = { + executor.submit(upsert_batch, batch): batch + for batch in batch_iterable(all_points, batch_size) + } + + for future in as_completed(futures): + result = future.result() + if kwargs.get("result_view", False): + print( + f"Operation_id: {result.operation_id} | Status: {result.status}" + ) + + def search(self, query: str, k: int = 10, **kwargs: Any) -> List[Document]: + """ + Perform a vector similarity search with optional metadata filtering. + + Args: + query (str): The input query string to embed and search. + k (int): The number of top results to return. + **kwargs: + filters (List[Dict[str, Any]]): Optional metadata filters. + Example: [{"category": "news"}, {"lang": "en"}] + + Returns: + List[Document]: List of matching documents with metadata. + """ + + query_vector: VectorStruct = self.embedding(query)[0] + + # filtering + query_filter = None + if "filters" in kwargs: + condition = [] + for f in kwargs["filters"]: + for key, value in f.items(): + condition.append( + FieldCondition(key=key, match=MatchValue(value=value)) + ) + query_filter = Filter(must=condition) + + results = ( + self.qdrant_client.query_points( + collection_name=self.collection_name, + query=query_vector, + limit=k, + query_filter=query_filter, + ) + .model_dump() + .get("points", []) + ) + return [ + Document( + page_content=hit["payload"].get("text", ""), + metadata={ + **{k: v for k, v in hit["payload"].items() if k != "text"}, + "score": hit["score"], + "id": hit["id"], + }, + ) + for hit in results + ] + + def delete( + self, + ids: Optional[List[str]] = None, + filters: Optional[Dict] = None, + **kwargs: Any, + ) -> None: + """ + Delete points from the Qdrant collection by ID, filter, or both. + + - If only `ids` are given: delete those points. + - If only `filters` are given: delete all points matching the filter. + - If both `ids` and `filters` are given: delete only points that match both conditions. + - If neither is given: delete all points in the collection. + + Args: + ids (Optional[List[str]]): List of point IDs to delete. + filters (Optional[List[Dict[str, Any]]]): Metadata filter conditions. + **kwargs: Reserved for future use. + + Returns: + None + """ + if ids and filters: + # Delete by ids and filters + conditions = [] + for f in filters: + for key, value in f.items(): + conditions.append( + FieldCondition(key=key, match=MatchValue(value=value)) + ) + + query_filter = Filter(must=conditions) + + # query points matching filter + scroll_offset = None + matched_ids = set() + while True: + scroll_result = self.qdrant_client.scroll( + collection_name=self.collection_name, + scroll_filter=query_filter, + with_payload=False, + limit=30, + offset=scroll_offset, + ) + points_batch, scroll_offset = scroll_result + + if not points_batch: + print("Delete All Finished") + break + + ids_to_delete = [point.id for point in points_batch] + self.qdrant_client.delete( + collection_name=self.collection_name, + points_selector=PointIdsList(points=ids_to_delete), + ) + + print(f"{len(ids_to_delete)} data delete...") + + elif ids: + # Delete by ids + self.qdrant_client.delete( + collection_name=self.collection_name, + points_selector=PointIdsList(points=ids), + ) + print(f"{len(ids)} data delete...") + + elif filters: + # Delete by filters + conditions = [] + for f in filters: + for key, value in f.items(): + conditions.append( + FieldCondition(key=key, match=MatchValue(value=value)) + ) + + query_filter = Filter(must=conditions) + + self.qdrant_client.delete( + collection_name=self.collection_name, + points_selector=FilterSelector(filter=query_filter), + ) + print(f"Filters: {query_filter}") + print("Delete All Finished") + + else: + # all delete + scroll_offset = None + while True: + scroll_result = self.qdrant_client.scroll( + collection_name=self.collection_name, + with_payload=False, + limit=256, + offset=scroll_offset, + ) + points_batch, scroll_offset = scroll_result + + if not points_batch: + print("Delete All Finished") + break + + ids_to_delete = [point.id for point in points_batch] + + self.qdrant_client.delete( + collection_name=self.collection_name, + points_selector=PointIdsList(points=ids_to_delete), + ) + print(f"{len(ids_to_delete)} data delete...") From c572bdf2d8e94a0c09d22e09a590ae3909920190 Mon Sep 17 00:00:00 2001 From: Gwangwon Jung Date: Wed, 7 May 2025 18:22:28 +0900 Subject: [PATCH 4/4] =?UTF-8?q?[N-2]=2009-Vector=20Store=20/=2005-Qdrant?= =?UTF-8?q?=20-=20QdrantClient=20=EA=B0=9D=EC=B2=B4=20=EC=83=9D=EC=84=B1?= =?UTF-8?q?=20=EC=98=A4=EB=A5=98=20=ED=95=B4=EA=B2=B0=20:=20=EB=B2=84?= =?UTF-8?q?=EC=A0=84=20=EC=B0=A8=EC=9D=B4=EB=A1=9C=20=EC=9D=B8=ED=95=B4=20?= =?UTF-8?q?`check=5Fcompatibility`=20=EC=82=AC=EC=9A=A9=20X?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- 09-VectorStore/05-Qdrant.ipynb | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/09-VectorStore/05-Qdrant.ipynb b/09-VectorStore/05-Qdrant.ipynb index 9c13b9b14..04ff1e44f 100644 --- a/09-VectorStore/05-Qdrant.ipynb +++ b/09-VectorStore/05-Qdrant.ipynb @@ -363,7 +363,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "id": "eed0ebad", "metadata": {}, "outputs": [], @@ -396,11 +396,12 @@ "\n", " host = os.environ.get(\"QDRANT_URL\", None)\n", " api_key = os.environ.get(\"QDRANT_API_KEY\", None)\n", - "\n", - " client = QdrantClient(\n", - " url=host, api_key=api_key, check_compatibility=False, timeout=30\n", - " )\n", - "\n", + " try:\n", + " client = QdrantClient(url=host, api_key=api_key, timeout=30)\n", + " except Exception as e:\n", + " print(\"Error\")\n", + " print(f\"{e}\")\n", + " return None\n", " return client" ] },