diff --git a/09-VectorStore/10-Weaviate.ipynb b/09-VectorStore/10-Weaviate.ipynb new file mode 100644 index 000000000..9830cacfc --- /dev/null +++ b/09-VectorStore/10-Weaviate.ipynb @@ -0,0 +1,2598 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Weaviate\n", + "\n", + "- Author: [Haseom Shin](https://github.com/IHAGI-c)\n", + "- Design: []()\n", + "- Peer Review: []()\n", + "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n", + "\n", + "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/13-LangChain-Expression-Language/11-Fallbacks.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/13-LangChain-Expression-Language/11-Fallbacks.ipynb)\n", + "\n", + "## Overview\n", + "\n", + "This comprehensive tutorial explores Weaviate, a powerful open-source vector database that enables efficient similarity search and semantic operations. Through hands-on examples, you'll learn:\n", + "\n", + "- How to set up and configure Weaviate for production use\n", + "- Essential operations including document indexing, querying, and deletion\n", + "- Advanced features such as hybrid search, multi-tenancy, and batch processing\n", + "- Integration with LangChain for sophisticated applications like RAG and QA systems\n", + "- Best practices for managing and scaling your vector database\n", + "\n", + "Whether you're building a semantic search engine, implementing RAG systems, or developing AI-powered applications, this tutorial provides the foundational knowledge and practical examples you need to leverage Weaviate effectively.\n", + "\n", + "> [Weaviate](https://weaviate.io/) is an open-source vector database. It allows you to store data objects and vector embeddings from your favorite ML-models, and scale seamlessly into billions of data objects.\n", + "\n", + "To use this integration, you need to have a running Weaviate database instance.\n", + "\n", + "### Table of Contents\n", + "\n", + "- [Overview](#overview)\n", + "- [Environment Setup](#environment-setup)\n", + "- [Credentials](#credentials)\n", + " - [Setting up Weaviate Cloud Services](#setting-up-weaviate-cloud-services)\n", + "- [What is Weaviate?](#what-is-weaviate)\n", + "- [Why Use Weaviate?](#why-use-weaviate)\n", + "- [Initialization](#initialization)\n", + " - [Creating Collections in Weaviate](#creating-collections-in-weaviate)\n", + " - [Delete Collection](#delete-collection)\n", + " - [List Collections](#list-collections)\n", + " - [Data Preprocessing](#data-preprocessing)\n", + " - [Document Preprocessing Function](#document-preprocessing-function)\n", + "- [Manage vector store](#manage-vector-store)\n", + " - [Add items to vector store](#add-items-to-vector-store)\n", + " - [Delete items from vector store](#delete-items-from-vector-store)\n", + "- [Finding Objects by Similarity](#finding-objects-by-similarity)\n", + " - [Step 1: Preparing Your Data](#step-1-preparing-your-data)\n", + " - [Step 2: Perform the search](#step-2-perform-the-search)\n", + " - [Quantify Result Similarity](#quantify-result-similarity)\n", + "- [Search mechanism](#search-mechanism)\n", + "- [Persistence](#persistence)\n", + "- [Multi-tenancy](#multi-tenancy)\n", + "- [Retriever options](#retriever-options)\n", + "- [Use with LangChain](#use-with-langchain)\n", + " - [Question Answering with Sources](#question-answering-with-sources)\n", + " - [Retrieval-Augmented Generation](#retrieval-augmented-generation)\n", + "\n", + "\n", + "### References\n", + "- [Langchain-Weaviate](https://python.langchain.com/docs/integrations/providers/weaviate/)\n", + "- [Weaviate Documentation](https://weaviate.io/developers/weaviate)\n", + "- [Weaviate Introduction](https://weaviate.io/developers/weaviate/introduction)\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Environment Setup\n", + "\n", + "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n", + "\n", + "**[Note]**\n", + "- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n", + "- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "%%capture --no-stderr\n", + "%pip install langchain-opentutorial" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.3.1\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" + ] + } + ], + "source": [ + "# Install required packages\n", + "from langchain_opentutorial import package\n", + "\n", + "package.install(\n", + " [\n", + " \"openai\",\n", + " \"langsmith\",\n", + " \"langchain\",\n", + " \"tiktoken\",\n", + " \"langchain-weaviate\",\n", + " \"langchain-openai\",\n", + " ],\n", + " verbose=False,\n", + " upgrade=False,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Environment variables have been set successfully.\n" + ] + } + ], + "source": [ + "# Set environment variables\n", + "from langchain_opentutorial import set_env\n", + "\n", + "set_env(\n", + " {\n", + " \"OPENAI_API_KEY\": \"\",\n", + " \"WEAVIATE_API_KEY\": \"\",\n", + " \"WEAVIATE_URL\": \"\",\n", + " \"LANGCHAIN_API_KEY\": \"\",\n", + " \"LANGCHAIN_TRACING_V2\": \"true\",\n", + " \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n", + " \"LANGCHAIN_PROJECT\": \"Weaviate\",\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can alternatively set `OPENAI_API_KEY` in `.env` file and load it. \n", + "\n", + "[Note] This is not necessary if you've already set `OPENAI_API_KEY` in previous steps." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from dotenv import load_dotenv\n", + "\n", + "load_dotenv(override=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Credentials\n", + "\n", + "There are three main ways to connect to Weaviate:\n", + "\n", + "1. **Local Connection**: Connect to a Weaviate instance running locally through Docker\n", + "2. **Weaviate Cloud(WCD)**: Use Weaviate's managed cloud service\n", + "3. **Custom Deployment**: Deploy Weaviate on Kubernetes or other custom configurations\n", + "\n", + "For this notebook, we'll use Weaviate Cloud (WCD) as it provides the easiest way to get started without any local setup.\n", + "\n", + "### Setting up Weaviate Cloud Services\n", + "\n", + "1. First, sign up for a free account at [Weaviate Cloud Console](https://console.weaviate.cloud)\n", + "2. Create a new cluster\n", + "3. Get your API key\n", + "4. Set API key\n", + "5. Connect to your WCD cluster\n", + "\n", + "#### 1. Weaviate Signup\n", + "![Weaviate Cloud Console](./assets/10-weaviate-credentials-01.png)\n", + "\n", + "#### 2. Create Cluster\n", + "![Weaviate Cloud Console](./assets/10-weaviate-credentials-02.png)\n", + "![Weaviate Cloud Console](./assets/10-weaviate-credentials-03.png)\n", + "\n", + "#### 3. Get API Key\n", + "**If you using gRPC, please copy the gRPC URL**\n", + "\n", + "![Weaviate Cloud Console](./assets/10-weaviate-credentials-04.png)\n", + "\n", + "#### 4. Set API Key\n", + "```\n", + "WEAVIATE_API_KEY=\"YOUR_WEAVIATE_API_KEY\"\n", + "WEAVIATE_URL=\"YOUR_WEAVIATE_CLUSTER_URL\"\n", + "```\n", + "\n", + "#### 5. Connect to your WCD cluster" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "True\n" + ] + } + ], + "source": [ + "import os\n", + "import weaviate\n", + "from weaviate.classes.init import Auth\n", + "\n", + "weaviate_url = os.environ.get(\"WEAVIATE_URL\")\n", + "weaviate_api_key = os.environ.get(\"WEAVIATE_API_KEY\")\n", + "\n", + "client = weaviate.connect_to_weaviate_cloud(\n", + " cluster_url=weaviate_url,\n", + " auth_credentials=Auth.api_key(weaviate_api_key),\n", + " headers={\"X-Openai-Api-Key\": os.environ.get(\"OPENAI_API_KEY\")},\n", + ")\n", + "\n", + "print(client.is_ready())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "## api key Lookup\n", + "def get_api_key():\n", + " return weaviate_api_key\n", + "\n", + "\n", + "print(get_api_key())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What is Weaviate?\n", + "\n", + "Weaviate is a powerful open-source vector database that revolutionizes how we store and search data. It combines traditional database capabilities with advanced machine learning features, allowing you to:\n", + "\n", + "- Weaviate is an open source [vector database](https://weaviate.io/blog/what-is-a-vector-database).\n", + "- Weaviate allows you to store and retrieve data objects based on their semantic properties by indexing them with [vectors](./concepts/vector-index.md).\n", + "- Weaviate can be used stand-alone (aka _bring your vectors_) or with a variety of [modules](./modules/index.md) that can do the vectorization for you and extend the core capabilities.\n", + "- Weaviate has a [GraphQL-API](./api/graphql/index.md) to access your data easily.\n", + "- Weaviate is fast (check our [open source benchmarks](./benchmarks/index.md)).\n", + "\n", + "> 💡 **Key Feature**: Weaviate achieves millisecond-level query performance, making it suitable for production environments.\n", + "\n", + "## Why Use Weaviate?\n", + "\n", + "Weaviate stands out for several reasons:\n", + "\n", + "1. **Versatility**: Supports multiple media types (text, images, etc.)\n", + "2. **Advanced Features**:\n", + " - Semantic Search\n", + " - Question-Answer Extraction\n", + " - Classification\n", + " - Custom ML Model Integration\n", + "3. **Production-Ready**: Built in Go for high performance and scalability\n", + "4. **Developer-Friendly**: Multiple access methods through GraphQL, REST, and various client libraries\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Initialization\n", + "Before initializing our vector store, let's connect to a Weaviate collection. If one named index_name doesn't exist, it will be created." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Creating Collections in Weaviate\n", + "\n", + "The `create_collection` function establishes a new collection in Weaviate, configuring it with specified properties and vector settings. This foundational operation requires six key parameters:\n", + "\n", + "**Required Parameters:**\n", + "- `client`: Weaviate client instance for database connection\n", + "- `collection_name`: Unique identifier for your collection\n", + "- `description`: Detailed description of the collection's purpose\n", + "- `properties`: List of property definitions for data schema\n", + "- `vectorizer`: Configuration for vector embedding generation\n", + "- `metric`: Distance metric for similarity calculations\n", + "\n", + "**Advanced Configuration Options:**\n", + "- For custom distance metrics: Utilize the `VectorDistances` class\n", + "- For alternative vectorization: Leverage the `Configure.Vectorizer` class\n", + "\n", + "**Example Usage:**\n", + "```python\n", + "properties = [\n", + " Property(name=\"text\", data_type=DataType.TEXT),\n", + " Property(name=\"title\", data_type=DataType.TEXT)\n", + "]\n", + "vectorizer = Configure.Vectorizer.text2vec_openai()\n", + "create_collection(client, \"Documents\", \"Document storage\", properties, vectorizer)\n", + "```\n", + "\n", + "> **Note:** Choose your distance metric and vectorizer carefully as they significantly impact search performance and accuracy." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "from weaviate.classes.config import Property, DataType, Configure, VectorDistances\n", + "from typing import List\n", + "\n", + "\n", + "def create_collection(\n", + " client: weaviate.Client,\n", + " collection_name: str,\n", + " description: str,\n", + " properties: List[Property],\n", + " vectorizer: Configure.Vectorizer,\n", + " metric: str = \"cosine\",\n", + ") -> None:\n", + " \"\"\"\n", + " Creates a new index (collection) in Weaviate with the specified properties.\n", + "\n", + " :param client: Weaviate client instance\n", + " :param collection_name: Name of the index (collection) (e.g., \"BookChunk\")\n", + " :param description: Description of the index (e.g., \"A collection for storing book chunks\")\n", + " :param properties: List of properties, where each property is a dictionary with keys:\n", + " - name (str): Name of the property\n", + " - dataType (list[str]): Data types for the property (e.g., [\"text\"], [\"int\"])\n", + " - description (str): Description of the property\n", + " :param vectorizer: Vectorizer configuration created using Configure.Vectorizer\n", + " (e.g., Configure.Vectorizer.text2vec_openai())\n", + " :return: None\n", + " \"\"\"\n", + " distance_metric = getattr(VectorDistances, metric.upper(), None)\n", + "\n", + " # Set vector_index_config to hnsw\n", + " vector_index_config = Configure.VectorIndex.hnsw(distance_metric=distance_metric)\n", + "\n", + " # Create the collection in Weaviate\n", + " try:\n", + " client.collections.create(\n", + " name=collection_name,\n", + " description=description,\n", + " properties=properties,\n", + " vectorizer_config=vectorizer,\n", + " vector_index_config=vector_index_config,\n", + " )\n", + " print(f\"Collection '{collection_name}' created successfully.\")\n", + " except Exception as e:\n", + " print(f\"Failed to create collection '{collection_name}': {e}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's use the `create_collection` function to create the collection we'll use in this tutorial." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collection 'BookChunk' created successfully.\n" + ] + } + ], + "source": [ + "collection_name = \"BookChunk\" # change if desired\n", + "description = \"A chunk of a book's content\"\n", + "vectorizer = Configure.Vectorizer.text2vec_openai(\n", + " model=\"text-embedding-3-large\"\n", + ") # You can select other vectorizer\n", + "metric = \"dot\" # You can select other distance metric\n", + "properties = [\n", + " Property(\n", + " name=\"text\", data_type=DataType.TEXT, description=\"The content of the text\"\n", + " ),\n", + " Property(\n", + " name=\"order\",\n", + " data_type=DataType.INT,\n", + " description=\"The order of the chunk in the book\",\n", + " ),\n", + " Property(\n", + " name=\"title\", data_type=DataType.TEXT, description=\"The title of the book\"\n", + " ),\n", + " Property(\n", + " name=\"author\", data_type=DataType.TEXT, description=\"The author of the book\"\n", + " ),\n", + " Property(\n", + " name=\"source\", data_type=DataType.TEXT, description=\"The source of the book\"\n", + " ),\n", + "]\n", + "\n", + "create_collection(client, collection_name, description, properties, vectorizer, metric)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Delete Collection\n", + "\n", + "Managing collections in Weaviate includes the ability to remove them when they're no longer needed. The `delete_collection` function provides a straightforward way to remove collections from your Weaviate instance.\n", + "\n", + "**Function Signature:**\n", + "- `client`: Weaviate client instance for database connection\n", + "- `collection_name`: Name of the collection to be deleted\n", + "\n", + "**Advanced Operations:**\n", + "For batch operations or managing multiple collections, you can use the `delete_all_collections()` function, which removes all collections from your Weaviate instance.\n", + "\n", + "> **Important:** Collection deletion is permanent and cannot be undone. Always ensure you have appropriate backups before deleting collections in production environments." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Deleted index: BookChunk\n" + ] + } + ], + "source": [ + "def delete_collection(client, collection_name):\n", + " client.collections.delete(collection_name)\n", + " print(f\"Deleted index: {collection_name}\")\n", + "\n", + "\n", + "def delete_all_collections():\n", + " client.collections.delete_all()\n", + " print(\"Deleted all collections\")\n", + "\n", + "\n", + "# delete_all_collections() # if you want to delete all collections, uncomment this line\n", + "delete_collection(client, collection_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### List Collections\n", + "\n", + "Lists all collections in Weaviate, providing a comprehensive view of your database schema and configurations. The `list_collections` function helps you inspect and manage your Weaviate instance's structure.\n", + "\n", + "**Key Information Returned:**\n", + "- Collection names\n", + "- Collection descriptions\n", + "- Property configurations\n", + "- Data types for each property\n", + "\n", + "> **Note:** This operation is particularly useful for database maintenance, debugging, and documentation purposes.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collections (indexes) in the Weaviate schema:\n", + "- Collection name: LangChain_4c510d6dc12d46069d5b6a74a742c4ff\n", + " Description: No description available\n", + " Properties:\n", + " - Name: text, Type: DataType.TEXT\n", + " - Name: order, Type: DataType.NUMBER\n", + " - Name: source, Type: DataType.TEXT\n", + " - Name: author, Type: DataType.TEXT\n", + " - Name: title, Type: DataType.TEXT\n", + "\n", + "- Collection name: LangChain_25ab58a0f16d476a8d261bd4a11245be\n", + " Description: No description available\n", + " Properties:\n", + " - Name: text, Type: DataType.TEXT\n", + "\n", + "- Collection name: BookChunk\n", + " Description: A chunk of a book's content\n", + " Properties:\n", + " - Name: text, Type: DataType.TEXT\n", + " - Name: order, Type: DataType.INT\n", + " - Name: title, Type: DataType.TEXT\n", + " - Name: author, Type: DataType.TEXT\n", + " - Name: source, Type: DataType.TEXT\n", + "\n", + "- Collection name: LangChain_e63c8e8a49cc4915995dae2fcdf1aef1\n", + " Description: No description available\n", + " Properties:\n", + " - Name: text, Type: DataType.TEXT\n", + " - Name: order, Type: DataType.NUMBER\n", + " - Name: source, Type: DataType.TEXT\n", + " - Name: author, Type: DataType.TEXT\n", + " - Name: title, Type: DataType.TEXT\n", + "\n", + "- Collection name: LangChain_a6190f02a2f64ff4aca85e3c24f8e8cb\n", + " Description: No description available\n", + " Properties:\n", + " - Name: text, Type: DataType.TEXT\n", + "\n", + "- Collection name: LangChain_be71f63889d74d09b2ade15d384ec210\n", + " Description: No description available\n", + " Properties:\n", + " - Name: text, Type: DataType.TEXT\n", + " - Name: source, Type: DataType.TEXT\n", + " - Name: author, Type: DataType.TEXT\n", + " - Name: title, Type: DataType.TEXT\n", + " - Name: order, Type: DataType.NUMBER\n", + "\n", + "- Collection name: LangChain_bd62d989508f479a8ab02fcc3190010e\n", + " Description: No description available\n", + " Properties:\n", + " - Name: text, Type: DataType.TEXT\n", + " - Name: order, Type: DataType.NUMBER\n", + " - Name: source, Type: DataType.TEXT\n", + " - Name: author, Type: DataType.TEXT\n", + " - Name: title, Type: DataType.TEXT\n", + "\n", + "- Collection name: LangChain_0a18b4c9d03f4f3d8ab2e7a6258d9a2c\n", + " Description: No description available\n", + " Properties:\n", + " - Name: text, Type: DataType.TEXT\n", + " - Name: order, Type: DataType.NUMBER\n", + " - Name: source, Type: DataType.TEXT\n", + " - Name: author, Type: DataType.TEXT\n", + " - Name: title, Type: DataType.TEXT\n", + "\n", + "- Collection name: LangChain_7ead0866ef9f4e3eb559142c74f79446\n", + " Description: No description available\n", + " Properties:\n", + " - Name: text, Type: DataType.TEXT\n", + "\n" + ] + } + ], + "source": [ + "def list_collections():\n", + " \"\"\"\n", + " Lists all collections (indexes) in the Weaviate database, including their properties.\n", + " \"\"\"\n", + " # Retrieve all collection configurations\n", + " collections = client.collections.list_all()\n", + "\n", + " # Check if there are any collections\n", + " if collections:\n", + " print(\"Collections (indexes) in the Weaviate schema:\")\n", + " for name, config in collections.items():\n", + " print(f\"- Collection name: {name}\")\n", + " print(\n", + " f\" Description: {config.description if config.description else 'No description available'}\"\n", + " )\n", + " print(f\" Properties:\")\n", + " for prop in config.properties:\n", + " print(f\" - Name: {prop.name}, Type: {prop.data_type}\")\n", + " print()\n", + " else:\n", + " print(\"No collections found in the schema.\")\n", + "\n", + "\n", + "list_collections()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "def lookup_collection(collection_name: str):\n", + " return client.collections.get(collection_name)\n", + "\n", + "\n", + "print(lookup_collection(collection_name))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Preprocessing\n", + "\n", + "Before storing documents in Weaviate, it's essential to preprocess them into manageable chunks. This section demonstrates how to effectively prepare your documents using the `RecursiveCharacterTextSplitter` for optimal vector storage and retrieval.\n", + "\n", + "**Key Preprocessing Steps:**\n", + "- Text chunking for better semantic representation\n", + "- Metadata assignment for enhanced searchability\n", + "- Document structure optimization\n", + "- Batch preparation for efficient storage\n", + "\n", + "> **Note:** While this example uses `RecursiveCharacterTextSplitter`, choose your text splitter based on your specific content type and requirements. The chunk size and overlap parameters significantly impact search quality and performance." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "# This is a long document we can split up.\n", + "with open(\"./data/the_little_prince.txt\") as f:\n", + " raw_text = f.read()" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[Document(metadata={}, page_content='The Little Prince\\nWritten By Antoine de Saiot-Exupery (1900〜1944)'), Document(metadata={}, page_content='[ Antoine de Saiot-Exupery ]'), Document(metadata={}, page_content='Over the past century, the thrill of flying has inspired some to perform remarkable feats of daring. For others, their desire to soar into the skies led to dramatic leaps in technology. For Antoine'), Document(metadata={}, page_content='in technology. For Antoine de Saint-ExupĂ©ry, his love of aviation inspired stories, which have touched the hearts of millions around the world.'), Document(metadata={}, page_content='Born in 1900 in Lyons, France, young Antoine was filled with a passion for adventure. When he failed an entrance exam for the Naval Academy, his interest in aviation took hold. He joined the French'), Document(metadata={}, page_content='hold. He joined the French Army Air Force in 1921 where he first learned to fly a plane. Five years later, he would leave the military in order to begin flying air mail between remote settlements in'), Document(metadata={}, page_content='between remote settlements in the Sahara desert.'), Document(metadata={}, page_content=\"For Saint-ExupĂ©ry, it was a grand adventure - one with dangers lurking at every corner. Flying his open cockpit biplane, Saint-ExupĂ©ry had to fight the desert's swirling sandstorms. Worse, still, he\"), Document(metadata={}, page_content=\"sandstorms. Worse, still, he ran the risk of being shot at by unfriendly tribesmen below. Saint-ExupĂ©ry couldn't have been more thrilled. Soaring across the Sahara inspired him to spend his nights\"), Document(metadata={}, page_content='him to spend his nights writing about his love affair with flying.'), Document(metadata={}, page_content='When World War II broke out, Saint-ExupĂ©ry rejoined the French Air Force. After Nazi troops overtook France in 1940, Saint-ExupĂ©ry fled to the United States. He had hoped to join the U. S. war effort'), Document(metadata={}, page_content='to join the U. S. war effort as a fighter pilot, but was dismissed because of his age. To console himself, he drew upon his experiences over the Saharan desert to write and illustrate what would'), Document(metadata={}, page_content='and illustrate what would become his most famous book, The Little Prince (1943). Mystical and enchanting, this small book has fascinated both children and adults for decades. In the book, a pilot is'), Document(metadata={}, page_content='In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little prince'), Document(metadata={}, page_content='the book, the little prince discovers the true meaning of life. At the end of his conversation with the Little Prince, the aviator manages to fix his plane and both he and the little prince continue'), Document(metadata={}, page_content='the little prince continue on their journeys'), Document(metadata={}, page_content='Shortly after completing the book, Saint-ExupĂ©ry finally got his wish. He returned to North Africa to fly a warplane for his country. On July 31, 1944, Saint-ExupĂ©ry took off on a mission. Sadly, he'), Document(metadata={}, page_content='off on a mission. Sadly, he was never heard from again.'), Document(metadata={}, page_content='[ TO LEON WERTH ]'), Document(metadata={}, page_content='I ask the indulgence of the children who may read this book for dedicating it to a grown-up. I have a serious reason: he is the best friend I have in the world. I have another reason: this grown-up')]\n" + ] + } + ], + "source": [ + "from langchain_text_splitters import RecursiveCharacterTextSplitter\n", + "\n", + "text_splitter = RecursiveCharacterTextSplitter(\n", + " # Set a really small chunk size, just to show.\n", + " chunk_size=200,\n", + " chunk_overlap=30,\n", + " length_function=len,\n", + " is_separator_regex=False,\n", + ")\n", + "\n", + "split_docs = text_splitter.create_documents([raw_text])\n", + "\n", + "print(split_docs[:20])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Document Preprocessing Function\n", + "\n", + "The `preprocess_documents` function transforms pre-split documents into a format suitable for Weaviate storage. This utility function handles both document content and metadata, ensuring proper organization of your data.\n", + "\n", + "**Function Parameters:**\n", + "- `split_docs`: List of LangChain Document objects containing page content and metadata\n", + "- `metadata`: Optional dictionary of additional metadata to include with each chunk\n", + "\n", + "**Processing Steps:**\n", + "- Iterates through Document objects\n", + "- Assigns sequential order numbers\n", + "- Combines document metadata with additional metadata\n", + "- Formats data for Weaviate ingestion\n", + "\n", + "> **Best Practice:** When preprocessing documents, always maintain consistent metadata structure across your collection. This ensures efficient querying and filtering capabilities later." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[{'text': 'The Little Prince\\nWritten By Antoine de Saiot-Exupery (1900〜1944)',\n", + " 'order': 1,\n", + " 'title': 'The Little Prince',\n", + " 'author': 'Antoine de Saint-ExupĂ©ry',\n", + " 'source': 'Original Text'},\n", + " {'text': '[ Antoine de Saiot-Exupery ]',\n", + " 'order': 2,\n", + " 'title': 'The Little Prince',\n", + " 'author': 'Antoine de Saint-ExupĂ©ry',\n", + " 'source': 'Original Text'},\n", + " {'text': 'Over the past century, the thrill of flying has inspired some to perform remarkable feats of daring. For others, their desire to soar into the skies led to dramatic leaps in technology. For Antoine',\n", + " 'order': 3,\n", + " 'title': 'The Little Prince',\n", + " 'author': 'Antoine de Saint-ExupĂ©ry',\n", + " 'source': 'Original Text'},\n", + " {'text': 'in technology. For Antoine de Saint-ExupĂ©ry, his love of aviation inspired stories, which have touched the hearts of millions around the world.',\n", + " 'order': 4,\n", + " 'title': 'The Little Prince',\n", + " 'author': 'Antoine de Saint-ExupĂ©ry',\n", + " 'source': 'Original Text'},\n", + " {'text': 'Born in 1900 in Lyons, France, young Antoine was filled with a passion for adventure. When he failed an entrance exam for the Naval Academy, his interest in aviation took hold. He joined the French',\n", + " 'order': 5,\n", + " 'title': 'The Little Prince',\n", + " 'author': 'Antoine de Saint-ExupĂ©ry',\n", + " 'source': 'Original Text'},\n", + " {'text': 'hold. He joined the French Army Air Force in 1921 where he first learned to fly a plane. Five years later, he would leave the military in order to begin flying air mail between remote settlements in',\n", + " 'order': 6,\n", + " 'title': 'The Little Prince',\n", + " 'author': 'Antoine de Saint-ExupĂ©ry',\n", + " 'source': 'Original Text'},\n", + " {'text': 'between remote settlements in the Sahara desert.',\n", + " 'order': 7,\n", + " 'title': 'The Little Prince',\n", + " 'author': 'Antoine de Saint-ExupĂ©ry',\n", + " 'source': 'Original Text'},\n", + " {'text': \"For Saint-ExupĂ©ry, it was a grand adventure - one with dangers lurking at every corner. Flying his open cockpit biplane, Saint-ExupĂ©ry had to fight the desert's swirling sandstorms. Worse, still, he\",\n", + " 'order': 8,\n", + " 'title': 'The Little Prince',\n", + " 'author': 'Antoine de Saint-ExupĂ©ry',\n", + " 'source': 'Original Text'},\n", + " {'text': \"sandstorms. Worse, still, he ran the risk of being shot at by unfriendly tribesmen below. Saint-ExupĂ©ry couldn't have been more thrilled. Soaring across the Sahara inspired him to spend his nights\",\n", + " 'order': 9,\n", + " 'title': 'The Little Prince',\n", + " 'author': 'Antoine de Saint-ExupĂ©ry',\n", + " 'source': 'Original Text'},\n", + " {'text': 'him to spend his nights writing about his love affair with flying.',\n", + " 'order': 10,\n", + " 'title': 'The Little Prince',\n", + " 'author': 'Antoine de Saint-ExupĂ©ry',\n", + " 'source': 'Original Text'}]" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from typing import List, Dict\n", + "from langchain_core.documents import Document\n", + "\n", + "\n", + "def preprocess_documents(\n", + " split_docs: List[Document], metadata: Dict[str, str] = None\n", + ") -> List[Dict[str, Dict[str, object]]]:\n", + " \"\"\"\n", + " Processes a list of pre-split documents into a format suitable for storing in Weaviate.\n", + "\n", + " :param split_docs: List of LangChain Document objects (each containing page_content and metadata).\n", + " :param metadata: Additional metadata to include in each chunk (e.g., title, source).\n", + " :return: A list of dictionaries, each representing a chunk in the format:\n", + " {'properties': {'text': ..., 'order': ..., ...metadata}}\n", + " \"\"\"\n", + " processed_chunks = []\n", + "\n", + " # Iterate over Document objects\n", + " for idx, doc in enumerate(split_docs, start=1):\n", + " # Extract text from page_content and include metadata\n", + " chunk_data = {\"text\": doc.page_content, \"order\": idx}\n", + " # Combine with metadata from Document and additional metadata if provided\n", + " if metadata:\n", + " chunk_data.update(metadata)\n", + " if doc.metadata:\n", + " chunk_data.update(doc.metadata)\n", + "\n", + " # Format for Weaviate\n", + " processed_chunks.append(chunk_data)\n", + "\n", + " return processed_chunks\n", + "\n", + "\n", + "metadata = {\n", + " \"title\": \"The Little Prince\",\n", + " \"author\": \"Antoine de Saint-ExupĂ©ry\",\n", + " \"source\": \"Original Text\",\n", + "}\n", + "\n", + "processed_chunks = preprocess_documents(split_docs, metadata=metadata)\n", + "\n", + "processed_chunks[:10]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Manage vector store\n", + "Once you have created your vector store, we can interact with it by adding and deleting different items." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Add Items to Vector Store\n", + "\n", + "Weaviate provides flexible methods for adding documents to your vector store. This section explores two efficient approaches: standard insertion and parallel batch processing, each optimized for different use cases.\n", + "\n", + "#### Standard Insertion\n", + "Best for smaller datasets or when processing order is important:\n", + "- Sequential document processing\n", + "- Automatic UUID generation\n", + "- Built-in duplicate handling\n", + "- Real-time progress tracking\n", + "\n", + "#### Parallel Batch Processing\n", + "Optimized for large-scale document ingestion:\n", + "- Multi-threaded processing\n", + "- Configurable batch sizes\n", + "- Concurrent execution\n", + "- Enhanced throughput\n", + "\n", + "**Configuration Options:**\n", + "- `batch_size`: Control memory usage and processing chunks\n", + "- `max_workers`: Adjust concurrent processing threads\n", + "- `unique_key`: Define document identification field\n", + "- `show_progress`: Monitor ingestion progress\n", + "\n", + "**Performance Tips:**\n", + "- For datasets < 1000 documents: Use standard insertion\n", + "- For datasets > 1000 documents: Consider parallel processing\n", + "- Monitor memory usage when increasing batch size\n", + "- Adjust worker count based on available CPU cores\n", + "\n", + "> **Best Practice:** Choose your ingestion method based on dataset size and system resources. Start with conservative batch sizes and gradually optimize based on performance metrics." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_weaviate import WeaviateVectorStore\n", + "from langchain_openai import OpenAIEmbeddings\n", + "\n", + "embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", + "\n", + "vector_store = WeaviateVectorStore(\n", + " client=client, index_name=collection_name, embedding=embeddings, text_key=\"text\"\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Processed batch 1/7\n", + "Processed batch 2/7\n", + "Processed batch 3/7\n", + "Processed batch 4/7\n", + "Processed batch 5/7\n", + "Processed batch 6/7\n", + "Processed batch 7/7\n", + "\n", + "Processing complete\n", + "Number of successfully processed documents: 698\n", + "Total elapsed time: 316.36 seconds\n" + ] + } + ], + "source": [ + "from weaviate.util import generate_uuid5\n", + "import time\n", + "\n", + "\n", + "def upsert_documents(\n", + " vector_store: WeaviateVectorStore,\n", + " docs: List[Dict],\n", + " unique_key: str = \"order\",\n", + " batch_size: int = 100,\n", + " show_progress: bool = True,\n", + ") -> List[str]:\n", + " \"\"\"\n", + " Upserts documents into the WeaviateVectorStore.\n", + " \"\"\"\n", + " # Prepare Document objects and IDs\n", + " documents = []\n", + " ids = []\n", + "\n", + " for doc in docs:\n", + " unique_value = str(doc[unique_key])\n", + " doc_id = generate_uuid5(vector_store._index_name, unique_value)\n", + "\n", + " documents.append(\n", + " Document(\n", + " page_content=doc[\"text\"],\n", + " metadata={k: v for k, v in doc.items() if k != \"text\"},\n", + " )\n", + " )\n", + " ids.append(doc_id)\n", + "\n", + " # Generate embeddings\n", + " texts = [doc.page_content for doc in documents]\n", + " metadatas = [doc.metadata for doc in documents]\n", + " embeddings = vector_store.embeddings.embed_documents(texts)\n", + "\n", + " # Get the collection\n", + " collection = vector_store._client.collections.get(vector_store._index_name)\n", + " successful_ids = []\n", + "\n", + " try:\n", + " for i in range(0, len(texts), batch_size):\n", + " batch_texts = texts[i : i + batch_size]\n", + " batch_embeddings = embeddings[i : i + batch_size]\n", + " batch_ids = ids[i : i + batch_size]\n", + " batch_metadatas = metadatas[i : i + batch_size] if metadatas else None\n", + "\n", + " for j, text in enumerate(batch_texts):\n", + " properties = {\"text\": text}\n", + " if batch_metadatas:\n", + " properties.update(batch_metadatas[j])\n", + "\n", + " try:\n", + " # First, check if the object exists\n", + " exists = collection.data.exists(uuid=batch_ids[j])\n", + "\n", + " if exists:\n", + " # If the object exists, update it\n", + " collection.data.replace(\n", + " uuid=batch_ids[j],\n", + " properties=properties,\n", + " vector=batch_embeddings[j],\n", + " )\n", + " else:\n", + " # If the object does not exist, insert it\n", + " collection.data.insert(\n", + " uuid=batch_ids[j],\n", + " properties=properties,\n", + " vector=batch_embeddings[j],\n", + " )\n", + " successful_ids.append(batch_ids[j])\n", + "\n", + " except Exception as e:\n", + " print(f\"Error processing document (ID: {batch_ids[j]}): {e}\")\n", + " continue\n", + "\n", + " if show_progress:\n", + " print(\n", + " f\"Processed batch {i//batch_size + 1}/{(len(texts)-1)//batch_size + 1}\"\n", + " )\n", + "\n", + " except Exception as e:\n", + " print(f\"Error during batch processing: {e}\")\n", + "\n", + " return successful_ids\n", + "\n", + "\n", + "start_time = time.time()\n", + "\n", + "# Example usage\n", + "results = upsert_documents(\n", + " vector_store=vector_store,\n", + " docs=processed_chunks,\n", + " unique_key=\"order\",\n", + " batch_size=100,\n", + " show_progress=True,\n", + ")\n", + "\n", + "end_time = time.time()\n", + "print(f\"\\nProcessing complete\")\n", + "print(f\"Number of successfully processed documents: {len(results)}\")\n", + "print(f\"Total elapsed time: {end_time - start_time:.2f} seconds\")" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Processing batches: 100%|██████████| 7/7 [01:31<00:00, 13.02s/it]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Processing complete\n", + "Number of successfully processed documents: 698\n", + "Total elapsed time: 94.17 seconds\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "from typing import List, Dict, Optional\n", + "from concurrent.futures import ThreadPoolExecutor, as_completed\n", + "from tqdm import tqdm\n", + "import time\n", + "\n", + "\n", + "def upsert_documents_parallel(\n", + " vector_store: WeaviateVectorStore,\n", + " docs: List[Dict],\n", + " unique_key: str = \"order\",\n", + " batch_size: int = 100,\n", + " max_workers: Optional[int] = 4,\n", + " show_progress: bool = True,\n", + ") -> List[str]:\n", + " \"\"\"\n", + " Upserts documents in parallel to WeaviateVectorStore.\n", + "\n", + " Args:\n", + " vector_store: WeaviateVectorStore instance\n", + " docs: List of documents to upsert\n", + " unique_key: Key to use as the unique identifier\n", + " batch_size: Size of each batch\n", + " max_workers: Maximum number of workers\n", + " show_progress: Whether to show progress\n", + " Returns:\n", + " List[str]: List of IDs of successfully processed documents\n", + " \"\"\"\n", + "\n", + " # Divide data into batches\n", + " def create_batches(data: List, size: int) -> List[List]:\n", + " return [data[i : i + size] for i in range(0, len(data), size)]\n", + "\n", + " batched_docs = create_batches(docs, batch_size)\n", + "\n", + " def process_batch(batch: List[Dict]) -> List[str]:\n", + " try:\n", + " return upsert_documents(\n", + " vector_store=vector_store,\n", + " docs=batch,\n", + " unique_key=unique_key,\n", + " batch_size=len(batch),\n", + " show_progress=False, # Do not show progress for individual batches\n", + " )\n", + " except Exception as e:\n", + " print(f\"Error processing batch: {e}\")\n", + " return []\n", + "\n", + " successful_ids = []\n", + "\n", + " with ThreadPoolExecutor(max_workers=max_workers) as executor:\n", + " futures = {\n", + " executor.submit(process_batch, batch): i\n", + " for i, batch in enumerate(batched_docs)\n", + " }\n", + "\n", + " if show_progress:\n", + " with tqdm(total=len(batched_docs), desc=\"Processing batches\") as pbar:\n", + " for future in as_completed(futures):\n", + " batch_result = future.result()\n", + " successful_ids.extend(batch_result)\n", + " pbar.update(1)\n", + " else:\n", + " for future in as_completed(futures):\n", + " batch_result = future.result()\n", + " successful_ids.extend(batch_result)\n", + "\n", + " return successful_ids\n", + "\n", + "\n", + "# Example usage\n", + "start_time = time.time()\n", + "\n", + "results = upsert_documents_parallel(\n", + " vector_store=vector_store,\n", + " docs=processed_chunks,\n", + " unique_key=\"order\",\n", + " batch_size=100, # Set batch size\n", + " max_workers=4, # Set maximum number of workers\n", + " show_progress=True,\n", + ")\n", + "\n", + "end_time = time.time()\n", + "print(f\"\\nProcessing complete\")\n", + "print(f\"Number of successfully processed documents: {len(results)}\")\n", + "print(f\"Total elapsed time: {end_time - start_time:.2f} seconds\")" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_weaviate import WeaviateVectorStore\n", + "from langchain.chains.qa_with_sources.retrieval import RetrievalQAWithSourcesChain\n", + "from langchain_core.retrievers import BaseRetriever\n", + "from langchain_core.language_models import BaseChatModel\n", + "from weaviate.collections.classes.filters import Filter\n", + "from typing import Any, List, Dict, Optional, Union, Tuple\n", + "from langchain_core.documents import Document\n", + "from weaviate.collections.classes.filters import Filter\n", + "\n", + "\n", + "class WeaviateSearch:\n", + " def __init__(self, vector_store: WeaviateVectorStore):\n", + " \"\"\"\n", + " Initialize Weaviate search class\n", + " \"\"\"\n", + " self.vector_store = vector_store\n", + " self.collection = vector_store._client.collections.get(vector_store._index_name)\n", + " self.text_key = vector_store._text_key\n", + "\n", + " def _format_filter(self, filter_query: Filter) -> str:\n", + " \"\"\"\n", + " Converts a Filter object to a readable string.\n", + "\n", + " Args:\n", + " filter_query: Weaviate Filter object\n", + "\n", + " Returns:\n", + " str: Filter description string\n", + " \"\"\"\n", + " if not filter_query:\n", + " return \"No filter\"\n", + "\n", + " try:\n", + " # Converts the internal structure of the Filter object to a string\n", + " if hasattr(filter_query, \"filters\"): # Composite filter (AND/OR)\n", + " operator = \"AND\" if filter_query.operator == \"And\" else \"OR\"\n", + " filter_strs = []\n", + " for f in filter_query.filters:\n", + " if hasattr(f, \"value\"): # Single filter\n", + " filter_strs.append(\n", + " f\"({f.target} {f.operator.lower()} {f.value})\"\n", + " )\n", + " return f\" {operator} \".join(filter_strs)\n", + " elif hasattr(filter_query, \"value\"): # Single filter\n", + " return f\"{filter_query.target} {filter_query.operator.lower()} {filter_query.value}\"\n", + " else:\n", + " return str(filter_query)\n", + " except Exception:\n", + " return \"Complex filter\"\n", + "\n", + " def similarity_search(\n", + " self,\n", + " query: str,\n", + " filter_query: Optional[Filter] = None,\n", + " k: int = 3,\n", + " **kwargs: Any,\n", + " ):\n", + " \"\"\"\n", + " Perform basic similarity search\n", + " \"\"\"\n", + " documents = self.vector_store.similarity_search(\n", + " query, k=k, filters=filter_query, **kwargs\n", + " )\n", + " return documents\n", + "\n", + " def similarity_search_with_score(\n", + " self,\n", + " query: str,\n", + " filter_query: Optional[Filter] = None,\n", + " k: int = 3,\n", + " **kwargs: Any,\n", + " ):\n", + " \"\"\"\n", + " Perform similarity search with score\n", + " \"\"\"\n", + " documents_and_scores = self.vector_store.similarity_search_with_score(\n", + " query, k=k, filters=filter_query, **kwargs\n", + " )\n", + " return documents_and_scores\n", + "\n", + " def mmr_search(\n", + " self,\n", + " query: str,\n", + " filter_query: Optional[Filter] = None,\n", + " k: int = 3,\n", + " fetch_k: int = 10,\n", + " **kwargs: Any,\n", + " ):\n", + " \"\"\"\n", + " Perform MMR algorithm-based diverse search\n", + " \"\"\"\n", + " documents = self.vector_store.max_marginal_relevance_search(\n", + " query=query, k=k, fetch_k=fetch_k, filters=filter_query, **kwargs\n", + " )\n", + " return documents\n", + "\n", + " def hybrid_search(\n", + " self,\n", + " query: str,\n", + " filter_query: Optional[Filter] = None,\n", + " alpha: float = 0.5,\n", + " limit: int = 3,\n", + " **kwargs: Any,\n", + " ) -> List[Document]:\n", + " \"\"\"\n", + " Hybrid search (keyword + vector search)\n", + "\n", + " Args:\n", + " query: Text to search\n", + " filter_dict: Filter condition dictionary\n", + " alpha: Weight for keyword and vector search (0: keyword only, 1: vector only)\n", + " limit: Number of documents to return\n", + " return_score: Whether to return similarity score\n", + "\n", + " Returns:\n", + " List of Documents hybrid search results\n", + " \"\"\"\n", + " embedding_vector = self.vector_store.embeddings.embed_query(query)\n", + " results = self.collection.query.hybrid(\n", + " query=query,\n", + " vector=embedding_vector,\n", + " alpha=alpha,\n", + " limit=limit,\n", + " filters=filter_query,\n", + " **kwargs,\n", + " )\n", + "\n", + " documents = []\n", + " for obj in results.objects:\n", + " metadata = {\n", + " key: value\n", + " for key, value in obj.properties.items()\n", + " if key != self.text_key\n", + " }\n", + " metadata[\"uuid\"] = str(obj.uuid)\n", + "\n", + " if hasattr(obj.metadata, \"score\"):\n", + " metadata[\"score\"] = obj.metadata.score\n", + "\n", + " doc = Document(\n", + " page_content=obj.properties.get(self.text_key, str(obj.properties)),\n", + " metadata=metadata,\n", + " )\n", + "\n", + " documents.append(doc)\n", + "\n", + " return documents\n", + "\n", + " def semantic_search(\n", + " self,\n", + " query: str,\n", + " filter_query: Optional[Filter] = None,\n", + " limit: int = 3,\n", + " **kwargs: Any,\n", + " ) -> List[Dict]:\n", + " \"\"\"\n", + " Semantic search (vector-based)\n", + " \"\"\"\n", + " results = self.collection.query.near_text(\n", + " query=query, limit=limit, filters=filter_query, **kwargs\n", + " )\n", + "\n", + " documents = []\n", + " for obj in results.objects:\n", + " metadata = {\n", + " key: value\n", + " for key, value in obj.properties.items()\n", + " if key != self.text_key\n", + " }\n", + " metadata[\"uuid\"] = str(obj.uuid)\n", + " documents.append(\n", + " Document(\n", + " page_content=obj.properties.get(self.text_key, str(obj.properties)),\n", + " metadata=metadata,\n", + " )\n", + " )\n", + "\n", + " return documents\n", + "\n", + " def keyword_search(\n", + " self,\n", + " query: str,\n", + " filter_query: Optional[Filter] = None,\n", + " limit: int = 3,\n", + " **kwargs: Any,\n", + " ) -> List[Dict]:\n", + " \"\"\"\n", + " Keyword-based search (BM25)\n", + " \"\"\"\n", + " results = self.collection.query.bm25(\n", + " query=query, limit=limit, filters=filter_query, **kwargs\n", + " )\n", + "\n", + " documents = []\n", + " for obj in results.objects:\n", + " metadata = {\n", + " key: value\n", + " for key, value in obj.properties.items()\n", + " if key != self.text_key\n", + " }\n", + " metadata[\"uuid\"] = str(obj.uuid)\n", + " documents.append(\n", + " Document(\n", + " page_content=obj.properties.get(self.text_key, str(obj.properties)),\n", + " metadata=metadata,\n", + " )\n", + " )\n", + "\n", + " return documents\n", + "\n", + " def create_qa_chain(\n", + " self,\n", + " llm: BaseChatModel = None,\n", + " chain_type: str = \"stuff\",\n", + " retriever: BaseRetriever = None,\n", + " **kwargs: Any,\n", + " ):\n", + " \"\"\"\n", + " Create search-QA chain\n", + " \"\"\"\n", + " qa_chain = RetrievalQAWithSourcesChain.from_chain_type(\n", + " llm=llm,\n", + " chain_type=chain_type,\n", + " retriever=retriever,\n", + " **kwargs,\n", + " )\n", + " return qa_chain\n", + "\n", + " def print_results(\n", + " self,\n", + " results: Union[List[Document], List[Tuple[Document, float]]],\n", + " search_type: str,\n", + " filter_query: Optional[Filter] = None,\n", + " ) -> None:\n", + " \"\"\"\n", + " Print search results in a readable format\n", + "\n", + " Args:\n", + " results: List of Document or (Document, score) tuples\n", + " search_type: Search type (e.g., \"Hybrid\", \"Semantic\" etc.)\n", + " filter_dict: Applied filter information\n", + " \"\"\"\n", + " print(f\"\\n=== {search_type.upper()} SEARCH RESULTS ===\")\n", + " if filter_query:\n", + " print(f\"Filter: {self._format_filter(filter_query)}\")\n", + "\n", + " for i, result in enumerate(results, 1):\n", + " print(f\"\\nResult {i}:\")\n", + "\n", + " # Separate Document object and score\n", + " if isinstance(result, tuple):\n", + " doc, score = result\n", + " print(f\"Score: {score:.4f}\")\n", + " else:\n", + " doc = result\n", + "\n", + " # Print content\n", + " print(f\"Content: {doc.page_content}\")\n", + "\n", + " # Print metadata\n", + " if doc.metadata:\n", + " print(\"\\nMetadata:\")\n", + " for key, value in doc.metadata.items():\n", + " if (\n", + " key != \"score\" and key != \"uuid\"\n", + " ): # Exclude already printed information\n", + " print(f\" {key}: {value}\")\n", + "\n", + " print(\"-\" * 50)\n", + "\n", + " def print_search_comparison(\n", + " self,\n", + " query: str,\n", + " filter_query: Optional[Filter] = None,\n", + " limit: int = 5,\n", + " alpha: float = 0.5,\n", + " fetch_k: int = 10,\n", + " **kwargs: Any,\n", + " ) -> None:\n", + " \"\"\"\n", + " Print comparison of all search methods' results\n", + "\n", + " Args:\n", + " query: Search query\n", + " filter_dict: Filter condition\n", + " limit: Number of results\n", + " alpha: Weight for hybrid search (0: keyword only, 1: vector only)\n", + " fetch_k: Number of candidate documents for MMR search\n", + " **kwargs: Additional search parameters\n", + " \"\"\"\n", + " search_methods = [\n", + " # 1. Basic similarity search\n", + " {\n", + " \"name\": \"Similarity Search\",\n", + " \"method\": self.similarity_search,\n", + " \"params\": {\"k\": limit},\n", + " },\n", + " # 2. Similarity search with score\n", + " {\n", + " \"name\": \"Similarity Search with Score\",\n", + " \"method\": self.similarity_search_with_score,\n", + " \"params\": {\"k\": limit},\n", + " },\n", + " # 3. MMR search\n", + " {\n", + " \"name\": \"MMR Search\",\n", + " \"method\": self.mmr_search,\n", + " \"params\": {\"k\": limit, \"fetch_k\": fetch_k},\n", + " },\n", + " # 4. Hybrid search\n", + " {\n", + " \"name\": \"Hybrid Search\",\n", + " \"method\": self.hybrid_search,\n", + " \"params\": {\"limit\": limit, \"alpha\": alpha},\n", + " },\n", + " # 5. Semantic search\n", + " {\n", + " \"name\": \"Semantic Search\",\n", + " \"method\": self.semantic_search,\n", + " \"params\": {\"limit\": limit},\n", + " },\n", + " # 6. Keyword search\n", + " {\n", + " \"name\": \"Keyword Search\",\n", + " \"method\": self.keyword_search,\n", + " \"params\": {\"limit\": limit},\n", + " },\n", + " ]\n", + "\n", + " print(\"\\n=== SEARCH METHODS COMPARISON ===\")\n", + " print(f\"Query: {query}\")\n", + " if filter_query:\n", + " print(f\"Filter: {self._format_filter(filter_query)}\")\n", + " print(\"=\" * 50)\n", + "\n", + " for search_config in search_methods:\n", + " try:\n", + " method_params = {\n", + " **search_config[\"params\"],\n", + " \"query\": query,\n", + " \"filter_query\": filter_query,\n", + " **kwargs,\n", + " }\n", + "\n", + " results = search_config[\"method\"](**method_params)\n", + "\n", + " print(f\"\\n>>> {search_config['name'].upper()} <<<\")\n", + " self.print_results(results, search_config[\"name\"], filter_query)\n", + "\n", + " except Exception as e:\n", + " print(f\"\\nError in {search_config['name']}: {str(e)}\")\n", + "\n", + " print(\"\\n\" + \"=\" * 50)" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "=== SEARCH METHODS COMPARISON ===\n", + "Query: What is the little prince about?\n", + "Filter: author equal Antoine de Saint-ExupĂ©ry\n", + "==================================================\n", + "\n", + ">>> SIMILARITY SEARCH <<<\n", + "\n", + "=== SIMILARITY SEARCH SEARCH RESULTS ===\n", + "Filter: author equal Antoine de Saint-ExupĂ©ry\n", + "\n", + "Result 1:\n", + "Content: In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little prince\n", + "\n", + "Metadata:\n", + " title: The Little Prince\n", + " author: Antoine de Saint-ExupĂ©ry\n", + " source: Original Text\n", + " order: 14\n", + "--------------------------------------------------\n", + "\n", + "Result 2:\n", + "Content: and illustrate what would become his most famous book, The Little Prince (1943). Mystical and enchanting, this small book has fascinated both children and adults for decades. In the book, a pilot is\n", + "\n", + "Metadata:\n", + " title: The Little Prince\n", + " order: 13\n", + " source: Original Text\n", + " author: Antoine de Saint-ExupĂ©ry\n", + "--------------------------------------------------\n", + "\n", + "Result 3:\n", + "Content: The Little Prince\n", + "Written By Antoine de Saiot-Exupery (1900〜1944)\n", + "\n", + "Metadata:\n", + " title: The Little Prince\n", + " author: Antoine de Saint-ExupĂ©ry\n", + " source: Original Text\n", + " order: 1\n", + "--------------------------------------------------\n", + "\n", + "==================================================\n", + "\n", + ">>> SIMILARITY SEARCH WITH SCORE <<<\n", + "\n", + "=== SIMILARITY SEARCH WITH SCORE SEARCH RESULTS ===\n", + "Filter: author equal Antoine de Saint-ExupĂ©ry\n", + "\n", + "Result 1:\n", + "Score: 0.7000\n", + "Content: In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little prince\n", + "\n", + "Metadata:\n", + " title: The Little Prince\n", + " order: 14\n", + " source: Original Text\n", + " author: Antoine de Saint-ExupĂ©ry\n", + "--------------------------------------------------\n", + "\n", + "Result 2:\n", + "Score: 0.6264\n", + "Content: and illustrate what would become his most famous book, The Little Prince (1943). Mystical and enchanting, this small book has fascinated both children and adults for decades. In the book, a pilot is\n", + "\n", + "Metadata:\n", + " title: The Little Prince\n", + " order: 13\n", + " source: Original Text\n", + " author: Antoine de Saint-ExupĂ©ry\n", + "--------------------------------------------------\n", + "\n", + "Result 3:\n", + "Score: 0.6003\n", + "Content: The Little Prince\n", + "Written By Antoine de Saiot-Exupery (1900〜1944)\n", + "\n", + "Metadata:\n", + " title: The Little Prince\n", + " author: Antoine de Saint-ExupĂ©ry\n", + " source: Original Text\n", + " order: 1\n", + "--------------------------------------------------\n", + "\n", + "==================================================\n", + "\n", + ">>> MMR SEARCH <<<\n", + "\n", + "=== MMR SEARCH SEARCH RESULTS ===\n", + "Filter: author equal Antoine de Saint-ExupĂ©ry\n", + "\n", + "Result 1:\n", + "Content: In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little prince\n", + "\n", + "Metadata:\n", + " title: The Little Prince\n", + " order: 14\n", + " source: Original Text\n", + " author: Antoine de Saint-ExupĂ©ry\n", + "--------------------------------------------------\n", + "\n", + "Result 2:\n", + "Content: The Little Prince\n", + "Written By Antoine de Saiot-Exupery (1900〜1944)\n", + "\n", + "Metadata:\n", + " title: The Little Prince\n", + " author: Antoine de Saint-ExupĂ©ry\n", + " source: Original Text\n", + " order: 1\n", + "--------------------------------------------------\n", + "\n", + "Result 3:\n", + "Content: And that is how I made the acquaintance of the little prince.\n", + "\n", + "Metadata:\n", + " title: The Little Prince\n", + " author: Antoine de Saint-ExupĂ©ry\n", + " source: Original Text\n", + " order: 78\n", + "--------------------------------------------------\n", + "\n", + "==================================================\n", + "\n", + ">>> HYBRID SEARCH <<<\n", + "\n", + "=== HYBRID SEARCH SEARCH RESULTS ===\n", + "Filter: author equal Antoine de Saint-ExupĂ©ry\n", + "\n", + "Result 1:\n", + "Content: [ Chapter 7 ]\n", + "- the narrator learns about the secret of the little prince‘s life\n", + "\n", + "Metadata:\n", + " title: The Little Prince\n", + " order: 174\n", + " source: Original Text\n", + " author: Antoine de Saint-ExupĂ©ry\n", + "--------------------------------------------------\n", + "\n", + "Result 2:\n", + "Content: [ Chapter 3 ]\n", + "- the narrator learns more about from where the little prince came\n", + "\n", + "Metadata:\n", + " title: The Little Prince\n", + " order: 79\n", + " source: Original Text\n", + " author: Antoine de Saint-ExupĂ©ry\n", + "--------------------------------------------------\n", + "\n", + "Result 3:\n", + "Content: In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little prince\n", + "\n", + "Metadata:\n", + " title: The Little Prince\n", + " order: 14\n", + " source: Original Text\n", + " author: Antoine de Saint-ExupĂ©ry\n", + "--------------------------------------------------\n", + "\n", + "==================================================\n", + "\n", + ">>> SEMANTIC SEARCH <<<\n", + "\n", + "=== SEMANTIC SEARCH SEARCH RESULTS ===\n", + "Filter: author equal Antoine de Saint-ExupĂ©ry\n", + "\n", + "Result 1:\n", + "Content: In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little prince\n", + "\n", + "Metadata:\n", + " title: The Little Prince\n", + " order: 14\n", + " source: Original Text\n", + " author: Antoine de Saint-ExupĂ©ry\n", + "--------------------------------------------------\n", + "\n", + "Result 2:\n", + "Content: and illustrate what would become his most famous book, The Little Prince (1943). Mystical and enchanting, this small book has fascinated both children and adults for decades. In the book, a pilot is\n", + "\n", + "Metadata:\n", + " title: The Little Prince\n", + " order: 13\n", + " source: Original Text\n", + " author: Antoine de Saint-ExupĂ©ry\n", + "--------------------------------------------------\n", + "\n", + "Result 3:\n", + "Content: The Little Prince\n", + "Written By Antoine de Saiot-Exupery (1900〜1944)\n", + "\n", + "Metadata:\n", + " title: The Little Prince\n", + " order: 1\n", + " source: Original Text\n", + " author: Antoine de Saint-ExupĂ©ry\n", + "--------------------------------------------------\n", + "\n", + "==================================================\n", + "\n", + ">>> KEYWORD SEARCH <<<\n", + "\n", + "=== KEYWORD SEARCH SEARCH RESULTS ===\n", + "Filter: author equal Antoine de Saint-ExupĂ©ry\n", + "\n", + "Result 1:\n", + "Content: \"Hum! Hum!\" replied the king; and before saying anything else he consulted a bulky almanac. \"Hum! Hum! That will be about-- about-- that will be this evening about twenty minutes to eight. And you\n", + "\n", + "Metadata:\n", + " title: The Little Prince\n", + " order: 291\n", + " source: Original Text\n", + " author: Antoine de Saint-ExupĂ©ry\n", + "--------------------------------------------------\n", + "\n", + "Result 2:\n", + "Content: have made a new friend, they never ask you any questions about essential matters. They never say to you, \"What does his voice sound like? What games does he love best? Does he collect butterflies?\"\n", + "\n", + "Metadata:\n", + " title: The Little Prince\n", + " order: 110\n", + " source: Original Text\n", + " author: Antoine de Saint-ExupĂ©ry\n", + "--------------------------------------------------\n", + "\n", + "Result 3:\n", + "Content: figures do they think they have learned anything about him.\n", + "\n", + "Metadata:\n", + " title: The Little Prince\n", + " order: 112\n", + " source: Original Text\n", + " author: Antoine de Saint-ExupĂ©ry\n", + "--------------------------------------------------\n", + "\n", + "==================================================\n" + ] + } + ], + "source": [ + "searcher = WeaviateSearch(vector_store)\n", + "\n", + "filter_query = Filter.by_property(\"author\").equal(\"Antoine de Saint-ExupĂ©ry\")\n", + "\n", + "searcher.print_search_comparison(\n", + " query=\"What is the little prince about?\",\n", + " filter_query=filter_query,\n", + " limit=3,\n", + " alpha=0.5, # keyword/vector weight for hybrid search\n", + " fetch_k=10, # number of candidate documents for MMR search\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Delete items from vector store\n", + "\n", + "You can delete items from vector store by filter\n", + "\n", + "First, let's search for documents that contain the text `Hum! Hum!` in the `text` property." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={'title': 'The Little Prince', 'order': 291, 'source': 'Original Text', 'author': 'Antoine de Saint-ExupĂ©ry', 'uuid': '16ddf535-a610-510c-b597-1fd3ce13360f'}, page_content='\"Hum! Hum!\" replied the king; and before saying anything else he consulted a bulky almanac. \"Hum! Hum! That will be about-- about-- that will be this evening about twenty minutes to eight. And you'),\n", + " Document(metadata={'title': 'The Little Prince', 'order': 269, 'source': 'Original Text', 'author': 'Antoine de Saint-ExupĂ©ry', 'uuid': 'a4c46e83-a491-5c1a-be06-e6635dfa58e5'}, page_content='\"That frightens me... I cannot, any more...\" murmured the little prince, now completely abashed.\\n\"Hum! Hum!\" replied the king. \"Then I-- I order you sometimes to yawn and sometimes to--\"'),\n", + " Document(metadata={'title': 'The Little Prince', 'order': 301, 'source': 'Original Text', 'author': 'Antoine de Saint-ExupĂ©ry', 'uuid': 'a8ff68c1-db62-51f6-a03b-5e12aceda12f'}, page_content='\"Hum! Hum!\" said the king. \"I have good reason to believe that somewhere on my planet there is an old rat. I hear him at night. You can judge this old rat. From time to time you will condemn him to')]" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "filter_query = Filter.by_property(\"text\").equal(\"Hum! Hum!\")\n", + "\n", + "searcher.keyword_search(\n", + " query=\"Hum! Hum!\",\n", + " filter_query=filter_query,\n", + " limit=3,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's delete the document with the filter applied." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of documents deleted: 3\n" + ] + }, + { + "data": { + "text/plain": [ + "3" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from weaviate.collections.classes.filters import Filter\n", + "\n", + "\n", + "def delete_by_filter(collection_name: str, filter_query: Filter) -> int:\n", + " try:\n", + " # Retrieve the collection\n", + " collection = client.collections.get(collection_name)\n", + "\n", + " # Check the number of documents that match the filter before deletion\n", + " query_result = collection.query.fetch_objects(\n", + " filters=filter_query,\n", + " )\n", + " initial_count = len(query_result.objects)\n", + "\n", + " # Delete documents that match the filter condition\n", + " collection.data.delete_many(where=filter_query)\n", + "\n", + " print(f\"Number of documents deleted: {initial_count}\")\n", + " return initial_count\n", + "\n", + " except Exception as e:\n", + " print(f\"Error occurred during deletion: {e}\")\n", + " raise\n", + "\n", + "\n", + "delete_by_filter(collection_name=collection_name, filter_query=filter_query)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's verify that the document was deleted properly." + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[]" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "searcher.keyword_search(\n", + " query=\"Hum! Hum!\",\n", + " filter_query=filter_query,\n", + " limit=3,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Great job, now let's dive into Similarity Search with a simple example.\n", + "\n", + "----" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Finding Objects by Similarity\n", + "\n", + "Weaviate allows you to find objects that are semantically similar to your query. Let's walk through a complete example, from importing data to executing similarity searches.\n", + "\n", + "### Step 1: Preparing Your Data\n", + "\n", + "Before we can perform similarity searches, we need to populate our Weaviate instance with data. We'll start by loading and chunking a text file into manageable pieces.\n", + "\n", + "> 💡 **Tip**: Breaking down large texts into smaller chunks helps optimize vector search performance and relevance." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_openai import OpenAIEmbeddings\n", + "from langchain_weaviate.vectorstores import WeaviateVectorStore\n", + "from langchain_text_splitters import RecursiveCharacterTextSplitter\n", + "\n", + "# This is a long document we can split up.\n", + "with open(\"./data/the_little_prince.txt\") as f:\n", + " raw_text = f.read()\n", + "\n", + "text_splitter = RecursiveCharacterTextSplitter(\n", + " # Set a really small chunk size, just to show.\n", + " chunk_size=200,\n", + " chunk_overlap=30,\n", + " length_function=len,\n", + " is_separator_regex=False,\n", + ")\n", + "\n", + "split_docs = text_splitter.create_documents([raw_text])\n", + "embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")\n", + "\n", + "vector_store = WeaviateVectorStore(\n", + " client=client, index_name=collection_name, embedding=embeddings, text_key=\"text\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 2: Perform the search\n", + "\n", + "We can now perform a similarity search. This will return the most similar documents to the query text, based on the embeddings stored in Weaviate and an equivalent embedding generated from the query text." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Document 1:\n", + "In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little prince\n" + ] + } + ], + "source": [ + "query = \"What is the little prince about?\"\n", + "searcher = WeaviateSearch(vector_store)\n", + "docs = searcher.similarity_search(query, k=1)\n", + "\n", + "for i, doc in enumerate(docs):\n", + " print(f\"\\nDocument {i+1}:\")\n", + " print(doc.page_content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also add filters, which will either include or exclude results based on the filter conditions. (See [more filter examples](https://weaviate.io/developers/weaviate/search/filters).)\n", + "\n", + "It is also possible to provide `k`, which is the upper limit of the number of results to return." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={'title': 'The Little Prince', 'order': 14, 'source': 'Original Text', 'author': 'Antoine de Saint-ExupĂ©ry'}, page_content='In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little prince')]" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from weaviate.classes.query import Filter\n", + "\n", + "filter_query = Filter.by_property(\"text\").equal(\"In the book, a pilot is\")\n", + "\n", + "searcher.similarity_search(\n", + " query=query,\n", + " filter_query=filter_query,\n", + " k=1,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Quantify Result Similarity\n", + "\n", + "When performing similarity searches, you might want to know not just which documents are similar, but how similar they are. Weaviate provides this information through a relevance score.\n", + "> 💡 Tip: The relevance score helps you understand the relative similarity between search results." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.700 : In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little prince\n", + "0.627 : and illustrate what would become his most famous book, The Little Prince (1943). Mystical and enchanting, this small book has fascinated both children and adults for decades. In the book, a pilot is\n", + "0.600 : The Little Prince\n", + "Written By Antoine de Saiot-Exupery (1900〜1944)\n", + "0.525 : [ Chapter 7 ]\n", + "- the narrator learns about the secret of the little prince‘s life\n", + "0.519 : [ Chapter 3 ]\n", + "- the narrator learns more about from where the little prince came\n" + ] + } + ], + "source": [ + "docs = searcher.similarity_search_with_score(query, k=5)\n", + "\n", + "for doc in docs:\n", + " print(f\"{doc[1]:.3f}\", \":\", doc[0].page_content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Search mechanism" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`similarity_search` uses Weaviate's [hybrid search](https://weaviate.io/developers/weaviate/api/graphql/search-operators#hybrid).\n", + "\n", + "A hybrid search combines a vector and a keyword search, with `alpha` as the weight of the vector search. The `similarity_search` function allows you to pass additional arguments as kwargs. See this [reference doc](https://weaviate.io/developers/weaviate/api/graphql/search-operators#hybrid) for the available arguments.\n", + "\n", + "So, you can perform a pure keyword search by adding `alpha=0` as shown below:" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Document(metadata={'title': 'The Little Prince', 'order': 110, 'source': 'Original Text', 'author': 'Antoine de Saint-ExupĂ©ry'}, page_content='have made a new friend, they never ask you any questions about essential matters. They never say to you, \"What does his voice sound like? What games does he love best? Does he collect butterflies?\"')" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "docs = searcher.similarity_search(query, alpha=0)\n", + "docs[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Persistence" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Any data added through `langchain-weaviate` will persist in Weaviate according to its configuration. \n", + "\n", + "WCS instances, for example, are configured to persist data indefinitely, and Docker instances can be set up to persist data in a volume. Read more about [Weaviate's persistence](https://weaviate.io/developers/weaviate/configuration/persistence)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Multi-tenancy" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[Multi-tenancy](https://weaviate.io/developers/weaviate/concepts/data#multi-tenancy) allows you to have a high number of isolated collections of data, with the same collection configuration, in a single Weaviate instance. This is great for multi-user environments such as building a SaaS app, where each end user will have their own isolated data collection.\n", + "\n", + "To use multi-tenancy, the vector store need to be aware of the `tenant` parameter. \n", + "\n", + "So when adding any data, provide the `tenant` parameter as shown below." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-Jan-19 09:14 PM - langchain_weaviate.vectorstores - INFO - Tenant tenant1 does not exist in index LangChain_866945876dc24c83bb0247ce4324bdbd. Creating tenant.\n" + ] + } + ], + "source": [ + "# 2. Create a vector store with a specific tenant\n", + "vector_store_with_tenant = WeaviateVectorStore.from_documents(\n", + " docs, embeddings, client=client, tenant=\"tenant1\" # specify the tenant name\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\"Yes?\" said the little prince, who did not understand what the conceited man was talking about. \n", + "\"Clap your hands, one against the other,\" the conceited man now directed him.\n", + "have made a new friend, they never ask you any questions about essential matters. They never say to you, \"What does his voice sound like? What games does he love best? Does he collect butterflies?\"\n", + "figures do they think they have learned anything about him.\n" + ] + } + ], + "source": [ + "results = vector_store_with_tenant.similarity_search(\n", + " query, tenant=\"tenant1\" # use the same tenant name\n", + ")\n", + "\n", + "for doc in results:\n", + " print(doc.page_content)" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2025-Jan-19 09:14 PM - langchain_weaviate.vectorstores - INFO - Tenant tenant1 does not exist in index LangChain_c07a19db3f994319935be1ccdeb957c0. Creating tenant.\n" + ] + } + ], + "source": [ + "vector_store_with_tenant = WeaviateVectorStore.from_documents(\n", + " docs, embeddings, client=client, tenant=\"tenant1\", mt=True\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And when performing queries, provide the `tenant` parameter also." + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(metadata={'title': 'The Little Prince', 'order': 313.0, 'source': 'Original Text', 'author': 'Antoine de Saint-ExupĂ©ry'}, page_content='\"Yes?\" said the little prince, who did not understand what the conceited man was talking about. \\n\"Clap your hands, one against the other,\" the conceited man now directed him.'),\n", + " Document(metadata={'title': 'The Little Prince', 'order': 110.0, 'source': 'Original Text', 'author': 'Antoine de Saint-ExupĂ©ry'}, page_content='have made a new friend, they never ask you any questions about essential matters. They never say to you, \"What does his voice sound like? What games does he love best? Does he collect butterflies?\"'),\n", + " Document(metadata={'title': 'The Little Prince', 'order': 112.0, 'source': 'Original Text', 'author': 'Antoine de Saint-ExupĂ©ry'}, page_content='figures do they think they have learned anything about him.')]" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "vector_store_with_tenant.similarity_search(query, tenant=\"tenant1\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Retriever options\n", + "\n", + "Weaviate can also be used as a retriever\n", + "\n", + "### Maximal marginal relevance search (MMR)\n", + "\n", + "In addition to using similaritysearch in the retriever object, you can also use `mmr`." + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Document(metadata={'title': 'The Little Prince', 'author': 'Antoine de Saint-ExupĂ©ry', 'source': 'Original Text', 'order': 14}, page_content='In the book, a pilot is stranded in the midst of the Sahara where he meets a tiny prince from another world traveling the universe in order to understand life. In the book, the little prince')" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "retriever = vector_store.as_retriever(search_type=\"mmr\")\n", + "retriever.invoke(query)[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Use with LangChain" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A known limitation of large language models (LLMs) is that their training data can be outdated, or not include the specific domain knowledge that you require.\n", + "\n", + "Take a look at the example below:" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\"The Little Prince\" is a novella written by Antoine de Saint-ExupĂ©ry, first published in 1943. The story is narrated by a pilot who crashes in the Sahara Desert and meets a young boy who appears to be a prince. The little prince hails from a small asteroid called B-612 and shares his adventures and experiences as he travels from one planet to another.\n", + "\n", + "Throughout the story, the little prince encounters various inhabitants of different planets, each representing different aspects of human nature and society, such as a king, a vain man, a drunkard, a businessman, a geographer, and a fox. These encounters serve as allegories for adult behaviors and societal norms, often highlighting themes of loneliness, love, friendship, and the loss of innocence.\n", + "\n", + "One of the central messages of the book is the importance of seeing with the heart rather than just the eyes, emphasizing that true understanding and connection come from emotional and spiritual insight rather than superficial appearances. The story also explores themes of childhood, imagination, and the essence of what it means to be human.\n", + "\n", + "Ultimately, \"The Little Prince\" is a poignant reflection on the nature of relationships, the value of love, and the wisdom that can be found in simplicity and innocence. It has resonated with readers of all ages and is considered a classic of world literature.\n" + ] + } + ], + "source": [ + "from langchain_openai import ChatOpenAI\n", + "\n", + "llm = ChatOpenAI(model=\"gpt-4o-mini\", temperature=0)\n", + "result = llm.invoke(query)\n", + "print(result.content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Vector stores complement LLMs by providing a way to store and retrieve relevant information. This allow you to combine the strengths of LLMs and vector stores, by using LLM's reasoning and linguistic capabilities with vector stores' ability to retrieve relevant information.\n", + "\n", + "Two well-known applications for combining LLMs and vector stores are:\n", + "- Question answering\n", + "- Retrieval-augmented generation (RAG)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question Answering with Sources\n", + "\n", + "Question answering in langchain can be enhanced by the use of vector stores. Let's see how this can be done.\n", + "\n", + "This section uses the `RetrievalQAWithSourcesChain`, which does the lookup of the documents from an Index. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can construct the chain, with the retriever specified:" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [], + "source": [ + "searcher = WeaviateSearch(vector_store)\n", + "\n", + "chain = searcher.create_qa_chain(\n", + " llm=llm, retriever=vector_store.as_retriever(), chain_type=\"stuff\"\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'answer': 'The Little Prince is about a pilot who is stranded in the Sahara Desert and encounters a tiny prince from another world. The prince is traveling the universe to understand life. The story is mystical and enchanting, captivating both children and adults for decades.\\n\\n',\n", + " 'sources': 'Original Text'}" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "chain.invoke(\n", + " {\"question\": query},\n", + " return_only_outputs=True,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Retrieval-Augmented Generation\n", + "\n", + "Another very popular application of combining LLMs and vector stores is retrieval-augmented generation (RAG). This is a technique that uses a retriever to find relevant information from a vector store, and then uses an LLM to provide an output based on the retrieved data and a prompt.\n", + "\n", + "We begin with a similar setup:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We need to construct a template for the RAG model so that the retrieved information will be populated in the template." + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "input_variables=['context', 'question'] input_types={} partial_variables={} messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template=\"You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\\nQuestion: {question}\\nContext: {context}\\nAnswer:\\n\"), additional_kwargs={})]\n" + ] + } + ], + "source": [ + "from langchain_core.prompts import ChatPromptTemplate\n", + "\n", + "template = \"\"\"You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\n", + "Question: {question}\n", + "Context: {context}\n", + "Answer:\n", + "\"\"\"\n", + "prompt = ChatPromptTemplate.from_template(template)\n", + "\n", + "print(prompt)" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'\"The Little Prince\" is about a pilot who, while stranded in the Sahara, meets a young prince from another world who is exploring the universe to understand life. The story contrasts the prince\\'s innocent perspective with the often misguided views of adults. It explores themes of love, loss, and the importance of seeing beyond the surface.'" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from langchain_core.output_parsers import StrOutputParser\n", + "from langchain_core.runnables import RunnablePassthrough\n", + "from langchain_openai import ChatOpenAI\n", + "\n", + "llm = ChatOpenAI(model=\"gpt-4o-mini\", temperature=0)\n", + "\n", + "rag_chain = (\n", + " {\"context\": retriever, \"question\": RunnablePassthrough()}\n", + " | prompt\n", + " | llm\n", + " | StrOutputParser()\n", + ")\n", + "\n", + "rag_chain.invoke(query)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "langchain-opentutorial-BXw0bE1H-py3.11", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.10" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/09-VectorStore/assets/10-weaviate-credentials-01.png b/09-VectorStore/assets/10-weaviate-credentials-01.png new file mode 100644 index 000000000..c301609f1 Binary files /dev/null and b/09-VectorStore/assets/10-weaviate-credentials-01.png differ diff --git a/09-VectorStore/assets/10-weaviate-credentials-02.png b/09-VectorStore/assets/10-weaviate-credentials-02.png new file mode 100644 index 000000000..05ec72f0c Binary files /dev/null and b/09-VectorStore/assets/10-weaviate-credentials-02.png differ diff --git a/09-VectorStore/assets/10-weaviate-credentials-03.png b/09-VectorStore/assets/10-weaviate-credentials-03.png new file mode 100644 index 000000000..7f4b65e28 Binary files /dev/null and b/09-VectorStore/assets/10-weaviate-credentials-03.png differ diff --git a/09-VectorStore/assets/10-weaviate-credentials-04.png b/09-VectorStore/assets/10-weaviate-credentials-04.png new file mode 100644 index 000000000..f946153a0 Binary files /dev/null and b/09-VectorStore/assets/10-weaviate-credentials-04.png differ