LangChain-OpenTutorial · teddylee777 · Jan 11, 2025 · Jan 6, 2025 · Jan 9, 2025 · Jan 9, 2025
diff --git a/11-Reranker/01-CrossEncoderReranker.ipynb b/11-Reranker/01-CrossEncoderReranker.ipynb
@@ -0,0 +1,369 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "T3l-XPuhit0_"
+      },
+      "source": [
+        "# Cross Encoder Reranker\n",
+        "\n",
+        "- Author: [Jeongho Shin](https://github.com/ThePurpleCollar)\n",
+        "- Design:\n",
+        "- Peer Review:\n",
+        "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
+        "\n",
+        "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/01-Basic/05-Using-OpenAIAPI-MultiModal.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/01-Basic/05-Using-OpenAIAPI-MultiModal.ipynb)\n",
+        "\n",
+        "## Overview\n",
+        "\n",
+        "The **Cross Encoder Reranker** is a technique designed to enhance the performance of Retrieval-Augmented Generation (RAG) systems. This guide explains how to implement a reranker using Hugging Face's cross-encoder model to refine the ranking of retrieved documents, promoting those most relevant to a query.\n",
+        "\n",
+        "### Table of Contents\n",
+        "\n",
+        "- [Overview](#overview)\n",
+        "- [Key Features and Mechanism](#key-features-and-mechanism)\n",
+        "- [Practical Applications](#practical-applications)\n",
+        "- [Implementation](#implementation)\n",
+        "- [Key Advantages of Reranker](#key-advantages-of-reranker)\n",
+        "- [Document Count Settings for Reranker](#document-count-settings-for-reranker)\n",
+        "- [Trade-offs When Using a Reranker](#trade-offs-when-using-a-reranker)\n",
+        "\n",
+        "### References\n",
+        "\n",
+        "[Hugging Face cross encoder models ](https://huggingface.co/cross-encoder)\n",
+        "\n",
+        "----"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "ydu-Mu2tizbd"
+      },
+      "source": [
+        "## Key Features and Mechanism\n",
+        "\n",
+        "### Purpose\n",
+        "- Re-rank retrieved documents to refine their ranking, prioritizing the most relevant results for the query.\n",
+        "\n",
+        "### Structure\n",
+        "- Accepts both the `query` and `document` as a single input pair, enabling joint processing.\n",
+        "\n",
+        "### Mechanism\n",
+        "- **Single Input Pair**:  \n",
+        "  Processes the `query` and `document` as a combined input to output a relevance score directly.\n",
+        "- **Self-Attention Mechanism**:  \n",
+        "  Uses self-attention to jointly analyze the `query` and `document`, effectively capturing their semantic relationship.\n",
+        "\n",
+        "### Advantages\n",
+        "- **Higher Accuracy**:  \n",
+        "  Provides more precise similarity scores.\n",
+        "- **Deep Contextual Analysis**:  \n",
+        "  Explores semantic nuances between `query` and `document`.\n",
+        "\n",
+        "### Limitations\n",
+        "- **High Computational Costs**:  \n",
+        "  Processing can be time-intensive.\n",
+        "- **Scalability Issues**:  \n",
+        "  Not suitable for large-scale document collections without optimization.\n",
+        "\n",
+        "---\n",
+        "\n",
+        "## Practical Applications\n",
+        "- A **Bi-Encoder** quickly retrieves candidate `documents` by computing lightweight similarity scores.  \n",
+        "- A **Cross Encoder** refines these results by deeply analyzing the semantic relationship between the `query` and the retrieved `documents`.\n",
+        "\n",
+        "---\n",
+        "\n",
+        "## Implementation\n",
+        "- Use Hugging Face cross encoder models, such as `BAAI/bge-reranker`.\n",
+        "- Easily integrate with frameworks like `LangChain` through the `CrossEncoderReranker` component.\n",
+        "\n",
+        "---\n",
+        "\n",
+        "## Key Advantages of Reranker\n",
+        "- **Precise Similarity Scoring**:  \n",
+        "  Delivers highly accurate measurements of relevance between the `query` and `documents`.\n",
+        "- **Semantic Depth**:  \n",
+        "  Analyzes deeper semantic relationships, uncovering nuances in `query-document` interactions.\n",
+        "- **Refined Search Quality**:  \n",
+        "  Improves the relevance and quality of the retrieved `documents`.\n",
+        "- **RAG System Boost**:  \n",
+        "  Enhances the performance of `Retrieval-Augmented Generation (RAG)` systems by refining input relevance.\n",
+        "- **Seamless Integration**:  \n",
+        "  Easily adaptable to various workflows and compatible with multiple frameworks.\n",
+        "- **Model Versatility**:  \n",
+        "  Offers flexibility with a wide range of pre-trained models for tailored use cases.\n",
+        "\n",
+        "---\n",
+        "\n",
+        "## Document Count Settings for Reranker\n",
+        "- Reranking is generally performed on the top `5–10` `documents` retrieved during the initial search.\n",
+        "- The ideal number of `documents` for reranking should be determined through experimentation and evaluation, as it depends on the dataset characteristics and computational resources available.\n",
+        "\n",
+        "---\n",
+        "\n",
+        "## Trade-offs When Using a Reranker\n",
+        "- **Accuracy vs Processing Time**:  \n",
+        "  Striking a balance between achieving higher accuracy and minimizing processing time.\n",
+        "- **Performance Improvement vs Computational Cost**:  \n",
+        "  Weighing the benefits of improved performance against the additional computational resources required.\n",
+        "- **Search Speed vs Relevance Accuracy**:  \n",
+        "  Managing the trade-off between faster retrieval and maintaining high relevance in results.\n",
+        "- **System Requirements**:  \n",
+        "  Ensuring the system meets the necessary hardware and software requirements to support reranking.\n",
+        "- **Dataset Characteristics**:  \n",
+        "  Considering the scale, diversity, and specific attributes of the `dataset` to optimize reranker performance.\n",
+        "---"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "tk9Sg0AKi76R"
+      },
+      "source": [
+        "Explaining the Implementation of Cross Encoder Reranker with a Simple Example"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 2,
+      "metadata": {
+        "id": "_DlGOaxxozWp"
+      },
+      "outputs": [],
+      "source": [
+        "# Helper function to format and print document content\n",
+        "def pretty_print_docs(docs):\n",
+        "    # Print each document in the list with a separator between them\n",
+        "    print(\n",
+        "        f\"\\n{'-' * 100}\\n\".join(  # Separator line for better readability\n",
+        "            [f\"Document {i+1}:\\n\\n\" + d.page_content for i, d in enumerate(docs)]  # Format: Document number + content\n",
+        "        )\n",
+        "    )"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 7,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "obj9TBnHjFBt",
+        "outputId": "f3a0ecd4-e9ba-4bcf-a509-ef38631be18a"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Document 1:\n",
+            "\n",
+            "Word2Vec\n",
+            "Definition: Word2Vec is a technique in NLP that maps words to a vector space, representing their semantic relationships based on context.\n",
+            "Example: In a Word2Vec model, \"king\" and \"queen\" are represented by vectors located close to each other.\n",
+            "Related Keywords: Natural Language Processing (NLP), Embedding, Semantic Similarity\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "Document 2:\n",
+            "\n",
+            "Token\n",
+            "Definition: A token refers to a smaller unit of text obtained by splitting a larger piece of text. It can be a word, phrase, or sentence.\n",
+            "Example: The sentence \"I go to school\" can be tokenized into \"I,\" \"go,\" \"to,\" and \"school.\"\n",
+            "Related Keywords: Tokenization, Natural Language Processing (NLP), Syntax Analysis\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "Document 3:\n",
+            "\n",
+            "Example: A customer information table in a relational database is an example of structured data.\n",
+            "Related Keywords: Database, Data Analysis, Data Modeling\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "Document 4:\n",
+            "\n",
+            "Schema\n",
+            "Definition: A schema defines the structure of a database or file, detailing how data is organized and stored.\n",
+            "Example: A relational database schema specifies column names, data types, and key constraints.\n",
+            "Related Keywords: Database, Data Modeling, Data Management\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "Document 5:\n",
+            "\n",
+            "Keyword Search\n",
+            "Definition: Keyword search involves finding information based on user-inputted keywords, commonly used in search engines and database systems.\n",
+            "Example: Searching \n",
+            "When a user searches for \"coffee shops in Seoul,\" the system returns a list of relevant coffee shops.\n",
+            "Related Keywords: Search Engine, Data Search, Information Retrieval\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "Document 6:\n",
+            "\n",
+            "TF-IDF (Term Frequency-Inverse Document Frequency)\n",
+            "Definition: TF-IDF is a statistical measure used to evaluate the importance of a word within a document by considering its frequency and rarity across a corpus.\n",
+            "Example: Words with high TF-IDF values are often unique and critical for understanding the document.\n",
+            "Related Keywords: Natural Language Processing (NLP), Information Retrieval, Data Mining\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "Document 7:\n",
+            "\n",
+            "SQL\n",
+            "Definition: SQL (Structured Query Language) is a programming language for managing data in databases. \n",
+            "It allows you to perform various operations such as querying, updating, inserting, and deleting data.\n",
+            "Example: SELECT * FROM users WHERE age > 18; retrieves information about users aged above 18.\n",
+            "Related Keywords: Database, Query, Data Management\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "Document 8:\n",
+            "\n",
+            "Open Source\n",
+            "Definition: Open source software allows its source code to be freely used, modified, and distributed, fostering collaboration and innovation.\n",
+            "Example: The Linux operating system is a well-known open source project.\n",
+            "Related Keywords: Software Development, Community, Technical Collaboration\n",
+            "Structured Data\n",
+            "Definition: Structured data is organized according to a specific format or schema, making it easy to search and analyze.\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "Document 9:\n",
+            "\n",
+            "Semantic Search\n",
+            "Definition: Semantic search is a search technique that understands the meaning of a user's query beyond simple keyword matching, returning results that are contextually relevant.\n",
+            "Example: If a user searches for \"planets in the solar system,\" the system provides information about planets like Jupiter and Mars.\n",
+            "Related Keywords: Natural Language Processing (NLP), Search Algorithms, Data Mining\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "Document 10:\n",
+            "\n",
+            "GPT (Generative Pretrained Transformer)\n",
+            "Definition: GPT is a generative language model pre-trained on vast datasets, capable of performing various text-based tasks. It generates natural and coherent text based on input.\n",
+            "Example: A chatbot generating detailed answers to user queries is powered by GPT models.\n",
+            "Related Keywords: Natural Language Processing (NLP), Text Generation, Deep Learning\n"
+          ]
+        }
+      ],
+      "source": [
+        "from langchain_community.document_loaders import TextLoader\n",
+        "from langchain_community.vectorstores import FAISS\n",
+        "from langchain_huggingface import HuggingFaceEmbeddings\n",
+        "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
+        "\n",
+        "# Load documents\n",
+        "documents = TextLoader(\"./data/appendix-keywords.txt\").load()\n",
+        "\n",
+        "# Configure text splitter\n",
+        "text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)\n",
+        "\n",
+        "# Split documents into chunks\n",
+        "texts = text_splitter.split_documents(documents)\n",
+        "\n",
+        "# Set up the embedding model\n",
+        "embeddings_model = HuggingFaceEmbeddings(\n",
+        "    model_name=\"sentence-transformers/msmarco-distilbert-dot-v5\"\n",
+        ")\n",
+        "\n",
+        "# Create FAISS index from documents and set up retriever\n",
+        "retriever = FAISS.from_documents(texts, embeddings_model).as_retriever(\n",
+        "    search_kwargs={\"k\": 10}\n",
+        ")\n",
+        "\n",
+        "# Define the query\n",
+        "query = \"Can you tell me about Word2Vec?\"\n",
+        "\n",
+        "# Execute the query and retrieve results\n",
+        "docs = retriever.invoke(query)\n",
+        "\n",
+        "# Display the retrieved documents\n",
+        "pretty_print_docs(docs)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "T_6qFjxQpUc3"
+      },
+      "source": [
+        "Now, let's wrap the base retriever with a `ContextualCompressionRetriever`. The `CrossEncoderReranker` leverages `HuggingFaceCrossEncoder` to re-rank the retrieved results.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "_D3uj-DOpwUt"
+      },
+      "source": [
+        "Multilingual Support BGE Reranker: [`bge-reranker-v2-m3`](https://huggingface.co/BAAI/bge-reranker-v2-m3)\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 6,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "f5yrJ-FTjWJB",
+        "outputId": "ddfebce9-e45a-4057-cadc-2767d7c26152"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Document 1:\n",
+            "\n",
+            "Word2Vec\n",
+            "Definition: Word2Vec is a technique in NLP that maps words to a vector space, representing their semantic relationships based on context.\n",
+            "Example: In a Word2Vec model, \"king\" and \"queen\" are represented by vectors located close to each other.\n",
+            "Related Keywords: Natural Language Processing (NLP), Embedding, Semantic Similarity\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "Document 2:\n",
+            "\n",
+            "Open Source\n",
+            "Definition: Open source software allows its source code to be freely used, modified, and distributed, fostering collaboration and innovation.\n",
+            "Example: The Linux operating system is a well-known open source project.\n",
+            "Related Keywords: Software Development, Community, Technical Collaboration\n",
+            "Structured Data\n",
+            "Definition: Structured data is organized according to a specific format or schema, making it easy to search and analyze.\n",
+            "----------------------------------------------------------------------------------------------------\n",
+            "Document 3:\n",
+            "\n",
+            "TF-IDF (Term Frequency-Inverse Document Frequency)\n",
+            "Definition: TF-IDF is a statistical measure used to evaluate the importance of a word within a document by considering its frequency and rarity across a corpus.\n",
+            "Example: Words with high TF-IDF values are often unique and critical for understanding the document.\n",
+            "Related Keywords: Natural Language Processing (NLP), Information Retrieval, Data Mining\n"
+          ]
+        }
+      ],
+      "source": [
+        "from langchain.retrievers import ContextualCompressionRetriever\n",
+        "from langchain.retrievers.document_compressors import CrossEncoderReranker\n",
+        "from langchain_community.cross_encoders import HuggingFaceCrossEncoder\n",
+        "\n",
+        "# Initialize the model\n",
+        "model = HuggingFaceCrossEncoder(model_name=\"BAAI/bge-reranker-v2-m3\")\n",
+        "\n",
+        "# Select the top 3 documents\n",
+        "compressor = CrossEncoderReranker(model=model, top_n=3)\n",
+        "\n",
+        "# Initialize the contextual compression retriever\n",
+        "compression_retriever = ContextualCompressionRetriever(\n",
+        "    base_compressor=compressor, base_retriever=retriever\n",
+        ")\n",
+        "\n",
+        "# Retrieve compressed documents\n",
+        "compressed_docs = compression_retriever.invoke(\"Can you tell me about Word2Vec?\")\n",
+        "\n",
+        "# Display the documents\n",
+        "pretty_print_docs(compressed_docs)"
+      ]
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}