Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
369 changes: 369 additions & 0 deletions 11-Reranker/01-CrossEncoderReranker.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,369 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "T3l-XPuhit0_"
},
"source": [
"# Cross Encoder Reranker\n",
"\n",
"- Author: [Jeongho Shin](https://github.com/ThePurpleCollar)\n",
"- Design:\n",
"- Peer Review:\n",
"- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
"\n",
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/01-Basic/05-Using-OpenAIAPI-MultiModal.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/01-Basic/05-Using-OpenAIAPI-MultiModal.ipynb)\n",
"\n",
"## Overview\n",
"\n",
"The **Cross Encoder Reranker** is a technique designed to enhance the performance of Retrieval-Augmented Generation (RAG) systems. This guide explains how to implement a reranker using Hugging Face's cross-encoder model to refine the ranking of retrieved documents, promoting those most relevant to a query.\n",
"\n",
"### Table of Contents\n",
"\n",
"- [Overview](#overview)\n",
"- [Key Features and Mechanism](#key-features-and-mechanism)\n",
"- [Practical Applications](#practical-applications)\n",
"- [Implementation](#implementation)\n",
"- [Key Advantages of Reranker](#key-advantages-of-reranker)\n",
"- [Document Count Settings for Reranker](#document-count-settings-for-reranker)\n",
"- [Trade-offs When Using a Reranker](#trade-offs-when-using-a-reranker)\n",
"\n",
"### References\n",
"\n",
"[Hugging Face cross encoder models ](https://huggingface.co/cross-encoder)\n",
"\n",
"----"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ydu-Mu2tizbd"
},
"source": [
"## Key Features and Mechanism\n",
"\n",
"### Purpose\n",
"- Re-rank retrieved documents to refine their ranking, prioritizing the most relevant results for the query.\n",
"\n",
"### Structure\n",
"- Accepts both the `query` and `document` as a single input pair, enabling joint processing.\n",
"\n",
"### Mechanism\n",
"- **Single Input Pair**: \n",
" Processes the `query` and `document` as a combined input to output a relevance score directly.\n",
"- **Self-Attention Mechanism**: \n",
" Uses self-attention to jointly analyze the `query` and `document`, effectively capturing their semantic relationship.\n",
"\n",
"### Advantages\n",
"- **Higher Accuracy**: \n",
" Provides more precise similarity scores.\n",
"- **Deep Contextual Analysis**: \n",
" Explores semantic nuances between `query` and `document`.\n",
"\n",
"### Limitations\n",
"- **High Computational Costs**: \n",
" Processing can be time-intensive.\n",
"- **Scalability Issues**: \n",
" Not suitable for large-scale document collections without optimization.\n",
"\n",
"---\n",
"\n",
"## Practical Applications\n",
"- A **Bi-Encoder** quickly retrieves candidate `documents` by computing lightweight similarity scores. \n",
"- A **Cross Encoder** refines these results by deeply analyzing the semantic relationship between the `query` and the retrieved `documents`.\n",
"\n",
"---\n",
"\n",
"## Implementation\n",
"- Use Hugging Face cross encoder models, such as `BAAI/bge-reranker`.\n",
"- Easily integrate with frameworks like `LangChain` through the `CrossEncoderReranker` component.\n",
"\n",
"---\n",
"\n",
"## Key Advantages of Reranker\n",
"- **Precise Similarity Scoring**: \n",
" Delivers highly accurate measurements of relevance between the `query` and `documents`.\n",
"- **Semantic Depth**: \n",
" Analyzes deeper semantic relationships, uncovering nuances in `query-document` interactions.\n",
"- **Refined Search Quality**: \n",
" Improves the relevance and quality of the retrieved `documents`.\n",
"- **RAG System Boost**: \n",
" Enhances the performance of `Retrieval-Augmented Generation (RAG)` systems by refining input relevance.\n",
"- **Seamless Integration**: \n",
" Easily adaptable to various workflows and compatible with multiple frameworks.\n",
"- **Model Versatility**: \n",
" Offers flexibility with a wide range of pre-trained models for tailored use cases.\n",
"\n",
"---\n",
"\n",
"## Document Count Settings for Reranker\n",
"- Reranking is generally performed on the top `5–10` `documents` retrieved during the initial search.\n",
"- The ideal number of `documents` for reranking should be determined through experimentation and evaluation, as it depends on the dataset characteristics and computational resources available.\n",
"\n",
"---\n",
"\n",
"## Trade-offs When Using a Reranker\n",
"- **Accuracy vs Processing Time**: \n",
" Striking a balance between achieving higher accuracy and minimizing processing time.\n",
"- **Performance Improvement vs Computational Cost**: \n",
" Weighing the benefits of improved performance against the additional computational resources required.\n",
"- **Search Speed vs Relevance Accuracy**: \n",
" Managing the trade-off between faster retrieval and maintaining high relevance in results.\n",
"- **System Requirements**: \n",
" Ensuring the system meets the necessary hardware and software requirements to support reranking.\n",
"- **Dataset Characteristics**: \n",
" Considering the scale, diversity, and specific attributes of the `dataset` to optimize reranker performance.\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tk9Sg0AKi76R"
},
"source": [
"Explaining the Implementation of Cross Encoder Reranker with a Simple Example"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "_DlGOaxxozWp"
},
"outputs": [],
"source": [
"# Helper function to format and print document content\n",
"def pretty_print_docs(docs):\n",
" # Print each document in the list with a separator between them\n",
" print(\n",
" f\"\\n{'-' * 100}\\n\".join( # Separator line for better readability\n",
" [f\"Document {i+1}:\\n\\n\" + d.page_content for i, d in enumerate(docs)] # Format: Document number + content\n",
" )\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "obj9TBnHjFBt",
"outputId": "f3a0ecd4-e9ba-4bcf-a509-ef38631be18a"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Document 1:\n",
"\n",
"Word2Vec\n",
"Definition: Word2Vec is a technique in NLP that maps words to a vector space, representing their semantic relationships based on context.\n",
"Example: In a Word2Vec model, \"king\" and \"queen\" are represented by vectors located close to each other.\n",
"Related Keywords: Natural Language Processing (NLP), Embedding, Semantic Similarity\n",
"----------------------------------------------------------------------------------------------------\n",
"Document 2:\n",
"\n",
"Token\n",
"Definition: A token refers to a smaller unit of text obtained by splitting a larger piece of text. It can be a word, phrase, or sentence.\n",
"Example: The sentence \"I go to school\" can be tokenized into \"I,\" \"go,\" \"to,\" and \"school.\"\n",
"Related Keywords: Tokenization, Natural Language Processing (NLP), Syntax Analysis\n",
"----------------------------------------------------------------------------------------------------\n",
"Document 3:\n",
"\n",
"Example: A customer information table in a relational database is an example of structured data.\n",
"Related Keywords: Database, Data Analysis, Data Modeling\n",
"----------------------------------------------------------------------------------------------------\n",
"Document 4:\n",
"\n",
"Schema\n",
"Definition: A schema defines the structure of a database or file, detailing how data is organized and stored.\n",
"Example: A relational database schema specifies column names, data types, and key constraints.\n",
"Related Keywords: Database, Data Modeling, Data Management\n",
"----------------------------------------------------------------------------------------------------\n",
"Document 5:\n",
"\n",
"Keyword Search\n",
"Definition: Keyword search involves finding information based on user-inputted keywords, commonly used in search engines and database systems.\n",
"Example: Searching \n",
"When a user searches for \"coffee shops in Seoul,\" the system returns a list of relevant coffee shops.\n",
"Related Keywords: Search Engine, Data Search, Information Retrieval\n",
"----------------------------------------------------------------------------------------------------\n",
"Document 6:\n",
"\n",
"TF-IDF (Term Frequency-Inverse Document Frequency)\n",
"Definition: TF-IDF is a statistical measure used to evaluate the importance of a word within a document by considering its frequency and rarity across a corpus.\n",
"Example: Words with high TF-IDF values are often unique and critical for understanding the document.\n",
"Related Keywords: Natural Language Processing (NLP), Information Retrieval, Data Mining\n",
"----------------------------------------------------------------------------------------------------\n",
"Document 7:\n",
"\n",
"SQL\n",
"Definition: SQL (Structured Query Language) is a programming language for managing data in databases. \n",
"It allows you to perform various operations such as querying, updating, inserting, and deleting data.\n",
"Example: SELECT * FROM users WHERE age > 18; retrieves information about users aged above 18.\n",
"Related Keywords: Database, Query, Data Management\n",
"----------------------------------------------------------------------------------------------------\n",
"Document 8:\n",
"\n",
"Open Source\n",
"Definition: Open source software allows its source code to be freely used, modified, and distributed, fostering collaboration and innovation.\n",
"Example: The Linux operating system is a well-known open source project.\n",
"Related Keywords: Software Development, Community, Technical Collaboration\n",
"Structured Data\n",
"Definition: Structured data is organized according to a specific format or schema, making it easy to search and analyze.\n",
"----------------------------------------------------------------------------------------------------\n",
"Document 9:\n",
"\n",
"Semantic Search\n",
"Definition: Semantic search is a search technique that understands the meaning of a user's query beyond simple keyword matching, returning results that are contextually relevant.\n",
"Example: If a user searches for \"planets in the solar system,\" the system provides information about planets like Jupiter and Mars.\n",
"Related Keywords: Natural Language Processing (NLP), Search Algorithms, Data Mining\n",
"----------------------------------------------------------------------------------------------------\n",
"Document 10:\n",
"\n",
"GPT (Generative Pretrained Transformer)\n",
"Definition: GPT is a generative language model pre-trained on vast datasets, capable of performing various text-based tasks. It generates natural and coherent text based on input.\n",
"Example: A chatbot generating detailed answers to user queries is powered by GPT models.\n",
"Related Keywords: Natural Language Processing (NLP), Text Generation, Deep Learning\n"
]
}
],
"source": [
"from langchain_community.document_loaders import TextLoader\n",
"from langchain_community.vectorstores import FAISS\n",
"from langchain_huggingface import HuggingFaceEmbeddings\n",
"from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
"\n",
"# Load documents\n",
"documents = TextLoader(\"./data/appendix-keywords.txt\").load()\n",
"\n",
"# Configure text splitter\n",
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)\n",
"\n",
"# Split documents into chunks\n",
"texts = text_splitter.split_documents(documents)\n",
"\n",
"# Set up the embedding model\n",
"embeddings_model = HuggingFaceEmbeddings(\n",
" model_name=\"sentence-transformers/msmarco-distilbert-dot-v5\"\n",
")\n",
"\n",
"# Create FAISS index from documents and set up retriever\n",
"retriever = FAISS.from_documents(texts, embeddings_model).as_retriever(\n",
" search_kwargs={\"k\": 10}\n",
")\n",
"\n",
"# Define the query\n",
"query = \"Can you tell me about Word2Vec?\"\n",
"\n",
"# Execute the query and retrieve results\n",
"docs = retriever.invoke(query)\n",
"\n",
"# Display the retrieved documents\n",
"pretty_print_docs(docs)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "T_6qFjxQpUc3"
},
"source": [
"Now, let's wrap the base retriever with a `ContextualCompressionRetriever`. The `CrossEncoderReranker` leverages `HuggingFaceCrossEncoder` to re-rank the retrieved results.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_D3uj-DOpwUt"
},
"source": [
"Multilingual Support BGE Reranker: [`bge-reranker-v2-m3`](https://huggingface.co/BAAI/bge-reranker-v2-m3)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "f5yrJ-FTjWJB",
"outputId": "ddfebce9-e45a-4057-cadc-2767d7c26152"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Document 1:\n",
"\n",
"Word2Vec\n",
"Definition: Word2Vec is a technique in NLP that maps words to a vector space, representing their semantic relationships based on context.\n",
"Example: In a Word2Vec model, \"king\" and \"queen\" are represented by vectors located close to each other.\n",
"Related Keywords: Natural Language Processing (NLP), Embedding, Semantic Similarity\n",
"----------------------------------------------------------------------------------------------------\n",
"Document 2:\n",
"\n",
"Open Source\n",
"Definition: Open source software allows its source code to be freely used, modified, and distributed, fostering collaboration and innovation.\n",
"Example: The Linux operating system is a well-known open source project.\n",
"Related Keywords: Software Development, Community, Technical Collaboration\n",
"Structured Data\n",
"Definition: Structured data is organized according to a specific format or schema, making it easy to search and analyze.\n",
"----------------------------------------------------------------------------------------------------\n",
"Document 3:\n",
"\n",
"TF-IDF (Term Frequency-Inverse Document Frequency)\n",
"Definition: TF-IDF is a statistical measure used to evaluate the importance of a word within a document by considering its frequency and rarity across a corpus.\n",
"Example: Words with high TF-IDF values are often unique and critical for understanding the document.\n",
"Related Keywords: Natural Language Processing (NLP), Information Retrieval, Data Mining\n"
]
}
],
"source": [
"from langchain.retrievers import ContextualCompressionRetriever\n",
"from langchain.retrievers.document_compressors import CrossEncoderReranker\n",
"from langchain_community.cross_encoders import HuggingFaceCrossEncoder\n",
"\n",
"# Initialize the model\n",
"model = HuggingFaceCrossEncoder(model_name=\"BAAI/bge-reranker-v2-m3\")\n",
"\n",
"# Select the top 3 documents\n",
"compressor = CrossEncoderReranker(model=model, top_n=3)\n",
"\n",
"# Initialize the contextual compression retriever\n",
"compression_retriever = ContextualCompressionRetriever(\n",
" base_compressor=compressor, base_retriever=retriever\n",
")\n",
"\n",
"# Retrieve compressed documents\n",
"compressed_docs = compression_retriever.invoke(\"Can you tell me about Word2Vec?\")\n",
"\n",
"# Display the documents\n",
"pretty_print_docs(compressed_docs)"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Loading
Loading