diff --git a/11-Reranker/01-CrossEncoderReranker.ipynb b/11-Reranker/01-CrossEncoderReranker.ipynb new file mode 100644 index 000000000..d93fd2e40 --- /dev/null +++ b/11-Reranker/01-CrossEncoderReranker.ipynb @@ -0,0 +1,369 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "T3l-XPuhit0_" + }, + "source": [ + "# Cross Encoder Reranker\n", + "\n", + "- Author: [Jeongho Shin](https://github.com/ThePurpleCollar)\n", + "- Design:\n", + "- Peer Review:\n", + "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n", + "\n", + "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/01-Basic/05-Using-OpenAIAPI-MultiModal.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/01-Basic/05-Using-OpenAIAPI-MultiModal.ipynb)\n", + "\n", + "## Overview\n", + "\n", + "The **Cross Encoder Reranker** is a technique designed to enhance the performance of Retrieval-Augmented Generation (RAG) systems. This guide explains how to implement a reranker using Hugging Face's cross-encoder model to refine the ranking of retrieved documents, promoting those most relevant to a query.\n", + "\n", + "### Table of Contents\n", + "\n", + "- [Overview](#overview)\n", + "- [Key Features and Mechanism](#key-features-and-mechanism)\n", + "- [Practical Applications](#practical-applications)\n", + "- [Implementation](#implementation)\n", + "- [Key Advantages of Reranker](#key-advantages-of-reranker)\n", + "- [Document Count Settings for Reranker](#document-count-settings-for-reranker)\n", + "- [Trade-offs When Using a Reranker](#trade-offs-when-using-a-reranker)\n", + "\n", + "### References\n", + "\n", + "[Hugging Face cross encoder models ](https://huggingface.co/cross-encoder)\n", + "\n", + "----" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ydu-Mu2tizbd" + }, + "source": [ + "## Key Features and Mechanism\n", + "\n", + "### Purpose\n", + "- Re-rank retrieved documents to refine their ranking, prioritizing the most relevant results for the query.\n", + "\n", + "### Structure\n", + "- Accepts both the `query` and `document` as a single input pair, enabling joint processing.\n", + "\n", + "### Mechanism\n", + "- **Single Input Pair**: \n", + " Processes the `query` and `document` as a combined input to output a relevance score directly.\n", + "- **Self-Attention Mechanism**: \n", + " Uses self-attention to jointly analyze the `query` and `document`, effectively capturing their semantic relationship.\n", + "\n", + "### Advantages\n", + "- **Higher Accuracy**: \n", + " Provides more precise similarity scores.\n", + "- **Deep Contextual Analysis**: \n", + " Explores semantic nuances between `query` and `document`.\n", + "\n", + "### Limitations\n", + "- **High Computational Costs**: \n", + " Processing can be time-intensive.\n", + "- **Scalability Issues**: \n", + " Not suitable for large-scale document collections without optimization.\n", + "\n", + "---\n", + "\n", + "## Practical Applications\n", + "- A **Bi-Encoder** quickly retrieves candidate `documents` by computing lightweight similarity scores. \n", + "- A **Cross Encoder** refines these results by deeply analyzing the semantic relationship between the `query` and the retrieved `documents`.\n", + "\n", + "---\n", + "\n", + "## Implementation\n", + "- Use Hugging Face cross encoder models, such as `BAAI/bge-reranker`.\n", + "- Easily integrate with frameworks like `LangChain` through the `CrossEncoderReranker` component.\n", + "\n", + "---\n", + "\n", + "## Key Advantages of Reranker\n", + "- **Precise Similarity Scoring**: \n", + " Delivers highly accurate measurements of relevance between the `query` and `documents`.\n", + "- **Semantic Depth**: \n", + " Analyzes deeper semantic relationships, uncovering nuances in `query-document` interactions.\n", + "- **Refined Search Quality**: \n", + " Improves the relevance and quality of the retrieved `documents`.\n", + "- **RAG System Boost**: \n", + " Enhances the performance of `Retrieval-Augmented Generation (RAG)` systems by refining input relevance.\n", + "- **Seamless Integration**: \n", + " Easily adaptable to various workflows and compatible with multiple frameworks.\n", + "- **Model Versatility**: \n", + " Offers flexibility with a wide range of pre-trained models for tailored use cases.\n", + "\n", + "---\n", + "\n", + "## Document Count Settings for Reranker\n", + "- Reranking is generally performed on the top `5–10` `documents` retrieved during the initial search.\n", + "- The ideal number of `documents` for reranking should be determined through experimentation and evaluation, as it depends on the dataset characteristics and computational resources available.\n", + "\n", + "---\n", + "\n", + "## Trade-offs When Using a Reranker\n", + "- **Accuracy vs Processing Time**: \n", + " Striking a balance between achieving higher accuracy and minimizing processing time.\n", + "- **Performance Improvement vs Computational Cost**: \n", + " Weighing the benefits of improved performance against the additional computational resources required.\n", + "- **Search Speed vs Relevance Accuracy**: \n", + " Managing the trade-off between faster retrieval and maintaining high relevance in results.\n", + "- **System Requirements**: \n", + " Ensuring the system meets the necessary hardware and software requirements to support reranking.\n", + "- **Dataset Characteristics**: \n", + " Considering the scale, diversity, and specific attributes of the `dataset` to optimize reranker performance.\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tk9Sg0AKi76R" + }, + "source": [ + "Explaining the Implementation of Cross Encoder Reranker with a Simple Example" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "_DlGOaxxozWp" + }, + "outputs": [], + "source": [ + "# Helper function to format and print document content\n", + "def pretty_print_docs(docs):\n", + " # Print each document in the list with a separator between them\n", + " print(\n", + " f\"\\n{'-' * 100}\\n\".join( # Separator line for better readability\n", + " [f\"Document {i+1}:\\n\\n\" + d.page_content for i, d in enumerate(docs)] # Format: Document number + content\n", + " )\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "obj9TBnHjFBt", + "outputId": "f3a0ecd4-e9ba-4bcf-a509-ef38631be18a" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Document 1:\n", + "\n", + "Word2Vec\n", + "Definition: Word2Vec is a technique in NLP that maps words to a vector space, representing their semantic relationships based on context.\n", + "Example: In a Word2Vec model, \"king\" and \"queen\" are represented by vectors located close to each other.\n", + "Related Keywords: Natural Language Processing (NLP), Embedding, Semantic Similarity\n", + "----------------------------------------------------------------------------------------------------\n", + "Document 2:\n", + "\n", + "Token\n", + "Definition: A token refers to a smaller unit of text obtained by splitting a larger piece of text. It can be a word, phrase, or sentence.\n", + "Example: The sentence \"I go to school\" can be tokenized into \"I,\" \"go,\" \"to,\" and \"school.\"\n", + "Related Keywords: Tokenization, Natural Language Processing (NLP), Syntax Analysis\n", + "----------------------------------------------------------------------------------------------------\n", + "Document 3:\n", + "\n", + "Example: A customer information table in a relational database is an example of structured data.\n", + "Related Keywords: Database, Data Analysis, Data Modeling\n", + "----------------------------------------------------------------------------------------------------\n", + "Document 4:\n", + "\n", + "Schema\n", + "Definition: A schema defines the structure of a database or file, detailing how data is organized and stored.\n", + "Example: A relational database schema specifies column names, data types, and key constraints.\n", + "Related Keywords: Database, Data Modeling, Data Management\n", + "----------------------------------------------------------------------------------------------------\n", + "Document 5:\n", + "\n", + "Keyword Search\n", + "Definition: Keyword search involves finding information based on user-inputted keywords, commonly used in search engines and database systems.\n", + "Example: Searching \n", + "When a user searches for \"coffee shops in Seoul,\" the system returns a list of relevant coffee shops.\n", + "Related Keywords: Search Engine, Data Search, Information Retrieval\n", + "----------------------------------------------------------------------------------------------------\n", + "Document 6:\n", + "\n", + "TF-IDF (Term Frequency-Inverse Document Frequency)\n", + "Definition: TF-IDF is a statistical measure used to evaluate the importance of a word within a document by considering its frequency and rarity across a corpus.\n", + "Example: Words with high TF-IDF values are often unique and critical for understanding the document.\n", + "Related Keywords: Natural Language Processing (NLP), Information Retrieval, Data Mining\n", + "----------------------------------------------------------------------------------------------------\n", + "Document 7:\n", + "\n", + "SQL\n", + "Definition: SQL (Structured Query Language) is a programming language for managing data in databases. \n", + "It allows you to perform various operations such as querying, updating, inserting, and deleting data.\n", + "Example: SELECT * FROM users WHERE age > 18; retrieves information about users aged above 18.\n", + "Related Keywords: Database, Query, Data Management\n", + "----------------------------------------------------------------------------------------------------\n", + "Document 8:\n", + "\n", + "Open Source\n", + "Definition: Open source software allows its source code to be freely used, modified, and distributed, fostering collaboration and innovation.\n", + "Example: The Linux operating system is a well-known open source project.\n", + "Related Keywords: Software Development, Community, Technical Collaboration\n", + "Structured Data\n", + "Definition: Structured data is organized according to a specific format or schema, making it easy to search and analyze.\n", + "----------------------------------------------------------------------------------------------------\n", + "Document 9:\n", + "\n", + "Semantic Search\n", + "Definition: Semantic search is a search technique that understands the meaning of a user's query beyond simple keyword matching, returning results that are contextually relevant.\n", + "Example: If a user searches for \"planets in the solar system,\" the system provides information about planets like Jupiter and Mars.\n", + "Related Keywords: Natural Language Processing (NLP), Search Algorithms, Data Mining\n", + "----------------------------------------------------------------------------------------------------\n", + "Document 10:\n", + "\n", + "GPT (Generative Pretrained Transformer)\n", + "Definition: GPT is a generative language model pre-trained on vast datasets, capable of performing various text-based tasks. It generates natural and coherent text based on input.\n", + "Example: A chatbot generating detailed answers to user queries is powered by GPT models.\n", + "Related Keywords: Natural Language Processing (NLP), Text Generation, Deep Learning\n" + ] + } + ], + "source": [ + "from langchain_community.document_loaders import TextLoader\n", + "from langchain_community.vectorstores import FAISS\n", + "from langchain_huggingface import HuggingFaceEmbeddings\n", + "from langchain_text_splitters import RecursiveCharacterTextSplitter\n", + "\n", + "# Load documents\n", + "documents = TextLoader(\"./data/appendix-keywords.txt\").load()\n", + "\n", + "# Configure text splitter\n", + "text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)\n", + "\n", + "# Split documents into chunks\n", + "texts = text_splitter.split_documents(documents)\n", + "\n", + "# Set up the embedding model\n", + "embeddings_model = HuggingFaceEmbeddings(\n", + " model_name=\"sentence-transformers/msmarco-distilbert-dot-v5\"\n", + ")\n", + "\n", + "# Create FAISS index from documents and set up retriever\n", + "retriever = FAISS.from_documents(texts, embeddings_model).as_retriever(\n", + " search_kwargs={\"k\": 10}\n", + ")\n", + "\n", + "# Define the query\n", + "query = \"Can you tell me about Word2Vec?\"\n", + "\n", + "# Execute the query and retrieve results\n", + "docs = retriever.invoke(query)\n", + "\n", + "# Display the retrieved documents\n", + "pretty_print_docs(docs)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T_6qFjxQpUc3" + }, + "source": [ + "Now, let's wrap the base retriever with a `ContextualCompressionRetriever`. The `CrossEncoderReranker` leverages `HuggingFaceCrossEncoder` to re-rank the retrieved results.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_D3uj-DOpwUt" + }, + "source": [ + "Multilingual Support BGE Reranker: [`bge-reranker-v2-m3`](https://huggingface.co/BAAI/bge-reranker-v2-m3)\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "f5yrJ-FTjWJB", + "outputId": "ddfebce9-e45a-4057-cadc-2767d7c26152" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Document 1:\n", + "\n", + "Word2Vec\n", + "Definition: Word2Vec is a technique in NLP that maps words to a vector space, representing their semantic relationships based on context.\n", + "Example: In a Word2Vec model, \"king\" and \"queen\" are represented by vectors located close to each other.\n", + "Related Keywords: Natural Language Processing (NLP), Embedding, Semantic Similarity\n", + "----------------------------------------------------------------------------------------------------\n", + "Document 2:\n", + "\n", + "Open Source\n", + "Definition: Open source software allows its source code to be freely used, modified, and distributed, fostering collaboration and innovation.\n", + "Example: The Linux operating system is a well-known open source project.\n", + "Related Keywords: Software Development, Community, Technical Collaboration\n", + "Structured Data\n", + "Definition: Structured data is organized according to a specific format or schema, making it easy to search and analyze.\n", + "----------------------------------------------------------------------------------------------------\n", + "Document 3:\n", + "\n", + "TF-IDF (Term Frequency-Inverse Document Frequency)\n", + "Definition: TF-IDF is a statistical measure used to evaluate the importance of a word within a document by considering its frequency and rarity across a corpus.\n", + "Example: Words with high TF-IDF values are often unique and critical for understanding the document.\n", + "Related Keywords: Natural Language Processing (NLP), Information Retrieval, Data Mining\n" + ] + } + ], + "source": [ + "from langchain.retrievers import ContextualCompressionRetriever\n", + "from langchain.retrievers.document_compressors import CrossEncoderReranker\n", + "from langchain_community.cross_encoders import HuggingFaceCrossEncoder\n", + "\n", + "# Initialize the model\n", + "model = HuggingFaceCrossEncoder(model_name=\"BAAI/bge-reranker-v2-m3\")\n", + "\n", + "# Select the top 3 documents\n", + "compressor = CrossEncoderReranker(model=model, top_n=3)\n", + "\n", + "# Initialize the contextual compression retriever\n", + "compression_retriever = ContextualCompressionRetriever(\n", + " base_compressor=compressor, base_retriever=retriever\n", + ")\n", + "\n", + "# Retrieve compressed documents\n", + "compressed_docs = compression_retriever.invoke(\"Can you tell me about Word2Vec?\")\n", + "\n", + "# Display the documents\n", + "pretty_print_docs(compressed_docs)" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/11-Reranker/data/appendix-keywords.txt b/11-Reranker/data/appendix-keywords.txt new file mode 100644 index 000000000..67dede2a1 --- /dev/null +++ b/11-Reranker/data/appendix-keywords.txt @@ -0,0 +1,153 @@ +Semantic Search +Definition: Semantic search is a search technique that understands the meaning of a user's query beyond simple keyword matching, returning results that are contextually relevant. +Example: If a user searches for "planets in the solar system," the system provides information about planets like Jupiter and Mars. +Related Keywords: Natural Language Processing (NLP), Search Algorithms, Data Mining + +Embedding +Definition: Embedding is the process of converting textual data, such as words or sentences, into low-dimensional continuous vectors that computers can process and understand. +Example: The word "apple" can be represented as a vector like [0.65, -0.23, 0.17]. +Related Keywords: Natural Language Processing (NLP), Vectorization, Deep Learning + +Token +Definition: A token refers to a smaller unit of text obtained by splitting a larger piece of text. It can be a word, phrase, or sentence. +Example: The sentence "I go to school" can be tokenized into "I," "go," "to," and "school." +Related Keywords: Tokenization, Natural Language Processing (NLP), Syntax Analysis + +Tokenizer +Definition: A tokenizer is a tool that splits text data into tokens, often used for preprocessing in natural language processing tasks. +Example: The sentence "I love programming." is tokenized into ["I", "love", "programming", "."]. +Related Keywords: Tokenization, Natural Language Processing (NLP), Syntax Analysis. + +VectorStore +Definition: A VectorStore is a system designed to store data in vector format, enabling efficient retrieval, classification, and analysis tasks. +Example: Storing word embedding vectors in a database for quick access during semantic search. +Related Keywords: Embedding, Database, Vectorization + +SQL +Definition: SQL (Structured Query Language) is a programming language for managing data in databases. +It allows you to perform various operations such as querying, updating, inserting, and deleting data. +Example: SELECT * FROM users WHERE age > 18; retrieves information about users aged above 18. +Related Keywords: Database, Query, Data Management + +CSV +Definition: CSV (Comma-Separated Values) is a file format used for storing tabular data, where each value is separated by a comma. +Example: A CSV file with headers "Name, Age, Occupation" may contain data like "John, 30, Developer." +Related Keywords: Data Format, File Processing, Data Exchange + +JSON +Definition: JSON (JavaScript Object Notation) is a lightweight data-interchange format that represents data objects using readable text for both humans and machines. +Example: {"Name": John", " Age": 30, " Occupation ": "Developer"} is a JSON object. +Related Keywords: Data Exchange, Web Development, API + +Transformer +Definition: A Transformer is a type of deep learning model widely used in natural language processing tasks like translation, summarization, and text generation. It is based on the Attention mechanism. +Example: Google Translate utilizes a Transformer model for multilingual translation. +Related Keywords: Deep Learning, Natural Language Processing (NLP), Attention mechanism + +HuggingFace +Definition: HuggingFace is a library offering pre-trained models and tools for natural language processing, making NLP tasks accessible to researchers and developers. +Example: HuggingFace's Transformers library can be used for sentiment analysis and text generation. +Related Keywords: Natural Language Processing (NLP), Deep Learning, Library. + +Digital Transformation +Definition: Digital transformation refers to the integration of technology to innovate services, culture, and operations within a company, enhancing competitiveness and business models. +Example: A company adopting cloud computing to revolutionize data storage and processing demonstrates digital transformation. +Related Keywords: Innovation, Technology, Business Model + +Crawling +Definition: Crawling is the automated process of visiting web pages to gather data, commonly used for search engine optimization and data analysis. +Example: Google Search Engine crawls websites to collect and index content. +Related Keywords: Data Collection, Web Scraping, Search Engine + +Word2Vec +Definition: Word2Vec is a technique in NLP that maps words to a vector space, representing their semantic relationships based on context. +Example: In a Word2Vec model, "king" and "queen" are represented by vectors located close to each other. +Related Keywords: Natural Language Processing (NLP), Embedding, Semantic Similarity + +LLM (Large Language Model) +Definition: LLMs are massive language models trained on large-scale text data, used for various natural language understanding and generation tasks. +Example: OpenAI's GPT series is a prominent example of LLMs. +Related Keywords: Natural Language Processing (NLP), Deep Learning, Text Generation + +FAISS (Facebook AI Similarity Search) +Definition: FAISS is a high-speed similarity search library developed by Facebook, optimized for searching large sets of vectors efficiently. +Example: FAISS can quickly find similar images among millions of image vectors. +Related Keywords: Vector Search, Machine Learning, Database Optimization + +Open Source +Definition: Open source software allows its source code to be freely used, modified, and distributed, fostering collaboration and innovation. +Example: The Linux operating system is a well-known open source project. +Related Keywords: Software Development, Community, Technical Collaboration +Structured Data +Definition: Structured data is organized according to a specific format or schema, making it easy to search and analyze. +Example: A customer information table in a relational database is an example of structured data. +Related Keywords: Database, Data Analysis, Data Modeling + +Parser +Definition: A parser analyzes input data (text, files, etc.) and converts it into a structured format, often used in programming language syntax analysis or file processing. +Example: Parsing an HTML document to generate its DOM structure is an instance of parsing. +Related Keywords: Syntax Analysis, Compiler, Data Processing + +TF-IDF (Term Frequency-Inverse Document Frequency) +Definition: TF-IDF is a statistical measure used to evaluate the importance of a word within a document by considering its frequency and rarity across a corpus. +Example: Words with high TF-IDF values are often unique and critical for understanding the document. +Related Keywords: Natural Language Processing (NLP), Information Retrieval, Data Mining + +Deep Learning +Definition: Deep learning is a subset of machine learning that uses neural networks to solve complex problems, focusing on learning high-level representations from data. +Example: Deep learning models are used for tasks like image recognition, speech recognition, and NLP. +Related Keywords: Artificial Neural Networks, Machine Learning, Data Analysis + +Schema +Definition: A schema defines the structure of a database or file, detailing how data is organized and stored. +Example: A relational database schema specifies column names, data types, and key constraints. +Related Keywords: Database, Data Modeling, Data Management + +DataFrame +Definition: A DataFrame is a tabular data structure with rows and columns, commonly used for data manipulation and analysis. +Example: In Python's Pandas library, a DataFrame can contain diverse data types and support various data operations. +Related Keywords: Data Analysis, Pandas, Data Processing + +Attention Mechanism +Definition: +The Attention mechanism is a technique in deep learning that allows models to focus more on important information. It is primarily used in processing sequential data such as text and time series. +Example: +In translation models, the Attention mechanism helps the model focus on relevant parts of the input sentence to generate accurate translations. +Related Keywords: Deep Learning, Natural Language Processing, Sequence Modeling + +Pandas +Definition: Pandas is a Python library offering tools for efficient data manipulation and analysis. It simplifies complex data operations. +Example: Pandas can be used to load, clean, and analyze CSV files. +Related Keywords: Data Analysis, Python, Data Processing + +GPT (Generative Pretrained Transformer) +Definition: GPT is a generative language model pre-trained on vast datasets, capable of performing various text-based tasks. It generates natural and coherent text based on input. +Example: A chatbot generating detailed answers to user queries is powered by GPT models. +Related Keywords: Natural Language Processing (NLP), Text Generation, Deep Learning + +InstructGPT +Definition: +InstructGPT is an optimized GPT model designed to perform specific tasks based on user instructions. It is built to generate more accurate and relevant results in response to given commands. +Example: When a user provides a specific instruction like "Draft an email," InstructGPT generates an email based on the provided content. +Related Keywords: Artificial Intelligence, Natural Language Understanding, Command-Based Processing + +Keyword Search +Definition: Keyword search involves finding information based on user-inputted keywords, commonly used in search engines and database systems. +Example: Searching +When a user searches for "coffee shops in Seoul," the system returns a list of relevant coffee shops. +Related Keywords: Search Engine, Data Search, Information Retrieval + +Page Rank +Definition: Page Rank is an algorithm for evaluating the importance of web pages, primarily used to rank search engine results. It analyzes the link structure of websites. +Example: Google uses Page Rank to determine the order of search results. +Related Keywords: Search Engine Optimization, Web Analytics, Link Analysis + +Data Mining +Definition: Data mining is the process of extracting useful information from large datasets using techniques like statistics, machine learning, and pattern recognition. +Example: Retailers analyzing customer purchase data to devise sales strategies is an application of data mining. +Related Keywords: Big Data, Pattern Recognition, Predictive Analytics + +Multimodal +Definition: Multimodal refers to combining and processing multiple types of data (e.g., text, images, and sound) to extract richer insights and predictions. +Example: A system analyzing both images and captions to perform accurate image classification demonstrates multimodal technology. +Related Keywords: Data Fusion, Artificial Intelligence, Deep Learning \ No newline at end of file