From 12bf485f8c454f3daeac94ab057a2862551df0af Mon Sep 17 00:00:00 2001 From: krkrong Date: Thu, 2 Jan 2025 03:22:06 +0900 Subject: [PATCH 1/2] Create 06-GPT4ALLEmbedding.ipynb [N-3] 08-Embedding / 06-GPT4ALLEmbedding --- 08-EMBEDDING/06-GPT4ALLEmbedding.ipynb | 301 +++++++++++++++++++++++++ 1 file changed, 301 insertions(+) create mode 100644 08-EMBEDDING/06-GPT4ALLEmbedding.ipynb diff --git a/08-EMBEDDING/06-GPT4ALLEmbedding.ipynb b/08-EMBEDDING/06-GPT4ALLEmbedding.ipynb new file mode 100644 index 000000000..2b0e6a985 --- /dev/null +++ b/08-EMBEDDING/06-GPT4ALLEmbedding.ipynb @@ -0,0 +1,301 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "635d8ebb", + "metadata": {}, + "source": [ + "# GPT4ALL\n", + "\n", + "- Author: [Do Woung Kong](https://github.com/krkrong)\n", + "- Design: \n", + "- Peer Review: \n", + "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n", + "\n", + "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)\n", + "\n", + "## Overview\n", + "\n", + "`GPT4All` is a local execution-based privacy chatbot that can be used for free.\n", + "\n", + "No GPU or internet connection is required, and `GPT4All` offers popular models such as GPT4All Falcon, Wizard, and its own models.\n", + "\n", + "This notebook explains how to use `GPT4Allembeddings` with `LangChain`.\n", + "\n", + "### Table of Contents\n", + "\n", + "- [Overview](#overview)\n", + "- [Environement Setup](#environment-setup)\n", + "- [Install Python Binding for GPT4All](#create-a-basic-pdf-based-retrieval-chain)\n", + "- [Embed the Textual Data](#query-routing-and-document-evaluation)\n", + "\n", + "\n", + "### References\n", + "\n", + "- [GPT4All docs](https://docs.gpt4all.io/gpt4all_python_embedding.html#gpt4all.gpt4all.Embed4All)\n", + "- [GPT4AllEmbeddings](https://python.langchain.com/api_reference/community/embeddings/langchain_community.embeddings.gpt4all.GPT4AllEmbeddings.html#langchain_community.embeddings.gpt4all.GPT4AllEmbeddings)" + ] + }, + { + "cell_type": "markdown", + "id": "c6c7aba4", + "metadata": {}, + "source": [ + "## Environment Setup\n", + "\n", + "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n", + "\n", + "**[Note]**\n", + "- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n", + "- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "21943adb", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture --no-stderr\n", + "!pip install langchain-opentutorial" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "f25ec196", + "metadata": {}, + "outputs": [], + "source": [ + "# Install required packages\n", + "from langchain_opentutorial import package\n", + "\n", + "package.install(\n", + " [\"langchain_community\"],\n", + " verbose=False,\n", + " upgrade=False,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "aa00c3f4", + "metadata": {}, + "source": [ + "## Install Python Binding for GPT4All\n", + "\n", + "Before diving into the practical exercises, you need to install the Python bindings for `GPT4All`.\n", + "\n", + "Python bindings allow a Python program to interface with external libraries or tools, enabling seamless integration and usage of functionalities provided by those external resources.\n", + "\n", + "To install the Python bindings for `GPT4All`, run the following command:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "07c0a550", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install --upgrade --quiet gpt4all > /dev/null" + ] + }, + { + "cell_type": "markdown", + "id": "7e566192", + "metadata": {}, + "source": [ + "Import the `GPT4AllEmbeddings` class from the `langchain_community.embeddings` module.\n", + "\n", + "The `GPT4AllEmbeddings` class provides functionality to embed text data into vectors using the GPT4All model.\n", + "\n", + "This class implements the embedding interface of the LangChain framework, allowing it to be used seamlessly with LangChain's various features." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "5e20fcca", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_community.embeddings import GPT4AllEmbeddings" + ] + }, + { + "cell_type": "markdown", + "id": "d4cc5b5b", + "metadata": {}, + "source": [ + "GPT4All supports the generation of high-quality embeddings for text documents of arbitrary length using a contrastive learning sentence transformer optimized for CPUs. These embeddings offer a quality comparable to many tasks using OpenAI models.\n", + "\n", + "An instance of the `GPT4AllEmbeddings` class is created.\n", + "\n", + "- The `GPT4AllEmbeddings` class is an embedding model that uses the GPT4All model to transform text data into vectors. \n", + "\n", + "- In this code, the `gpt4all_embd` variable is assigned an instance of `GPT4AllEmbeddings`. \n", + "\n", + "- You can then use `gpt4all_embd` to convert text data into vectors." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "04c9e49c", + "metadata": {}, + "outputs": [], + "source": [ + "# Create a GPT4All embedding object\n", + "gpt4all_embd = GPT4AllEmbeddings()" + ] + }, + { + "cell_type": "markdown", + "id": "f1dc8496", + "metadata": {}, + "source": [ + "Assign the string \"This is a sample sentence for testing embeddings.\" to the `text` variable." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "2584f975", + "metadata": {}, + "outputs": [], + "source": [ + "# Define a sample document text for testing\n", + "text = \"This is a sample sentence for testing embeddings.\"" + ] + }, + { + "cell_type": "markdown", + "id": "2b2fc536", + "metadata": {}, + "source": [ + "## Embed the Textual Data\n", + "\n", + "\n", + "The process of embedding text data is as follows:\n", + "\n", + "First, the text data is tokenized and converted into numerical form. \n", + "\n", + "During this step, a pre-trained tokenizer is used to split the text into tokens and map each token to a unique integer. \n", + "\n", + "Next, the tokenized data is input into an embedding layer, where it is transformed into high-dimensional dense vectors. \n", + "\n", + "In this process, each token is represented as a vector of real numbers that capture the token's meaning and context. \n", + "\n", + "Finally, the embedded vectors can be used in various natural language processing tasks. \n", + "\n", + "For example, they can serve as input data for tasks such as document classification, sentiment analysis, and machine translation, enhancing model performance. \n", + "\n", + "This process of text data embedding plays a crucial role in natural language processing, making it essential for efficiently processing and analyzing large amounts of text data.\n", + "\n", + "Use the `embed_query` method of the `gpt4all_embd` object to embed the given text (`text`). \n", + "\n", + "- The `text` variable stores the text to be embedded. \n", + "- The `gpt4all_embd` object uses the GPT4All model to perform text embedding. \n", + "- The `embed_query` method converts the given text into a vector format and returns it. \n", + "- The embedding result is stored in the `query_result` variable." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "e98a28df", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "384" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Generate query embeddings for the given text.\n", + "query_result = gpt4all_embd.embed_query(text)\n", + "\n", + "# Check the dimensions of the embedded space.\n", + "len(query_result)" + ] + }, + { + "cell_type": "markdown", + "id": "1ba709ff", + "metadata": {}, + "source": [ + "You can use the `embed_documents` function to embed multiple text fragments.\n", + "\n", + "Use the `embed_documents` method of the `gpt4all_embd` object to embed the `text` document.\n", + "\n", + "- Wrap the `text` document in a list and pass it as an argument to the `embed_documents` method. \n", + "- The `embed_documents` method calculates and returns the embedding vector of the document. \n", + "- The resulting embedding vector is stored in the `doc_result` variable.\n", + "\n", + "Additionally, these embeddings can be mapped with Nomic's Atlas (https://docs.nomic.ai/index.html) to visualize the data." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "8444d5b3", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "384" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Generate query embeddings for the given text.\n", + "doc_result = gpt4all_embd.embed_documents([text])\n", + "\n", + "# Check the dimensions of the embedded space.\n", + "len(doc_result[0])" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "langchain-kr-lwwSZlnu-py3.11", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From a023b76adb118c118c88960f3f6cc3f4c8ea0e91 Mon Sep 17 00:00:00 2001 From: krkrong <101310738+krkrong@users.noreply.github.com> Date: Thu, 2 Jan 2025 21:36:20 +0900 Subject: [PATCH 2/2] Rename 06-GPT4ALLEmbedding.ipynb to 07-GPT4ALLEmbedding.ipynb --- .../{06-GPT4ALLEmbedding.ipynb => 07-GPT4ALLEmbedding.ipynb} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename 08-EMBEDDING/{06-GPT4ALLEmbedding.ipynb => 07-GPT4ALLEmbedding.ipynb} (100%) diff --git a/08-EMBEDDING/06-GPT4ALLEmbedding.ipynb b/08-EMBEDDING/07-GPT4ALLEmbedding.ipynb similarity index 100% rename from 08-EMBEDDING/06-GPT4ALLEmbedding.ipynb rename to 08-EMBEDDING/07-GPT4ALLEmbedding.ipynb