LangChain-OpenTutorial · teddylee777 · Jan 3, 2025 · Jan 1, 2025 · Jan 2, 2025
diff --git a/08-EMBEDDING/07-GPT4ALLEmbedding.ipynb b/08-EMBEDDING/07-GPT4ALLEmbedding.ipynb
@@ -0,0 +1,301 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "635d8ebb",
+   "metadata": {},
+   "source": [
+    "# GPT4ALL\n",
+    "\n",
+    "- Author: [Do Woung Kong](https://github.com/krkrong)\n",
+    "- Design: \n",
+    "- Peer Review: \n",
+    "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
+    "\n",
+    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "`GPT4All` is a local execution-based privacy chatbot that can be used for free.\n",
+    "\n",
+    "No GPU or internet connection is required, and `GPT4All` offers popular models such as GPT4All Falcon, Wizard, and its own models.\n",
+    "\n",
+    "This notebook explains how to use `GPT4Allembeddings` with `LangChain`.\n",
+    "\n",
+    "### Table of Contents\n",
+    "\n",
+    "- [Overview](#overview)\n",
+    "- [Environement Setup](#environment-setup)\n",
+    "- [Install Python Binding for GPT4All](#create-a-basic-pdf-based-retrieval-chain)\n",
+    "- [Embed the Textual Data](#query-routing-and-document-evaluation)\n",
+    "\n",
+    "\n",
+    "### References\n",
+    "\n",
+    "- [GPT4All docs](https://docs.gpt4all.io/gpt4all_python_embedding.html#gpt4all.gpt4all.Embed4All)\n",
+    "- [GPT4AllEmbeddings](https://python.langchain.com/api_reference/community/embeddings/langchain_community.embeddings.gpt4all.GPT4AllEmbeddings.html#langchain_community.embeddings.gpt4all.GPT4AllEmbeddings)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c6c7aba4",
+   "metadata": {},
+   "source": [
+    "## Environment Setup\n",
+    "\n",
+    "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n",
+    "\n",
+    "**[Note]**\n",
+    "- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n",
+    "- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "21943adb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture --no-stderr\n",
+    "!pip install langchain-opentutorial"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "f25ec196",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install required packages\n",
+    "from langchain_opentutorial import package\n",
+    "\n",
+    "package.install(\n",
+    "    [\"langchain_community\"],\n",
+    "    verbose=False,\n",
+    "    upgrade=False,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aa00c3f4",
+   "metadata": {},
+   "source": [
+    "## Install Python Binding for GPT4All\n",
+    "\n",
+    "Before diving into the practical exercises, you need to install the Python bindings for `GPT4All`.\n",
+    "\n",
+    "Python bindings allow a Python program to interface with external libraries or tools, enabling seamless integration and usage of functionalities provided by those external resources.\n",
+    "\n",
+    "To install the Python bindings for `GPT4All`, run the following command:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "07c0a550",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Note: you may need to restart the kernel to use updated packages.\n"
+     ]
+    }
+   ],
+   "source": [
+    "%pip install --upgrade --quiet  gpt4all > /dev/null"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7e566192",
+   "metadata": {},
+   "source": [
+    "Import the `GPT4AllEmbeddings` class from the `langchain_community.embeddings` module.\n",
+    "\n",
+    "The `GPT4AllEmbeddings` class provides functionality to embed text data into vectors using the GPT4All model.\n",
+    "\n",
+    "This class implements the embedding interface of the LangChain framework, allowing it to be used seamlessly with LangChain's various features."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "5e20fcca",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_community.embeddings import GPT4AllEmbeddings"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d4cc5b5b",
+   "metadata": {},
+   "source": [
+    "GPT4All supports the generation of high-quality embeddings for text documents of arbitrary length using a contrastive learning sentence transformer optimized for CPUs. These embeddings offer a quality comparable to many tasks using OpenAI models.\n",
+    "\n",
+    "An instance of the `GPT4AllEmbeddings` class is created.\n",
+    "\n",
+    "- The `GPT4AllEmbeddings` class is an embedding model that uses the GPT4All model to transform text data into vectors.  \n",
+    "\n",
+    "- In this code, the `gpt4all_embd` variable is assigned an instance of `GPT4AllEmbeddings`.  \n",
+    "\n",
+    "- You can then use `gpt4all_embd` to convert text data into vectors."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "04c9e49c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create a GPT4All embedding object\n",
+    "gpt4all_embd = GPT4AllEmbeddings()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f1dc8496",
+   "metadata": {},
+   "source": [
+    "Assign the string \"This is a sample sentence for testing embeddings.\" to the `text` variable."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "2584f975",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define a sample document text for testing\n",
+    "text = \"This is a sample sentence for testing embeddings.\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2b2fc536",
+   "metadata": {},
+   "source": [
+    "## Embed the Textual Data\n",
+    "\n",
+    "\n",
+    "The process of embedding text data is as follows:\n",
+    "\n",
+    "First, the text data is tokenized and converted into numerical form.  \n",
+    "\n",
+    "During this step, a pre-trained tokenizer is used to split the text into tokens and map each token to a unique integer.  \n",
+    "\n",
+    "Next, the tokenized data is input into an embedding layer, where it is transformed into high-dimensional dense vectors.  \n",
+    "\n",
+    "In this process, each token is represented as a vector of real numbers that capture the token's meaning and context.  \n",
+    "\n",
+    "Finally, the embedded vectors can be used in various natural language processing tasks.  \n",
+    "\n",
+    "For example, they can serve as input data for tasks such as document classification, sentiment analysis, and machine translation, enhancing model performance.  \n",
+    "\n",
+    "This process of text data embedding plays a crucial role in natural language processing, making it essential for efficiently processing and analyzing large amounts of text data.\n",
+    "\n",
+    "Use the `embed_query` method of the `gpt4all_embd` object to embed the given text (`text`).  \n",
+    "\n",
+    "- The `text` variable stores the text to be embedded.  \n",
+    "- The `gpt4all_embd` object uses the GPT4All model to perform text embedding.  \n",
+    "- The `embed_query` method converts the given text into a vector format and returns it.  \n",
+    "- The embedding result is stored in the `query_result` variable."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "e98a28df",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "384"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Generate query embeddings for the given text.\n",
+    "query_result = gpt4all_embd.embed_query(text)\n",
+    "\n",
+    "# Check the dimensions of the embedded space.\n",
+    "len(query_result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1ba709ff",
+   "metadata": {},
+   "source": [
+    "You can use the `embed_documents` function to embed multiple text fragments.\n",
+    "\n",
+    "Use the `embed_documents` method of the `gpt4all_embd` object to embed the `text` document.\n",
+    "\n",
+    "- Wrap the `text` document in a list and pass it as an argument to the `embed_documents` method.  \n",
+    "- The `embed_documents` method calculates and returns the embedding vector of the document.  \n",
+    "- The resulting embedding vector is stored in the `doc_result` variable.\n",
+    "\n",
+    "Additionally, these embeddings can be mapped with Nomic's Atlas (https://docs.nomic.ai/index.html) to visualize the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "8444d5b3",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "384"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Generate query embeddings for the given text.\n",
+    "doc_result = gpt4all_embd.embed_documents([text])\n",
+    "\n",
+    "# Check the dimensions of the embedded space.\n",
+    "len(doc_result[0])"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "langchain-kr-lwwSZlnu-py3.11",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}