Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
301 changes: 301 additions & 0 deletions 08-EMBEDDING/07-GPT4ALLEmbedding.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,301 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "635d8ebb",
"metadata": {},
"source": [
"# GPT4ALL\n",
"\n",
"- Author: [Do Woung Kong](https://github.com/krkrong)\n",
"- Design: \n",
"- Peer Review: \n",
"- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
"\n",
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)\n",
"\n",
"## Overview\n",
"\n",
"`GPT4All` is a local execution-based privacy chatbot that can be used for free.\n",
"\n",
"No GPU or internet connection is required, and `GPT4All` offers popular models such as GPT4All Falcon, Wizard, and its own models.\n",
"\n",
"This notebook explains how to use `GPT4Allembeddings` with `LangChain`.\n",
"\n",
"### Table of Contents\n",
"\n",
"- [Overview](#overview)\n",
"- [Environement Setup](#environment-setup)\n",
"- [Install Python Binding for GPT4All](#create-a-basic-pdf-based-retrieval-chain)\n",
"- [Embed the Textual Data](#query-routing-and-document-evaluation)\n",
"\n",
"\n",
"### References\n",
"\n",
"- [GPT4All docs](https://docs.gpt4all.io/gpt4all_python_embedding.html#gpt4all.gpt4all.Embed4All)\n",
"- [GPT4AllEmbeddings](https://python.langchain.com/api_reference/community/embeddings/langchain_community.embeddings.gpt4all.GPT4AllEmbeddings.html#langchain_community.embeddings.gpt4all.GPT4AllEmbeddings)"
]
},
{
"cell_type": "markdown",
"id": "c6c7aba4",
"metadata": {},
"source": [
"## Environment Setup\n",
"\n",
"Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n",
"\n",
"**[Note]**\n",
"- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n",
"- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "21943adb",
"metadata": {},
"outputs": [],
"source": [
"%%capture --no-stderr\n",
"!pip install langchain-opentutorial"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "f25ec196",
"metadata": {},
"outputs": [],
"source": [
"# Install required packages\n",
"from langchain_opentutorial import package\n",
"\n",
"package.install(\n",
" [\"langchain_community\"],\n",
" verbose=False,\n",
" upgrade=False,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "aa00c3f4",
"metadata": {},
"source": [
"## Install Python Binding for GPT4All\n",
"\n",
"Before diving into the practical exercises, you need to install the Python bindings for `GPT4All`.\n",
"\n",
"Python bindings allow a Python program to interface with external libraries or tools, enabling seamless integration and usage of functionalities provided by those external resources.\n",
"\n",
"To install the Python bindings for `GPT4All`, run the following command:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "07c0a550",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"%pip install --upgrade --quiet gpt4all > /dev/null"
]
},
{
"cell_type": "markdown",
"id": "7e566192",
"metadata": {},
"source": [
"Import the `GPT4AllEmbeddings` class from the `langchain_community.embeddings` module.\n",
"\n",
"The `GPT4AllEmbeddings` class provides functionality to embed text data into vectors using the GPT4All model.\n",
"\n",
"This class implements the embedding interface of the LangChain framework, allowing it to be used seamlessly with LangChain's various features."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "5e20fcca",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.embeddings import GPT4AllEmbeddings"
]
},
{
"cell_type": "markdown",
"id": "d4cc5b5b",
"metadata": {},
"source": [
"GPT4All supports the generation of high-quality embeddings for text documents of arbitrary length using a contrastive learning sentence transformer optimized for CPUs. These embeddings offer a quality comparable to many tasks using OpenAI models.\n",
"\n",
"An instance of the `GPT4AllEmbeddings` class is created.\n",
"\n",
"- The `GPT4AllEmbeddings` class is an embedding model that uses the GPT4All model to transform text data into vectors. \n",
"\n",
"- In this code, the `gpt4all_embd` variable is assigned an instance of `GPT4AllEmbeddings`. \n",
"\n",
"- You can then use `gpt4all_embd` to convert text data into vectors."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "04c9e49c",
"metadata": {},
"outputs": [],
"source": [
"# Create a GPT4All embedding object\n",
"gpt4all_embd = GPT4AllEmbeddings()"
]
},
{
"cell_type": "markdown",
"id": "f1dc8496",
"metadata": {},
"source": [
"Assign the string \"This is a sample sentence for testing embeddings.\" to the `text` variable."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "2584f975",
"metadata": {},
"outputs": [],
"source": [
"# Define a sample document text for testing\n",
"text = \"This is a sample sentence for testing embeddings.\""
]
},
{
"cell_type": "markdown",
"id": "2b2fc536",
"metadata": {},
"source": [
"## Embed the Textual Data\n",
"\n",
"\n",
"The process of embedding text data is as follows:\n",
"\n",
"First, the text data is tokenized and converted into numerical form. \n",
"\n",
"During this step, a pre-trained tokenizer is used to split the text into tokens and map each token to a unique integer. \n",
"\n",
"Next, the tokenized data is input into an embedding layer, where it is transformed into high-dimensional dense vectors. \n",
"\n",
"In this process, each token is represented as a vector of real numbers that capture the token's meaning and context. \n",
"\n",
"Finally, the embedded vectors can be used in various natural language processing tasks. \n",
"\n",
"For example, they can serve as input data for tasks such as document classification, sentiment analysis, and machine translation, enhancing model performance. \n",
"\n",
"This process of text data embedding plays a crucial role in natural language processing, making it essential for efficiently processing and analyzing large amounts of text data.\n",
"\n",
"Use the `embed_query` method of the `gpt4all_embd` object to embed the given text (`text`). \n",
"\n",
"- The `text` variable stores the text to be embedded. \n",
"- The `gpt4all_embd` object uses the GPT4All model to perform text embedding. \n",
"- The `embed_query` method converts the given text into a vector format and returns it. \n",
"- The embedding result is stored in the `query_result` variable."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "e98a28df",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"384"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Generate query embeddings for the given text.\n",
"query_result = gpt4all_embd.embed_query(text)\n",
"\n",
"# Check the dimensions of the embedded space.\n",
"len(query_result)"
]
},
{
"cell_type": "markdown",
"id": "1ba709ff",
"metadata": {},
"source": [
"You can use the `embed_documents` function to embed multiple text fragments.\n",
"\n",
"Use the `embed_documents` method of the `gpt4all_embd` object to embed the `text` document.\n",
"\n",
"- Wrap the `text` document in a list and pass it as an argument to the `embed_documents` method. \n",
"- The `embed_documents` method calculates and returns the embedding vector of the document. \n",
"- The resulting embedding vector is stored in the `doc_result` variable.\n",
"\n",
"Additionally, these embeddings can be mapped with Nomic's Atlas (https://docs.nomic.ai/index.html) to visualize the data."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "8444d5b3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"384"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Generate query embeddings for the given text.\n",
"doc_result = gpt4all_embd.embed_documents([text])\n",
"\n",
"# Check the dimensions of the embedded space.\n",
"len(doc_result[0])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "langchain-kr-lwwSZlnu-py3.11",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading