LangChain-OpenTutorial · teddylee777 · Feb 8, 2025 · Jan 14, 2025 · Jan 15, 2025 · Jan 15, 2025
diff --git a/16-Evaluations/03-HF-Upload.ipynb b/16-Evaluations/03-HF-Upload.ipynb
@@ -0,0 +1,371 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# HF-Upload\n",
+    "\n",
+    "- Author: [Sun Hyoung Lee](https://github.com/LEE1026icarus)\n",
+    "- Design: \n",
+    "- Peer Review : \n",
+    "- Proofread:\n",
+    "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
+    "\n",
+    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/08-Embeeding/04-UpstageEmbeddings.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/08-Embeeding/04-UpstageEmbeddings.ipynb)\n",
+    "\n",
+    "## Overview\n",
+    "\n",
+    "The process involves loading a local CSV file, converting it to a HuggingFace Dataset format, and uploading it to the Hugging Face Hub as a private dataset. This process allows for easy sharing and access of the dataset through the HuggingFace infrastructure.\n",
+    "\n",
+    "### Table of Contents\n",
+    "\n",
+    "- [Overview](#overview)\n",
+    "- [Environement Setup](#environment-setup)\n",
+    "- [Upload Generated Dataset](#upload-generated-dataset)\n",
+    "- [Upload to HuggingFace Dataset](#upload-to-huggingface-dataset)\n",
+    "\n",
+    "\n",
+    "### References\n",
+    "- [Huggingface / Share a dataset to the Hub](https://huggingface.co/docs/datasets/upload_dataset)\n",
+    "---\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Environment Setup\n",
+    "\n",
+    "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n",
+    "\n",
+    " **[Note]** \n",
+    "- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n",
+    "- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.\n",
+    "\n",
+    "### API Key Configuration\n",
+    "To use `HuggingFace Dataset` , you need to [obtain a HuggingFace write token](https://huggingface.co/settings/tokens).\n",
+    "\n",
+    "Once you have your API key, set it as the value for the variable `HUGGINGFACEHUB_API_TOKEN` ."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture --no-stderr\n",
+    "%pip install langchain-opentutorial"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install required packages\n",
+    "from langchain_opentutorial import package\n",
+    "\n",
+    "package.install(\n",
+    "    [\"datasets\"],\n",
+    "    verbose=False,\n",
+    "    upgrade=False,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can set API keys in a `.env` file or set them manually.\n",
+    "\n",
+    "[Note] If you’re not using the `.env` file, no worries! Just enter the keys directly in the cell below, and you’re good to go.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dotenv import load_dotenv\n",
+    "from langchain_opentutorial import set_env\n",
+    "\n",
+    "# Attempt to load environment variables from a .env file; if unsuccessful, set them manually.\n",
+    "if not load_dotenv(override=True):\n",
+    "    set_env(\n",
+    "        {\n",
+    "            \"LANGCHAIN_API_KEY\": \"\",\n",
+    "            \"LANGCHAIN_TRACING_V2\": \"true\",\n",
+    "            \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n",
+    "            \"LANGCHAIN_PROJECT\": \"\", # set the project name same as the title\n",
+    "            \"HUGGINGFACEHUB_API_TOKEN\": \"\",\n",
+    "        }\n",
+    "    )\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 45,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 45,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from dotenv import load_dotenv\n",
+    "\n",
+    "load_dotenv(override=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Upload Generated Dataset\n",
+    "Import the pandas library for data upload"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>user_input</th>\n",
+       "      <th>reference_contexts</th>\n",
+       "      <th>reference</th>\n",
+       "      <th>synthesizer_name</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Wht is an API?</td>\n",
+       "      <td>[\"Agents\\nThis combination of reasoning,\\nlogi...</td>\n",
+       "      <td>An API can be used by a model to make various ...</td>\n",
+       "      <td>single_hop_specifc_query_synthesizer</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>What are the three essential components in an ...</td>\n",
+       "      <td>['Agents\\nWhat is an agent?\\nIn its most funda...</td>\n",
+       "      <td>The three essential components in an agent's c...</td>\n",
+       "      <td>single_hop_specifc_query_synthesizer</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>What Chain-of-Thought do in agent model, how i...</td>\n",
+       "      <td>['Agents\\nFigure 1. General agent architecture...</td>\n",
+       "      <td>Chain-of-Thought is a reasoning and logic fram...</td>\n",
+       "      <td>single_hop_specifc_query_synthesizer</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Waht is the DELETE method used for?</td>\n",
+       "      <td>['Agents\\nThe tools\\nFoundational models, desp...</td>\n",
+       "      <td>The DELETE method is a common web API method t...</td>\n",
+       "      <td>single_hop_specifc_query_synthesizer</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>How do foundational components contribute to t...</td>\n",
+       "      <td>['&lt;1-hop&gt;\\n\\nAgents\\ncombining specialized age...</td>\n",
+       "      <td>Foundational components contribute to the cogn...</td>\n",
+       "      <td>NewMultiHopQuery</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                          user_input  \\\n",
+       "0                                     Wht is an API?   \n",
+       "1  What are the three essential components in an ...   \n",
+       "2  What Chain-of-Thought do in agent model, how i...   \n",
+       "3                Waht is the DELETE method used for?   \n",
+       "4  How do foundational components contribute to t...   \n",
+       "\n",
+       "                                  reference_contexts  \\\n",
+       "0  [\"Agents\\nThis combination of reasoning,\\nlogi...   \n",
+       "1  ['Agents\\nWhat is an agent?\\nIn its most funda...   \n",
+       "2  ['Agents\\nFigure 1. General agent architecture...   \n",
+       "3  ['Agents\\nThe tools\\nFoundational models, desp...   \n",
+       "4  ['<1-hop>\\n\\nAgents\\ncombining specialized age...   \n",
+       "\n",
+       "                                           reference  \\\n",
+       "0  An API can be used by a model to make various ...   \n",
+       "1  The three essential components in an agent's c...   \n",
+       "2  Chain-of-Thought is a reasoning and logic fram...   \n",
+       "3  The DELETE method is a common web API method t...   \n",
+       "4  Foundational components contribute to the cogn...   \n",
+       "\n",
+       "                       synthesizer_name  \n",
+       "0  single_hop_specifc_query_synthesizer  \n",
+       "1  single_hop_specifc_query_synthesizer  \n",
+       "2  single_hop_specifc_query_synthesizer  \n",
+       "3  single_hop_specifc_query_synthesizer  \n",
+       "4                      NewMultiHopQuery  "
+      ]
+     },
+     "execution_count": 47,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "df = pd.read_csv(\"data/ragas_synthetic_dataset.csv\")\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Upload to HuggingFace Dataset\n",
+    "Convert a Pandas DataFrame to a Hugging Face Dataset and proceed with the upload."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Dataset({\n",
+      "    features: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name'],\n",
+      "    num_rows: 10\n",
+      "})\n"
+     ]
+    }
+   ],
+   "source": [
+    "from datasets import Dataset\n",
+    "\n",
+    "# Convert pandas DataFrame to Hugging Face Dataset\n",
+    "dataset = Dataset.from_pandas(df)\n",
+    "\n",
+    "# Check the dataset\n",
+    "print(dataset)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "902d30cb54d64f1fb7115250c01672d9",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "2c60415789974ce88c9c95535dfefa4a",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from datasets import Dataset\n",
+    "import os\n",
+    "\n",
+    "# Convert pandas DataFrame to Hugging Face Dataset\n",
+    "dataset = Dataset.from_pandas(df)\n",
+    "\n",
+    "# Set dataset name (change to your desired name)\n",
+    "hf_username = \"icarus1026\"  # Your Hugging Face Username(ID)\n",
+    "dataset_name = f\"{hf_username}/rag-synthetic-dataset\"\n",
+    "\n",
+    "# Upload dataset\n",
+    "dataset.push_to_hub(\n",
+    "    dataset_name,\n",
+    "    private=True,  # Set private=False for a public dataset\n",
+    "    split=\"test_v1\",  # Enter dataset split name\n",
+    "    token=os.getenv(\"HUGGINGFACEHUB_API_TOKEN\"),\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Note] The Dataset Viewer may take some time to display.\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "langchain-kr-bpXWMSjn-py3.11",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/...valuations/data/Newwhitepaper_Agents2.pdf → ...luations/assets/Newwhitepaper_Agents2.pdf b/...valuations/data/Newwhitepaper_Agents2.pdf → ...luations/assets/Newwhitepaper_Agents2.pdf