Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
371 changes: 371 additions & 0 deletions 16-Evaluations/03-HF-Upload.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,371 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# HF-Upload\n",
"\n",
"- Author: [Sun Hyoung Lee](https://github.com/LEE1026icarus)\n",
"- Design: \n",
"- Peer Review : \n",
"- Proofread:\n",
"- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
"\n",
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/08-Embeeding/04-UpstageEmbeddings.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/08-Embeeding/04-UpstageEmbeddings.ipynb)\n",
"\n",
"## Overview\n",
"\n",
"The process involves loading a local CSV file, converting it to a HuggingFace Dataset format, and uploading it to the Hugging Face Hub as a private dataset. This process allows for easy sharing and access of the dataset through the HuggingFace infrastructure.\n",
"\n",
"### Table of Contents\n",
"\n",
"- [Overview](#overview)\n",
"- [Environement Setup](#environment-setup)\n",
"- [Upload Generated Dataset](#upload-generated-dataset)\n",
"- [Upload to HuggingFace Dataset](#upload-to-huggingface-dataset)\n",
"\n",
"\n",
"### References\n",
"- [Huggingface / Share a dataset to the Hub](https://huggingface.co/docs/datasets/upload_dataset)\n",
"---\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Environment Setup\n",
"\n",
"Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n",
"\n",
" **[Note]** \n",
"- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n",
"- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.\n",
"\n",
"### API Key Configuration\n",
"To use `HuggingFace Dataset` , you need to [obtain a HuggingFace write token](https://huggingface.co/settings/tokens).\n",
"\n",
"Once you have your API key, set it as the value for the variable `HUGGINGFACEHUB_API_TOKEN` ."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"%%capture --no-stderr\n",
"%pip install langchain-opentutorial"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Install required packages\n",
"from langchain_opentutorial import package\n",
"\n",
"package.install(\n",
" [\"datasets\"],\n",
" verbose=False,\n",
" upgrade=False,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can set API keys in a `.env` file or set them manually.\n",
"\n",
"[Note] If you’re not using the `.env` file, no worries! Just enter the keys directly in the cell below, and you’re good to go.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from dotenv import load_dotenv\n",
"from langchain_opentutorial import set_env\n",
"\n",
"# Attempt to load environment variables from a .env file; if unsuccessful, set them manually.\n",
"if not load_dotenv(override=True):\n",
" set_env(\n",
" {\n",
" \"LANGCHAIN_API_KEY\": \"\",\n",
" \"LANGCHAIN_TRACING_V2\": \"true\",\n",
" \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n",
" \"LANGCHAIN_PROJECT\": \"\", # set the project name same as the title\n",
" \"HUGGINGFACEHUB_API_TOKEN\": \"\",\n",
" }\n",
" )\n"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from dotenv import load_dotenv\n",
"\n",
"load_dotenv(override=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Upload Generated Dataset\n",
"Import the pandas library for data upload"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>user_input</th>\n",
" <th>reference_contexts</th>\n",
" <th>reference</th>\n",
" <th>synthesizer_name</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Wht is an API?</td>\n",
" <td>[\"Agents\\nThis combination of reasoning,\\nlogi...</td>\n",
" <td>An API can be used by a model to make various ...</td>\n",
" <td>single_hop_specifc_query_synthesizer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>What are the three essential components in an ...</td>\n",
" <td>['Agents\\nWhat is an agent?\\nIn its most funda...</td>\n",
" <td>The three essential components in an agent's c...</td>\n",
" <td>single_hop_specifc_query_synthesizer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>What Chain-of-Thought do in agent model, how i...</td>\n",
" <td>['Agents\\nFigure 1. General agent architecture...</td>\n",
" <td>Chain-of-Thought is a reasoning and logic fram...</td>\n",
" <td>single_hop_specifc_query_synthesizer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Waht is the DELETE method used for?</td>\n",
" <td>['Agents\\nThe tools\\nFoundational models, desp...</td>\n",
" <td>The DELETE method is a common web API method t...</td>\n",
" <td>single_hop_specifc_query_synthesizer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>How do foundational components contribute to t...</td>\n",
" <td>['&lt;1-hop&gt;\\n\\nAgents\\ncombining specialized age...</td>\n",
" <td>Foundational components contribute to the cogn...</td>\n",
" <td>NewMultiHopQuery</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" user_input \\\n",
"0 Wht is an API? \n",
"1 What are the three essential components in an ... \n",
"2 What Chain-of-Thought do in agent model, how i... \n",
"3 Waht is the DELETE method used for? \n",
"4 How do foundational components contribute to t... \n",
"\n",
" reference_contexts \\\n",
"0 [\"Agents\\nThis combination of reasoning,\\nlogi... \n",
"1 ['Agents\\nWhat is an agent?\\nIn its most funda... \n",
"2 ['Agents\\nFigure 1. General agent architecture... \n",
"3 ['Agents\\nThe tools\\nFoundational models, desp... \n",
"4 ['<1-hop>\\n\\nAgents\\ncombining specialized age... \n",
"\n",
" reference \\\n",
"0 An API can be used by a model to make various ... \n",
"1 The three essential components in an agent's c... \n",
"2 Chain-of-Thought is a reasoning and logic fram... \n",
"3 The DELETE method is a common web API method t... \n",
"4 Foundational components contribute to the cogn... \n",
"\n",
" synthesizer_name \n",
"0 single_hop_specifc_query_synthesizer \n",
"1 single_hop_specifc_query_synthesizer \n",
"2 single_hop_specifc_query_synthesizer \n",
"3 single_hop_specifc_query_synthesizer \n",
"4 NewMultiHopQuery "
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.read_csv(\"data/ragas_synthetic_dataset.csv\")\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data/ragas_synthetic_dataset.csv 파일이 존재하지 않습니다.
검토 부탁드립니다.

"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Upload to HuggingFace Dataset\n",
"Convert a Pandas DataFrame to a Hugging Face Dataset and proceed with the upload."
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset({\n",
" features: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name'],\n",
" num_rows: 10\n",
"})\n"
]
}
],
"source": [
"from datasets import Dataset\n",
"\n",
"# Convert pandas DataFrame to Hugging Face Dataset\n",
"dataset = Dataset.from_pandas(df)\n",
"\n",
"# Check the dataset\n",
"print(dataset)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "902d30cb54d64f1fb7115250c01672d9",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Pushing dataset shards to the dataset hub: 0%| | 0/1 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "2c60415789974ce88c9c95535dfefa4a",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Creating parquet from Arrow format: 0%| | 0/1 [00:00<?, ?ba/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from datasets import Dataset\n",
"import os\n",
"\n",
"# Convert pandas DataFrame to Hugging Face Dataset\n",
"dataset = Dataset.from_pandas(df)\n",
"\n",
"# Set dataset name (change to your desired name)\n",
"hf_username = \"icarus1026\" # Your Hugging Face Username(ID)\n",
"dataset_name = f\"{hf_username}/rag-synthetic-dataset\"\n",
"\n",
"# Upload dataset\n",
"dataset.push_to_hub(\n",
" dataset_name,\n",
" private=True, # Set private=False for a public dataset\n",
" split=\"test_v1\", # Enter dataset split name\n",
" token=os.getenv(\"HUGGINGFACEHUB_API_TOKEN\"),\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Note] The Dataset Viewer may take some time to display.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "langchain-kr-bpXWMSjn-py3.11",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading
Loading