-
Notifications
You must be signed in to change notification settings - Fork 283
[E-2] 16-Evaluations / 03-HF-Upload.ipynb #421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
459743e
Merge remote-tracking branch 'upstream/main' into Upstage
LEE1026icarus 35912dd
Merge remote-tracking branch 'upstream/main' into eval
LEE1026icarus e976992
[Title] 03-HF-Upload
LEE1026icarus 4edaf7e
Merge remote-tracking branch 'upstream/main' into eval
LEE1026icarus c89fefe
add PDF
LEE1026icarus 89f3b84
[Title] 03-HF-Upload
LEE1026icarus bddc0ac
Merge remote-tracking branch 'upstream/main' into eval
LEE1026icarus 4f6f527
add the csv file and revise the py file.
LEE1026icarus 6646689
Update 16-Evaluations/03-HF-Upload.ipynb
LEE1026icarus 18d9939
Update 16-Evaluations/03-HF-Upload.ipynb
LEE1026icarus 4230e6a
Update 16-Evaluations/03-HF-Upload.ipynb
LEE1026icarus File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,371 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# HF-Upload\n", | ||
"\n", | ||
"- Author: [Sun Hyoung Lee](https://github.com/LEE1026icarus)\n", | ||
"- Design: \n", | ||
"- Peer Review : \n", | ||
"- Proofread:\n", | ||
"- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n", | ||
"\n", | ||
"[](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/08-Embeeding/04-UpstageEmbeddings.ipynb) [](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/08-Embeeding/04-UpstageEmbeddings.ipynb)\n", | ||
"\n", | ||
"## Overview\n", | ||
"\n", | ||
"The process involves loading a local CSV file, converting it to a HuggingFace Dataset format, and uploading it to the Hugging Face Hub as a private dataset. This process allows for easy sharing and access of the dataset through the HuggingFace infrastructure.\n", | ||
"\n", | ||
"### Table of Contents\n", | ||
"\n", | ||
"- [Overview](#overview)\n", | ||
"- [Environement Setup](#environment-setup)\n", | ||
"- [Upload Generated Dataset](#upload-generated-dataset)\n", | ||
"- [Upload to HuggingFace Dataset](#upload-to-huggingface-dataset)\n", | ||
"\n", | ||
"\n", | ||
"### References\n", | ||
"- [Huggingface / Share a dataset to the Hub](https://huggingface.co/docs/datasets/upload_dataset)\n", | ||
"---\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Environment Setup\n", | ||
"\n", | ||
"Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n", | ||
"\n", | ||
" **[Note]** \n", | ||
"- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n", | ||
"- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.\n", | ||
"\n", | ||
"### API Key Configuration\n", | ||
"To use `HuggingFace Dataset` , you need to [obtain a HuggingFace write token](https://huggingface.co/settings/tokens).\n", | ||
"\n", | ||
"Once you have your API key, set it as the value for the variable `HUGGINGFACEHUB_API_TOKEN` ." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%capture --no-stderr\n", | ||
"%pip install langchain-opentutorial" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Install required packages\n", | ||
"from langchain_opentutorial import package\n", | ||
"\n", | ||
"package.install(\n", | ||
" [\"datasets\"],\n", | ||
" verbose=False,\n", | ||
" upgrade=False,\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"You can set API keys in a `.env` file or set them manually.\n", | ||
"\n", | ||
"[Note] If you’re not using the `.env` file, no worries! Just enter the keys directly in the cell below, and you’re good to go.\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from dotenv import load_dotenv\n", | ||
"from langchain_opentutorial import set_env\n", | ||
"\n", | ||
"# Attempt to load environment variables from a .env file; if unsuccessful, set them manually.\n", | ||
"if not load_dotenv(override=True):\n", | ||
" set_env(\n", | ||
" {\n", | ||
" \"LANGCHAIN_API_KEY\": \"\",\n", | ||
" \"LANGCHAIN_TRACING_V2\": \"true\",\n", | ||
" \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n", | ||
" \"LANGCHAIN_PROJECT\": \"\", # set the project name same as the title\n", | ||
" \"HUGGINGFACEHUB_API_TOKEN\": \"\",\n", | ||
" }\n", | ||
" )\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 45, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"True" | ||
] | ||
}, | ||
"execution_count": 45, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"from dotenv import load_dotenv\n", | ||
"\n", | ||
"load_dotenv(override=True)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Upload Generated Dataset\n", | ||
"Import the pandas library for data upload" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 47, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/html": [ | ||
"<div>\n", | ||
"<style scoped>\n", | ||
" .dataframe tbody tr th:only-of-type {\n", | ||
" vertical-align: middle;\n", | ||
" }\n", | ||
"\n", | ||
" .dataframe tbody tr th {\n", | ||
" vertical-align: top;\n", | ||
" }\n", | ||
"\n", | ||
" .dataframe thead th {\n", | ||
" text-align: right;\n", | ||
" }\n", | ||
"</style>\n", | ||
"<table border=\"1\" class=\"dataframe\">\n", | ||
" <thead>\n", | ||
" <tr style=\"text-align: right;\">\n", | ||
" <th></th>\n", | ||
" <th>user_input</th>\n", | ||
" <th>reference_contexts</th>\n", | ||
" <th>reference</th>\n", | ||
" <th>synthesizer_name</th>\n", | ||
" </tr>\n", | ||
" </thead>\n", | ||
" <tbody>\n", | ||
" <tr>\n", | ||
" <th>0</th>\n", | ||
" <td>Wht is an API?</td>\n", | ||
" <td>[\"Agents\\nThis combination of reasoning,\\nlogi...</td>\n", | ||
" <td>An API can be used by a model to make various ...</td>\n", | ||
" <td>single_hop_specifc_query_synthesizer</td>\n", | ||
" </tr>\n", | ||
" <tr>\n", | ||
" <th>1</th>\n", | ||
" <td>What are the three essential components in an ...</td>\n", | ||
" <td>['Agents\\nWhat is an agent?\\nIn its most funda...</td>\n", | ||
" <td>The three essential components in an agent's c...</td>\n", | ||
" <td>single_hop_specifc_query_synthesizer</td>\n", | ||
" </tr>\n", | ||
" <tr>\n", | ||
" <th>2</th>\n", | ||
" <td>What Chain-of-Thought do in agent model, how i...</td>\n", | ||
" <td>['Agents\\nFigure 1. General agent architecture...</td>\n", | ||
" <td>Chain-of-Thought is a reasoning and logic fram...</td>\n", | ||
" <td>single_hop_specifc_query_synthesizer</td>\n", | ||
" </tr>\n", | ||
" <tr>\n", | ||
" <th>3</th>\n", | ||
" <td>Waht is the DELETE method used for?</td>\n", | ||
" <td>['Agents\\nThe tools\\nFoundational models, desp...</td>\n", | ||
" <td>The DELETE method is a common web API method t...</td>\n", | ||
" <td>single_hop_specifc_query_synthesizer</td>\n", | ||
" </tr>\n", | ||
" <tr>\n", | ||
" <th>4</th>\n", | ||
" <td>How do foundational components contribute to t...</td>\n", | ||
" <td>['<1-hop>\\n\\nAgents\\ncombining specialized age...</td>\n", | ||
" <td>Foundational components contribute to the cogn...</td>\n", | ||
" <td>NewMultiHopQuery</td>\n", | ||
" </tr>\n", | ||
" </tbody>\n", | ||
"</table>\n", | ||
"</div>" | ||
], | ||
"text/plain": [ | ||
" user_input \\\n", | ||
"0 Wht is an API? \n", | ||
"1 What are the three essential components in an ... \n", | ||
"2 What Chain-of-Thought do in agent model, how i... \n", | ||
"3 Waht is the DELETE method used for? \n", | ||
"4 How do foundational components contribute to t... \n", | ||
"\n", | ||
" reference_contexts \\\n", | ||
"0 [\"Agents\\nThis combination of reasoning,\\nlogi... \n", | ||
"1 ['Agents\\nWhat is an agent?\\nIn its most funda... \n", | ||
"2 ['Agents\\nFigure 1. General agent architecture... \n", | ||
"3 ['Agents\\nThe tools\\nFoundational models, desp... \n", | ||
"4 ['<1-hop>\\n\\nAgents\\ncombining specialized age... \n", | ||
"\n", | ||
" reference \\\n", | ||
"0 An API can be used by a model to make various ... \n", | ||
"1 The three essential components in an agent's c... \n", | ||
"2 Chain-of-Thought is a reasoning and logic fram... \n", | ||
"3 The DELETE method is a common web API method t... \n", | ||
"4 Foundational components contribute to the cogn... \n", | ||
"\n", | ||
" synthesizer_name \n", | ||
"0 single_hop_specifc_query_synthesizer \n", | ||
"1 single_hop_specifc_query_synthesizer \n", | ||
"2 single_hop_specifc_query_synthesizer \n", | ||
"3 single_hop_specifc_query_synthesizer \n", | ||
"4 NewMultiHopQuery " | ||
] | ||
}, | ||
"execution_count": 47, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"import pandas as pd\n", | ||
"\n", | ||
"df = pd.read_csv(\"data/ragas_synthetic_dataset.csv\")\n", | ||
"df.head()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Upload to HuggingFace Dataset\n", | ||
"Convert a Pandas DataFrame to a Hugging Face Dataset and proceed with the upload." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 48, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Dataset({\n", | ||
" features: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name'],\n", | ||
" num_rows: 10\n", | ||
"})\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from datasets import Dataset\n", | ||
"\n", | ||
"# Convert pandas DataFrame to Hugging Face Dataset\n", | ||
"dataset = Dataset.from_pandas(df)\n", | ||
"\n", | ||
"# Check the dataset\n", | ||
"print(dataset)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"application/vnd.jupyter.widget-view+json": { | ||
"model_id": "902d30cb54d64f1fb7115250c01672d9", | ||
"version_major": 2, | ||
"version_minor": 0 | ||
}, | ||
"text/plain": [ | ||
"Pushing dataset shards to the dataset hub: 0%| | 0/1 [00:00<?, ?it/s]" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
}, | ||
{ | ||
"data": { | ||
"application/vnd.jupyter.widget-view+json": { | ||
"model_id": "2c60415789974ce88c9c95535dfefa4a", | ||
"version_major": 2, | ||
"version_minor": 0 | ||
}, | ||
"text/plain": [ | ||
"Creating parquet from Arrow format: 0%| | 0/1 [00:00<?, ?ba/s]" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
} | ||
], | ||
"source": [ | ||
"from datasets import Dataset\n", | ||
"import os\n", | ||
"\n", | ||
"# Convert pandas DataFrame to Hugging Face Dataset\n", | ||
"dataset = Dataset.from_pandas(df)\n", | ||
"\n", | ||
"# Set dataset name (change to your desired name)\n", | ||
"hf_username = \"icarus1026\" # Your Hugging Face Username(ID)\n", | ||
"dataset_name = f\"{hf_username}/rag-synthetic-dataset\"\n", | ||
"\n", | ||
"# Upload dataset\n", | ||
"dataset.push_to_hub(\n", | ||
" dataset_name,\n", | ||
" private=True, # Set private=False for a public dataset\n", | ||
" split=\"test_v1\", # Enter dataset split name\n", | ||
" token=os.getenv(\"HUGGINGFACEHUB_API_TOKEN\"),\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"[Note] The Dataset Viewer may take some time to display.\n" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "langchain-kr-bpXWMSjn-py3.11", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.9" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
File renamed without changes.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data/ragas_synthetic_dataset.csv
파일이 존재하지 않습니다.검토 부탁드립니다.