| Dataset Name | Size | Languages | Task Tags | Generation Method | Source | Cost | License |
|---|---|---|---|---|---|---|---|
| Dolly | 15K | English | Multi-task | Human Generated | databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. | - | CC 3.0 |
| Code Alpaca | 20K | English | Code Generation | - | Code generation task involving 20,022 samples | - | - |
| HC3 | 37K | English, Chinese | Multi-task | - | 37,175 instructions generated by ChatGPT and human | - | - |
| Alpaca Dataset | 52K | English | Multi-task | text-davinci-003 |
175 seed instructions by OpenAI API | <$500 | CC By NC 4.0; OpenAI terms of use |
| Alpaca Data Cleaned | 52K | English | Multi-task | Cleaned Dataset | Revised version of Alpaca Dataset | - | - |
| Alpaca GPT-4 Data | 52K | English | Multi-task | gpt-4 |
Generated by GPT-4 using Alpaca prompts | - | - |
| Alpaca GPT-4 Data (Chinese) | 52K | Chinese | Multi-task | gpt-4 |
Generated by GPT-4 using Chinese prompts translated from Alpaca by ChatGPT | - | - |
| Cabrita Dataset | 52K | Portuguese | Multi-task | Translation | Translated from Alpaca Data | - | |
| Japanese Alpaca Dataset | 52K | Japanese | Multi-task | Translation | Translated from Alpaca Data by ChatGPT API | $45 | CC By NC 4.0; OpenAI terms of use |
| Traditional Chinese Alpaca Dataset | 52K | Traditional Chinese | Multi-task | Translation | Translated from Alpaca Data by ChatGPT API | $40 | Apache-2.0 license |
| Finance | 69K | English | Finance | - | 68,912 financial related instructions | - | - |
| Vicuna Dataset | 75K | English | Multi-task | gpt-3.5-turbo gpt-4 |
~100k ShareGPT conversations | - | - |
| InstructionTranslation | 80K | Multi-lingual | Multi-task | - | Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G). | - | MIT |
| Guanaco Dataset | 98K | English, Simplified Chinese, Traditional Chinese HK & TW, Japanese | Multi-task | Self-Instruct | 175 tasks from the Alpaca model | $6K | GPLv3 |
| InstructionWild | 104K | English, Chinese | Multi-task | Self-Instruct | 429 seed instructions and follow Alpaca to generate 52K | $880 | Research only; OpenAI terms of use |
| Camel Dataset | 107K | Multi-lingual | Multi-task | Dataset Collection | Role-playing between AIs (Open AI API) | - | |
| Prosocial Dialog | 166K | English | Multi-task | Human Generated | 165,681 instructions produced by GPT-3 rewrites questions and human feedback | - | - |
| ultrachat | 404K | English | Multi-task | gpt-3.5-turbo |
To ensure generation quality, two separate ChatGPT Turbo APIs are adopted in generation, where one plays the role of the user to generate queries and the other generates the response. | - | cc-by-nc-4.0 |
| GPT4All Dataset | 806K | Multi-lingual | Multi-task | Dataset Collection | Subset of LAION OIG, StackOverflow Question, BigSciense/p3 dataset. Answered by OpenAI API. | - | |
| Instruct | 889K | English | Multi-task | NLP Tools Augmentation | 888,969 English instructions, augmentation using AllenAI NLP tools | - | MIT |
| MOSS | 1M | Chinese | Multi-task | gpt-3.5-turbo |
Generated by gpt-3.5-turbo | Apache-2.0, AGPL-3.0 licenses | |
| Natural Instructions | 5M | Multi-lingual | Multi-task | - | 5,040,134 instructions collected from diverse NLP tasks | - | - |
| BELLE | 10M | Chinese | Multi-task | gpt-3.5-turbo |
The 10M Chinese dataset is composed of subsets spanning multiple (instruction) types and multiple fields. | - | Research only; OpenAI terms of use |
| Firefly | 16M | Chinese | Multi-task | - | 1,649,398 Chinese instructions in 23 NLP tasks | - | - |
| OIG-43M Dataset | 43M | Multi-lingual | Multi-task | Dataset Collection | Together, LAION, and Ontocord.ai. | - | |
| xP3 | 79M | Multi-lingual | Multi-task | Dataset Collection | 78,883,588 instructions collected by prompts & datasets across 46 languages & 16 NLP tasks | - | - |
| Alpaca-CoT Dataset | - | Multi-lingual | Multi-task | Dataset Collection | Instruction Data Collection | - | ODC-By |
| HH-RLHF | - | English | Multi-task | - | The data are described in the paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. | - | MIT |
| stack-exchange-paired | - | English | Multi-task | Human Generated | This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. | - | cc-by-sa-4.0 |
forked from voidful/awesome-chatgpt-dataset
-
Notifications
You must be signed in to change notification settings - Fork 0
Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!
License
joeupwu/awesome-chatgpt-dataset
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
About
Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published