Thanks to visit codestin.com
Credit goes to github.com

Skip to content

joeupwu/awesome-chatgpt-dataset

 
 

Repository files navigation

awesome-chatgpt-dataset

Alt Text

Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!

Dataset Name Size Languages Task Tags Generation Method Source Cost License
Dolly 15K English Multi-task Human Generated databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. - CC 3.0
Code Alpaca 20K English Code Generation - Code generation task involving 20,022 samples - -
HC3 37K English, Chinese Multi-task - 37,175 instructions generated by ChatGPT and human - -
Alpaca Dataset 52K English Multi-task text-davinci-003 175 seed instructions by OpenAI API <$500 CC By NC 4.0; OpenAI terms of use
Alpaca Data Cleaned 52K English Multi-task Cleaned Dataset Revised version of Alpaca Dataset - -
Alpaca GPT-4 Data 52K English Multi-task gpt-4 Generated by GPT-4 using Alpaca prompts - -
Alpaca GPT-4 Data (Chinese) 52K Chinese Multi-task gpt-4 Generated by GPT-4 using Chinese prompts translated from Alpaca by ChatGPT - -
Cabrita Dataset 52K Portuguese Multi-task Translation Translated from Alpaca Data -
Japanese Alpaca Dataset 52K Japanese Multi-task Translation Translated from Alpaca Data by ChatGPT API $45 CC By NC 4.0; OpenAI terms of use
Traditional Chinese Alpaca Dataset 52K Traditional Chinese Multi-task Translation Translated from Alpaca Data by ChatGPT API $40 Apache-2.0 license
Finance 69K English Finance - 68,912 financial related instructions - -
Vicuna Dataset 75K English Multi-task gpt-3.5-turbo gpt-4 ~100k ShareGPT conversations - -
InstructionTranslation 80K Multi-lingual Multi-task - Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G). - MIT
Guanaco Dataset 98K English, Simplified Chinese, Traditional Chinese HK & TW, Japanese Multi-task Self-Instruct 175 tasks from the Alpaca model $6K GPLv3
InstructionWild 104K English, Chinese Multi-task Self-Instruct 429 seed instructions and follow Alpaca to generate 52K $880 Research only; OpenAI terms of use
Camel Dataset 107K Multi-lingual Multi-task Dataset Collection Role-playing between AIs (Open AI API) -
Prosocial Dialog 166K English Multi-task Human Generated 165,681 instructions produced by GPT-3 rewrites questions and human feedback - -
ultrachat 404K English Multi-task gpt-3.5-turbo To ensure generation quality, two separate ChatGPT Turbo APIs are adopted in generation, where one plays the role of the user to generate queries and the other generates the response. - cc-by-nc-4.0
GPT4All Dataset 806K Multi-lingual Multi-task Dataset Collection Subset of LAION OIG, StackOverflow Question, BigSciense/p3 dataset. Answered by OpenAI API. -
Instruct 889K English Multi-task NLP Tools Augmentation 888,969 English instructions, augmentation using AllenAI NLP tools - MIT
MOSS 1M Chinese Multi-task gpt-3.5-turbo Generated by gpt-3.5-turbo Apache-2.0, AGPL-3.0 licenses
Natural Instructions 5M Multi-lingual Multi-task - 5,040,134 instructions collected from diverse NLP tasks - -
BELLE 10M Chinese Multi-task gpt-3.5-turbo The 10M Chinese dataset is composed of subsets spanning multiple (instruction) types and multiple fields. - Research only; OpenAI terms of use
Firefly 16M Chinese Multi-task - 1,649,398 Chinese instructions in 23 NLP tasks - -
OIG-43M Dataset 43M Multi-lingual Multi-task Dataset Collection Together, LAION, and Ontocord.ai. -
xP3 79M Multi-lingual Multi-task Dataset Collection 78,883,588 instructions collected by prompts & datasets across 46 languages & 16 NLP tasks - -
Alpaca-CoT Dataset - Multi-lingual Multi-task Dataset Collection Instruction Data Collection - ODC-By
HH-RLHF - English Multi-task - The data are described in the paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. - MIT
stack-exchange-paired - English Multi-task Human Generated This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. - cc-by-sa-4.0

About

Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published