awesome-chatgpt-dataset

Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!

Dataset Name	Size	Languages	Task Tags	Generation Method	Source	Cost	License
Dolly	15K	English	Multi-task	Human Generated	databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT.	-	CC 3.0
Code Alpaca	20K	English	Code Generation	-	Code generation task involving 20,022 samples	-	-
HC3	37K	English, Chinese	Multi-task	-	37,175 instructions generated by ChatGPT and human	-	-
Alpaca Dataset	52K	English	Multi-task	`text-davinci-003`	175 seed instructions by OpenAI API	<$500	CC By NC 4.0; OpenAI terms of use
Alpaca Data Cleaned	52K	English	Multi-task	Cleaned Dataset	Revised version of Alpaca Dataset	-	-
Alpaca GPT-4 Data	52K	English	Multi-task	`gpt-4`	Generated by GPT-4 using Alpaca prompts	-	-
Alpaca GPT-4 Data (Chinese)	52K	Chinese	Multi-task	`gpt-4`	Generated by GPT-4 using Chinese prompts translated from Alpaca by ChatGPT	-	-
Cabrita Dataset	52K	Portuguese	Multi-task	Translation	Translated from Alpaca Data	-
Japanese Alpaca Dataset	52K	Japanese	Multi-task	Translation	Translated from Alpaca Data by ChatGPT API	$45	CC By NC 4.0; OpenAI terms of use
Traditional Chinese Alpaca Dataset	52K	Traditional Chinese	Multi-task	Translation	Translated from Alpaca Data by ChatGPT API	$40	Apache-2.0 license
Finance	69K	English	Finance	-	68,912 financial related instructions	-	-
Vicuna Dataset	75K	English	Multi-task	`gpt-3.5-turbo` `gpt-4`	~100k ShareGPT conversations	-	-
InstructionTranslation	80K	Multi-lingual	Multi-task	-	Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G).	-	MIT
Guanaco Dataset	98K	English, Simplified Chinese, Traditional Chinese HK & TW, Japanese	Multi-task	Self-Instruct	175 tasks from the Alpaca model	$6K	GPLv3
InstructionWild	104K	English, Chinese	Multi-task	Self-Instruct	429 seed instructions and follow Alpaca to generate 52K	$880	Research only; OpenAI terms of use
Camel Dataset	107K	Multi-lingual	Multi-task	Dataset Collection	Role-playing between AIs (Open AI API)	-
Prosocial Dialog	166K	English	Multi-task	Human Generated	165,681 instructions produced by GPT-3 rewrites questions and human feedback	-	-
ultrachat	404K	English	Multi-task	`gpt-3.5-turbo`	To ensure generation quality, two separate ChatGPT Turbo APIs are adopted in generation, where one plays the role of the user to generate queries and the other generates the response.	-	cc-by-nc-4.0
GPT4All Dataset	806K	Multi-lingual	Multi-task	Dataset Collection	Subset of LAION OIG, StackOverflow Question, BigSciense/p3 dataset. Answered by OpenAI API.	-
Instruct	889K	English	Multi-task	NLP Tools Augmentation	888,969 English instructions, augmentation using AllenAI NLP tools	-	MIT
MOSS	1M	Chinese	Multi-task	`gpt-3.5-turbo`	Generated by gpt-3.5-turbo		Apache-2.0, AGPL-3.0 licenses
Natural Instructions	5M	Multi-lingual	Multi-task	-	5,040,134 instructions collected from diverse NLP tasks	-	-
BELLE	10M	Chinese	Multi-task	`gpt-3.5-turbo`	The 10M Chinese dataset is composed of subsets spanning multiple (instruction) types and multiple fields.	-	Research only; OpenAI terms of use
Firefly	16M	Chinese	Multi-task	-	1,649,398 Chinese instructions in 23 NLP tasks	-	-
OIG-43M Dataset	43M	Multi-lingual	Multi-task	Dataset Collection	Together, LAION, and Ontocord.ai.	-
xP3	79M	Multi-lingual	Multi-task	Dataset Collection	78,883,588 instructions collected by prompts & datasets across 46 languages & 16 NLP tasks	-	-
Alpaca-CoT Dataset	-	Multi-lingual	Multi-task	Dataset Collection	Instruction Data Collection	-	ODC-By
HH-RLHF	-	English	Multi-task	-	The data are described in the paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.	-	MIT
stack-exchange-paired	-	English	Multi-task	Human Generated	This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training.	-	cc-by-sa-4.0

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
A cat to Unlock the Power of LLM Explore These Datasets to Train Your Own ChatGPT!.gif		A cat to Unlock the Power of LLM Explore These Datasets to Train Your Own ChatGPT!.gif
A cat to Unlock the Power of LLM Explore These Datasets to Train Your Own ChatGPT!.png		A cat to Unlock the Power of LLM Explore These Datasets to Train Your Own ChatGPT!.png
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

awesome-chatgpt-dataset

Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!

About

Uh oh!

Releases

Packages

License

joeupwu/awesome-chatgpt-dataset

Folders and files

Latest commit

History

Repository files navigation

awesome-chatgpt-dataset

Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages