llm-datasets

Star

Here are 22 public repositories matching this topic...

neo4j-labs / text2cypher

Star

collection of text2cypher datasets, evaluations, and finetuning instructions

neo4j graph cypher cypher-query-language llm llms llm-training llm-datasets text2cypher

Updated Jun 13, 2024
Jupyter Notebook

dsdanielpark / open-llm-datasets

Sponsor

Star

Repository for organizing datasets and papers used in Open LLM.

natural-language-processing datasets large-language-models llm llm-training llm-datasets

Updated Jul 6, 2023

discus-labs / discus

Star

A data-centric AI package for ML/AI. Get the best high-quality data for the best results. Discord: https://discord.gg/t6ADqBKrdZ

python openai gpt synthetic-data fine-tuning synthetic-dataset-generation ner-data huggingface-transformers gpt-4 large-language-models llms llm-training llm-datasets fine-tuning-llm

Updated Nov 20, 2023
Python

asimsinan / LLM-Research

Star

A collection of LLM related papers, thesis, tools, datasets, courses, open source models, benchmarks

arxiv-papers large-language-models llm llms llm-datasets llm-tools buyuk-dil-modelleri llm-research llm-theses llm-benchmarking llm-frameworks

Updated Oct 8, 2024
Python

ServiceNow / SyGra

Star

SyGra - Graph-oriented Synthetic data generation Pipeline

python open-source ai multimodality synthetic-data synthetic-dataset-generation dpo image-datasets low-code-no-code llm-datasets llm-framework sft-data llm-training-data

Updated Dec 26, 2025
Python

amao0o0 / awesome-AI-Math-Datasets

Star

A collection of recent open-source math datasets for training and evaluating Math LLMs

math mathematics llm ai4math llm-datasets math-llm

Updated Dec 8, 2025

A framework to analyze how AGI/ASI might emerge from decentralized, adaptive systems, rather than as the fruit of a single model deployment. It also aims to present orientation as a dynamic and self-evolving Magna Carta, helping to guide the emergence of such phenomena.

machine-learning agi dataset artificial-general-intelligence machine-learning-library datasets machine-learning-projects llm llms rlhf llm-datasets llm-framework llms-benchmarking llm-benchmarking artificial-general-super-intelligence agi-development

Updated Aug 6, 2025

altunenes / rustysozluk

Sponsor

Star

Efficiently fetch and perform sentiment analysis (Turkish Only) on eksisozluk.com entries using Rust

rust scraper sentiment-analysis turkish eksisozluk rust-lang webscraping eksi-sozluk reqwest duyguanalizi rust-scraping llm-training llm-datasets

Updated Feb 8, 2024
Rust

arian-askari / SOLID

Star

Synthetically Generating Intent-Aware Information-Seeking Dialogues! Useful for various tasks such as training/evaluating User Intent Predictors with the possibility to training/evaluating on real human dialogues. The backbone LLM of SOLID is Zephyr-7b-beta.

solid dataset-generation conversational-ai intent-classification llm-training llm-inference llm-datasets llm-dialogs llm-conversations zephyr-7b-beta intent-aware-conversation-generation solid-rl

Updated Aug 18, 2024
Python

neuralwork / audio2chat

Star

Convert multi-speaker audio files to structured chat data for LLMs

chat transcription whisper speaker-diarization llm llm-datasets

Updated Jan 29, 2025
Python

tiddly-gittly / TiddlyWiki-LLM-dataset

Star

WikiText syntax dataset generation pipeline and open dataset for auto UI generation in TiddlyWiki. (WIP)

dataset tiddlywiki wikitext llm llm-training llm-datasets

Updated Nov 20, 2024
TypeScript

DefinetlyNotAI / LLM_Data

Star

A bunch of very famous repos source code's in python as pure localdocs all in this repo to train CODE AI

c data cpp cuda jupyter-notebook python3 code-examples llm llm-datasets data-dum programming-data programming-data-sets llm-code

Updated Dec 12, 2024
Python

dmeldrum6 / LLM-Dataset-Builder

Star

LLM-Powered Dataset Creation Tool

synthetic-data synthetic-dataset-generation synthetic-data-generation llm llm-training llm-datasets

Updated Aug 15, 2025
HTML

JochiRaider / sievio

Star

Sievio turns GitHub, local repos, and web PDFs into clean JSONL for LLM pretraining, fine-tuning, and RAG. It offers structure-aware chunking, reliable Unicode decoding, pluggable QC and safety checks, plus optional dataset cards and deduplication.

python data-deduplication dataset-creation data-pipelines repository-mining jsonl github-repos rag text-preprocessing quality-filtering code-mining llm llm-training llm-datasets

Updated Dec 26, 2025
Python

AmanPriyanshu / Stratified-LLM-Subsets-100K-1M-Scale

Sponsor

Star

Stratified LLM Subsets delivers diverse training data at 100K-1M scales across pre-training (FineWeb-Edu, Proof-Pile-2), instruction-following (Tulu-3, Orca AgentInstruct), and reasoning distillation (Llama-Nemotron). Embedding-based k-means clustering ensures maximum diversity across 5 high-quality open datasets.

Updated Oct 4, 2025
HTML

redblock-ai / parrot-python

Star

PARROT (Performance Assessment of Reasoning and Responses On Trivia) is a novel benchmarking framework designed to evaluate Large Language Models (LLMs) on real-world, complex, and ambiguous QA tasks.

benchmarking-framework llm-inference llm-datasets llm-qa-document llm-benchmarking

Updated Oct 14, 2024
Python

aloobun / ccpem-modified

Star

A modified dataset consisting of English dialogs between a user and an assistant discussing movie preferences in natural language.

dataset llm-datasets

Updated Sep 29, 2023

mohammadreza-mohammadi94 / Persian-Poem-Dataset

Star

A collection of Persian poems structured for NLP and LLM tasks. Each poem is stored as a separate file, organized by poet, and formatted for easy use in training, fine-tuning, or text analysis workflows.

dataset persian-dataset llm-datasets

Updated Jul 3, 2025

bot08 / aiua-20k

Star

dataset generated ukrainian-language huggingface-datasets llm-datasets

Updated Feb 11, 2025

faidrapts / dclm-german

Star

A data curation pipeline in German, inspired by DCLM-Baseline.

data-pipeline llm-datasets llm-curation

Updated Sep 24, 2025
Python

Improve this page

Add a description, image, and links to the llm-datasets topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-datasets topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-datasets

Here are 22 public repositories matching this topic...

neo4j-labs / text2cypher

dsdanielpark / open-llm-datasets

discus-labs / discus

asimsinan / LLM-Research

ServiceNow / SyGra

amao0o0 / awesome-AI-Math-Datasets

ronniross / asi-core-protocol

altunenes / rustysozluk

arian-askari / SOLID

neuralwork / audio2chat

tiddly-gittly / TiddlyWiki-LLM-dataset

DefinetlyNotAI / LLM_Data

dmeldrum6 / LLM-Dataset-Builder

JochiRaider / sievio

AmanPriyanshu / Stratified-LLM-Subsets-100K-1M-Scale

redblock-ai / parrot-python

aloobun / ccpem-modified

mohammadreza-mohammadi94 / Persian-Poem-Dataset

bot08 / aiua-20k

faidrapts / dclm-german

Improve this page

Add this topic to your repo