BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets

This repository contains the official code and resources for the paper "BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets". The complete dataset is available on Zenodo: https://zenodo.org/records/17381119

Overview

BioGraphletQA is a principled and scalable framework for systematically generating complex, factually-grounded Question Answering (QA) data. The core of our framework is a novel graphlet-anchored generation process, where small subgraphs from a Knowledge Graph are used as a factual base to control the complexity and ensure the grounding of questions generated by Large Language Models (LLMs).

This repository provides:

The full data generation pipeline, from KG processing to downstream task rephrasing.
The complete BioGraphletQA dataset, containing 119,856 complex biomedical QA pairs.
Scripts to reproduce the experiments and analysis presented in the paper.

The BioGraphletQA Pipeline

The dataset generation process consists of six main stages, with each numbered directory corresponding to a stage in the pipeline.

KG Acquisition & Preprocessing: We download the OREGANO KG, then hydrate nodes with their textual names and apply a degree-based reduction to filter for nodes best suited for complex question generation. Finally, we extract graphlets (subgraphs of 3-5 nodes) to serve as the factual basis for generation.
Prompt Ablation Study: We conduct a rigorous, systematic study to identify the optimal prompt for guiding an LLM to generate high-quality, scientifically valid questions from the graphlets.
Large-Scale Dataset Generation: Using the best prompt, we generate the full raw dataset of QA pairs from all sampled graphlets.
LLM-Based Filtering & Human Validation: A second LLM pass acts as a judge, filtering out lower-quality or incoherent QA pairs. This automated process is validated by a human domain expert to ensure its effectiveness.
Supporting Document Retrieval: We enrich the dataset by retrieving relevant abstracts from PubMed for each QA pair using BM25 and then use an LLM to identify and extract the most salient supporting snippets.
Task-Specific Rephrasing: To demonstrate utility, we rephrase a subset of the dataset into the formats of established benchmarks like MedQA (multiple-choice) and PubMedQA (yes/no) for downstream evaluation.

Repository Structure

The code is organized sequentially by numbered directories. Generally, scripts within each directory are also numbered in the order they should be run. Directories starting with an underscore (_) contain data outputs, which will be available on Zenodo.

├── 0-kg_aquisition/          # Script for downloading the OREGANO KG
├── 1-kg_preprocessing/       # KG hydration, reduction, and graphlet extraction
├── 2-prompt_ablation/        # Scripts and notebooks for the prompt ablation study
├── 3-ds_generation/          # Large-scale QA pair generation from graphlets
├── 4-ds_filtering/           # LLM-based quality filtering and human evaluation analysis
├── 5-retrieval/              # LLM-based annotation of retrieved PubMed documents
├── 6-rephrasing/             # Scripts for downstream task rephrasing and evaluation
│
├── _figures/                 # Figures used in the paper and READMEs
└── requirements.txt          # Python package requirements

Requirements

This project requires two separate environments due to library dependencies.

Graph Processing: The scripts in 1-kg_preprocessing require graph-tool. It must be installed via Conda:
```
conda create -n graphtool -c conda-forge graph-tool
conda activate graphtool
```
Main Environment: All other scripts use standard Python packages. Create a virtual environment and install them using:
```
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```

Dataset Statistics

The final BioGraphletQA dataset contains 119,856 question-answer pairs, each grounded in a graphlet of 3 to 5 nodes. The table below shows the generation and acceptance statistics for each of the 29 graphlet shapes.

Graphlet ID	Sampled Graphlets	Generated QA Pairs	Accepted QA Pairs	Acceptance Rate
1	9,954	9,913	4,544	45.8%
2	3,702	3,690	1,744	47.3%
3	9,826	9,783	4,149	42.4%
...	...	...	...	...
28	10,067	9,577	5,593	58.4%
29	10,019	9,639	6,059	62.9%

(Table abridged for brevity. See 4-ds_filtering/README.md for the full table.)

Quality Assessment

The quality of the dataset and the effectiveness of our LLM-based filter were validated by a human domain expert.

A total of 106 QA pairs (78 accepted by the filter, 28 rejected) were manually annotated.
The evaluation used a 5-point Likert scale across several criteria, including Scientific Validity, Question Complexity, and Answer Completeness.
The results confirmed a significant quality difference between the accepted and rejected sets, validating our automated filtering approach.

How to Cite

If you use the BioGraphletQA framework or dataset in your research, please cite our paper:

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
0-kg_aquisition		0-kg_aquisition
1-kg_processing		1-kg_processing
2-prompt_ablation		2-prompt_ablation
3-ds_generation		3-ds_generation
4-ds_filtering		4-ds_filtering
5_retrieval		5_retrieval
6_rephrasing		6_rephrasing
_figures		_figures
_generation_templates		_generation_templates
_graph		_graph
_llm_outs		_llm_outs
_prompts		_prompts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
constants.json		constants.json
requirements-vllm.txt		requirements-vllm.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets

Overview

The BioGraphletQA Pipeline

Repository Structure

Requirements

Dataset Statistics

Quality Assessment

How to Cite

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

ieeta-pt/BioGraphletQA

Folders and files

Latest commit

History

Repository files navigation

BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets

Overview

The BioGraphletQA Pipeline

Repository Structure

Requirements

Dataset Statistics

Quality Assessment

How to Cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages