This dataset is an extension of MathEDU, tailored for concept-level knowledge tracing tasks. It is designed to predict which specific concepts a student is likely to struggle with in future questions.
The dataset includes the concept annotations associated with each question, as well as the concepts that the student lacked or failed to master on that question.
Each entry in this dataset includes the following fields:
id: Unique identifier for the problem. Corresponds to the index in the combined MathQA dataset (train + validation + test).student_id: The ID of the student who answered the problem.associated_concepts: A list of concepts associated with the problem. These represent the knowledge components required to solve the question.missing_concepts: A list of concepts that the student failed to demonstrate understanding of when answering the problem. These reflect potential learning gaps.
Each record can be linked back to the corresponding student response in the MathEDU dataset using the id and student_id fields.
To access the full content of each question, use the id to match against the combined and indexed MathQA. Here’s how to create the indexed version:
import json
from datasets import load_dataset, concatenate_datasets
dataset = load_dataset("json", data_files={
"train": "MathQA/train.json",
"validation": "MathQA/dev.json",
"test": "MathQA/test.json"
})
mathqa = concatenate_datasets([dataset['train'], dataset['validation'], dataset['test']])
mathqa = mathqa.add_column("index", list(range(len(mathqa))))
mathqa.to_json("mathqa_index.json", orient="records", lines=True)Here is an example of an entry in the dataset:
{
"id": 17467,
"student_id": 6,
"associated_concepts": [
"Volume Calculation",
"Area Calculation",
"Unit Conversion"
],
"missing_concepts": [
"Volume Calculation",
"Area Calculation"
]
}