We propose THiNK (Testing Higher-order Notion of Knowledge), a multi-agent, feedback-driven evaluation framework grounded in Bloom’s Taxonomy. THiNK frames reasoning assessment as an iterative task of problem generation, critique, and revision, encouraging LLMs to think-aloud through step-by-step reflection and refinement.
- OpenAI API key (for GPT model)
- HuggingFace API key (for Open-source model)
- Your own tested model
-
Clone the repository:
git clone [your-repo-url] cd [your-repo-name] -
Install dependencies:
pip install -r requirements.txt
-
Configure the model: Edit
config.jsonto set your preferred model parameters:{ "models": { "gpt": { "name": "o1-mini", "base_url": "your-api-base-url", "temperature": 0, "max_iterations": 3, "quality_threshold": 0.7 } } } -
Run the pipeline:
python run.py --model gpt --api_key your_api_key --num_questions 120 --max_iterations 3
The pipeline generates several output files:
bad_questions_evaluation_results.json: Detailed evaluation resultsround_metrics.csv: Metrics for each iterationresults/cognitive_performance_table.tex: LaTeX table of cognitive performance
python run.py --model [gpt|open_source] --api_key YOUR_API_KEYpython run.py --model gpt \
--api_key YOUR_API_KEY \
--num_questions 50 \
--max_iterations 5--model: Choose between 'gpt' or 'open_source'--api_key: Your API key for the selected model--num_questions: Number of questions to process (default: 120)--max_iterations: Maximum iterations per question (default: 3)
The framework provides a comprehensive analysis of question quality:
- Cognitive level performance
- Quality score progression
- Agent agreement metrics
- Improvement suggestions
Results are available in both JSON and CSV formats, with LaTeX table generation for academic papers.
Please kindly cite the following paper if you found our benchmark helpful!
This repo is anonymized since underreview