🧠 THiNK - Can Large Language Models Think-Aloud?

We propose THiNK (Testing Higher-order Notion of Knowledge), a multi-agent, feedback-driven evaluation framework grounded in Bloom’s Taxonomy. THiNK frames reasoning assessment as an iterative task of problem generation, critique, and revision, encouraging LLMs to think-aloud through step-by-step reflection and refinement.

📋 Prerequisites

OpenAI API key (for GPT model)
HuggingFace API key (for Open-source model)
Your own tested model

🚀 Quick Start

Clone the repository:

git clone [your-repo-url]
cd [your-repo-name]

Install dependencies:
```
pip install -r requirements.txt
```

Configure the model: Edit config.json to set your preferred model parameters:

{
    "models": {
        "gpt": {
            "name": "o1-mini",
            "base_url": "your-api-base-url",
            "temperature": 0,
            "max_iterations": 3,
            "quality_threshold": 0.7
        }
    }
}

Run the pipeline:

python run.py --model gpt --api_key your_api_key --num_questions 120 --max_iterations 3

📊 Output Files

The pipeline generates several output files:

bad_questions_evaluation_results.json: Detailed evaluation results
round_metrics.csv: Metrics for each iteration
results/cognitive_performance_table.tex: LaTeX table of cognitive performance

🛠️ Usage

Basic Usage

python run.py --model [gpt|open_source] --api_key YOUR_API_KEY

Advanced Options

python run.py --model gpt \
              --api_key YOUR_API_KEY \
              --num_questions 50 \
              --max_iterations 5

Parameters

--model: Choose between 'gpt' or 'open_source'
--api_key: Your API key for the selected model
--num_questions: Number of questions to process (default: 120)
--max_iterations: Maximum iterations per question (default: 3)

📈 Analysis

The framework provides a comprehensive analysis of question quality:

Cognitive level performance
Quality score progression
Agent agreement metrics
Improvement suggestions

Results are available in both JSON and CSV formats, with LaTeX table generation for academic papers.

⛰️ Citation

Please kindly cite the following paper if you found our benchmark helpful!

This repo is anonymized since underreview

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
__pycache__		__pycache__
pic		pic
results		results
scripts		scripts
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md
bad_questions.csv		bad_questions.csv
config.json		config.json
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 THiNK - Can Large Language Models Think-Aloud?

📋 Prerequisites

🚀 Quick Start

📊 Output Files

🛠️ Usage

Basic Usage

Advanced Options

Parameters

📈 Analysis

⛰️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Michaelyya/THiNK

Folders and files

Latest commit

History

Repository files navigation

🧠 THiNK - Can Large Language Models Think-Aloud?

📋 Prerequisites

🚀 Quick Start

📊 Output Files

🛠️ Usage

Basic Usage

Advanced Options

Parameters

📈 Analysis

⛰️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages