Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Systematic evaluation framework that automatically rates overthinking behavior in large language models.

License

Notifications You must be signed in to change notification settings

AlexCuadron/ThinkingAgent

Repository files navigation

Thinking Agent

Overview

AgentThink is a systematic evaluation framework that automatically identifies failure patterns in large language models.

Getting Started

First, clone the repository and install the required packages:

cd ThinkingAgent
pip install -r requirements.txt

The framework consists of two main components:

  1. format_message.py: Processes and formats interaction logs into a standardized format
  2. analyze_agent_think.py: Analyzes the formatted interactions and produces overthinking scores

Configuration

The framework uses a config.toml file to configure the LLM settings:

[llm]
model = "claude-3-5-sonnet-20241022"
api_key = ""  # Set via environment variable LLM_API_KEY
temperature = 0.0
max_output_tokens = 4096
num_retries = 3
retry_min_wait = 4
retry_max_wait = 10
retry_multiplier = 2

Evaluation

The evaluation process follows these steps:

  1. Data Collection: Gather interaction logs from models performing agentic tasks
  2. Message Formatting: Use format_message.py to standardize the interaction format
  3. Analysis: Run analyze_overthinking.py to evaluate overthinking behaviors
  4. Scoring: Generate scores (0-10) for each interaction based on:
    • 0-3: Always interacting with the environment
    • 4-7: Sometimes relies on internal reasoning
    • 8-10: Completely relies on internal reasoning

Usage

To analyze a set of interactions:

# Load configuration and initialize LLM
config = load_config()
llm = LLM(config)

# Analyze responses
analyze_responses("path/to/interactions", iteration_number=None)

About

Systematic evaluation framework that automatically rates overthinking behavior in large language models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published