LLM Token Generation Introspection

LLM Token Generation Introspection

🎯 Why Token Introspection Matters for AI Agents

When developing AI agents and complex LLM-based systems, prompt debugging is a critical development stage. Unlike traditional programming where you can use debuggers and breakpoints, prompt engineering requires entirely different tools to understand how and why a model makes specific decisions.

This tool provides deep introspection into the token generation process, enabling you to:

Visualize Top-K candidate probabilities for each token
Track the impact of different prompting techniques on probability distributions
Identify moments of model uncertainty (low confidence)
Compare the effectiveness of different query formulations
Understand how context and system prompts influence token selection

🚀 Quick Start

Requirements

Locally running llama.cpp server with API enabled
Any modern web browser

Installation and Launch

Download the logit-m.html file from the repository
Open the file in your browser (double-click or File → Open)
Enter the address of your llama.cpp server (e.g., http://127.0.0.1:8080/v1)
Done! The application is fully self-contained — all JavaScript and CSS are embedded in the HTML file

📊 Prompt Analysis and Debugging Methodologies

1. Anchoring Analysis

Methodology: Anchoring is a technique where key concepts or instructions are placed at the beginning and end of the prompt to strengthen their influence on the model.

How to use the tool:

Create two prompt variants:
- Variant A: No anchoring — "Write a product description for a smartphone"
- Variant B: With anchoring — "**Professional marketing description**. Write a product description for a smartphone. **Focus on innovation and premium quality**."
Run both variants and compare:
- Average confidence — anchoring should increase model confidence
- Top-1 probabilities — analyze how probabilities of key tokens change
- Token choices — track the first 5-10 tokens: anchoring should shift selection toward more specific vocabulary

Expected results: With effective anchoring, you should see:

5-15% increase in average confidence
More predictable selection of specialized tokens
Reduced variability in Top-K candidates

2. Semantic Markup Analysis

Methodology: Using structured formats (XML, Markdown, JSON) to explicitly specify roles and hierarchy of prompt elements.

How to use the tool:

Compare three variants:

Plain text:

You are an assistant. Analyze the text and extract key ideas.
Text: [content]

Markdown:

## Role
You are a text analyst

## Task
Extract key ideas

## Input
[content]

XML:

<role>Text analyst</role>
<task>Extract key ideas</task>
<input>[content]</input>

Analyze metrics:
- Min confidence — structured format should increase minimum confidence
- Confidence distribution — look at the percentage of High confidence (≥90%) tokens
- Step-by-step tokens — check if the model follows the markup structure

Expected results:

XML/Markdown markup reduces Low confidence (<70%) tokens by 10-20%
Model better separates logical blocks in responses
Increased consistency in output

3. Chain-of-Thought (CoT) Introspection

Methodology: Forcing the model to reason step-by-step through explicit instructions or examples.

How to use the tool:

Compare prompts:

Direct query:

Solve the problem: Mary had 15 apples, she gave 40% to Peter. How many are left?

CoT prompt:

Solve the problem step by step:
1. Determine the initial quantity
2. Calculate the percentage
3. Find the result

Problem: Mary had 15 apples, she gave 40% to Peter. How many are left?

Track tokens:
- Look for reasoning patterns — tokens like "first", "then", "therefore"
- Analyze confidence on numerical tokens — CoT should increase confidence in calculations
- Check sequencing — model should generate tokens in logical order

Expected results:

Appearance of intermediate reasoning with high confidence (>95%)
Higher probability of correct numerical tokens
Reduction in "impulsive" low-confidence answers

4. Few-Shot Learning Analysis

Methodology: Providing input/output examples for in-context learning.

How to use the tool:

Create prompts with different numbers of examples (0-shot, 1-shot, 3-shot)
Analyze:
- Token probability convergence — each new example should reinforce the pattern
- Sampled tokens ratio — with good few-shot examples, sampled (non-top-1) tokens should decrease
- Consistency across runs — run multiple times and compare variance

Expected results:

3-shot prompts show 10-20% higher average confidence
Reduced variability in token selection
Stricter adherence to example format

5. Context Window and Attention Decay

Methodology: Understanding how context length affects model attention to different prompt parts.

How to use the tool:

Create a long prompt with key information at different positions (beginning/middle/end)
Ask a question requiring information from a specific position
Track:
- Confidence on answer tokens — information from beginning/end has higher confidence
- Structural token appearance — model may "lose" structure in the middle of long context

Lost in the middle problem:

If key information is in the middle of long context and you see low confidence on answer tokens — this is classic "lost in the middle"
Solution: Duplicate important information at the beginning and end of the prompt

📈 Practical Usage Scenarios

Scenario 1: Debugging Tool-Calling Agent

Problem: Agent selects wrong functions or passes incorrect parameters.

Solution through introspection:

Review tokens when model generates function name
Check Top-K candidates — if the correct function is in Top-3 but not Top-1, adjust prompt to strengthen its priority
Analyze confidence on parameter tokens — low confidence indicates ambiguity in function description

Scenario 2: A/B Testing Prompts for Production

Problem: Need to choose the best prompt from several variants.

Comparison metrics:

Average confidence — higher is better for accuracy-requiring tasks
Min confidence — critical for safety (e.g., medical/legal advice)
Sampled tokens ratio — lower = more deterministic behavior
Confidence distribution — more High (≥90%) tokens = better

Recommendation: Run each prompt 10+ times on different test queries and build metric statistics.

🔬 In-Depth Analysis

Understanding Metrics

Average confidence — average model confidence across all generated tokens. High values (>90%) indicate good prompt-task alignment.
Average log p — average log probability. Closer to 0 = higher confidence (typically -3 to 0).
Min confidence — minimum confidence. Critical indicator for identifying "weak spots" in generation.
Top-1 candidate chosen — how often the model chose the most probable token. High percentage (>70%) = deterministic behavior.
Filtered candidates — number of tokens filtered through top-p/min-p. Shows how aggressive the sampling is.

Probability Interpretation

High probability (>90%) — Model very confident, token is expected
Medium probability (50-80%) — Several competing variants
Low probability (<50%) — High uncertainty, unexpected choices possible

Red flags:

If several critical tokens in a row have <20% probability — prompt needs rewriting
If Top-1 and Top-2 have close probabilities (e.g., 35% vs 32%) — model at bifurcation point, small prompt change can dramatically alter result

🛠️ Limitations and Features

Works only with llama.cpp — adaptation required for other inference engines (vLLM, TGI, Transformers)
Requires enabled logprobs — ensure llama.cpp server is running with log probability return support
Performance impact — requesting logprobs slows generation by 10-30%, use only for debugging
Sampling parameters — tool shows results after applying top-p/top-k/min-p, not "raw" probabilities of all vocabulary tokens

🤝 Contributing

Pull requests welcome with:

New prompt analysis methodologies
Usage examples for specific domains (medicine, law, code generation)
Metric visualization improvements
Integration with other inference backends

Developed to assist in debugging LLM agents and improving prompt engineering quality

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
README_RU.md		README_RU.md
logit-m.html		logit-m.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Token Generation Introspection

🎯 Why Token Introspection Matters for AI Agents

🚀 Quick Start

Requirements

Installation and Launch

📊 Prompt Analysis and Debugging Methodologies

1. Anchoring Analysis

2. Semantic Markup Analysis

3. Chain-of-Thought (CoT) Introspection

4. Few-Shot Learning Analysis

5. Context Window and Attention Decay

📈 Practical Usage Scenarios

Scenario 1: Debugging Tool-Calling Agent

Scenario 2: A/B Testing Prompts for Production

🔬 In-Depth Analysis

Understanding Metrics

Probability Interpretation

🛠️ Limitations and Features

🤝 Contributing

About

Uh oh!

Releases

Packages

Languages

airnsk/logit-m

Folders and files

Latest commit

History

Repository files navigation

LLM Token Generation Introspection

🎯 Why Token Introspection Matters for AI Agents

🚀 Quick Start

Requirements

Installation and Launch

📊 Prompt Analysis and Debugging Methodologies

1. Anchoring Analysis

2. Semantic Markup Analysis

3. Chain-of-Thought (CoT) Introspection

4. Few-Shot Learning Analysis

5. Context Window and Attention Decay

📈 Practical Usage Scenarios

Scenario 1: Debugging Tool-Calling Agent

Scenario 2: A/B Testing Prompts for Production

🔬 In-Depth Analysis

Understanding Metrics

Probability Interpretation

🛠️ Limitations and Features

🤝 Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages