Thanks to visit codestin.com
Credit goes to github.com

Skip to content

airnsk/logit-m

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

image

LLM Token Generation Introspection

🎯 Why Token Introspection Matters for AI Agents

When developing AI agents and complex LLM-based systems, prompt debugging is a critical development stage. Unlike traditional programming where you can use debuggers and breakpoints, prompt engineering requires entirely different tools to understand how and why a model makes specific decisions.

This tool provides deep introspection into the token generation process, enabling you to:

  • Visualize Top-K candidate probabilities for each token
  • Track the impact of different prompting techniques on probability distributions
  • Identify moments of model uncertainty (low confidence)
  • Compare the effectiveness of different query formulations
  • Understand how context and system prompts influence token selection

🚀 Quick Start

Requirements

  • Locally running llama.cpp server with API enabled
  • Any modern web browser

Installation and Launch

  1. Download the logit-m.html file from the repository
  2. Open the file in your browser (double-click or File → Open)
  3. Enter the address of your llama.cpp server (e.g., http://127.0.0.1:8080/v1)
  4. Done! The application is fully self-contained — all JavaScript and CSS are embedded in the HTML file

📊 Prompt Analysis and Debugging Methodologies

1. Anchoring Analysis

Methodology: Anchoring is a technique where key concepts or instructions are placed at the beginning and end of the prompt to strengthen their influence on the model.

How to use the tool:

  1. Create two prompt variants:

    • Variant A: No anchoring — "Write a product description for a smartphone"
    • Variant B: With anchoring — "**Professional marketing description**. Write a product description for a smartphone. **Focus on innovation and premium quality**."
  2. Run both variants and compare:

    • Average confidence — anchoring should increase model confidence
    • Top-1 probabilities — analyze how probabilities of key tokens change
    • Token choices — track the first 5-10 tokens: anchoring should shift selection toward more specific vocabulary

Expected results: With effective anchoring, you should see:

  • 5-15% increase in average confidence
  • More predictable selection of specialized tokens
  • Reduced variability in Top-K candidates

2. Semantic Markup Analysis

Methodology: Using structured formats (XML, Markdown, JSON) to explicitly specify roles and hierarchy of prompt elements.

How to use the tool:

  1. Compare three variants:

    Plain text:

    You are an assistant. Analyze the text and extract key ideas.
    Text: [content]
    

    Markdown:

    ## Role
    You are a text analyst
    
    ## Task
    Extract key ideas
    
    ## Input
    [content]
    

    XML:

    <role>Text analyst</role>
    <task>Extract key ideas</task>
    <input>[content]</input>
  2. Analyze metrics:

    • Min confidence — structured format should increase minimum confidence
    • Confidence distribution — look at the percentage of High confidence (≥90%) tokens
    • Step-by-step tokens — check if the model follows the markup structure

Expected results:

  • XML/Markdown markup reduces Low confidence (<70%) tokens by 10-20%
  • Model better separates logical blocks in responses
  • Increased consistency in output

3. Chain-of-Thought (CoT) Introspection

Methodology: Forcing the model to reason step-by-step through explicit instructions or examples.

How to use the tool:

  1. Compare prompts:

    Direct query:

    Solve the problem: Mary had 15 apples, she gave 40% to Peter. How many are left?
    

    CoT prompt:

    Solve the problem step by step:
    1. Determine the initial quantity
    2. Calculate the percentage
    3. Find the result
    
    Problem: Mary had 15 apples, she gave 40% to Peter. How many are left?
    
  2. Track tokens:

    • Look for reasoning patterns — tokens like "first", "then", "therefore"
    • Analyze confidence on numerical tokens — CoT should increase confidence in calculations
    • Check sequencing — model should generate tokens in logical order

Expected results:

  • Appearance of intermediate reasoning with high confidence (>95%)
  • Higher probability of correct numerical tokens
  • Reduction in "impulsive" low-confidence answers

4. Few-Shot Learning Analysis

Methodology: Providing input/output examples for in-context learning.

How to use the tool:

  1. Create prompts with different numbers of examples (0-shot, 1-shot, 3-shot)
  2. Analyze:
    • Token probability convergence — each new example should reinforce the pattern
    • Sampled tokens ratio — with good few-shot examples, sampled (non-top-1) tokens should decrease
    • Consistency across runs — run multiple times and compare variance

Expected results:

  • 3-shot prompts show 10-20% higher average confidence
  • Reduced variability in token selection
  • Stricter adherence to example format

5. Context Window and Attention Decay

Methodology: Understanding how context length affects model attention to different prompt parts.

How to use the tool:

  1. Create a long prompt with key information at different positions (beginning/middle/end)
  2. Ask a question requiring information from a specific position
  3. Track:
    • Confidence on answer tokens — information from beginning/end has higher confidence
    • Structural token appearance — model may "lose" structure in the middle of long context

Lost in the middle problem:

  • If key information is in the middle of long context and you see low confidence on answer tokens — this is classic "lost in the middle"
  • Solution: Duplicate important information at the beginning and end of the prompt

📈 Practical Usage Scenarios

Scenario 1: Debugging Tool-Calling Agent

Problem: Agent selects wrong functions or passes incorrect parameters.

Solution through introspection:

  1. Review tokens when model generates function name
  2. Check Top-K candidates — if the correct function is in Top-3 but not Top-1, adjust prompt to strengthen its priority
  3. Analyze confidence on parameter tokens — low confidence indicates ambiguity in function description

Scenario 2: A/B Testing Prompts for Production

Problem: Need to choose the best prompt from several variants.

Comparison metrics:

  1. Average confidence — higher is better for accuracy-requiring tasks
  2. Min confidence — critical for safety (e.g., medical/legal advice)
  3. Sampled tokens ratio — lower = more deterministic behavior
  4. Confidence distribution — more High (≥90%) tokens = better

Recommendation: Run each prompt 10+ times on different test queries and build metric statistics.

🔬 In-Depth Analysis

Understanding Metrics

  • Average confidence — average model confidence across all generated tokens. High values (>90%) indicate good prompt-task alignment.
  • Average log p — average log probability. Closer to 0 = higher confidence (typically -3 to 0).
  • Min confidence — minimum confidence. Critical indicator for identifying "weak spots" in generation.
  • Top-1 candidate chosen — how often the model chose the most probable token. High percentage (>70%) = deterministic behavior.
  • Filtered candidates — number of tokens filtered through top-p/min-p. Shows how aggressive the sampling is.

Probability Interpretation

High probability (>90%) — Model very confident, token is expected
Medium probability (50-80%) — Several competing variants
Low probability (<50%) — High uncertainty, unexpected choices possible

Red flags:

  • If several critical tokens in a row have <20% probability — prompt needs rewriting
  • If Top-1 and Top-2 have close probabilities (e.g., 35% vs 32%) — model at bifurcation point, small prompt change can dramatically alter result

🛠️ Limitations and Features

  • Works only with llama.cpp — adaptation required for other inference engines (vLLM, TGI, Transformers)
  • Requires enabled logprobs — ensure llama.cpp server is running with log probability return support
  • Performance impact — requesting logprobs slows generation by 10-30%, use only for debugging
  • Sampling parameters — tool shows results after applying top-p/top-k/min-p, not "raw" probabilities of all vocabulary tokens

🤝 Contributing

Pull requests welcome with:

  • New prompt analysis methodologies
  • Usage examples for specific domains (medicine, law, code generation)
  • Metric visualization improvements
  • Integration with other inference backends

Developed to assist in debugging LLM agents and improving prompt engineering quality

About

LLM Token Generation Introspection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages