When developing AI agents and complex LLM-based systems, prompt debugging is a critical development stage. Unlike traditional programming where you can use debuggers and breakpoints, prompt engineering requires entirely different tools to understand how and why a model makes specific decisions.
This tool provides deep introspection into the token generation process, enabling you to:
- Visualize Top-K candidate probabilities for each token
- Track the impact of different prompting techniques on probability distributions
- Identify moments of model uncertainty (low confidence)
- Compare the effectiveness of different query formulations
- Understand how context and system prompts influence token selection
- Locally running llama.cpp server with API enabled
- Any modern web browser
- Download the
logit-m.htmlfile from the repository - Open the file in your browser (double-click or File → Open)
- Enter the address of your llama.cpp server (e.g.,
http://127.0.0.1:8080/v1) - Done! The application is fully self-contained — all JavaScript and CSS are embedded in the HTML file
Methodology: Anchoring is a technique where key concepts or instructions are placed at the beginning and end of the prompt to strengthen their influence on the model.
How to use the tool:
-
Create two prompt variants:
- Variant A: No anchoring —
"Write a product description for a smartphone" - Variant B: With anchoring —
"**Professional marketing description**. Write a product description for a smartphone. **Focus on innovation and premium quality**."
- Variant A: No anchoring —
-
Run both variants and compare:
- Average confidence — anchoring should increase model confidence
- Top-1 probabilities — analyze how probabilities of key tokens change
- Token choices — track the first 5-10 tokens: anchoring should shift selection toward more specific vocabulary
Expected results: With effective anchoring, you should see:
- 5-15% increase in average confidence
- More predictable selection of specialized tokens
- Reduced variability in Top-K candidates
Methodology: Using structured formats (XML, Markdown, JSON) to explicitly specify roles and hierarchy of prompt elements.
How to use the tool:
-
Compare three variants:
Plain text:
You are an assistant. Analyze the text and extract key ideas. Text: [content]Markdown:
## Role You are a text analyst ## Task Extract key ideas ## Input [content]XML:
<role>Text analyst</role> <task>Extract key ideas</task> <input>[content]</input>
-
Analyze metrics:
- Min confidence — structured format should increase minimum confidence
- Confidence distribution — look at the percentage of High confidence (≥90%) tokens
- Step-by-step tokens — check if the model follows the markup structure
Expected results:
- XML/Markdown markup reduces Low confidence (<70%) tokens by 10-20%
- Model better separates logical blocks in responses
- Increased consistency in output
Methodology: Forcing the model to reason step-by-step through explicit instructions or examples.
How to use the tool:
-
Compare prompts:
Direct query:
Solve the problem: Mary had 15 apples, she gave 40% to Peter. How many are left?CoT prompt:
Solve the problem step by step: 1. Determine the initial quantity 2. Calculate the percentage 3. Find the result Problem: Mary had 15 apples, she gave 40% to Peter. How many are left? -
Track tokens:
- Look for reasoning patterns — tokens like "first", "then", "therefore"
- Analyze confidence on numerical tokens — CoT should increase confidence in calculations
- Check sequencing — model should generate tokens in logical order
Expected results:
- Appearance of intermediate reasoning with high confidence (>95%)
- Higher probability of correct numerical tokens
- Reduction in "impulsive" low-confidence answers
Methodology: Providing input/output examples for in-context learning.
How to use the tool:
- Create prompts with different numbers of examples (0-shot, 1-shot, 3-shot)
- Analyze:
- Token probability convergence — each new example should reinforce the pattern
- Sampled tokens ratio — with good few-shot examples, sampled (non-top-1) tokens should decrease
- Consistency across runs — run multiple times and compare variance
Expected results:
- 3-shot prompts show 10-20% higher average confidence
- Reduced variability in token selection
- Stricter adherence to example format
Methodology: Understanding how context length affects model attention to different prompt parts.
How to use the tool:
- Create a long prompt with key information at different positions (beginning/middle/end)
- Ask a question requiring information from a specific position
- Track:
- Confidence on answer tokens — information from beginning/end has higher confidence
- Structural token appearance — model may "lose" structure in the middle of long context
Lost in the middle problem:
- If key information is in the middle of long context and you see low confidence on answer tokens — this is classic "lost in the middle"
- Solution: Duplicate important information at the beginning and end of the prompt
Problem: Agent selects wrong functions or passes incorrect parameters.
Solution through introspection:
- Review tokens when model generates function name
- Check Top-K candidates — if the correct function is in Top-3 but not Top-1, adjust prompt to strengthen its priority
- Analyze confidence on parameter tokens — low confidence indicates ambiguity in function description
Problem: Need to choose the best prompt from several variants.
Comparison metrics:
- Average confidence — higher is better for accuracy-requiring tasks
- Min confidence — critical for safety (e.g., medical/legal advice)
- Sampled tokens ratio — lower = more deterministic behavior
- Confidence distribution — more High (≥90%) tokens = better
Recommendation: Run each prompt 10+ times on different test queries and build metric statistics.
- Average confidence — average model confidence across all generated tokens. High values (>90%) indicate good prompt-task alignment.
- Average log p — average log probability. Closer to 0 = higher confidence (typically -3 to 0).
- Min confidence — minimum confidence. Critical indicator for identifying "weak spots" in generation.
- Top-1 candidate chosen — how often the model chose the most probable token. High percentage (>70%) = deterministic behavior.
- Filtered candidates — number of tokens filtered through top-p/min-p. Shows how aggressive the sampling is.
High probability (>90%) — Model very confident, token is expected
Medium probability (50-80%) — Several competing variants
Low probability (<50%) — High uncertainty, unexpected choices possible
Red flags:
- If several critical tokens in a row have <20% probability — prompt needs rewriting
- If Top-1 and Top-2 have close probabilities (e.g., 35% vs 32%) — model at bifurcation point, small prompt change can dramatically alter result
- Works only with llama.cpp — adaptation required for other inference engines (vLLM, TGI, Transformers)
- Requires enabled logprobs — ensure llama.cpp server is running with log probability return support
- Performance impact — requesting logprobs slows generation by 10-30%, use only for debugging
- Sampling parameters — tool shows results after applying top-p/top-k/min-p, not "raw" probabilities of all vocabulary tokens
Pull requests welcome with:
- New prompt analysis methodologies
- Usage examples for specific domains (medicine, law, code generation)
- Metric visualization improvements
- Integration with other inference backends
Developed to assist in debugging LLM agents and improving prompt engineering quality