Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

AmineF0
Copy link

@AmineF0 AmineF0 commented May 12, 2025

updated VannaBase.extract_sql to strip out , , , and tags—used by models like Alibaba’s Qwen3 to wrap intermediate reasoning—before parsing SQL. Without this, the extractor sometimes misread the reasoning as part of the query.

What Changed:

Tag Removal: We loop through known reasoning tags and use non-greedy regex (e.g. …) to clean them out.

Whitespace Cleanup: Removes extra blank lines to avoid confusing the parser.

Safe Defaults: If no tags are present (like with normal LLMs), nothing changes.

Why It Matters:
Qwen3 and similar LLMs output thoughts in XML-style tags. We now clean those up so only the final SQL remains, ensuring compatibility without breaking existing behavior.

Edge Cases:
Only known tags are removed—new ones (e.g., ) won’t be unless added.

Assumes well-formed tags; malformed ones may slip through.

Regex is cautious to avoid removing too much.

Minimal performance cost.

Copy link

@llamapreview llamapreview bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto Pull Request Review from LlamaPReview

1. Overview

1.1 Core Changes

  • Primary purpose and scope: Enhance SQL extraction to handle LLM responses with XML-style reasoning tags
  • Key components modified: VannaBase.extract_sql method
  • Cross-component impacts: Affects all SQL parsing logic using cleaned responses
  • Business value alignment: Improves compatibility with modern LLMs while maintaining backward compatibility

1.2 Technical Architecture

  • System design modifications: Added preprocessing layer for response sanitization
  • Integration points impact: Changes input processing for all SQL extraction patterns
  • Dependency changes: Introduces new regex patterns without new dependencies

2. Critical Findings

2.1 Must Fix (P0🔴)

Issue: Incorrect variable usage in WITH clause extraction

  • Analysis Confidence: High
  • Impact: Fails to apply cleaning to WITH-style queries, breaking core functionality
  • Resolution: Change llm_response to cleaned_response in WITH clause regex

Issue: Missing test coverage

  • Analysis Confidence: High
  • Impact: Risk of undetected regressions and edge case failures
  • Resolution: Add comprehensive unit tests for tag removal scenarios

2.2 Should Fix (P1🟡)

Issue: Potential false positives in SQL comments/strings

  • Analysis Confidence: Medium
  • Impact: Could corrupt valid SQL containing tag-like patterns
  • Suggested Solution: Add test cases and document limitation

2.3 Consider (P2🟢)

Area: Configurable tag list

  • Analysis Confidence: Medium
  • Improvement Opportunity: Future-proof against new LLM tag formats

Area: Code documentation

  • Analysis Confidence: High
  • Improvement Opportunity: Better maintainability through inline comments

2.4 Summary of Action Items

  1. Fix WITH clause variable (P0 - immediate)
  2. Add test suite (P0 - before merge)
  3. Document tag-in-SQL limitation (P1 - post-merge)
  4. Consider tag configuration (P2 - future)

3. Technical Analysis

3.1 Code Logic Analysis

📁 src/vanna/base/base.py - VannaBase.extract_sql

  • Submitted PR Code:
    # WITH clause extraction
    sqls = re.findall(r"\bWITH\b .*?;", llm_response, re.DOTALL | re.IGNORECASE)
  • Analysis:
    • Uses original uncleaned response for WITH queries
    • Defeats purpose of preprocessing step
    • Creates inconsistent parsing behavior
  • LlamaPReview Suggested Improvements:
    sqls = re.findall(r"\bWITH\b .*?;", cleaned_response, re.DOTALL | re.IGNORECASE)
  • Improvement rationale:
    • Ensures consistent preprocessing
    • Maintains feature integrity

📁 src/vanna/base/base.py - Tag Removal Logic

  • Submitted PR Code:
    reasoning_tags = ["reasoning", "reason", "thoughts", "think"]
    for tag_name in reasoning_tags:
        pattern = rf"<{re.escape(tag_name)}>(.*?)</{re.escape(tag_name)}>"
        cleaned_response = re.sub(pattern, "", cleaned_response, flags=re.DOTALL)
  • Analysis:
    • Effective non-greedy pattern with proper escaping
    • Lacks context awareness for SQL structure
    • No protection against comment/string matches
  • Suggested Improvements:
    # Remove reasoning sections (may affect SQL comments/strings)
    for tag_name in reasoning_tags:
        pattern = rf"<{re.escape(tag_name)}>(.*?)</{re.escape(tag_name)}>"
        cleaned_response = re.sub(pattern, "", cleaned_response, flags=re.DOTALL)

3.2 Key Quality Aspects

  • Testing strategy: Requires validation of multiline tags and mixed content
  • Documentation needs: Public API docs should mention tag-stripping behavior

4. Overall Evaluation

  • Technical assessment: Effective solution with critical implementation gaps
  • Business impact: High value for LLM compatibility once fixed
  • Risk evaluation: Medium risk without test coverage
  • Notable positives: Good regex safety practices with re.escape
  • Implementation quality: Requires critical fixes before production-ready
  • Final recommendation: Request Changes (Address P0 issues first)

💡 LlamaPReview Community
Have feedback on this AI Code review tool? Join our GitHub Discussions to share your thoughts and help shape the future of LlamaPReview.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant