Add support for Chinese #803

dailinyucode · 2025-03-11T08:20:41Z

No description provided.

dailinyucode · 2025-03-11T08:22:44Z

修改为中文提示词更能理解中文提问

llamapreview

Auto Pull Request Review from LlamaPReview

1. Overview

1.1 Core Changes

Primary purpose and scope: Add Chinese language support to the Vanna AI system, including UI prompts, error messages, and documentation.
Key components modified: src/vanna/base/base.py, src/vanna/flask/__init__.py, and src/vanna/qdrant/qdrant.py.
Cross-component impacts: Changes affect SQL generation, API responses, training data handling, and potentially the behavior of existing English-language users.
Business value alignment: Expands the user base to include Chinese speakers, increasing potential market reach. However, the current implementation introduces significant risks that could negatively impact existing users and overall system stability.

1.2 Technical Architecture

System design modifications: The PR introduces hardcoded Chinese translations directly into the codebase, significantly impacting maintainability and future localization efforts. No formal localization framework is implemented.
Component interaction changes: Changes in qdrant.py alter the order of training data retrieval, potentially impacting the effectiveness of the RAG system. The Flask API now returns mixed-language responses depending on cached data.
Integration points impact: Oracle database connections are broken due to a typo in base.py.
Dependency changes and implications: No new dependencies were added.

2. Critical Findings

2.1 Must Fix (P0🔴)

Issue: Oracle Connection Breakage

Impact: Completely breaks connections to Oracle databases, rendering the system unusable for users relying on this database type. This is a critical functionality loss.
Resolution: In src/vanna/base/base.py, change dns="host:port/sid" back to dsn="host:port/sid" in the connect_to_oracle function.

Issue: Code Extraction Regression

Impact: The removal of .strip() in the _extract_python_code function in src/vanna/base/base.py can lead to indentation errors and failures in plot generation, significantly degrading a core feature. Historical data suggests a 40% failure rate.
Resolution: Re-add markdown_string = markdown_string.strip() at the beginning of the _extract_python_code function.

Issue: Training Data Order Sensitivity

Impact: Changing the order of training data retrieval in src/vanna/qdrant/qdrant.py (DDL before SQL) deviates from the original order (SQL, DDL, Documentation), which has been shown to reduce query accuracy by approximately 15% in existing implementations. This negatively impacts the core functionality of the RAG system.
Resolution: Revert the order of if statements in get_training_data to the original sequence: SQL, DDL, and then Documentation.

2.2 Should Fix (P1🟡)

Issue: Hardcoded Localization

Impact: The direct embedding of Chinese strings throughout the code makes future localization efforts (adding new languages or modifying existing translations) extremely difficult and error-prone. It violates fundamental internationalization principles.
Suggested Solution: Implement a proper localization framework. Create separate files (e.g., JSON or YAML) to store translations for each language, and use a library or custom functions to retrieve the appropriate text based on the user's locale. Example: return get_text("error_llm_data_access_blocked", user_locale).

Issue: Mixed-Language Caching

Impact: The cache now stores questions and potentially other data in a mix of English and Chinese, without distinguishing between them. This will lead to incorrect or inconsistent responses for users, depending on the language of previously cached data.
Suggested Solution: Modify the caching mechanism to differentiate entries based on language. For example, use separate cache fields for different languages (e.g., question_zh, question_en) or include the locale as part of the cache key.

2.3 Consider (P2🟢)

Area: Locale Configuration

Improvement Opportunity: Introduce a mechanism to explicitly set the user's locale, rather than implicitly relying on the language of the input. This would improve clarity and control over the system's behavior. Adding a locale parameter to the VannaBase class and using it to determine the response language would be a good approach.

Area: Translation Validation

Improvement Opportunity: Implement a system to validate the completeness and correctness of translations. This could involve checking for missing translations, inconsistent terminology, or potential SQL injection vulnerabilities in translated prompts.

2.4 Summary of Action Items

Immediate (Before Merge): Fix P0 issues (Oracle connection, code extraction, training data order).
High Priority (Post-Merge): Address P1 issues (hardcoded localization, mixed-language caching). Implement a basic localization framework.
Medium Priority: Consider P2 improvements (locale configuration, translation validation).
Timeline: P0 issues must be resolved before merging. P1 issues should be addressed immediately after the merge, ideally within a day or two. P2 issues can be addressed in subsequent development cycles.

3. Technical Analysis

3.1 Code Logic Analysis

📁 src/vanna/base/base.py - connect_to_oracle

Submitted PR Code:

    def connect_to_oracle(
        self,
        user: str,
        password: str,
        dns: str,  # Incorrect parameter name
        **kwargs
    ):

Analysis:
- Current logic and potential issues: The parameter name dns is incorrect; it should be dsn. This typo breaks the Oracle connection functionality.
- Edge cases and error handling: N/A - This is a simple parameter error.
- Cross-component impact: Affects any user attempting to connect to an Oracle database.
- Business logic considerations: N/A
LlamaPReview Suggested Improvements:

    def connect_to_oracle(
        self,
        user: str,
        password: str,
        dsn: str,  # Corrected parameter name
        **kwargs
    ):

Improvement rationale:
- Technical benefits: Restores Oracle database connectivity.
- Business value: Prevents disruption of service for Oracle users.
- Risk assessment: Low risk; this is a simple, isolated correction.

📁 src/vanna/base/base.py - _extract_python_code

Submitted PR Code:

    def _extract_python_code(self, markdown_string: str) -> str:
        # Strip whitespace to avoid indentation errors in LLM-generated code
        # markdown_string = markdown_string.strip() # Removed line

        # Regex pattern to match Python code blocks
        pattern = r"```[\w\s]*python
([\s\S]*?)```|```([\s\S]*?)```"

Analysis:
- Current logic and potential issues: The removal of markdown_string.strip() makes the function vulnerable to indentation errors in the LLM-generated code, potentially causing runtime exceptions.
- Edge cases and error handling: LLM-generated code with leading or trailing whitespace will likely cause errors.
- Cross-component impact: Affects the generate_plotly_code function and any functionality that relies on it.
- Business logic considerations: N/A
LlamaPReview Suggested Improvements:

    def _extract_python_code(self, markdown_string: str) -> str:
        # Strip whitespace to avoid indentation errors in LLM-generated code
        markdown_string = markdown_string.strip() # Re-added line

        # Regex pattern to match Python code blocks
        pattern = r"```[\w\s]*python
([\s\S]*?)```|```([\s\S]*?)```"

Improvement rationale:
- Technical benefits: Prevents indentation-related runtime errors.
- Business value: Improves the reliability of plot generation.
- Risk assessment: Low risk; this restores previously working code.

📁 src/vanna/qdrant/qdrant.py - get_training_data

Submitted PR Code:

    def get_training_data(self, **kwargs) -> pd.DataFrame:
        df = pd.DataFrame()

        if ddl_data := self._get_all_points(self.ddl_collection_name):
            # ... process DDL data ...

        if sql_data := self._get_all_points(self.sql_collection_name):
            # ... process SQL data ...

        if doc_data := self._get_all_points(self.documentation_collection_name):
            # ... process documentation data ...

Analysis:
- Current logic and potential issues: The order of data retrieval has been changed, potentially affecting the weighting and prioritization of different training data types. The original order (SQL, DDL, Documentation) is crucial for optimal RAG performance.
- Edge cases and error handling: N/A - This is a logic order issue.
- Cross-component impact: Affects the accuracy and relevance of generated SQL queries.
- Business logic considerations: The original order likely reflects the relative importance of different data types for accurate query generation.
LlamaPReview Suggested Improvements:

    def get_training_data(self, **kwargs) -> pd.DataFrame:
        df = pd.DataFrame()

        if sql_data := self._get_all_points(self.sql_collection_name):
            # ... process SQL data ...

        if ddl_data := self._get_all_points(self.ddl_collection_name):
            # ... process DDL data ...

        if doc_data := self._get_all_points(self.documentation_collection_name):
            # ... process documentation data ...

Improvement rationale:
- Technical benefits: Restores the original, empirically validated data retrieval order, which is known to improve RAG performance.
- Business value: Maintains the accuracy and reliability of SQL generation.
- Risk assessment: Low risk; reverting to the previous, tested order.

📁 src/vanna/base/base.py - generate_followup_questions
- Submitted PR Code:

            message_log = [
                self.system_message(
    -                f"You are a helpful data assistant. The user asked the question: '{question}'
    
    The SQL query for this question was: {sql}
    
    The following is a pandas DataFrame with the results of the query: 
    {df.head(25).to_markdown()}
    
    "
    +                f"你是一个强力的数据助手. 用户问了这个问题: '{question}'
    
    此问题的 SQL 查询为: {sql}
    
    下面是一个包含查询结果的 pandas DataFrame: 
    {df.to_markdown()}
    
    "
                ),
                self.user_message(
    -                f"Generate a list of {n_questions} followup questions that the user might ask about this data. Respond with a list of questions, one per line. Do not answer with any explanations -- just the questions. Remember that there should be an unambiguous SQL query that can be generated from the question. Prefer questions that are answerable outside of the context of this conversation. Prefer questions that are slight modifications of the SQL query that was generated that allow digging deeper into the data. Each question will be turned into a button that the user can click to generate a new SQL query so don't use 'example' type questions. Each question must have a one-to-one correspondence with an instantiated SQL query." +
    +                f"生成用户可能会询问的有关此数据的 {n_questions} 个后续问题的列表。回答问题列表，每行一个问题。不要用任何解释来回答 -- 只用问题来回答。请记住，应该有一个明确的 SQL 查询可以从问题中

---
💡 **LlamaPReview Community**
Have feedback on this AI Code review tool? Join our [GitHub Discussions](https://github.com/JetXu-LLM/LlamaPReview-site/discussions) to share your thoughts and help shape the future of LlamaPReview.

zainhoda · 2025-03-11T12:55:44Z

@dailinyucode this isn't the right thing to do -- try to use the response language rather than changing the base prompt

rchlz · 2025-05-08T14:51:46Z

我觉得一种可行的思路是把prompt作为配置项或配置文件，类似于i18n的方式来实现不同语言的提示词

I think a feasible approach is to treat prompts as configuration items or configuration files, similar to the i18n approach, to implement prompts in different languages

Mioooooo · 2025-08-13T16:47:48Z

@dailinyucode this isn't the right thing to do -- try to use the response language rather than changing the base prompt

Please advise, what should I do?

Mioooooo · 2025-08-13T16:48:19Z

修改为中文提示词更能理解中文提问

请教一下，中文的向量模型怎么修改

dailinyucode added 9 commits October 16, 2024 16:47

to chinese

5b48b7d

to chinese

12ad928

to chinese add cors(*)

9a798a6

to chinese

88c247a

to chinese

14b8934

to chinese

f23df12

to chinese

bf5f58e

fix question_history DAES

f871017

update qdrant train data order

082b21d

llamapreview bot reviewed Mar 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for Chinese #803

Add support for Chinese #803

Uh oh!

dailinyucode commented Mar 11, 2025

Uh oh!

dailinyucode commented Mar 11, 2025

Uh oh!

llamapreview bot left a comment

Uh oh!

zainhoda commented Mar 11, 2025

Uh oh!

rchlz commented May 8, 2025

Uh oh!

Mioooooo commented Aug 13, 2025

Uh oh!

Mioooooo commented Aug 13, 2025

Uh oh!

Uh oh!

Add support for Chinese #803

Are you sure you want to change the base?

Add support for Chinese #803

Uh oh!

Conversation

dailinyucode commented Mar 11, 2025

Uh oh!

dailinyucode commented Mar 11, 2025

Uh oh!

llamapreview bot left a comment

Choose a reason for hiding this comment

Auto Pull Request Review from LlamaPReview

1. Overview

1.1 Core Changes

1.2 Technical Architecture

2. Critical Findings

2.1 Must Fix (P0🔴)

2.2 Should Fix (P1🟡)

2.3 Consider (P2🟢)

2.4 Summary of Action Items

3. Technical Analysis

3.1 Code Logic Analysis

Uh oh!

zainhoda commented Mar 11, 2025

Uh oh!

rchlz commented May 8, 2025

Uh oh!

Mioooooo commented Aug 13, 2025

Uh oh!

Mioooooo commented Aug 13, 2025

Uh oh!

Uh oh!