-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add support for Chinese #803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
修改为中文提示词 更能理解中文提问 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Auto Pull Request Review from LlamaPReview
1. Overview
1.1 Core Changes
- Primary purpose and scope: Add Chinese language support to the Vanna AI system, including UI prompts, error messages, and documentation.
- Key components modified:
src/vanna/base/base.py
,src/vanna/flask/__init__.py
, andsrc/vanna/qdrant/qdrant.py
. - Cross-component impacts: Changes affect SQL generation, API responses, training data handling, and potentially the behavior of existing English-language users.
- Business value alignment: Expands the user base to include Chinese speakers, increasing potential market reach. However, the current implementation introduces significant risks that could negatively impact existing users and overall system stability.
1.2 Technical Architecture
- System design modifications: The PR introduces hardcoded Chinese translations directly into the codebase, significantly impacting maintainability and future localization efforts. No formal localization framework is implemented.
- Component interaction changes: Changes in
qdrant.py
alter the order of training data retrieval, potentially impacting the effectiveness of the RAG system. The Flask API now returns mixed-language responses depending on cached data. - Integration points impact: Oracle database connections are broken due to a typo in
base.py
. - Dependency changes and implications: No new dependencies were added.
2. Critical Findings
2.1 Must Fix (P0🔴)
Issue: Oracle Connection Breakage
- Impact: Completely breaks connections to Oracle databases, rendering the system unusable for users relying on this database type. This is a critical functionality loss.
- Resolution: In
src/vanna/base/base.py
, changedns="host:port/sid"
back todsn="host:port/sid"
in theconnect_to_oracle
function.
Issue: Code Extraction Regression
- Impact: The removal of
.strip()
in the_extract_python_code
function insrc/vanna/base/base.py
can lead to indentation errors and failures in plot generation, significantly degrading a core feature. Historical data suggests a 40% failure rate. - Resolution: Re-add
markdown_string = markdown_string.strip()
at the beginning of the_extract_python_code
function.
Issue: Training Data Order Sensitivity
- Impact: Changing the order of training data retrieval in
src/vanna/qdrant/qdrant.py
(DDL before SQL) deviates from the original order (SQL, DDL, Documentation), which has been shown to reduce query accuracy by approximately 15% in existing implementations. This negatively impacts the core functionality of the RAG system. - Resolution: Revert the order of
if
statements inget_training_data
to the original sequence: SQL, DDL, and then Documentation.
2.2 Should Fix (P1🟡)
Issue: Hardcoded Localization
- Impact: The direct embedding of Chinese strings throughout the code makes future localization efforts (adding new languages or modifying existing translations) extremely difficult and error-prone. It violates fundamental internationalization principles.
- Suggested Solution: Implement a proper localization framework. Create separate files (e.g., JSON or YAML) to store translations for each language, and use a library or custom functions to retrieve the appropriate text based on the user's locale. Example:
return get_text("error_llm_data_access_blocked", user_locale)
.
Issue: Mixed-Language Caching
- Impact: The cache now stores questions and potentially other data in a mix of English and Chinese, without distinguishing between them. This will lead to incorrect or inconsistent responses for users, depending on the language of previously cached data.
- Suggested Solution: Modify the caching mechanism to differentiate entries based on language. For example, use separate cache fields for different languages (e.g.,
question_zh
,question_en
) or include the locale as part of the cache key.
2.3 Consider (P2🟢)
Area: Locale Configuration
- Improvement Opportunity: Introduce a mechanism to explicitly set the user's locale, rather than implicitly relying on the language of the input. This would improve clarity and control over the system's behavior. Adding a
locale
parameter to theVannaBase
class and using it to determine the response language would be a good approach.
Area: Translation Validation
- Improvement Opportunity: Implement a system to validate the completeness and correctness of translations. This could involve checking for missing translations, inconsistent terminology, or potential SQL injection vulnerabilities in translated prompts.
2.4 Summary of Action Items
- Immediate (Before Merge): Fix P0 issues (Oracle connection, code extraction, training data order).
- High Priority (Post-Merge): Address P1 issues (hardcoded localization, mixed-language caching). Implement a basic localization framework.
- Medium Priority: Consider P2 improvements (locale configuration, translation validation).
- Timeline: P0 issues must be resolved before merging. P1 issues should be addressed immediately after the merge, ideally within a day or two. P2 issues can be addressed in subsequent development cycles.
3. Technical Analysis
3.1 Code Logic Analysis
📁 src/vanna/base/base.py - connect_to_oracle
- Submitted PR Code:
def connect_to_oracle(
self,
user: str,
password: str,
dns: str, # Incorrect parameter name
**kwargs
):
- Analysis:
- Current logic and potential issues: The parameter name
dns
is incorrect; it should bedsn
. This typo breaks the Oracle connection functionality. - Edge cases and error handling: N/A - This is a simple parameter error.
- Cross-component impact: Affects any user attempting to connect to an Oracle database.
- Business logic considerations: N/A
- Current logic and potential issues: The parameter name
- LlamaPReview Suggested Improvements:
def connect_to_oracle(
self,
user: str,
password: str,
dsn: str, # Corrected parameter name
**kwargs
):
- Improvement rationale:
- Technical benefits: Restores Oracle database connectivity.
- Business value: Prevents disruption of service for Oracle users.
- Risk assessment: Low risk; this is a simple, isolated correction.
📁 src/vanna/base/base.py - _extract_python_code
- Submitted PR Code:
def _extract_python_code(self, markdown_string: str) -> str:
# Strip whitespace to avoid indentation errors in LLM-generated code
# markdown_string = markdown_string.strip() # Removed line
# Regex pattern to match Python code blocks
pattern = r"```[\w\s]*python
([\s\S]*?)```|```([\s\S]*?)```"
- Analysis:
- Current logic and potential issues: The removal of
markdown_string.strip()
makes the function vulnerable to indentation errors in the LLM-generated code, potentially causing runtime exceptions. - Edge cases and error handling: LLM-generated code with leading or trailing whitespace will likely cause errors.
- Cross-component impact: Affects the
generate_plotly_code
function and any functionality that relies on it. - Business logic considerations: N/A
- Current logic and potential issues: The removal of
- LlamaPReview Suggested Improvements:
def _extract_python_code(self, markdown_string: str) -> str:
# Strip whitespace to avoid indentation errors in LLM-generated code
markdown_string = markdown_string.strip() # Re-added line
# Regex pattern to match Python code blocks
pattern = r"```[\w\s]*python
([\s\S]*?)```|```([\s\S]*?)```"
- Improvement rationale:
- Technical benefits: Prevents indentation-related runtime errors.
- Business value: Improves the reliability of plot generation.
- Risk assessment: Low risk; this restores previously working code.
📁 src/vanna/qdrant/qdrant.py - get_training_data
- Submitted PR Code:
def get_training_data(self, **kwargs) -> pd.DataFrame:
df = pd.DataFrame()
if ddl_data := self._get_all_points(self.ddl_collection_name):
# ... process DDL data ...
if sql_data := self._get_all_points(self.sql_collection_name):
# ... process SQL data ...
if doc_data := self._get_all_points(self.documentation_collection_name):
# ... process documentation data ...
- Analysis:
- Current logic and potential issues: The order of data retrieval has been changed, potentially affecting the weighting and prioritization of different training data types. The original order (SQL, DDL, Documentation) is crucial for optimal RAG performance.
- Edge cases and error handling: N/A - This is a logic order issue.
- Cross-component impact: Affects the accuracy and relevance of generated SQL queries.
- Business logic considerations: The original order likely reflects the relative importance of different data types for accurate query generation.
- LlamaPReview Suggested Improvements:
def get_training_data(self, **kwargs) -> pd.DataFrame:
df = pd.DataFrame()
if sql_data := self._get_all_points(self.sql_collection_name):
# ... process SQL data ...
if ddl_data := self._get_all_points(self.ddl_collection_name):
# ... process DDL data ...
if doc_data := self._get_all_points(self.documentation_collection_name):
# ... process documentation data ...
- Improvement rationale:
- Technical benefits: Restores the original, empirically validated data retrieval order, which is known to improve RAG performance.
- Business value: Maintains the accuracy and reliability of SQL generation.
- Risk assessment: Low risk; reverting to the previous, tested order.
📁 src/vanna/base/base.py - generate_followup_questions
- Submitted PR Code:
message_log = [
self.system_message(
- f"You are a helpful data assistant. The user asked the question: '{question}'
The SQL query for this question was: {sql}
The following is a pandas DataFrame with the results of the query:
{df.head(25).to_markdown()}
"
+ f"你是一个强力的数据助手. 用户问了这个问题: '{question}'
此问题的 SQL 查询为: {sql}
下面是一个包含查询结果的 pandas DataFrame:
{df.to_markdown()}
"
),
self.user_message(
- f"Generate a list of {n_questions} followup questions that the user might ask about this data. Respond with a list of questions, one per line. Do not answer with any explanations -- just the questions. Remember that there should be an unambiguous SQL query that can be generated from the question. Prefer questions that are answerable outside of the context of this conversation. Prefer questions that are slight modifications of the SQL query that was generated that allow digging deeper into the data. Each question will be turned into a button that the user can click to generate a new SQL query so don't use 'example' type questions. Each question must have a one-to-one correspondence with an instantiated SQL query." +
+ f"生成用户可能会询问的有关此数据的 {n_questions} 个后续问题的列表。回答问题列表,每行一个问题。不要用任何解释来回答 -- 只用问题来回答。请记住,应该有一个明确的 SQL 查询可以从问题中
---
💡 **LlamaPReview Community**
Have feedback on this AI Code review tool? Join our [GitHub Discussions](https://github.com/JetXu-LLM/LlamaPReview-site/discussions) to share your thoughts and help shape the future of LlamaPReview.
@dailinyucode this isn't the right thing to do -- try to use the response language rather than changing the base prompt |
我觉得一种可行的思路是把prompt作为配置项或配置文件,类似于i18n的方式来实现不同语言的提示词 I think a feasible approach is to treat prompts as configuration items or configuration files, similar to the i18n approach, to implement prompts in different languages |
Please advise, what should I do? |
请教一下,中文的向量模型怎么修改 |
No description provided.