-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Support image modality chat history fine-tuning #142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…模型路径和添加平台配置,增强数据处理逻辑以支持多语言切割类型,并在qa_generator.py中整合切割类型列表。
…transformers version in pyproject.toml, adjust the clean_dataset configuration in settings.template.jsonc, add media_dir to the mllm template, optimize data processing logic in qa_generatorV2.py and utils.py, update the length_cdf function to support the media_dir parameter.
…mage_max_pixels配置;优化qa_generatorV2.py中的数据处理逻辑以支持image_max_pixels参数;更新length_cdf函数以支持image_max_pixels参数。
|
BugBot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR extends the chat history fine-tuning pipeline to support image modality end-to-end.
- Introduce new image-related parameters and update dataset selection logic for multimodal training.
- Enhance utilities for image file existence checks and WeChat image extraction.
- Extend data models, CLI dispatch, and configuration/examples to include multimodal definitions.
Reviewed Changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| weclone/utils/length_cdf.py | Add media_dir & image_max_pixels parameters; update logs |
| weclone/utils/i18n.py | New MultiLangList class for bilingual label support |
| weclone/utils/config.py | Override dataset to wechat-mllm-sft when image included |
| weclone/data/utils.py | New check_image_file_exists helper for dataset/media/images |
| weclone/data/models.py | Add Message, QaPairV2, QaPairFormat enums and multimodal types |
| weclone/data/clean/strategies.py | Skip LLM cleaning for multimodal data; adjust string quoting |
| weclone/data/chat_parsers/wechat_parser.py | New script to gather encrypted WeChat images for decryption |
| weclone/cli.py | Dispatch to V2 QA generator when images present; refine logging |
| settings.template.jsonc | Bump version and remove deprecated train_pt_args section |
| examples/mllm.template.jsonc | Provide a full multimodal training config template |
| dataset/res_csv/sft/dataset_info.json | Register wechat-mllm-sft entry with ShareGPT formatting |
| README.md | Document image modality workflow and config examples |
Comments suppressed due to low confidence (2)
weclone/data/utils.py:6
- Return type mixes
strandbool; consider usingOptional[str](returnNoneon failure) for clearer API semantics.
def check_image_file_exists(file_path: str) -> str | bool:
weclone/utils/length_cdf.py:35
- [nitpick] Docstring is outdated: it does not describe the newly added
media_dirandimage_max_pixelsparameters. Update it to reflect the full signature.
r"""Calculate the distribution of the input lengths in the dataset.
| self.current_lang = lang | ||
| return self | ||
| else: | ||
| print(f"Warning: Language '{lang}' not available, using default") |
Copilot
AI
Jun 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace print with a structured logger call (e.g., logger.warning) for consistency with other modules.
| print(f"Warning: Language '{lang}' not available, using default") | |
| logger.warning(f"Language '{lang}' not available, using default") |
|
|
||
| if not config.get("clean_dataset", {}).get("enable_clean"): | ||
| logger.info("未启用清洗功能") | ||
| if not config.get("clean_dataset", {}).get("enable_clean") or "image" in config.get("include_type", ""): |
Copilot
AI
Jun 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During cleaning, config may not include include_type, so checking image here won’t skip multimodal data as intended. Consider passing include_type into the cleaning config or referencing the original settings.
| if not config.get("clean_dataset", {}).get("enable_clean") or "image" in config.get("include_type", ""): | |
| if not config.get("clean_dataset", {}).get("enable_clean") or "image" in self.include_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Default Value Error in Configuration Check
In strategies.py, the condition "image" in config.get("include_type", "") uses an incorrect default value ("") for include_type. Since include_type is expected to be a list, the default should be []. This causes the check to always evaluate to False when include_type is missing from the configuration, bypassing the intended logic.
weclone/data/clean/strategies.py#L101-L102
WeClone/weclone/data/clean/strategies.py
Lines 101 to 102 in b5b6611
| if not config.get("clean_dataset", {}).get("enable_clean") or "image" in config.get("include_type", ""): |
weclone/data/qa_generatorV2.py#L85-L86
WeClone/weclone/data/qa_generatorV2.py
Lines 85 to 86 in b5b6611
| if "image" in self.config.get("include_type", []): |
Bug: Type Mismatch in ChatMessage Dataclass
The ChatMessage.src field is defined as str in the dataclass, but the group_consecutive_messages function assigns a list (combined_src_list) to it. This type mismatch, noted by a # type: ignore comment, creates inconsistency and could lead to runtime errors when src is accessed elsewhere expecting a string.
weclone/data/qa_generatorV2.py#L406-L417
WeClone/weclone/data/qa_generatorV2.py
Lines 406 to 417 in b5b6611
| combined_message = ChatMessage( | |
| id=base_msg.id, | |
| MsgSvrID=base_msg.MsgSvrID, | |
| type_name=base_msg.type_name, | |
| is_sender=base_msg.is_sender, | |
| talker=base_msg.talker, | |
| room_name=base_msg.room_name, | |
| msg=combined_content, | |
| src=combined_src_list, # type: ignore | |
| CreateTime=messages[-1].CreateTime, # 使用最后一条消息的时间 | |
| ) |
BugBot free trial expires on June 12, 2025
You have used $0.00 of your $0.00 spend limit so far. Manage your spend limit in the Cursor dashboard.
Was this report helpful? Give feedback by reacting with 👍 or 👎
Support image modality chat history fine-tuning
Support image modality chat history fine-tuning