Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@xming521
Copy link
Owner

@xming521 xming521 commented Jun 3, 2025

Support image modality chat history fine-tuning

xming521 added 24 commits May 24, 2025 15:32
…模型路径和添加平台配置,增强数据处理逻辑以支持多语言切割类型,并在qa_generator.py中整合切割类型列表。
…transformers version in pyproject.toml, adjust the clean_dataset configuration in settings.template.jsonc, add media_dir to the mllm template, optimize data processing logic in qa_generatorV2.py and utils.py, update the length_cdf function to support the media_dir parameter.
…mage_max_pixels配置;优化qa_generatorV2.py中的数据处理逻辑以支持image_max_pixels参数;更新length_cdf函数以支持image_max_pixels参数。
@xming521 xming521 requested a review from Copilot June 3, 2025 15:05

This comment was marked as outdated.

@xming521 xming521 changed the title 添加图片模态聊天记录微调 Support image modality chat history fine-tuning Jun 4, 2025
@xming521
Copy link
Owner Author

xming521 commented Jun 5, 2025

BugBot run

cursor[bot]

This comment was marked as outdated.

@xming521 xming521 requested a review from Copilot June 6, 2025 12:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR extends the chat history fine-tuning pipeline to support image modality end-to-end.

  • Introduce new image-related parameters and update dataset selection logic for multimodal training.
  • Enhance utilities for image file existence checks and WeChat image extraction.
  • Extend data models, CLI dispatch, and configuration/examples to include multimodal definitions.

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
weclone/utils/length_cdf.py Add media_dir & image_max_pixels parameters; update logs
weclone/utils/i18n.py New MultiLangList class for bilingual label support
weclone/utils/config.py Override dataset to wechat-mllm-sft when image included
weclone/data/utils.py New check_image_file_exists helper for dataset/media/images
weclone/data/models.py Add Message, QaPairV2, QaPairFormat enums and multimodal types
weclone/data/clean/strategies.py Skip LLM cleaning for multimodal data; adjust string quoting
weclone/data/chat_parsers/wechat_parser.py New script to gather encrypted WeChat images for decryption
weclone/cli.py Dispatch to V2 QA generator when images present; refine logging
settings.template.jsonc Bump version and remove deprecated train_pt_args section
examples/mllm.template.jsonc Provide a full multimodal training config template
dataset/res_csv/sft/dataset_info.json Register wechat-mllm-sft entry with ShareGPT formatting
README.md Document image modality workflow and config examples
Comments suppressed due to low confidence (2)

weclone/data/utils.py:6

  • Return type mixes str and bool; consider using Optional[str] (return None on failure) for clearer API semantics.
def check_image_file_exists(file_path: str) -> str | bool:

weclone/utils/length_cdf.py:35

  • [nitpick] Docstring is outdated: it does not describe the newly added media_dir and image_max_pixels parameters. Update it to reflect the full signature.
r"""Calculate the distribution of the input lengths in the dataset.

self.current_lang = lang
return self
else:
print(f"Warning: Language '{lang}' not available, using default")
Copy link

Copilot AI Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace print with a structured logger call (e.g., logger.warning) for consistency with other modules.

Suggested change
print(f"Warning: Language '{lang}' not available, using default")
logger.warning(f"Language '{lang}' not available, using default")

Copilot uses AI. Check for mistakes.

if not config.get("clean_dataset", {}).get("enable_clean"):
logger.info("未启用清洗功能")
if not config.get("clean_dataset", {}).get("enable_clean") or "image" in config.get("include_type", ""):
Copy link

Copilot AI Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During cleaning, config may not include include_type, so checking image here won’t skip multimodal data as intended. Consider passing include_type into the cleaning config or referencing the original settings.

Suggested change
if not config.get("clean_dataset", {}).get("enable_clean") or "image" in config.get("include_type", ""):
if not config.get("clean_dataset", {}).get("enable_clean") or "image" in self.include_type:

Copilot uses AI. Check for mistakes.
cursor[bot]

This comment was marked as outdated.

@xming521 xming521 merged commit dc09602 into master Jun 6, 2025
1 check passed
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Default Value Error in Configuration Check

In strategies.py, the condition "image" in config.get("include_type", "") uses an incorrect default value ("") for include_type. Since include_type is expected to be a list, the default should be []. This causes the check to always evaluate to False when include_type is missing from the configuration, bypassing the intended logic.

weclone/data/clean/strategies.py#L101-L102

if not config.get("clean_dataset", {}).get("enable_clean") or "image" in config.get("include_type", ""):

weclone/data/qa_generatorV2.py#L85-L86

if "image" in self.config.get("include_type", []):

Fix in Cursor


Bug: Type Mismatch in ChatMessage Dataclass

The ChatMessage.src field is defined as str in the dataclass, but the group_consecutive_messages function assigns a list (combined_src_list) to it. This type mismatch, noted by a # type: ignore comment, creates inconsistency and could lead to runtime errors when src is accessed elsewhere expecting a string.

weclone/data/qa_generatorV2.py#L406-L417

combined_message = ChatMessage(
id=base_msg.id,
MsgSvrID=base_msg.MsgSvrID,
type_name=base_msg.type_name,
is_sender=base_msg.is_sender,
talker=base_msg.talker,
room_name=base_msg.room_name,
msg=combined_content,
src=combined_src_list, # type: ignore
CreateTime=messages[-1].CreateTime, # 使用最后一条消息的时间
)

Fix in Cursor


BugBot free trial expires on June 12, 2025
You have used $0.00 of your $0.00 spend limit so far. Manage your spend limit in the Cursor dashboard.

Was this report helpful? Give feedback by reacting with 👍 or 👎

BAIKEMARK pushed a commit to BAIKEMARK/WeClone that referenced this pull request Jun 9, 2025
Support image modality chat history fine-tuning
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants