Support image modality chat history fine-tuning #142

xming521 · 2025-06-03T03:44:42Z

Support image modality chat history fine-tuning

…模型路径和添加平台配置，增强数据处理逻辑以支持多语言切割类型，并在qa_generator.py中整合切割类型列表。

…transformers version in pyproject.toml, adjust the clean_dataset configuration in settings.template.jsonc, add media_dir to the mllm template, optimize data processing logic in qa_generatorV2.py and utils.py, update the length_cdf function to support the media_dir parameter.

…mage_max_pixels配置；优化qa_generatorV2.py中的数据处理逻辑以支持image_max_pixels参数；更新length_cdf函数以支持image_max_pixels参数。

xming521 · 2025-06-05T15:39:11Z

BugBot run

Copilot

Pull Request Overview

This PR extends the chat history fine-tuning pipeline to support image modality end-to-end.

Introduce new image-related parameters and update dataset selection logic for multimodal training.
Enhance utilities for image file existence checks and WeChat image extraction.
Extend data models, CLI dispatch, and configuration/examples to include multimodal definitions.

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
weclone/utils/length_cdf.py	Add `media_dir` & `image_max_pixels` parameters; update logs
weclone/utils/i18n.py	New `MultiLangList` class for bilingual label support
weclone/utils/config.py	Override `dataset` to `wechat-mllm-sft` when `image` included
weclone/data/utils.py	New `check_image_file_exists` helper for `dataset/media/images`
weclone/data/models.py	Add `Message`, `QaPairV2`, `QaPairFormat` enums and multimodal types
weclone/data/clean/strategies.py	Skip LLM cleaning for multimodal data; adjust string quoting
weclone/data/chat_parsers/wechat_parser.py	New script to gather encrypted WeChat images for decryption
weclone/cli.py	Dispatch to V2 QA generator when images present; refine logging
settings.template.jsonc	Bump version and remove deprecated `train_pt_args` section
examples/mllm.template.jsonc	Provide a full multimodal training config template
dataset/res_csv/sft/dataset_info.json	Register `wechat-mllm-sft` entry with ShareGPT formatting
README.md	Document image modality workflow and config examples

Comments suppressed due to low confidence (2)

weclone/data/utils.py:6

Return type mixes str and bool; consider using Optional[str] (return None on failure) for clearer API semantics.

def check_image_file_exists(file_path: str) -> str | bool:

weclone/utils/length_cdf.py:35

[nitpick] Docstring is outdated: it does not describe the newly added media_dir and image_max_pixels parameters. Update it to reflect the full signature.

r"""Calculate the distribution of the input lengths in the dataset.

Copilot · 2025-06-06T12:55:28Z

weclone/utils/i18n.py

+            self.current_lang = lang
+            return self
+        else:
+            print(f"Warning: Language '{lang}' not available, using default")


Replace print with a structured logger call (e.g., logger.warning) for consistency with other modules.

Suggested change

print(f"Warning: Language '{lang}' not available, using default")

logger.warning(f"Language '{lang}' not available, using default")

Copilot · 2025-06-06T12:55:29Z

weclone/data/clean/strategies.py


-        if not config.get("clean_dataset", {}).get("enable_clean"):
-            logger.info("未启用清洗功能")
+        if not config.get("clean_dataset", {}).get("enable_clean") or "image" in config.get("include_type", ""):


During cleaning, config may not include include_type, so checking image here won’t skip multimodal data as intended. Consider passing include_type into the cleaning config or referencing the original settings.

Suggested change

if not config.get("clean_dataset", {}).get("enable_clean") or "image" in config.get("include_type", ""):

if not config.get("clean_dataset", {}).get("enable_clean") or "image" in self.include_type:

cursor

Bug: Default Value Error in Configuration Check

In strategies.py, the condition "image" in config.get("include_type", "") uses an incorrect default value ("") for include_type. Since include_type is expected to be a list, the default should be []. This causes the check to always evaluate to False when include_type is missing from the configuration, bypassing the intended logic.

weclone/data/clean/strategies.py#L101-L102

WeClone/weclone/data/clean/strategies.py

Lines 101 to 102 in b5b6611

    
           if not config.get("clean_dataset", {}).get("enable_clean") or "image" in config.get("include_type", ""):

weclone/data/qa_generatorV2.py#L85-L86

WeClone/weclone/data/qa_generatorV2.py

Lines 85 to 86 in b5b6611


	if "image" in self.config.get("include_type", []):

Fix in Cursor

Bug: Type Mismatch in ChatMessage Dataclass

The ChatMessage.src field is defined as str in the dataclass, but the group_consecutive_messages function assigns a list (combined_src_list) to it. This type mismatch, noted by a # type: ignore comment, creates inconsistency and could lead to runtime errors when src is accessed elsewhere expecting a string.

weclone/data/qa_generatorV2.py#L406-L417

WeClone/weclone/data/qa_generatorV2.py

Lines 406 to 417 in b5b6611

    
           combined_message = ChatMessage( 
        
               id=base_msg.id, 
        
               MsgSvrID=base_msg.MsgSvrID, 
        
               type_name=base_msg.type_name, 
        
               is_sender=base_msg.is_sender, 
        
               talker=base_msg.talker, 
        
               room_name=base_msg.room_name, 
        
               msg=combined_content, 
        
               src=combined_src_list,  # type: ignore 
        
               CreateTime=messages[-1].CreateTime,  # 使用最后一条消息的时间 
        
           )

Fix in Cursor

BugBot free trial expires on June 12, 2025
You have used $0.00 of your $0.00 spend limit so far. Manage your spend limit in the Cursor dashboard.

Was this report helpful? Give feedback by reacting with 👍 or 👎

Support image modality chat history fine-tuning

xming521 added 24 commits May 24, 2025 15:32

更新版本号至0.2.22，添加微信图片复制功能，更新.gitignore以包含新目录

9f59ec4

Merge remote-tracking branch 'origin/master' into dev

9547392

更新pyproject.toml以使用llamafactory的GitHub链接，修改settings.template.jsonc以更新…

07662a3

…模型路径和添加平台配置，增强数据处理逻辑以支持多语言切割类型，并在qa_generator.py中整合切割类型列表。

Merge remote-tracking branch 'origin/master' into dev

54cc16d

更新数据模型以支持ShareGPT格式，新增QaPairV2类和Message类，修改数据处理逻辑以支持图片消息，优化图片文件检查功能。

5c28bf0

add QaPairV2

7e4b043

Merge remote-tracking branch 'origin/master' into dev

babba72

improve qag V2

b09edad

更新mllm模板以支持最大图片数量，优化数据处理逻辑。

4d69b0c

更新mllm模板。

7bc76d5

length_cdf间隔将其固定为512。

0f9d737

Merge remote-tracking branch 'origin/master' into dev

db873b3

修正check_image_file_exists函数中的文件匹配逻辑。

7ee6289

更新pyproject.toml中的transformers版本，添加accelerate依赖；在mllm模板中增加media_dir和i…

a12a067

…mage_max_pixels配置；优化qa_generatorV2.py中的数据处理逻辑以支持image_max_pixels参数；更新length_cdf函数以支持image_max_pixels参数。

在mllm模板中添加enable_thinking配置项，默认为false。

1e52a20

Merge remote-tracking branch 'origin/master' into dev

a703188

修改wechat_parser.py以支持命令行参数传递微信个人文件夹路径并优化图片复制功能。

4a61480

更新README.md

2a868a1

更新README.md；在cli.py中根据配置动态选择数据处理器。

18af671

更新README.md。

f7fc362

更新pyproject.toml。

8c20c1a

更新pyproject.toml，更新README.md。

48f0ed3

Update dependencies

cccd2fe

xming521 requested a review from Copilot June 3, 2025 15:05

This comment was marked as outdated.

Sign in to view

xming521 changed the title ~~添加图片模态聊天记录微调~~ Support image modality chat history fine-tuning Jun 4, 2025

This comment was marked as outdated.

Sign in to view

更新README.md；调整mllm模板中的训练参数。

3842932

xming521 requested a review from Copilot June 6, 2025 12:53

Copilot AI reviewed Jun 6, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

更新README.md。

b5b6611

xming521 merged commit dc09602 into master Jun 6, 2025
1 check passed

cursor bot reviewed Jun 6, 2025

View reviewed changes

BAIKEMARK pushed a commit to BAIKEMARK/WeClone that referenced this pull request Jun 9, 2025

Merge pull request xming521#142 from xming521/dev

8d1bca2

Support image modality chat history fine-tuning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support image modality chat history fine-tuning #142

Support image modality chat history fine-tuning #142

Uh oh!

xming521 commented Jun 3, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

xming521 commented Jun 5, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 6, 2025

Uh oh!

Copilot AI Jun 6, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	print(f"Warning: Language '{lang}' not available, using default")
	logger.warning(f"Language '{lang}' not available, using default")

	if not config.get("clean_dataset", {}).get("enable_clean") or "image" in config.get("include_type", ""):
	if not config.get("clean_dataset", {}).get("enable_clean") or "image" in self.include_type:


	combined_message = ChatMessage(
	id=base_msg.id,
	MsgSvrID=base_msg.MsgSvrID,
	type_name=base_msg.type_name,
	is_sender=base_msg.is_sender,
	talker=base_msg.talker,
	room_name=base_msg.room_name,
	msg=combined_content,
	src=combined_src_list, # type: ignore
	CreateTime=messages[-1].CreateTime, # 使用最后一条消息的时间
	)

Support image modality chat history fine-tuning #142

Support image modality chat history fine-tuning #142

Uh oh!

Conversation

xming521 commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

xming521 commented Jun 5, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Bug: Default Value Error in Configuration Check

Bug: Type Mismatch in ChatMessage Dataclass

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xming521 commented Jun 3, 2025 •

edited

Loading