Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Repo for Visual Cue Enhancement and Dual Low-Rank Adaptation for Efficient Visual Instruction Fine-Tuning

Notifications You must be signed in to change notification settings

pengkun-jiao/FHTC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

[ICCV 2025] From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning

Repo for paper "From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning"

Framework Overview

Overview of our two key components: Visual Cue Enhancement (VCE) module enhances high-level anchor features by aggregating local information from multi-level feature maps. The Dual Low-Rank Adaptation (Dual-LoRA) module projects the input feature into two low-rank subspaces: one for stable holistic domain knowledge (skill space) learning and the other for instruction condition (task space) learning. Dual-LoRA modules is integrated into the LLM’s query and value projection layers for efficiency. 20991754463329_ pic

Comparsion with Existing LoRA-MoE Methods

Comparison between the proposed Dual-LoRA and existing methods. (a) Mainstream MLLMs, e.g. LLaVA and QwenV, project the high-level visual feature map. (b) Our Visual Cue Enhancement module enhances high-level features by aggregating local information from multi-level feature maps. LoRA-MoE methods mitigate data conflicts by enabling localized responses, i.e., experts activation, tailored to several activation strategies: (c) Sparse Activation, where only the top-k experts are activated; (d) Dense Activation, where all experts are activated; and Rectified Activation, where multiple heterogeneous experts are dynamically activated. In contrast, our (f) Dual Low-Rank Adaptation rectifies a holistic knowledge (skill) space with an additional task space, which is fully differentiable, capable of learning any local response, and more structurally efficient. 21001754463343_ pic

Local Enhanced Visual Cues

Feature Map Visualization of Enhanced Visual Cue with VCE: (a) Enhanced Visual Cue emphasize key areas in food imagery, such as textures, garnishes, and toppings. (b) The cues highlight important details: text for readability, the cheetah-prey interaction, and the woman’s face and Oscar trophy on stage. 20971754463184_ pic

Local Enhanced Adaption Space

Panels (a), (b), and (c) represent feature visualizations for the recipe generation task where: (a) the distributions of feature outputs for the holistic skill space, (b) the distributions for the rectified skill space, and (c) the entropy of the holistic skill space (blue line) and the rectified skill space (orange line). Panels (d), (e), and (f) show the corresponding results for the nutrition estimation task. 20981754463201_ pic

Results on Multi-Food Computing Tasks

20941754462912_ pic

About

Repo for Visual Cue Enhancement and Dual Low-Rank Adaptation for Efficient Visual Instruction Fine-Tuning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages