[ICCV 2025] From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning

Repo for paper "From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning"

Framework Overview

Overview of our two key components: Visual Cue Enhancement (VCE) module enhances high-level anchor features by aggregating local information from multi-level feature maps. The Dual Low-Rank Adaptation (Dual-LoRA) module projects the input feature into two low-rank subspaces: one for stable holistic domain knowledge (skill space) learning and the other for instruction condition (task space) learning. Dual-LoRA modules is integrated into the LLM’s query and value projection layers for efficiency.

Comparsion with Existing LoRA-MoE Methods

Comparison between the proposed Dual-LoRA and existing methods. (a) Mainstream MLLMs, e.g. LLaVA and QwenV, project the high-level visual feature map. (b) Our Visual Cue Enhancement module enhances high-level features by aggregating local information from multi-level feature maps. LoRA-MoE methods mitigate data conflicts by enabling localized responses, i.e., experts activation, tailored to several activation strategies: (c) Sparse Activation, where only the top-k experts are activated; (d) Dense Activation, where all experts are activated; and Rectified Activation, where multiple heterogeneous experts are dynamically activated. In contrast, our (f) Dual Low-Rank Adaptation rectifies a holistic knowledge (skill) space with an additional task space, which is fully differentiable, capable of learning any local response, and more structurally efficient.

Local Enhanced Visual Cues

Feature Map Visualization of Enhanced Visual Cue with VCE: (a) Enhanced Visual Cue emphasize key areas in food imagery, such as textures, garnishes, and toppings. (b) The cues highlight important details: text for readability, the cheetah-prey interaction, and the woman’s face and Oscar trophy on stage.

Local Enhanced Adaption Space

Panels (a), (b), and (c) represent feature visualizations for the recipe generation task where: (a) the distributions of feature outputs for the holistic skill space, (b) the distributions for the rectified skill space, and (c) the entropy of the holistic skill space (blue line) and the rectified skill space (orange line). Panels (d), (e), and (f) show the corresponding results for the nutrition estimation task.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
source		source
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[ICCV 2025] From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning

Framework Overview

Comparsion with Existing LoRA-MoE Methods

Local Enhanced Visual Cues

Local Enhanced Adaption Space

Results on Multi-Food Computing Tasks

About

Uh oh!

Releases

Packages

Languages

pengkun-jiao/FHTC

Folders and files

Latest commit

History

Repository files navigation

[ICCV 2025] From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning

Framework Overview

Comparsion with Existing LoRA-MoE Methods

Local Enhanced Visual Cues

Local Enhanced Adaption Space

Results on Multi-Food Computing Tasks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages