- Authors : Yuxuan Gu, Wuyang Zhou, Giorgos Iacovides, Danilo Mandic
- Paper : https://arxiv.org/abs/2501.15674
This repository contains the implementation of TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs.
Overview : The reasoning abilities of Large Language Models (LLMs) can be improved by structurally denoising their weights, yet existing techniques primarily focus on denoising the feed-forward network (FFN) of the transformer block, and can not efficiently utilise the Multi-head Attention (MHA) block, which is the core of transformer architectures. To address this issue, we propose a novel intuitive framework that, at its very core, performs MHA compression through a multi-head tensorisation process and the Tucker decomposition. This enables both higher-dimensional structured denoising and compression of the MHA weights, by enforcing a shared higher-dimensional subspace across the weights of the multiple attention heads. We demonstrate that this approach consistently enhances the reasoning capabilities of LLMs across multiple benchmark datasets, and for both encoder-only and decoder-only architectures, while achieving compression rates of up to ∼250 times in the MHA weights, all without requiring any additional data, training, or fine-tuning. Furthermore, we show that the proposed method can be seamlessly combined with existing FFN-only-based denoising techniques to achieve further improvements in LLM reasoning performance.
To avoid version conflicts, it is recommended to use a separate Conda environment. Follow these steps to set up:
- Modify the create_env.sh script:
- Update the Conda path on line 19:
Replace
eval "$(~/miniforge3/bin/conda shell.bash hook)"
~/miniforge3/bin/condawith the correct path to your Conda installation.
- Update the Conda path on line 19:
- Run the installation script:
The setup takes approximately 3 minutes.
chmod +x create_script.sh ./create_script.sh
- Initialize Conda and activate the environment:
# Replace '~/miniforge3/bin/conda' with the path to your conda nstallation eval "$(~/miniforge3/bin/conda shell.bash hook)" conda activate TensorLLM
You can run experiments in the following modes:
- 4D_Tucker (our method only): Tucker decomposition with shared factor matrices (applied to MHA block).
- 4D_Tucker_laser: (our method for MHA) + (LASER for FFN).
- laser: Original LASER intervention.
- 3D_Tucker: Separately compresses
$\mathbf{W}_Q$ ,$\mathbf{W}_K$ ,$\mathbf{W}_V$ , and$\mathbf{W}_O$ (for ablation studies).
Below are example commands to reproduce the results from the paper. The following bash commands uses GPT-J model as an example.
python3 src/TensorLLM_intervention_gptj_bbh_qa.py --mode 4D_Tucker --lnum 27 --qkvo_rank 304 --head_dim_rank 19 --stack_rank 2 --single_experiment --device cuda
python3 src/TensorLLM_intervention_gptj_bios_profession.py --mode 4D_Tucker --lnum 18 --qkvo_rank 208 --head_dim_rank 13 --stack_rank 1 --single_experiment --device cuda
python3 src/TensorLLM_intervention_gptj_fever.py --mode 4D_Tucker --lnum 11 --qkvo_rank 800 --head_dim_rank 50 --stack_rank 2 --single_experiment --device cuda
python3 src/TensorLLM_intervention_gptj_hotpot.py --mode 4D_Tucker --lnum 27 --qkvo_rank 64 --head_dim_rank 4 --stack_rank 2 --single_experiment --device cudalnum: Specifies the layer number.Ranks: (qkvo_rank, head_dim_rank, stack_rank): Controls decomposition granularity.
We gratefully acknowledge the use of code from the following projects: LASER
If you find our paper or code useful, we will greatly appreacite it if you could consider citing our paper:
@article{gu2025tensorllm,
title={TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs},
author={Gu, Yuxuan and Zhou, Wuyang and Iacovides, Giorgos and Mandic, Danilo},
journal={arXiv preprint arXiv:2501.15674},
year={2025}
}