Codebase to test Top-k Attention and Top-theta Attention on Large Language Models using the lm-eval-harness [1] framework, and text generation tasks including HumanEval [2] and LongBench [3]
Testing was done for the LLaMA models on arc_challenge/arc_easy/hellaswag/medmcqa datasets for Q&A evaluation (prefill-only tasks) and on humaneval/longbench datasets (prefill + generative decoding). For detailed tested variants, refer to reproduce.md.
# Create a conda/pyenv virtual environment in the local directory
conda create python=3.9.12 --prefix ./topksel
conda activate $(pwd)/topksel
# Install the human_eval repo [2] and enable unsandboxed evaluation of LLM-generated python programs
git clone https://github.com/openai/human-eval.git
pushd human-eval
sed -i 's/^#\s*\(.*exec(check_program, exec_globals).*\)/\ exec(check_program, exec_globals)/' human_eval/execution.py
sed -i 's/^.*assert len(completion_id) == len(problems), "Some problems are not attempted."/\ # assert len(completion_id) == len(problems), "Some problems are not attempted."/' human_eval/evaluation.py
pip install -e .
popd
# Install the lm-eval harness [1] and patch it with the calibration tasks for Hellaswag, Arc_Challenge, Arch_Easy and MedMCQA
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
pushd lm-evaluation-harness
git checkout v0.4.8
echo -e "include: hellaswag.yaml\ntask: hellaswag_calibration\ntest_split: train\n" > lm_eval/tasks/hellaswag/hellaswag_calibration.yaml
echo -e "include: arc_challenge.yaml\ntask: arc_challenge_calibration\ntest_split: train\n" > lm_eval/tasks/arc/arc_challenge_calibration.yaml
echo -e "include: arc_easy.yaml\ntask: arc_easy_calibration\ntest_split: train\n" > lm_eval/tasks/arc/arc_easy_calibration.yaml
pip install -e .
popd
# Clone the longbench repo to enable accessability of the task confgurations [3]
git clone https://github.com/THUDM/LongBench.git
# Install the current topk_attention repo with its dependencies
# git clone <HERE COMES THE GITHUB REPO>topk_attention.git or just have the topk_attention directory ready with the code
pushd topk_attention
pip install -r requirements.txt
popd
test_llama.py- Runs Q&A task (hellaswag, arc_challenge, arc_easy, medmcqa) evaluations for Top-k and Top-theta (and baseline) on Llama models. To only calibrate thresholds - use --calibrate_only flag.gen_llama.py- Runs text generation task (humaneval, longbench-qmsum, longbench-gov_report) evaluations for Top-k and Top-theta (and baseline) on Llama models.topk_llama.py- Implements Top-k and Top-theta modifications to Vanilla Attentiontopk_llama_chunked.py- Implements Top-k and Top-theta modifications to Vanilla Attention - with support for chcked prefill to accommodate large sequence length above 30k tokens (tested with up to 50k tokens)
topk_llama.py contains the implementation of TopK_LLamaAttention class, a modified LlamaAttention layer from the transformers library. The TopK_LLamaAttention class implements all the functionality of Top-k attention and of Top-theta attention (including calibration). A few usage details:
-
mode=0implements only Top-theta,mode=1implements only Top-k, any other mode implements the baseline. -
update_model(model)- function to replace allLlamaAttentionwithTopK_LLamaAttentionlayers. -
set_params(model, **params)- function to set the parameters to theTopK_LLamaAttentionlayer e.g. mode, K etc.
test_llama.py and gen_llama.py - runs evaluations for Top-k and Top-theta on Llama models.
Outputs:
- Results are logged in the results-* directories.
- Various products of the run are dumped to a dedicated and time-stamped sub-directory under the products directory.
Instructions for reproducing the experiments are in reproduce.md
- Evaluate llama2-7b on hellaswag task using Top-theta attention (
--mode 0) calibrated for k=64 (--k 64) for all layers except layer 0 and 1 (--layerk 0:512,1:512), where the threholding should be placed before the softmax (--placement pre-softmax). During the calibration, for every (layer,head,seqlen) determine an individual threshold value by taking the average threshold across the thresholds found at the different calibration samples and increase this average by 0.1 standard deviation (--calib_add_sigma 0.1). In addition, during the calibration apply the recommended topk-at-calibration feature (--calib_tac) to emulate the presence of thresholding. During the inference, apply softmax denominator compensation of the type "offline-calibrated" (--sdc offline-calibrated) and use V-mean compensation (--vmc). The option--timestampscould become default in the future, but for now it is required to specify it in order to create a separate products subdirectory for the files being dumped during the evaluation run.
python test_llama.py --timestamps --llama 2-7 --task hellaswag --mode 0 --k 64 --layerk 0:512,1:512 --placement pre-softmax --calib_tac --calib_add_sigma 0.1 --sdc offline-calibrated --vmc- Evaluate codellama-34b model with Top-theta attention (
--mode 0) calibrated for k=64 (--k 64) for all layers except layer 0 and 1 (--layerk 0:512,1:512), where the threholding should be placed before the softmax (--placement pre-softmax). The test set consists of the first 20 out of 167 test examples (tasks) of the humaneval dataset. Evaluate only a single output per task (quality metric will be pass@1). No SDC or VMC compensations are applied. The model is allowed to generate tokens until the EOS token is generated or until the total sequence length reaches 2048 before being halted.
python gen_llama.py --timestamps --llama 34 --mode 0 --k 64 --layerk 0:512,1:512 --placement pre-softmax --calib_add_sigma 0.1 --calib_sample_frac 1.0 --calib_tac --num_samples_per_task 1 --max_seq_len 2048scripts/plot_th_llama.py - Plots the calibrated thresholds for different layers & Attention matrix size required w.r.t Top-theta during evaluation using calibrated thresholds. The produced plots are written to the products subdirectory of the evaluation (-d) and all have the title specified after the argument -t.
python plot_th_llama.py -d "products/2024-04-29_17-29-01_774054" -t "CodeLLaMA-34b-arc_challenge Top-th pre-softmax (k=512,512,128,128,...) single-k calibration mean+1.0*sigma, nacc=52.39% (base=54.44%)"`scripts/plot_gen_llama.py - important plots of the prompt and completion lengths
Check out the notebooks for various visualization capabilities.
- Calibration of thresholds for top-theta is enabled only through test_llama.py (which uses the topk_llama.py implementation) therefore possible on QA tasks on relatively short (few thousands of tokens). To solve it - the commented-out code in topk_llama_chunked.py
- QA tasks evaluation (test_llama.py) is limited to few thousands of tokens sequence length. To enable longer sequences -- need to import TopK_LLamaAttention from topk_llama_chunked.
- To dump statistics that enable attention topk visualizations (notebooks/3) one needs to pass an empty list to dump_stats_set_exclude in the set_params() in gen_llama.py. Only then running gen_llama.py will produce the corresponding txt output files visualizable by the notebooks/3.
- Thresholds produced after calibrations (th.txt file) is not undergoing any smoothening/interpolation or fitting to a function. The smoothening is implemented separately in scripts/th_interpolate_and_smoothen.py and it could be a good pratise to incorporate it as a permanent post-processing step after calibration.
[1] https://github.com/EleutherAI/lm-evaluation-harness
[2] https://github.com/openai/human-eval
[3] https://github.com/THUDM/LongBench