conda activate qlora
conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt- Copy llama-7b-hf-transformers-4.29 to localssd.
- Run
prepare_mmlu.pyto download mmlu data. - Run
run_qlora.shorrun_gwqlora.shorrun_lora.sh. Finetuning a llama 7b model costs about 5 hours on a A100 but the evaluation costs a lot of time too. The total running time should be within 8 hours. - A int benchmark can be generated by modifying the code in "class TweakEvery100Steps" and then using
tweak_once.shto run it.
| rank=4 llama 7B | 0-shot | 5-shot | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| group=64 MMLU | STEM | Hums | Social | Other | Avg | STEM | Hums | Social | Other | Avg |
| origin | 27.3 | 33.0 | 32.4 | 37.3 | 32.6 | 30.6 | 34.1 | 38.2 | 38.2 | 35.2 |
| lora | 31.6 | 36.9 | 38.9 | 42.1 | 37.4 | |||||
| int3 | 28.8 | 32.6 | 31.8 | 35.2 | 32.2 | 30.1 | 33.2 | 37.4 | 38.2 | 34.6 |
| int3 tweakonce | 29.4 | 32.0 | 33.4 | 36.2 | 32.7 | 30.2 | 33.2 | 37.4 | 38.6 | 34.7 |
| gwq | 29.8 | 33.3 | 34.2 | 37.7 | 33.7 | 31.3 | 33.8 | 38.3 | 39.7 | 35.6 |
| int3-g128 | 28.4 | 31.8 | 31.3 | 35.3 | 31.8 | 28.3 | 30.5 | 31.5 | 33.3 | 30.9 |
| l4q(3bit-g128) | 27.8 | 29.5 | 32.1 | 33.3 | 30.6 | 31.0 | 29.3 | 33.5 | 30.4 | 31.8 |
The result of l4q is taken from its original paper. I have no idea why it's worse than original int3-g128.
There are some experiments to be done, so some part of the chart remains blank.