Assignment 4 of Advanced Natural Language Processing (IIIT-Hyderabad, Monsoon '24)
Experiments in quantisation, consisting of quantisation from scratch (whole model and selective) as well as bitsandbytes integration, with quantisation to 4 bit and 8 bit formats and nf4 quantisation.
In addition, we deploy a model onto our local device using llama.cpp, quantise it, and upload it to the hugging face hub.
Refer to the env file to install the dependencies using conda.
conda env create -f docs/envs.ymlQuantize gpt-neo using your method of choice using:
python -m src.quantize --q_type <type>Types include custom_whole, custom_selective, bnb_4, bnb_8, bnb_nf4
and none.
custom_whole takes a lot of memory during inference and may have to be run with the --cpu flag.
The model gets saved to quantized. Run it the same way you did before, instead on the evaluate model, to evaluate:
python -m src.evaluate --q_type <type>. Quantised models can be found here: https://drive.google.com/drive/folders/1lHQnaPGtltS_SNNqdw4MLhvGHB0xKP1l?usp=sharing
reference: ggml-org/llama.cpp#2948
Set up the llama.cpp submodule stored in the llama.cpp directory as below:
git submodule init
git submodule updateThe remaining code assumes you're in the llama.cpp directory.
cd llama.cppBuild the executables by referring to the original directory.
Note
huggingface-hub is required to download and upload models
Download hf-smol-135m from huggingface to quantise:
python download.pyQuantise the model using llama.cpp:
python llama.cpp/convert_hf_to_gguf.py hf-smol \
--outfile hf-smol.gguf \
--outtype q8_0Prompt the model with whatever input you want using the llama-cli executable:
./llama.cpp/build/bin/llama-cli -m hf-smol.gguf -p "What is life?"If you want, upload the model to hugging-face by referring to and modifying upload.py as required:
python upload.py