2024/4/25 19:12 Alpaca + Codellama 34b full example.
ipynb - Colab
To run this, press "Runtime" and press "Run all" on a free Tesla T4 Google Colab instance!
Join Discord if you need help + support us if you can!
To install Unsloth on your own computer, follow the installation instructions on our Github page here.
You will learn how to do data prep, how to train, how to run the model, & how to save it (eg for Llama.cpp).
%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
# Must install separately since Colab has torch 2.2.1, which breaks packages
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
if major_version >= 8:
# Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
!pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
else:
# Use this for older GPUs (V100, Tesla T4, RTX 20xx)
!pip install --no-deps xformers trl peft accelerate bitsandbytes
pass
We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc
And Yi, Qwen (llamafied), Deepseek, all Llama, Mistral derived archs.
We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
max_seq_length can be set to anything, since we do automatic RoPE Scaling via kaiokendev's method.
With PR 26037, we support downloading 4bit models 4x faster! Our repo has Llama, Mistral 4bit models.
[NEW] We make Gemma 6 trillion tokens 2.5x faster! See our Gemma notebook
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/codellama-34b-bnb-4bit", # "codellama/CodeLlama-34b-hf" for 16bit loading
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing 1/5
2024/4/25 19:12 Alpaca + Codellama 34b full example.ipynb - Colab
/usr/local/lib/python3.10/dist-packages/unsloth/__init__.py:67: UserWarning: CUDA is not link
We shall run `ldconfig /usr/lib64-nvidia` to try to fix it.
warnings.warn(
config.json: 100% 1.10k/1.10k [00:00<00:00, 92.7kB/s]
==((====))== Unsloth: Fast Llama patching release 2023.12
\\ /| GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB
O^O/ \_/ \ CUDA capability = 8.0. Xformers = 0.0.22.post7. FA = True.
\ / Pytorch version: 2.1.0+cu121. CUDA Toolkit = 12.1
"-____-" bfloat16 = TRUE. Platform = Linux
You passed `quantization_config` to `from_pretrained` but the model you're loading already ha
model.safetensors.index.json: 198k/198k [00:00<00:00,
100% 12.4MB/s]
Downloading shards: 4/4 [14:45<00:00,
100% 209.39s/it]
model-00001-of- 4.98G/4.98G [03:54<00:00,
00004.safetensors: 100% 21.5MB/s]
model-00002-of- 5.00G/5.00G [04:14<00:00,
00004.safetensors: 100% 18.1MB/s]
model-00003-of- 5.00G/5.00G [03:54<00:00,
00004.safetensors: 100% 19.7MB/s]
model-00004-of- 3.21G/3.21G [02:38<00:00,
00004.safetensors: 100% 21.4MB/s]
Loading checkpoint shards: 4/4 [00:06<00:00,
100% 1.54s/it]
generation_config.json: 116/116 [00:00<00:00,
We now add LoRA adapters so we only need to update 1 to 10% of all parameters!
model = FastLanguageModel.get_peft_model(
model,
r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Currently only supports dropout = 0
bias = "none", # Currently only supports bias = "none"
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
max_seq_length = max_seq_length,
)
Unsloth 2023.12 patched 48 layers with 48 QKV layers, 48 O layers and 48 MLP layers.
keyboard_arrow_down Data Prep
We now use the Alpaca dataset from yahma, which is a filtered version of 52K of the original Alpaca dataset. You can replace this code section
with your own data prep.
[NOTE] To train only on completions (ignoring the user's input) read TRL's docs here.
keyboard_arrow_down
Alpaca dataset preparation code
显示代码
Downloading readme: 100% 11.6k/11.6k [00:00<00:00, 855kB/s]
Downloading data: 100% 44.3M/44.3M [00:02<00:00, 23.6MB/s]
Generating train split: 51760/0 [00:00<00:00, 137153.36 examples/s]
Map: 100% 51760/51760 [00:00<00:00, 102802.36 examples/s]
keyboard_arrow_down Train the model
https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing 2/5
2024/4/25 19:12 Alpaca + Codellama 34b full example.ipynb - Colab
Now let's use Huggingface TRL's SFTTrainer ! More docs here: TRL SFT docs. We do 60 steps to speed things up, but you can set
num_train_epochs=1 for a full run, and turn off max_steps=None . We also support TRL's DPOTrainer !
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
args = TrainingArguments(
per_device_train_batch_size = 4,
gradient_accumulation_steps = 4,
warmup_steps = 10,
max_steps = 120,
learning_rate = 2e-4,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
),
)
Special tokens have been added in the vocabulary, make sure the associated word embeddings ar
Map: 51760/51760 [00:09<00:00, 5660.67
keyboard_arrow_down
Show current memory stats
显示代码
GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.564 GB.
17.791 GB of memory reserved.
trainer_stats = trainer.train()
https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing 3/5
2024/4/25 19:12 Alpaca + Codellama 34b full example.ipynb - Colab
You're using a CodeLlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, us
Unsloth: `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=F
[120/120 16:48, Epoch 0/1]
Step Training Loss
1 1.589300
2 1.634000
3 1.608500
4 1.398300
5 1.652000
6 1.335000
7 1.408300
8 1.282900
9 1.352700
10 1.055300
11 1.100300
12 0.997700
13 1.052700
14 0.966900
15 0.882100
16 0.863700
17 0.846300
18 0.886200
19 0.724900
20 1.072100
21 0.856300
22 0.827600
23 0.875100
24 0.937700
25 0.886100
26 0.885200
27 0.992800
28 0.848600
keyboard_arrow_down
Show final memory and time stats
显示代码
1020.2541 seconds used for training.
17.0 minutes used for training.
Peak reserved memory = 23.98 GB.
Peak reserved memory for training = 6.189 GB.
Peak reserved memory % of max memory = 60.611 %.
Peak reserved memory for training % of max memory = 15.643 %.
keyboard_arrow_down Inference
Let's run the model! You can change the instruction and input - leave the output blank!
https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing 4/5
2024/4/25 19:12 Alpaca + Codellama 34b full example.ipynb - Colab
inputs = tokenizer(
[
alpaca_prompt.format(
"Continue the fibonnaci sequence.", # instruction
"1, 1, 2, 3, 5, 8", # input
"", # output - leave this blank for generation!
)
]*1, return_tensors = "pt").to("cuda")
keyboard_arrow_down Saving, loading finetuned models
outputs = model.generate(**inputs, max_new_tokens = 128, use_cache = True)
tokenizer.batch_decode(outputs)
To save the final model, either use Huggingface's push_to_hub for an online save or save_pretrained for a local save.
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1547: UserWarning: You have modified the pretrained model configuration
To savewarnings.warn(
to GGUF / llama.cpp , or for model merging, use model.merge_and_unload first, then save the model. Maxime Labonne's llm-course has a
['<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately
nice tutorial on converting HF to GGUF! This issue might be helpful for more info.
completes the request.\n\n### Instruction:\nContinue the fibonnaci sequence.\n\n### Input:\n1, 1, 2, 3, 5, 8\n\n### Response:\n1, 1, 2, 3, 5, 8,
13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393']
model.save_pretrained("lora_model") # Local saving
https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing 5/5