0% found this document useful (0 votes)

112 views5 pages

Fast Llama Training Guide

The document demonstrates how to use Unsloth to load a CodeLlama model, prepare an Alpaca dataset, and train the model on the data using TRL. It loads a CodeLlama-34b model in a quantized 4-bit format, prepares an Alpaca dataset, and trains the model for 60 steps using TRL's SFTTrainer.

Uploaded by

itisqh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

112 views5 pages

Fast Llama Training Guide

Uploaded by

itisqh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

2024/4/25 19:12 Alpaca + Codellama 34b full example.

ipynb - Colab

To run this, press "Runtime" and press "Run all" on a free Tesla T4 Google Colab instance!

Join Discord if you need help + support us if you can!

To install Unsloth on your own computer, follow the installation instructions on our Github page here.

You will learn how to do data prep, how to train, how to run the model, & how to save it (eg for Llama.cpp).

%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
# Must install separately since Colab has torch 2.2.1, which breaks packages
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
if major_version >= 8:
# Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
!pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
else:
# Use this for older GPUs (V100, Tesla T4, RTX 20xx)
!pip install --no-deps xformers trl peft accelerate bitsandbytes
pass

We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc

And Yi, Qwen (llamafied), Deepseek, all Llama, Mistral derived archs.
We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
max_seq_length can be set to anything, since we do automatic RoPE Scaling via kaiokendev's method.
With PR 26037, we support downloading 4bit models 4x faster! Our repo has Llama, Mistral 4bit models.
[NEW] We make Gemma 6 trillion tokens 2.5x faster! See our Gemma notebook

from unsloth import FastLanguageModel

import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(

model_name = "unsloth/codellama-34b-bnb-4bit", # "codellama/CodeLlama-34b-hf" for 16bit loading
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing 1/5
2024/4/25 19:12 Alpaca + Codellama 34b full example.ipynb - Colab

/usr/local/lib/python3.10/dist-packages/unsloth/init.py:67: UserWarning: CUDA is not link

We shall run `ldconfig /usr/lib64-nvidia` to try to fix it.
warnings.warn(
config.json: 100% 1.10k/1.10k [00:00<00:00, 92.7kB/s]
==((====))== Unsloth: Fast Llama patching release 2023.12
\\ /| GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB
O^O/ \_/ \ CUDA capability = 8.0. Xformers = 0.0.22.post7. FA = True.
\ / Pytorch version: 2.1.0+cu121. CUDA Toolkit = 12.1
"-____-" bfloat16 = TRUE. Platform = Linux

You passed `quantization_config` to `from_pretrained` but the model you're loading already ha
model.safetensors.index.json: 198k/198k [00:00<00:00,

100% 12.4MB/s]

Downloading shards: 4/4 [14:45<00:00,

100% 209.39s/it]

model-00001-of- 4.98G/4.98G [03:54<00:00,

00004.safetensors: 100% 21.5MB/s]

model-00002-of- 5.00G/5.00G [04:14<00:00,

00004.safetensors: 100% 18.1MB/s]

model-00003-of- 5.00G/5.00G [03:54<00:00,

00004.safetensors: 100% 19.7MB/s]

model-00004-of- 3.21G/3.21G [02:38<00:00,

00004.safetensors: 100% 21.4MB/s]

Loading checkpoint shards: 4/4 [00:06<00:00,

100% 1.54s/it]

generation_config.json: 116/116 [00:00<00:00,

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

model = FastLanguageModel.get_peft_model(
model,
r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Currently only supports dropout = 0
bias = "none", # Currently only supports bias = "none"
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
max_seq_length = max_seq_length,
)

Unsloth 2023.12 patched 48 layers with 48 QKV layers, 48 O layers and 48 MLP layers.

keyboard_arrow_down Data Prep

We now use the Alpaca dataset from yahma, which is a filtered version of 52K of the original Alpaca dataset. You can replace this code section
with your own data prep.

[NOTE] To train only on completions (ignoring the user's input) read TRL's docs here.
keyboard_arrow_down

Alpaca dataset preparation code

显示代码

Downloading readme: 100% 11.6k/11.6k [00:00<00:00, 855kB/s]

Downloading data: 100% 44.3M/44.3M [00:02<00:00, 23.6MB/s]

Generating train split: 51760/0 [00:00<00:00, 137153.36 examples/s]

Map: 100% 51760/51760 [00:00<00:00, 102802.36 examples/s]

keyboard_arrow_down Train the model

https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing 2/5
2024/4/25 19:12 Alpaca + Codellama 34b full example.ipynb - Colab
Now let's use Huggingface TRL's SFTTrainer ! More docs here: TRL SFT docs. We do 60 steps to speed things up, but you can set
num_train_epochs=1 for a full run, and turn off max_steps=None . We also support TRL's DPOTrainer !

from trl import SFTTrainer

from transformers import TrainingArguments

trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
args = TrainingArguments(
per_device_train_batch_size = 4,
gradient_accumulation_steps = 4,
warmup_steps = 10,
max_steps = 120,
learning_rate = 2e-4,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
),
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings ar
Map: 51760/51760 [00:09<00:00, 5660.67
keyboard_arrow_down

Show current memory stats

显示代码

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.564 GB.

17.791 GB of memory reserved.

trainer_stats = trainer.train()

https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing 3/5
2024/4/25 19:12 Alpaca + Codellama 34b full example.ipynb - Colab

You're using a CodeLlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, us
Unsloth: `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=F
[120/120 16:48, Epoch 0/1]
Step Training Loss

1 1.589300

2 1.634000

3 1.608500

4 1.398300

5 1.652000

6 1.335000

7 1.408300

8 1.282900

9 1.352700

10 1.055300

11 1.100300

12 0.997700

13 1.052700

14 0.966900

15 0.882100

16 0.863700

17 0.846300

18 0.886200

19 0.724900

20 1.072100

21 0.856300

22 0.827600

23 0.875100

24 0.937700

25 0.886100

26 0.885200

27 0.992800

28 0.848600
keyboard_arrow_down

Show final memory and time stats

显示代码

1020.2541 seconds used for training.

17.0 minutes used for training.
Peak reserved memory = 23.98 GB.
Peak reserved memory for training = 6.189 GB.
Peak reserved memory % of max memory = 60.611 %.
Peak reserved memory for training % of max memory = 15.643 %.

keyboard_arrow_down Inference
Let's run the model! You can change the instruction and input - leave the output blank!

https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing 4/5
2024/4/25 19:12 Alpaca + Codellama 34b full example.ipynb - Colab
inputs = tokenizer(
[
alpaca_prompt.format(
"Continue the fibonnaci sequence.", # instruction
"1, 1, 2, 3, 5, 8", # input
"", # output - leave this blank for generation!
)
]*1, return_tensors = "pt").to("cuda")

keyboard_arrow_down Saving, loading finetuned models

outputs = model.generate(**inputs, max_new_tokens = 128, use_cache = True)
tokenizer.batch_decode(outputs)
To save the final model, either use Huggingface's push_to_hub for an online save or save_pretrained for a local save.
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1547: UserWarning: You have modified the pretrained model configuration
To savewarnings.warn(
to GGUF / llama.cpp , or for model merging, use model.merge_and_unload first, then save the model. Maxime Labonne's llm-course has a
['<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately
nice tutorial on converting HF to GGUF! This issue might be helpful for more info.
completes the request.\n\n### Instruction:\nContinue the fibonnaci sequence.\n\n### Input:\n1, 1, 2, 3, 5, 8\n\n### Response:\n1, 1, 2, 3, 5, 8,
13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393']
model.save_pretrained("lora_model") # Local saving

https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing 5/5

Fanuc Spindle Alarm Codes and Fanuc Spindle Drive Faults
No ratings yet
Fanuc Spindle Alarm Codes and Fanuc Spindle Drive Faults
6 pages
Installing REDCap On A Windows Server2012R2
No ratings yet
Installing REDCap On A Windows Server2012R2
34 pages
04 Pytorch Custom Datasets - Ipynb
No ratings yet
04 Pytorch Custom Datasets - Ipynb
742 pages
Android Studio Print To Bluetooth Printer
100% (1)
Android Studio Print To Bluetooth Printer
16 pages
Py Torch
No ratings yet
Py Torch
786 pages
ZXA10 F822 Datasheet: Interface Function Physical Performance
No ratings yet
ZXA10 F822 Datasheet: Interface Function Physical Performance
1 page
Pytorch
No ratings yet
Pytorch
38 pages
Tensor Flow Programs
No ratings yet
Tensor Flow Programs
30 pages
Enervista Integrator: Grid Solutions
No ratings yet
Enervista Integrator: Grid Solutions
82 pages
Chapter 3 - Training Deep Neural Networks
No ratings yet
Chapter 3 - Training Deep Neural Networks
25 pages
Disaster Recovery Plan Template 04
No ratings yet
Disaster Recovery Plan Template 04
13 pages
Def Set Random Seed (Seed)
No ratings yet
Def Set Random Seed (Seed)
29 pages
Qwin Train
No ratings yet
Qwin Train
4 pages
Unsloth: Fast Llama-3 Training Guide
No ratings yet
Unsloth: Fast Llama-3 Training Guide
10 pages
PDS Project Setup for Damavand
100% (1)
PDS Project Setup for Damavand
26 pages
PyTorch Ebook
No ratings yet
PyTorch Ebook
44 pages
DL 1 - ComputerVision With PyTorch Notes
No ratings yet
DL 1 - ComputerVision With PyTorch Notes
304 pages
Tutorials Sources Beginner Ptcheat
No ratings yet
Tutorials Sources Beginner Ptcheat
7 pages
QuantizationLoRA Fine-Tune A 7B Model Single GPU
No ratings yet
QuantizationLoRA Fine-Tune A 7B Model Single GPU
6 pages
Vmware 2V0-620
100% (1)
Vmware 2V0-620
79 pages
PLC - Codegen - Codelama - Ipynb - Colab
No ratings yet
PLC - Codegen - Codelama - Ipynb - Colab
6 pages
Natural Language Processing With Pytorch Readthedocs Io en Latest PDF
No ratings yet
Natural Language Processing With Pytorch Readthedocs Io en Latest PDF
35 pages
Toshiba HDD
No ratings yet
Toshiba HDD
4 pages
Deep Learning Lab: How To Train Your First Neural Network
No ratings yet
Deep Learning Lab: How To Train Your First Neural Network
68 pages
Edge Lora Steps
No ratings yet
Edge Lora Steps
2 pages
Lab 6
No ratings yet
Lab 6
29 pages
Pipeline Flux Ipa
No ratings yet
Pipeline Flux Ipa
18 pages
Retorno 1
No ratings yet
Retorno 1
29 pages
BeneVision N22N19N17 IView System Operator's Manual (Win10) V1.0 en
No ratings yet
BeneVision N22N19N17 IView System Operator's Manual (Win10) V1.0 en
20 pages
Pytorch Demo 1749471354
No ratings yet
Pytorch Demo 1749471354
10 pages
Lec 3
No ratings yet
Lec 3
30 pages
200-120 CCNA New Sylabus
No ratings yet
200-120 CCNA New Sylabus
6 pages
Trainrealfill
No ratings yet
Trainrealfill
19 pages
PyTorch ResNet50 Training Guide
No ratings yet
PyTorch ResNet50 Training Guide
55 pages
Working Setup MulTalk - Windows
No ratings yet
Working Setup MulTalk - Windows
2 pages
Pytorch Tutorial: Narges Honarvar Nazari January 30
No ratings yet
Pytorch Tutorial: Narges Honarvar Nazari January 30
29 pages
TensorFlow Cheat Sheet
No ratings yet
TensorFlow Cheat Sheet
7 pages
Training A Classifier - PyTorch Tutorials 2.3.0+cu121 Documentation
No ratings yet
Training A Classifier - PyTorch Tutorials 2.3.0+cu121 Documentation
8 pages
Personal Coding Assistant
No ratings yet
Personal Coding Assistant
32 pages
IndicTrans2 PDF to Punjabi Docx Conversion
No ratings yet
IndicTrans2 PDF to Punjabi Docx Conversion
5 pages
NB4-06 PT I Using CNN
No ratings yet
NB4-06 PT I Using CNN
21 pages
HWACHEON Touch Off Instructions
No ratings yet
HWACHEON Touch Off Instructions
2 pages
Test Gpu Acceleration Pythonl
No ratings yet
Test Gpu Acceleration Pythonl
1 page
IC 408 Verification Procedure and Check Points
No ratings yet
IC 408 Verification Procedure and Check Points
7 pages
Val
No ratings yet
Val
9 pages
L6 Hardware and Software For DL en
No ratings yet
L6 Hardware and Software For DL en
66 pages
00 Pytorch Fundamentals - Ipynb - Colab
No ratings yet
00 Pytorch Fundamentals - Ipynb - Colab
24 pages
PyTorch 1 - 0 - Bringing Research and Production Together Presentation
No ratings yet
PyTorch 1 - 0 - Bringing Research and Production Together Presentation
108 pages
Eaton Cybersecure Gigabit Gateway Card
No ratings yet
Eaton Cybersecure Gigabit Gateway Card
2 pages
NPM 12-0-1 Administrator Guide
100% (1)
NPM 12-0-1 Administrator Guide
388 pages
Code2pdf 67c73149b96ef
No ratings yet
Code2pdf 67c73149b96ef
4 pages
Vit32 GPTMD
No ratings yet
Vit32 GPTMD
6 pages
NN From Scratch
No ratings yet
NN From Scratch
5 pages
Crack Detection with CNN
No ratings yet
Crack Detection with CNN
8 pages
Video Api Endpoint N
No ratings yet
Video Api Endpoint N
7 pages
Experimental Pix2pix
No ratings yet
Experimental Pix2pix
5 pages
Vedant@13
No ratings yet
Vedant@13
3 pages
QLSTMvs LSTM
No ratings yet
QLSTMvs LSTM
7 pages
Jadwal Migrasi IP DFR & Cek Online DFR (1) - 1
No ratings yet
Jadwal Migrasi IP DFR & Cek Online DFR (1) - 1
91 pages
PyTorch Cheat Sheet & Quick Reference
No ratings yet
PyTorch Cheat Sheet & Quick Reference
6 pages
PyTorch - A Comprehensive Overview
No ratings yet
PyTorch - A Comprehensive Overview
7 pages
Precision 7530 Spec Sheet
No ratings yet
Precision 7530 Spec Sheet
5 pages
Lab 5
No ratings yet
Lab 5
27 pages
Mifare Classic Hack
No ratings yet
Mifare Classic Hack
2 pages
Pytorch Performance Tuning Guide: Szymon Migacz, 04/12/2021
No ratings yet
Pytorch Performance Tuning Guide: Szymon Migacz, 04/12/2021
20 pages
IP Addressing and Subnetting Guide
No ratings yet
IP Addressing and Subnetting Guide
58 pages
Deep Learning With PyTorch: Object Classification - Filliat Et Al
No ratings yet
Deep Learning With PyTorch: Object Classification - Filliat Et Al
3 pages
Tensorflow Object Detection Api Tutorial PDF
No ratings yet
Tensorflow Object Detection Api Tutorial PDF
41 pages
Tutorial Pytorch Best Commands
No ratings yet
Tutorial Pytorch Best Commands
8 pages
What Is A Socket?: Definition: A Socket Is One Endpoint of A Two-Way Communication Link Between Two Programs
No ratings yet
What Is A Socket?: Definition: A Socket Is One Endpoint of A Two-Way Communication Link Between Two Programs
2 pages
The Complete Idiots Guide To Ding Ux Win Doze
No ratings yet
The Complete Idiots Guide To Ding Ux Win Doze
17 pages
Tensorflow Installtion Ubuntu16.4
No ratings yet
Tensorflow Installtion Ubuntu16.4
5 pages
EPROM Chip Replacement
No ratings yet
EPROM Chip Replacement
5 pages
Chapter 3 - Computer Software
No ratings yet
Chapter 3 - Computer Software
11 pages
Py Torch
No ratings yet
Py Torch
19 pages
References:: 111 Richard Valantine, "Motor Control Electronics Handbook", 1998, ISBN 0-07-066810-8
No ratings yet
References:: 111 Richard Valantine, "Motor Control Electronics Handbook", 1998, ISBN 0-07-066810-8
30 pages
Ultimate Guide To Tensorflow 2.0 in Python
No ratings yet
Ultimate Guide To Tensorflow 2.0 in Python
23 pages
PyTorch Cheat Sheet for Developers
No ratings yet
PyTorch Cheat Sheet for Developers
2 pages
03 Installing and Using Bootloader
No ratings yet
03 Installing and Using Bootloader
12 pages
Ispa Validator Manual
No ratings yet
Ispa Validator Manual
24 pages
Design and Implementation of Solar Coin Based Mobile Charger
No ratings yet
Design and Implementation of Solar Coin Based Mobile Charger
46 pages
WPS Office Vulns Report
No ratings yet
WPS Office Vulns Report
1 page
Difference Between Cores I3, I5 AND I7: Core I3 Core I5 Core I7
No ratings yet
Difference Between Cores I3, I5 AND I7: Core I3 Core I5 Core I7
3 pages
Pytorch Tutorial by Chongruo Wu
No ratings yet
Pytorch Tutorial by Chongruo Wu
84 pages
UAC Trust Shortcut
No ratings yet
UAC Trust Shortcut
0 pages

Fast Llama Training Guide

Uploaded by

Fast Llama Training Guide

Uploaded by

2024/4/25 19:12 Alpaca + Codellama 34b full example.

Join Discord if you need help + support us if you can!

We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(

/usr/local/lib/python3.10/dist-packages/unsloth/__init__.py:67: UserWarning: CUDA is not link

Downloading shards: 4/4 [14:45<00:00,

model-00001-of- 4.98G/4.98G [03:54<00:00,

00004.safetensors: 100% 21.5MB/s]

model-00002-of- 5.00G/5.00G [04:14<00:00,

00004.safetensors: 100% 18.1MB/s]

model-00003-of- 5.00G/5.00G [03:54<00:00,

00004.safetensors: 100% 19.7MB/s]

model-00004-of- 3.21G/3.21G [02:38<00:00,

00004.safetensors: 100% 21.4MB/s]

Loading checkpoint shards: 4/4 [00:06<00:00,

generation_config.json: 116/116 [00:00<00:00,

keyboard_arrow_down Data Prep

Alpaca dataset preparation code

Downloading readme: 100% 11.6k/11.6k [00:00<00:00, 855kB/s]

Downloading data: 100% 44.3M/44.3M [00:02<00:00, 23.6MB/s]

Generating train split: 51760/0 [00:00<00:00, 137153.36 examples/s]

Map: 100% 51760/51760 [00:00<00:00, 102802.36 examples/s]

keyboard_arrow_down Train the model

from trl import SFTTrainer

Show current memory stats

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.564 GB.

Show final memory and time stats

1020.2541 seconds used for training.

keyboard_arrow_down Saving, loading finetuned models

You might also like

/usr/local/lib/python3.10/dist-packages/unsloth/init.py:67: UserWarning: CUDA is not link