Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 332e6e5

Browse files
committed
Add 💫Starcoder in ggml
1 parent 0198211 commit 332e6e5

File tree

6 files changed

+1400
-1
lines changed

6 files changed

+1400
-1
lines changed

.gitignore

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
*.o
2+
*.a
3+
.cache/
4+
.vs/
5+
.vscode/
6+
.DS_Store
7+
8+
build/
9+
build-em/
10+
build-debug/
11+
build-release/
12+
build-static/
13+
build-no-accel/
14+
build-sanitize-addr/
15+
build-sanitize-thread/
16+
17+
models/*
18+
19+
/main
20+
/quantize
21+
22+
arm_neon.h
23+
compile_commands.json

CMakeLists.txt

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
#
2+
# starcoder
3+
4+
set(TEST_TARGET starcoder)
5+
add_executable(${TEST_TARGET} main.cpp)
6+
target_link_libraries(${TEST_TARGET} PRIVATE ggml common common-ggml)
7+
8+
#
9+
# starcoder-quantize
10+
11+
set(TEST_TARGET starcoder-quantize)
12+
add_executable(${TEST_TARGET} quantize.cpp)
13+
target_link_libraries(${TEST_TARGET} PRIVATE ggml common common-ggml)

README.md

Lines changed: 112 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,112 @@
1-
# StarCoder CPP
1+
# 💫StarCoder in C++
2+
3+
This is a C++ example running 💫 StarCoder inference using the [ggml](https://github.com/ggerganov/ggml) library.
4+
5+
The program runs on the CPU - no video card is required.
6+
7+
The example supports the following 💫 StarCoder models:
8+
9+
- `bigcode/starcoder`
10+
- `bigcode/gpt_bigcode-santacoder` aka the smol StarCoder
11+
12+
Sample performance on MacBook M1 Pro:
13+
14+
TODO
15+
16+
17+
Sample output:
18+
19+
```
20+
$ ./bin/starcoder -h
21+
usage: ./bin/starcoder [options]
22+
23+
options:
24+
-h, --help show this help message and exit
25+
-s SEED, --seed SEED RNG seed (default: -1)
26+
-t N, --threads N number of threads to use during computation (default: 8)
27+
-p PROMPT, --prompt PROMPT
28+
prompt to start generation with (default: random)
29+
-n N, --n_predict N number of tokens to predict (default: 200)
30+
--top_k N top-k sampling (default: 40)
31+
--top_p N top-p sampling (default: 0.9)
32+
--temp N temperature (default: 1.0)
33+
-b N, --batch_size N batch size for prompt processing (default: 8)
34+
-m FNAME, --model FNAME
35+
model path (default: models/starcoder-117M/ggml-model.bin)
36+
37+
$ ./bin/starcoder -m ../models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin -p "def fibonnaci(" -t 4 --top_k 0 --top_p 0.95 --temp 0.2
38+
main: seed = 1683881276
39+
starcoder_model_load: loading model from '../models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin'
40+
starcoder_model_load: n_vocab = 49280
41+
starcoder_model_load: n_ctx = 2048
42+
starcoder_model_load: n_embd = 2048
43+
starcoder_model_load: n_head = 16
44+
starcoder_model_load: n_layer = 24
45+
starcoder_model_load: ftype = 3
46+
starcoder_model_load: ggml ctx size = 1794.90 MB
47+
starcoder_model_load: memory size = 768.00 MB, n_mem = 49152
48+
starcoder_model_load: model size = 1026.83 MB
49+
main: prompt: 'def fibonnaci('
50+
main: number of tokens in prompt = 7, first 8 tokens: 563 24240 78 2658 64 2819 7
51+
52+
def fibonnaci(n):
53+
if n == 0:
54+
return 0
55+
elif n == 1:
56+
return 1
57+
else:
58+
return fibonacci(n-1) + fibonacci(n-2)
59+
60+
print(fibo(10))
61+
62+
main: mem per token = 9597928 bytes
63+
main: load time = 480.43 ms
64+
main: sample time = 26.21 ms
65+
main: predict time = 3987.95 ms / 19.36 ms per token
66+
main: total time = 4580.56 ms
67+
```
68+
69+
## Quick start
70+
```bash
71+
git clone https://github.com/ggerganov/ggml
72+
cd ggml
73+
74+
# Convert HF model to ggml
75+
python examples/starcoder/convert-hf-to-ggml.py bigcode/gpt_bigcode-santacoder
76+
77+
# Build ggml + examples
78+
mkdir build && cd build
79+
cmake .. && make -j4 starcoder starcoder-quantize
80+
81+
# quantize the model
82+
./bin/starcoder-quantize ../models/bigcode/gpt_bigcode-santacoder-ggml.bin ../models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin 3
83+
84+
# run inference
85+
./bin/starcoder -m ../models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin -p "def fibonnaci(" --top_k 0 --top_p 0.95 --temp 0.2
86+
```
87+
88+
89+
## Downloading and converting the original models (💫 StarCoder)
90+
91+
You can download the original model and convert it to `ggml` format using the script `convert-hf-to-ggml.py`:
92+
93+
```
94+
# Convert HF model to ggml
95+
python examples/starcoder/convert-hf-to-ggml.py bigcode/gpt_bigcode-santacoder
96+
```
97+
98+
This conversion requires that you have python and Transformers installed on your computer.
99+
100+
## Quantizing the models
101+
102+
You can also try to quantize the `ggml` models via 4-bit integer quantization.
103+
104+
```
105+
# quantize the model
106+
./bin/starcoder-quantize ../models/bigcode/gpt_bigcode-santacoder-ggml.bin ../models/bigcode/gpt_bigcode-santacoder-ggml-q4_1.bin 3
107+
```
108+
109+
| Model | Original size | Quantized size | Quantization type |
110+
| --- | --- | --- | --- |
111+
| `bigcode/gpt_bigcode-santacoder` | 5396.45 MB | 1026.83 MB | 4-bit integer (q4_1) |
112+
| `bigcode/starcoder` | 71628.23 MB | 13596.23 MB | 4-bit integer (q4_1) |

convert-hf-to-ggml.py

Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
# Convert HF models to ggml format
2+
#
3+
4+
import sys
5+
import struct
6+
import json
7+
import torch
8+
import numpy as np
9+
import re
10+
import os
11+
12+
from transformers import AutoModelForCausalLM
13+
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BloomForCausalLM
14+
15+
# ref: https://github.com/openai/gpt-2/blob/master/src/encoder.py
16+
def bytes_to_unicode():
17+
"""
18+
Returns list of utf-8 byte and a corresponding list of unicode strings.
19+
The reversible bpe codes work on unicode strings.
20+
This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
21+
When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
22+
This is a signficant percentage of your normal, say, 32K bpe vocab.
23+
To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
24+
And avoids mapping to whitespace/control characters the bpe code barfs on.
25+
"""
26+
bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
27+
cs = bs[:]
28+
n = 0
29+
for b in range(2**8):
30+
if b not in bs:
31+
bs.append(b)
32+
cs.append(2**8+n)
33+
n += 1
34+
cs = [chr(n) for n in cs]
35+
return dict(zip(bs, cs))
36+
37+
if len(sys.argv) < 2:
38+
print("Usage: python convert-hf-to-ggml.py hf-model-name [use-f32]")
39+
print("Example: python convert-hf-to-ggml.py bigcode/gpt_bigcode-santacoder")
40+
print("Example: python convert-hf-to-ggml.py bigcode/starcoder")
41+
sys.exit(1)
42+
43+
model_name = sys.argv[1].strip()
44+
fname_out = "models/" + sys.argv[1].strip() + "-ggml.bin"
45+
os.makedirs(os.path.dirname(fname_out), exist_ok=True)
46+
47+
48+
49+
# use 16-bit or 32-bit floats
50+
use_f16 = True
51+
if len(sys.argv) > 2:
52+
use_f16 = False
53+
54+
print("Loading model: ", model_name)
55+
tokenizer = AutoTokenizer.from_pretrained(model_name)
56+
config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
57+
hparams = config.to_dict()
58+
model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=torch.float16 if use_f16 else torch.float32, low_cpu_mem_usage=True, trust_remote_code=True, offload_state_dict=True)
59+
print("Model loaded: ", model_name)
60+
61+
#print (model)
62+
63+
list_vars = model.state_dict()
64+
#print (list_vars)
65+
66+
encoder = tokenizer.vocab
67+
# Add added_tokens (special tokens) to the encoder
68+
encoder.update(tokenizer.get_added_vocab())
69+
print(hparams)
70+
71+
print("Saving ggml model to: ", fname_out)
72+
fout = open(fname_out, "wb")
73+
74+
fout.write(struct.pack("i", 0x67676d6c)) # magic: ggml in hex
75+
vocab_size = hparams["vocab_size"]
76+
fout.write(struct.pack("i", vocab_size))
77+
# fout.write(struct.pack("i", len(encoder)))
78+
fout.write(struct.pack("i", hparams["n_positions"]))
79+
fout.write(struct.pack("i", hparams["n_embd"]))
80+
fout.write(struct.pack("i", hparams["n_head"]))
81+
fout.write(struct.pack("i", hparams["n_layer"]))
82+
fout.write(struct.pack("i", use_f16))
83+
84+
byte_encoder = bytes_to_unicode()
85+
byte_decoder = {v:k for k, v in byte_encoder.items()}
86+
87+
fout.write(struct.pack("i", vocab_size))
88+
89+
counter = 0
90+
# sort by value
91+
for key in sorted(encoder, key=encoder.get):
92+
text = bytearray([byte_decoder[c] for c in key])
93+
fout.write(struct.pack("i", len(text)))
94+
fout.write(text)
95+
counter += 1
96+
97+
# TODO: Repeat last token until vocab_size
98+
while counter < vocab_size:
99+
fout.write(struct.pack("i", len(text)))
100+
fout.write(text)
101+
counter += 1
102+
# assert counter == config.vocab_size
103+
104+
for name in list_vars.keys():
105+
data = list_vars[name].squeeze().numpy()
106+
print("Processing variable: " + name + " with shape: ", data.shape)
107+
108+
# rename headers to keep compatibility
109+
if name == "transformer.ln_f.weight":
110+
name = "model/ln_f/g"
111+
elif name == "transformer.ln_f.bias":
112+
name = "model/ln_f/b"
113+
elif name == "transformer.wte.weight":
114+
name = "model/wte"
115+
elif name == "transformer.wpe.weight":
116+
name = "model/wpe"
117+
elif name == "lm_head.weight":
118+
name = "model/lm_head"
119+
elif re.match(r"transformer.h\.\d+\.ln_1\.weight", name):
120+
i = re.findall("\d+", name)[0]
121+
name = f"model/h{i}/ln_1/g"
122+
elif re.match(r"transformer.h\.\d+\.ln_1\.bias", name):
123+
i = re.findall("\d+", name)[0]
124+
name = f"model/h{i}/ln_1/b"
125+
elif re.match(r"transformer.h\.\d+\.attn\.c_attn\.weight", name):
126+
i = re.findall("\d+", name)[0]
127+
name = f"model/h{i}/attn/c_attn/w"
128+
elif re.match(r"transformer.h\.\d+\.attn\.c_attn\.bias", name):
129+
i = re.findall("\d+", name)[0]
130+
name = f"model/h{i}/attn/c_attn/b"
131+
elif re.match(r"transformer.h\.\d+\.attn\.c_proj\.weight", name):
132+
i = re.findall("\d+", name)[0]
133+
name = f"model/h{i}/attn/c_proj/w"
134+
elif re.match(r"transformer.h.\d+.attn.c_proj.bias", name):
135+
i = re.findall("\d+", name)[0]
136+
name = f"model/h{i}/attn/c_proj/b"
137+
elif re.match(r"transformer.h.\d+.ln_2.weight", name):
138+
i = re.findall("\d+", name)[0]
139+
name = f"model/h{i}/ln_2/g"
140+
elif re.match(r"transformer.h.\d+.ln_2.bias", name):
141+
i = re.findall("\d+", name)[0]
142+
name = f"model/h{i}/ln_2/b"
143+
elif re.match(r"transformer.h.\d+.mlp.c_fc.weight", name):
144+
i = re.findall("\d+", name)[0]
145+
name = f"model/h{i}/mlp/c_fc/w"
146+
elif re.match(r"transformer.h.\d+.mlp.c_fc.bias", name):
147+
i = re.findall("\d+", name)[0]
148+
name = f"model/h{i}/mlp/c_fc/b"
149+
elif re.match(r"transformer.h.\d+.mlp.c_proj.weight", name):
150+
i = re.findall("\d+", name)[0]
151+
name = f"model/h{i}/mlp/c_proj/w"
152+
elif re.match(r"transformer.h.\d+.mlp.c_proj.bias", name):
153+
i = re.findall("\d+", name)[0]
154+
name = f"model/h{i}/mlp/c_proj/b"
155+
else:
156+
print("Unrecognized variable name. %s", name)
157+
158+
# we don't need these
159+
if name.endswith("attn.masked_bias") or name.endswith(".attn.bias"):
160+
print(" Skipping variable: " + name)
161+
continue
162+
163+
n_dims = len(data.shape);
164+
165+
# ftype == 0 -> float32, ftype == 1 -> float16
166+
ftype = 0;
167+
if use_f16:
168+
if (name == "model/wte" or name == "model/lm_head" or name[-2:] == "/g" or name[-2:] == "/w") and n_dims == 2:
169+
print(" Converting to float16")
170+
data = data.astype(np.float16)
171+
ftype = 1
172+
else:
173+
print(" Converting to float32")
174+
data = data.astype(np.float32)
175+
ftype = 0
176+
177+
"model/h.*/attn/c_attn/w"
178+
"model/h.*/attn/c_proj/w"
179+
"model/h.*/mlp/c_fc/w"
180+
"model/h.*/mlp/c_proj/w"
181+
if name[-14:] == "/attn/c_attn/w" or name[-14:] == "/attn/c_attn/b":
182+
print(" Duplicate K,V heads to use MHA instead of MQA")
183+
184+
embed_dim = hparams["n_embd"]
185+
head_dim = embed_dim // hparams["n_head"]
186+
187+
# ((n_heads + 2) * head_dim, hidden_dim) -> (3 * n_heads * head_dim, hidden_dim)
188+
q, k ,v = np.split(data, (hparams["n_head"] * head_dim, (hparams["n_head"] + 1) * head_dim), axis=0)
189+
# duplicate k, v along the first axis (head_dim, hidden_dim) -> (n_heads * head_dim, hidden_dim)
190+
if len(k.shape) == 2:
191+
k = np.tile(k, (hparams["n_head"], 1))
192+
v = np.tile(v, (hparams["n_head"], 1))
193+
elif len(k.shape) == 1:
194+
k = np.tile(k, (hparams["n_head"]))
195+
v = np.tile(v, (hparams["n_head"]))
196+
# concat q, k, v along the first axis (n_heads * head_dim, hidden_dim) -> (3 * n_heads * head_dim, hidden_dim)
197+
data = np.concatenate((q, k, v), axis=0)
198+
199+
# header
200+
str = name.encode('utf-8')
201+
fout.write(struct.pack("iii", n_dims, len(str), ftype))
202+
for i in range(n_dims):
203+
fout.write(struct.pack("i", data.shape[n_dims - 1 - i]))
204+
fout.write(str);
205+
206+
# data
207+
data.tofile(fout)
208+
209+
fout.close()
210+
211+
print("Done. Output file: " + fname_out)
212+
print("")

0 commit comments

Comments
 (0)