pip install -q transformers

Fine-tuning

Get the StarCoderBase Megatron-LM checkpoint: git clone https://huggingface.co/bigcode/starcoderbase-megatron
Get Megatron-LM: git clone -b mtf https://github.com/bigcode-project/Megatron-LM
Prepare a Python environment with PyTorch. (TODO: There may be some other packages needed that you will find out about when training fails)
Prepare dataset: Preapre a finetuning dataset in the form of a single jsonl file with two keys: inputs & outputs. inputs should contain the prompt and instruction while outputs contains the targets. Loss will only be computed over outputs. See dataset/commits_to_jsonl.py for an example of doing this. In that example we put the instruction (commit message) in the target, but it's better to put it in the input.
Tokenize the fine-tuning dataset by modifying dataset/preprocess.sh to point to your jsonl dataset. Also modify the path of the tokenizer, in our case point to the StarCoder's tokenizer.json (wget https://huggingface.co/bigcode/starcoderbase/raw/main/tokenizer.json). Finally specify an output prefix where the tokenized data will be stored. Then run it with bash dataset/preprocess.sh.
Create two files train_data_paths.txt.tmp and valid_data_paths.txt.tmp that contain the paths to the above created tokenized dataset. For example they could look like "train: 1.0 0:0.95 output_prefix" and "valid: 1.0 0.95:1.0 output_prefix. In this case the dataset is split into 95% training and 5% validation. The first number is the weight of the dataset, the second number is the start of the dataset and the third number is the end of the dataset.
Rename the checkpoint downloaded to release i.e. mv starcoderbase-megatron/iter* starcoderbase-megatron/release and create a file starcoderbase-megatron/latest_checkpointed_iteration.txt that contains simply release (echo release > starcoderbase-megatron/latest_checkpointed_iteration.txt).
Modify training/finetune_starcoderbase.sh to adapt CHECKPOINT_PATH to point to the downloaded Megatron-LM checkpoint, WEIGHTS_TRAIN & WEIGHTS_VALID to point to the above created txt files, TOKENIZER_FILE to StarCoder's tokenizer.json, point to your environment and cache locations, and modify the SBATCH settings to suit your setup. Then run it with bash training/finetune_starcoderbase.sh. You can interrupt and resume training, however, if you resume, you need to remove --no_load_optim and --no_load_rng from the command line arguments in the script to load the optimizer and random number generator state from the newly saved checkpoint (we only do not want to load them from starcoderbase).
Convert the saved checkpoint using the instructions below.

Checkpoint conversion

Update the paths in convert_large.sh & download the marked repos & run it

Other

for idx in ["00001", "00002", "00003", "00004", "00005", "00006", "00007"]: x = torch.load(f"/gpfsscratch/rech/ajs/commun/Bigcode-large-megatron_conv/base/shard2/pytorch_model-{idx}-of-00007.bin") y = torch.load(f"/gpfsscratch/rech/ajs/commun/starcoderbase/pytorch_model-{idx}-of-00007.bin") assert x.keys() == y.keys() for k in x.keys(): if not((x[k] == y[k]).all()): print(k) print(x[k].shape) print(y[k].shape) break

pip install -q transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "/gpfsscratch/rech/ajs/commun/Bigcode-large-megatron_conv/base/shard"
checkpoint = "/gpfsscratch/rech/ajs/commun/Bigcode-large-megatron_conv/base3/"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint, 
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
).to(device)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=1)
print(tokenizer.decode(outputs[0]))

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
dataset		dataset
evaluation		evaluation
training		training
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fine-tuning

Checkpoint conversion

Other

pip install -q transformers

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fine-tuning

Checkpoint conversion

Other

pip install -q transformers

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages