Loading sharded (GGUF) model files from HF with LLama.from_pretrained() 'additional_files' argument #1457

Gnurro · 2024-05-14T14:36:22Z

Added code allows to specify multiple files to load via HuggingFace Hub in LLama.from_pretrained(). New argument takes a List of strings, which are used the same as the 'file_name' string argument. Code could likely be more elegant (ie parallel downloads, but I'm not familiar enough with the HF hub library), but it works.
Tested and working on Windows10 and Ubuntu (inside a Docker stack).

…ained(additional_files=[...])

VinayHajare

Tested for VinayHajare/Meta-Llama3-70B-Instruct-v2-GGUF.
Its working.

from llama_cpp import Llama
llm = Llama.from_pretrained(
            repo_id="VinayHajare/Meta-Llama3-70B-Instruct-v2-GGUF",
            filename="Meta-Llama-3-70B-Instruct-v2.Q6_K-00001-of-00002.gguf",
            additional_files=["Meta-Llama-3-70B-Instruct-v2.Q6_K-00002-of-00002.gguf"],
            local_dir="./models",
            flash_attn=True,
            n_gpu_layers=81,
            n_batch=1024,
            n_ctx=8000,
        )

Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
(…)70B-Instruct-v2.Q6_K-00001-of-00002.gguf: 100%
 32.1G/32.1G [14:58<00:00, 34.7MB/s]
(…)70B-Instruct-v2.Q6_K-00002-of-00002.gguf: 100%
 25.7G/25.7G [09:45<00:00, 58.5MB/s]
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 25 key-value pairs and 723 tensors from ./models/Meta-Llama-3-70B-Instruct-v2.Q6_K-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = models
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   5:                          llama.block_count u32              = 80
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 18
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - kv  22:                                   split.no u16              = 0
llama_model_loader: - kv  23:                                split.count u16              = 2
llama_model_loader: - kv  24:                        split.tensors.count i32              = 723
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q6_K:  562 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 53.91 GiB (6.56 BPW) 
llm_load_print_meta: general.name     = models
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size =    0.37 MiB

roG0d · 2024-08-26T23:16:58Z

Hi, I've been tinkering with the from_pretrained method to: first; use GPU layers offload and; two, download multiple shards of a given huggingface model repo. I've tried to mimic the cache comprobations for already downloaded models too.

Anyway, I wanted to include these modifications in your PR if you see them as proper additions! It may be cool to refine the code a little bit to meet certain quality standards

Gnurro added 3 commits May 11, 2024 22:39

Add loading sharded GGUF files from HuggingFace with Llama.from_pretr…

0e67a83

…ained(additional_files=[...])

Merge branch 'abetlen:main' into main

6f78164

Merge branch 'abetlen:main' into main

34a385a

This was referenced May 14, 2024

Is there support for loading a sharded gguf file ? #1341

Closed

New model entries for May 2024 clp-research/clemcore#92

Merged

Gnurro added 3 commits May 16, 2024 15:58

Merge branch 'abetlen:main' into main

1649d04

Merge branch 'abetlen:main' into main

d6331c0

Merge branch 'abetlen:main' into main

948fa84

VinayHajare reviewed Jul 12, 2024

View reviewed changes

Merge branch 'abetlen:main' into main

de188b8

Gnurro and others added 2 commits August 31, 2024 12:58

Merge branch 'main' into main

9523cf2

Merge branch 'main' into main

4127d01

abetlen merged commit 84c0920 into abetlen:main Sep 19, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading sharded (GGUF) model files from HF with LLama.from_pretrained() 'additional_files' argument #1457

Loading sharded (GGUF) model files from HF with LLama.from_pretrained() 'additional_files' argument #1457

Gnurro commented May 14, 2024

VinayHajare left a comment •

edited

Loading

roG0d commented Aug 26, 2024

Loading sharded (GGUF) model files from HF with LLama.from_pretrained() 'additional_files' argument #1457

Loading sharded (GGUF) model files from HF with LLama.from_pretrained() 'additional_files' argument #1457

Conversation

Gnurro commented May 14, 2024

VinayHajare left a comment • edited Loading

Choose a reason for hiding this comment

roG0d commented Aug 26, 2024

VinayHajare left a comment •

edited

Loading