Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@arnavgarg1
Copy link
Contributor

@arnavgarg1 arnavgarg1 commented Jan 19, 2024

Microsoft updated the modeling_phi.py file 1 day ago: https://huggingface.co/microsoft/phi-2/blob/main/modeling_phi.py. They did this for Phi-1, Phi-1_5 and Phi-2.

The net effect of this change is that they no longer have the same model architecture from the originally released model -they use GQA now: huggingface/transformers#28163. This means that Wqkv and out_proj are no longer valid target modules. Because of this, the current default LoRA target modules for Phi are incompatible with the latest versions of these models and the error is silent, leading to poor LoRA fine-tuning performance.

This PR updates the default target modules to match the new model architecture.

Using Transformers 4.36.2

>>> from transformers import AutoModelForCausalLM
>>> model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2")
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00,  6.02s/it]
Some weights of the model checkpoint at microsoft/phi-2 were not used when initializing PhiForCausalLM: ['model.layers.7.self_attn.v_proj.weight', 'model.layers.21.self_attn.k_proj.weight', 'model.layers.25.self_attn.q_proj.bias', 'model.layers.1.self_attn.q_proj.bias', 'model.layers.3.self_attn.q_proj.bias', 'model.layers.11.self_attn.v_proj.weight', 'model.layers.10.self_attn.q_proj.weight', 'model.layers.24.self_attn.q_proj.weight', 'model.layers.7.self_attn.q_proj.bias', 'model.layers.0.self_attn.k_proj.weight', 'model.layers.31.self_attn.k_proj.weight', 'model.layers.20.self_attn.q_proj.bias', 'model.layers.1.self_attn.q_proj.weight', 'model.layers.9.self_attn.q_proj.weight', 'model.layers.18.self_attn.q_proj.weight', 'model.layers.27.self_attn.k_proj.weight', 'model.layers.27.self_attn.q_proj.bias', 'model.layers.21.self_attn.v_proj.bias', 'model.layers.25.self_attn.v_proj.bias', 'model.layers.12.self_attn.k_proj.bias', 'model.layers.21.self_attn.q_proj.bias', 'model.layers.17.self_attn.v_proj.weight', 'model.layers.6.self_attn.k_proj.bias', 'model.layers.16.self_attn.k_proj.bias', 'model.layers.11.self_attn.q_proj.bias', 'model.layers.22.self_attn.v_proj.bias', 'model.layers.27.self_attn.q_proj.weight', 'model.layers.23.self_attn.v_proj.weight', 'model.layers.20.self_attn.v_proj.bias', 'model.layers.5.self_attn.k_proj.weight', 'model.layers.22.self_attn.q_proj.weight', 'model.layers.15.self_attn.v_proj.weight', 'model.layers.13.self_attn.q_proj.bias', 'model.layers.26.self_attn.q_proj.weight', 'model.layers.26.self_attn.k_proj.bias', 'model.layers.8.self_attn.q_proj.weight', 'model.layers.25.self_attn.k_proj.weight', 'model.layers.14.self_attn.k_proj.weight', 'model.layers.4.self_attn.v_proj.weight', 'model.layers.29.self_attn.q_proj.weight', 'model.layers.2.self_attn.q_proj.bias', 'model.layers.31.self_attn.q_proj.weight', 'model.layers.0.self_attn.q_proj.weight', 'model.layers.10.self_attn.v_proj.bias', 'model.layers.13.self_attn.k_proj.bias', 'model.layers.28.self_attn.v_proj.bias', 'model.layers.14.self_attn.q_proj.bias', 'model.layers.0.self_attn.k_proj.bias', 'model.layers.5.self_attn.k_proj.bias', 'model.layers.12.self_attn.v_proj.bias', 'model.layers.8.self_attn.v_proj.weight', 'model.layers.13.self_attn.v_proj.bias', 'model.layers.3.self_attn.k_proj.bias', 'model.layers.11.self_attn.q_proj.weight', 'model.layers.26.self_attn.v_proj.weight', 'model.layers.15.self_attn.q_proj.weight', 'model.layers.5.self_attn.q_proj.bias', 'model.layers.7.self_attn.k_proj.bias', 'model.layers.18.self_attn.k_proj.weight', 'model.layers.16.self_attn.v_proj.bias', 'model.layers.7.self_attn.k_proj.weight', 'model.layers.10.self_attn.k_proj.bias', 'model.layers.20.self_attn.v_proj.weight', 'model.layers.27.self_attn.k_proj.bias', 'model.layers.4.self_attn.k_proj.bias', 'model.layers.12.self_attn.q_proj.bias', 'model.layers.12.self_attn.v_proj.weight', 'model.layers.5.self_attn.q_proj.weight', 'model.layers.10.self_attn.k_proj.weight', 'model.layers.19.self_attn.q_proj.bias', 'model.layers.30.self_attn.k_proj.bias', 'model.layers.30.self_attn.v_proj.weight', 'model.layers.9.self_attn.q_proj.bias', 'model.layers.16.self_attn.q_proj.weight', 'model.layers.14.self_attn.q_proj.weight', 'model.layers.3.self_attn.q_proj.weight', 'model.layers.9.self_attn.k_proj.bias', 'model.layers.10.self_attn.v_proj.weight', 'model.layers.26.self_attn.k_proj.weight', 'model.layers.31.self_attn.v_proj.weight', 'model.layers.18.self_attn.k_proj.bias', 'model.layers.8.self_attn.v_proj.bias', 'model.layers.25.self_attn.q_proj.weight', 'model.layers.6.self_attn.k_proj.weight', 'model.layers.20.self_attn.q_proj.weight', 'model.layers.14.self_attn.v_proj.bias', 'model.layers.13.self_attn.k_proj.weight', 'model.layers.23.self_attn.v_proj.bias', 'model.layers.5.self_attn.v_proj.bias', 'model.layers.0.self_attn.v_proj.weight', 'model.layers.26.self_attn.v_proj.bias', 'model.layers.22.self_attn.k_proj.bias', 'model.layers.30.self_attn.v_proj.bias', 'model.layers.16.self_attn.q_proj.bias', 'model.layers.15.self_attn.k_proj.bias', 'model.layers.7.self_attn.q_proj.weight', 'model.layers.11.self_attn.k_proj.bias', 'model.layers.2.self_attn.q_proj.weight', 'model.layers.29.self_attn.k_proj.weight', 'model.layers.24.self_attn.k_proj.weight', 'model.layers.22.self_attn.v_proj.weight', 'model.layers.9.self_attn.k_proj.weight', 'model.layers.9.self_attn.v_proj.weight', 'model.layers.26.self_attn.q_proj.bias', 'model.layers.30.self_attn.k_proj.weight', 'model.layers.19.self_attn.q_proj.weight', 'model.layers.25.self_attn.v_proj.weight', 'model.layers.21.self_attn.k_proj.bias', 'model.layers.1.self_attn.v_proj.bias', 'model.layers.21.self_attn.q_proj.weight', 'model.layers.10.self_attn.q_proj.bias', 'model.layers.19.self_attn.v_proj.weight', 'model.layers.2.self_attn.v_proj.weight', 'model.layers.1.self_attn.v_proj.weight', 'model.layers.23.self_attn.k_proj.bias', 'model.layers.18.self_attn.v_proj.weight', 'model.layers.19.self_attn.k_proj.bias', 'model.layers.5.self_attn.v_proj.weight', 'model.layers.19.self_attn.k_proj.weight', 'model.layers.25.self_attn.k_proj.bias', 'model.layers.1.self_attn.k_proj.weight', 'model.layers.17.self_attn.k_proj.bias', 'model.layers.3.self_attn.v_proj.bias', 'model.layers.4.self_attn.q_proj.bias', 'model.layers.13.self_attn.v_proj.weight', 'model.layers.15.self_attn.v_proj.bias', 'model.layers.4.self_attn.q_proj.weight', 'model.layers.28.self_attn.q_proj.bias', 'model.layers.23.self_attn.q_proj.weight', 'model.layers.22.self_attn.k_proj.weight', 'model.layers.24.self_attn.q_proj.bias', 'model.layers.4.self_attn.k_proj.weight', 'model.layers.24.self_attn.k_proj.bias', 'model.layers.20.self_attn.k_proj.weight', 'model.layers.12.self_attn.k_proj.weight', 'model.layers.18.self_attn.q_proj.bias', 'model.layers.31.self_attn.q_proj.bias', 'model.layers.16.self_attn.k_proj.weight', 'model.layers.11.self_attn.k_proj.weight', 'model.layers.23.self_attn.q_proj.bias', 'model.layers.12.self_attn.q_proj.weight', 'model.layers.11.self_attn.v_proj.bias', 'model.layers.30.self_attn.q_proj.weight', 'model.layers.29.self_attn.v_proj.bias', 'model.layers.31.self_attn.v_proj.bias', 'model.layers.1.self_attn.k_proj.bias', 'model.layers.13.self_attn.q_proj.weight', 'model.layers.28.self_attn.k_proj.weight', 'model.layers.19.self_attn.v_proj.bias', 'model.layers.22.self_attn.q_proj.bias', 'model.layers.28.self_attn.q_proj.weight', 'model.layers.30.self_attn.q_proj.bias', 'model.layers.23.self_attn.k_proj.weight', 'model.layers.14.self_attn.k_proj.bias', 'model.layers.24.self_attn.v_proj.bias', 'model.layers.14.self_attn.v_proj.weight', 'model.layers.16.self_attn.v_proj.weight', 'model.layers.9.self_attn.v_proj.bias', 'model.layers.6.self_attn.v_proj.bias', 'model.layers.31.self_attn.k_proj.bias', 'model.layers.4.self_attn.v_proj.bias', 'model.layers.27.self_attn.v_proj.bias', 'model.layers.15.self_attn.q_proj.bias', 'model.layers.20.self_attn.k_proj.bias', 'model.layers.17.self_attn.q_proj.weight', 'model.layers.29.self_attn.v_proj.weight', 'model.layers.6.self_attn.q_proj.weight', 'model.layers.29.self_attn.q_proj.bias', 'model.layers.28.self_attn.k_proj.bias', 'model.layers.17.self_attn.q_proj.bias', 'model.layers.2.self_attn.k_proj.weight', 'model.layers.29.self_attn.k_proj.bias', 'model.layers.7.self_attn.v_proj.bias', 'model.layers.8.self_attn.k_proj.weight', 'model.layers.27.self_attn.v_proj.weight', 'model.layers.0.self_attn.q_proj.bias', 'model.layers.15.self_attn.k_proj.weight', 'model.layers.17.self_attn.v_proj.bias', 'model.layers.21.self_attn.v_proj.weight', 'model.layers.3.self_attn.k_proj.weight', 'model.layers.17.self_attn.k_proj.weight', 'model.layers.6.self_attn.v_proj.weight', 'model.layers.18.self_attn.v_proj.bias', 'model.layers.6.self_attn.q_proj.bias', 'model.layers.28.self_attn.v_proj.weight', 'model.layers.8.self_attn.k_proj.bias', 'model.layers.0.self_attn.v_proj.bias', 'model.layers.3.self_attn.v_proj.weight', 'model.layers.2.self_attn.k_proj.bias', 'model.layers.8.self_attn.q_proj.bias', 'model.layers.2.self_attn.v_proj.bias', 'model.layers.24.self_attn.v_proj.weight']
- This IS expected if you are initializing PhiForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing PhiForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of PhiForCausalLM were not initialized from the model checkpoint at microsoft/phi-2 and are newly initialized: ['model.layers.13.self_attn.query_key_value.bias', 'model.layers.15.self_attn.query_key_value.weight', 'model.layers.12.self_attn.query_key_value.bias', 'model.layers.30.self_attn.query_key_value.bias', 'model.layers.25.self_attn.query_key_value.weight', 'model.layers.14.self_attn.query_key_value.bias', 'model.layers.24.self_attn.query_key_value.bias', 'model.layers.26.self_attn.query_key_value.weight', 'model.layers.21.self_attn.query_key_value.bias', 'model.layers.25.self_attn.query_key_value.bias', 'model.layers.14.self_attn.query_key_value.weight', 'model.layers.17.self_attn.query_key_value.weight', 'model.layers.11.self_attn.query_key_value.bias', 'model.layers.18.self_attn.query_key_value.bias', 'model.layers.23.self_attn.query_key_value.weight', 'model.layers.31.self_attn.query_key_value.bias', 'model.layers.2.self_attn.query_key_value.bias', 'model.layers.12.self_attn.query_key_value.weight', 'model.layers.9.self_attn.query_key_value.bias', 'model.layers.20.self_attn.query_key_value.weight', 'model.layers.26.self_attn.query_key_value.bias', 'model.layers.30.self_attn.query_key_value.weight', 'model.layers.7.self_attn.query_key_value.weight', 'model.layers.28.self_attn.query_key_value.weight', 'model.layers.22.self_attn.query_key_value.bias', 'model.layers.2.self_attn.query_key_value.weight', 'model.layers.8.self_attn.query_key_value.weight', 'model.layers.15.self_attn.query_key_value.bias', 'model.layers.1.self_attn.query_key_value.weight', 'model.layers.27.self_attn.query_key_value.bias', 'model.layers.10.self_attn.query_key_value.weight', 'model.layers.16.self_attn.query_key_value.bias', 'model.layers.28.self_attn.query_key_value.bias', 'model.layers.29.self_attn.query_key_value.weight', 'model.layers.3.self_attn.query_key_value.bias', 'model.layers.19.self_attn.query_key_value.bias', 'model.layers.4.self_attn.query_key_value.bias', 'model.layers.31.self_attn.query_key_value.weight', 'model.layers.18.self_attn.query_key_value.weight', 'model.layers.16.self_attn.query_key_value.weight', 'model.layers.21.self_attn.query_key_value.weight', 'model.layers.22.self_attn.query_key_value.weight', 'model.layers.29.self_attn.query_key_value.bias', 'model.layers.5.self_attn.query_key_value.bias', 'model.layers.8.self_attn.query_key_value.bias', 'model.layers.9.self_attn.query_key_value.weight', 'model.layers.3.self_attn.query_key_value.weight', 'model.layers.7.self_attn.query_key_value.bias', 'model.layers.27.self_attn.query_key_value.weight', 'model.layers.1.self_attn.query_key_value.bias', 'model.layers.6.self_attn.query_key_value.weight', 'model.layers.19.self_attn.query_key_value.weight', 'model.layers.0.self_attn.query_key_value.bias', 'model.layers.6.self_attn.query_key_value.bias', 'model.layers.0.self_attn.query_key_value.weight', 'model.layers.20.self_attn.query_key_value.bias', 'model.layers.23.self_attn.query_key_value.bias', 'model.layers.11.self_attn.query_key_value.weight', 'model.layers.24.self_attn.query_key_value.weight', 'model.layers.4.self_attn.query_key_value.weight', 'model.layers.5.self_attn.query_key_value.weight', 'model.layers.13.self_attn.query_key_value.weight', 'model.layers.17.self_attn.query_key_value.bias', 'model.layers.10.self_attn.query_key_value.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Using Transformers Master

>>> from transformers import AutoModelForCausalLM
>>> model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2")
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:22<00:00, 11.24s/it]
>>> model
PhiForCausalLM(
  (model): PhiModel(
    (embed_tokens): Embedding(51200, 2560)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x PhiDecoderLayer(
        (self_attn): PhiAttention(
          (q_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (k_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (v_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (dense): Linear(in_features=2560, out_features=2560, bias=True)
          (rotary_emb): PhiRotaryEmbedding()
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear(in_features=10240, out_features=2560, bias=True)
        )
        (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (final_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=2560, out_features=51200, bias=True)
)

With the existing defaults, you can see that the LoRA target modules only get applied to fc1 and fc2, while they get skipped (silently) for Wkqv and out_proj.

>>> from peft import LoraConfig, get_peft_model
>>> lora_config = LoraConfig()
>>> lora_config
LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type=None, inference_mode=False, r=8, target_modules=None, lora_alpha=8, lora_dropout=0.0, fan_in_fan_out=False, bias='none', modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={})
>>> get_peft_model(model, lora_config)
PeftModel(
  (base_model): LoraModel(
    (model): PhiForCausalLM(
      (model): PhiModel(
        (embed_tokens): Embedding(51200, 2560)
        (embed_dropout): Dropout(p=0.0, inplace=False)
        (layers): ModuleList(
          (0-31): 32 x PhiDecoderLayer(
            (self_attn): PhiAttention(
              (q_proj): Linear(in_features=2560, out_features=2560, bias=True)
              (k_proj): Linear(in_features=2560, out_features=2560, bias=True)
              (v_proj): Linear(in_features=2560, out_features=2560, bias=True)
              (dense): Linear(in_features=2560, out_features=2560, bias=True)
              (rotary_emb): PhiRotaryEmbedding()
            )
            (mlp): PhiMLP(
              (activation_fn): NewGELUActivation()
              (fc1): lora.Linear(
                (base_layer): Linear(in_features=2560, out_features=10240, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2560, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=10240, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (fc2): lora.Linear(
                (base_layer): Linear(in_features=10240, out_features=2560, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=10240, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2560, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
            )
            (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
            (resid_dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (final_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
      )
      (lm_head): Linear(in_features=2560, out_features=51200, bias=True)
    )
  )
)
>>

However, with the proposed changes, this is what the output looks like:

>>> lora_config.target_modules = ["q_proj", "v_proj", "fc1", "fc2"]
>>> get_peft_model(model, lora_config)
PeftModel(
  (base_model): LoraModel(
    (model): PhiForCausalLM(
      (model): PhiModel(
        (embed_tokens): Embedding(51200, 2560)
        (embed_dropout): Dropout(p=0.0, inplace=False)
        (layers): ModuleList(
          (0-31): 32 x PhiDecoderLayer(
            (self_attn): PhiAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=2560, out_features=2560, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2560, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2560, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): Linear(in_features=2560, out_features=2560, bias=True)
              (v_proj): lora.Linear(
                (base_layer): Linear(in_features=2560, out_features=2560, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2560, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2560, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (dense): Linear(in_features=2560, out_features=2560, bias=True)
              (rotary_emb): PhiRotaryEmbedding()
            )
            (mlp): PhiMLP(
              (activation_fn): NewGELUActivation()
              (fc1): lora.Linear(
                (base_layer): Linear(in_features=2560, out_features=10240, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2560, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=10240, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (fc2): lora.Linear(
                (base_layer): Linear(in_features=10240, out_features=2560, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=10240, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2560, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
            )
            (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
            (resid_dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (final_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
      )
      (lm_head): Linear(in_features=2560, out_features=51200, bias=True)
    )
  )
)

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense thanks!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@younesbelkada younesbelkada merged commit 1c1c7fd into huggingface:main Jan 24, 2024
BenjaminBossan pushed a commit to BenjaminBossan/peft that referenced this pull request Mar 14, 2024
Guy-Bilitski pushed a commit to Guy-Bilitski/peft that referenced this pull request May 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants