Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Why are the text and image feature inputs treated as unfrozen embeddings? #56

@gmeehan96

Description

@gmeehan96

Hi - thanks for the really useful framework. I'm confused by the choice in the code for many of the multimodal models to load the image and text features in as unfrozen embeddings, e.g.:

self.image_embedding = nn.Embedding.from_pretrained(self.v_feat, freeze=False)

While these are detached from gradients in the construction of the KNN graphs etc., they are treated as trainable variables for all models which have any MLP or linear transformation of the input features, e.g. MGCN:

image_feats = self.image_trs(self.image_embedding.weight)

My understanding is that the text/image features should be treated as fixed input tensors and stay constant during training. However, since the embeddings are not frozen, they will be updated during training - I've verified this by training a PGL model on the Sports dataset and comparing the model weights before and after training is complete. These are the image embedding weights before:

Image

and after training:

Image

This seems like a strange use of the content features, as they're essentially treated like a separate set of learnable item ID embeddings (in some cases, with very large dimensions) and may lose most of the original information about visual or textual content due to gradient updates during training. As far as I can tell, this choice is also not made clear in any of the relevant papers for these models. Please could you clarify if this is a deliberate design choice?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions