Why are the text and image feature inputs treated as unfrozen embeddings?

Hi - thanks for the really useful framework. I'm confused by the choice in the code for many of the multimodal models to load the image and text features in as unfrozen embeddings, e.g.:

`self.image_embedding = nn.Embedding.from_pretrained(self.v_feat, freeze=False)`

While these are detached from gradients in the construction of the KNN graphs etc., they are treated as trainable variables for all models which have any MLP or linear transformation of the input features, e.g. [MGCN](https://github.com/demonph10/MGCN/blob/c187ee85b68a3275c97ce730361940db11690072/src/models/mgcn.py#L149C13-L150C1): 

`image_feats = self.image_trs(self.image_embedding.weight)`

My understanding is that the text/image features should be treated as fixed input tensors and stay constant during training. However, since the embeddings are not frozen, they will be updated during training - I've verified this by training a PGL model on the Sports dataset and comparing the model weights before and after training is complete. These are the image embedding weights before:

<img width="530" height="141" alt="Image" src="https://github.com/user-attachments/assets/8442afd5-d7d2-4abf-b9f1-368f1ccea320" />

and after training:

<img width="544" height="140" alt="Image" src="https://github.com/user-attachments/assets/fccf3e20-8f2f-44b2-9df4-38f9195041a7" />

This seems like a strange use of the content features, as they're essentially treated like a separate set of learnable item ID embeddings (in some cases, with very large dimensions) and may lose most of the original information about visual or textual content due to gradient updates during training.   As far as I can tell, this choice is also not made clear in any of the relevant papers for these models. Please could you clarify if this is a deliberate design choice?

Thanks!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Why are the text and image feature inputs treated as unfrozen embeddings? #56

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Why are the text and image feature inputs treated as unfrozen embeddings? #56

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions