-
Couldn't load subscription status.
- Fork 83
Description
Hi - thanks for the really useful framework. I'm confused by the choice in the code for many of the multimodal models to load the image and text features in as unfrozen embeddings, e.g.:
self.image_embedding = nn.Embedding.from_pretrained(self.v_feat, freeze=False)
While these are detached from gradients in the construction of the KNN graphs etc., they are treated as trainable variables for all models which have any MLP or linear transformation of the input features, e.g. MGCN:
image_feats = self.image_trs(self.image_embedding.weight)
My understanding is that the text/image features should be treated as fixed input tensors and stay constant during training. However, since the embeddings are not frozen, they will be updated during training - I've verified this by training a PGL model on the Sports dataset and comparing the model weights before and after training is complete. These are the image embedding weights before:
and after training:
This seems like a strange use of the content features, as they're essentially treated like a separate set of learnable item ID embeddings (in some cases, with very large dimensions) and may lose most of the original information about visual or textual content due to gradient updates during training. As far as I can tell, this choice is also not made clear in any of the relevant papers for these models. Please could you clarify if this is a deliberate design choice?
Thanks!