Paper • Datasets • Symile vs. CLIP • Questions • Citation
Multimodal representation learning works for 2 modalities, but what if you're working with 3+ modalities, like in healthcare, robotics, or video?
Meet Symile: A flexible, architecture-agnostic framework for contrastive pre-training across any number of modalities. Symile maintains the simplicity of CLIP while delivering superior performance, even when some modalities are missing.
No more specialized architectures, complex fusion models, or applying CLIP to pairs of modalities (e.g. ImageBind). Now, with Symile, you can learn modality-specific representations simultaneously for any number of modalities!
For a similarity metric, Symile uses the multilinear inner product (MIP), a simple generalization of the dot product to more than two vectors that allows for the simultaneous contrasting of all modalities and enables zero-shot applications such as classification and retrieval.
To learn more, check out our paper (NeurIPS 2024)!
To install the Symile package via pip:
pip install symile
Example usage of the Symile loss and MIP similarity metric for three modalities:
import torch
import torch.nn.functional as F
from symile import Symile, MIPSimilarity
inputs_a = torch.randn(batch_size, input_dim)
inputs_b = torch.randn(batch_size, input_dim)
inputs_c = torch.randn(batch_size, input_dim)
outputs_a, outputs_b, outputs_c, logit_scale_exp = model(inputs_a, inputs_b, inputs_c)
outputs_a = F.normalize(outputs_a, p=2.0, dim=1)
outputs_b = F.normalize(outputs_b, p=2.0, dim=1)
outputs_c = F.normalize(outputs_c, p=2.0, dim=1)
### train step ###
symile_loss = Symile()
loss = symile_loss([outputs_a, outputs_b, outputs_c], logit_scale_exp)
### evaluation step ###
mip_similarity = MIPSimilarity()
inputs_a_candidates = torch.randn(num_candidates, input_dim)
outputs_a_candidates = model.encoder_a(inputs_a_candidates)
outputs_a_candidates = F.normalize(outputs_a_candidates, p=2.0, dim=1)
similarity_scores = mip_similarity(outputs_a_candidates, [outputs_b, outputs_c])
similarity_scores = logit_scale_exp * similarity_scores
We provide a very simple example script that uses the Symile loss and the MIP similarity metric to train and test 8 linear encoders for the following data generating procedure:
a, b, c, d, e, f, g
h
The zero-shot classification task is to predict whether a is 0 or 1 given the remaining variables b, c, d, e, f, g, h.
After cloning the repository, first install the necessary dependencies from the root directory and then run the script:
> poetry install --with examples
> poetry run python examples/binary_xor.py
Symile learns by contrasting positive samples with negative samples. Like CLIP, Symile constructs negatives for each positive by using other samples within the batch. Let's say you have a batch of 4 samples, consisting of three modalities A, B, and C:
A1 B1 C1
A2 B2 C2
A3 B3 C3
A4 B4 C4
Each of the above triples is a positive sample. How do we construct negatives? Symile offers two strategies:
This approach randomly shuffles the non-anchor modalities to create A1 is our anchor, we might get:
Positive: A1-B1-C1
Negatives: A1-B3-C4
A1-B4-C2
A1-B2-C3
To use this approach, you can either initialize Symile() with no arguments, or explicitly set the negative_sampling argument:
symile_loss = Symile()
# or
symile_loss = Symile(negative_sampling="n")
This approach creates all possible combinations of non-anchor modalities, creating A1 as our anchor again:
Positive: A1-B1-C1
Negatives: A1-B1-C2, A1-B1-C3, A1-B1-C4
A1-B2-C1, A1-B2-C2, A1-B2-C3, A1-B2-C4
A1-B3-C1, A1-B3-C2, A1-B3-C3, A1-B3-C4
A1-B4-C1, A1-B4-C2, A1-B4-C3, A1-B4-C4
To use the
symile_loss = Symile(negative_sampling="n_squared")
What if some samples in your dataset don’t contain all modalities? For instance, a patient may be missing lab results, or a social media post might not include an image. Symile can be easily adapted to handle missing modalities by passing as inputs to the model both the data (using any placeholder value for missing modalities) and binary indicators that signal which modalities are present for each sample. This approach lets Symile model the relationships between whichever modalities are present in each sample.
We provide a simple script demonstrating how to train Symile with missing modalities. The data is generated as follows:
a, b
The zero-shot classification task is to predict whether a is 0 or 1 given the remaining variables b, c. To simulate missingness in the training and validation sets, values in a, b, and c are randomly set to 0.5 with probability args.missingness_prob. The vectors a, b, c and their missingness indicators are then passed to the encoders. To run the script:
> poetry install --with examples
> poetry run python examples/binary_xor_missing.py
Note that instead of using binary indicators, you could also use any out-of-support placeholder to represent missing data (provided your model is expressive enough). Binary indicators provide a simple way to ensure missing data is out-of-support, but other approaches work, too. For example, with text data, you could use a special token that's outside of your model's vocabulary (e.g., [MISSING]), as we did in our paper's experiments.
As part of this research, we release two novel multimodal datasets:
- Symile-M3: a multilingual collection of 33 million image, text, and audio samples.
- Symile-MIMIC: a clinical dataset of chest X-rays, electrocardiograms, and laboratory measurements.
To reproduce the experiments from our paper using these datasets, navigate to the experiments/ directory and follow the step-by-step instructions in the dedicated README.
The Symile loss targets total correlation, which is the higher-order generalization of mutual information to any number of random variables. Total correlation can be decomposed into a summation of mutual information terms. For example, in the case of three random variables,
While, like many contrastive approaches, CLIP was designed to capture the shared information between modalities, the above equation indicates that when there are more than two modalities, the scope of what to capture should extend beyond pairwise information to include conditional interactions. Because it targets total correlation, Symile captures strictly more information than CLIP, guaranteeing performance that matches or surpasses CLIP!
Most real-world applications will exhibit a combination of both pairwise and higher-order information. For example, in order to diagnose acute pancreatitis, one might consider a patient’s clinical history of abdominal pain, elevated levels of digestive enzymes, and imaging results consistent with inflammation. While each of these modalities would provide useful information about the likelihood of pancreatitis (i.e., pairwise information between the modality and the diagnosis is non-zero), none of them alone would be diagnostic of the condition.
Bottom line: if you're looking to do contrastive pre-training with more than two modalities, use Symile!
We welcome all questions and feedback! Here's how to reach us:
- Paper: Join the discussion on alphaXiv.
- Code: Feel free to open an issue in this repository.
- Contact: Shoot Adriel an email at
[email protected].
Please don't hesitate to reach out—your questions help make this project better for everyone! 🚀
@inproceedings{saporta2024symile,
title = {Contrasting with Symile: Simple Model-Agnostic Representation Learning for Unlimited Modalities}
author = {Saporta, Adriel and Puli, Aahlad and Goldstein, Mark and Ranganath, Rajesh}
booktitle = {Advances in Neural Information Processing Systems},
year = {2024}
}