- Group Name: deficienza_computazionale
- Authors: Luigi Pizza, Aryan Sood, Alessandro Tazza, Federico Villa
Create a conda environment and activate it:
conda create -n CIL python=3.12
conda activate CILInstall pip inside the conda environment:
conda install pipCreate a Python virtual environment:
python -m venv CILpip install -r requirements.txtThe file example_training.py contains an example for running a model. It is the same script used by us to achieve our best Kaggle score.
In the example file, the begin_training_loop function is called. That function is called during training to start the main model training process. The function's parameter use_random_split is a boolean value that should be True if you want to perform a random split of your data, so creating at runtime a training set and validation set at random. If use_random_split = False, then a custom split should be provided, so both the train_split and val_split parameters should be non-empty.
Here there is an example of such parameters using the training and validation splits created by clustering images based on their cosine similarity (as we did in our experiments):
train_split=pd.read_csv("train_split.csv")["file_name"].to_list()val_split=pd.read_csv("val_split.csv")["file_name"].to_list()
Check the files train_split.csv and val_split.csv in the repository root folder to see how to format the files. Both files are formatted as Comma-Separated Value (.csv) files with only one column, file_name, and the values contained in the column are alternating names of depth values (.npy) and RGB Images (.png) filenames.
To evaluate a previously trained model, run the script test_evaluation.py after setting the model weights, the path of your test data and the directory where the predictions should be saved (refer to the comments inside the evaluation file for more details on how to change the values to evaluate a specific model).
To use the same model weights and test data as done in our best experiment, do not change any variables in the script. Simply download the model weights and save the downloaded checkpoint file in the root directory of this repository.
After evaluation, to generate a Kaggle submission file, run the script create_outputs.py.
Inside the models/ subfolder there are the files for the different models we tested.
Those are the models we trained from scratch by using the data referred in train_split.csv.
- Base U-Net:
models/large_unet.py - UNet++:
models/unetpp.py - Swin transformer encoder with a U-Net decoder:
models/unet_swin_depth_estimator.py - ViT transformer encoder with a U-Net decoder:
models/unet_vit_depth_estimator.py
The model created using a ResNet encoder plus a U-Net decoder used pre-trained weights for the encoder and was fine-tuned on the training data referred in train_split.csv. The encoder's pre-trained weights were taken from torchvision.models.ResNet50_Weights.
The files detailing ResNet based models are:
- ResNet:
models/resnet_unet_decoder.py - ResNet-Transformer:
models/resnet_transformer_unet.py
The model created by using SegFormer encoder plus SegFormer Decoder (modified by us to output a depth mask) used pre-trained weights for the encoder and was fine-tuned on the training data referred in train_split.csv. The encoder's pre-trained weights were taken from nvidia/segformer-b5-finetuned-ade-640-640.
For more details of the actual model's implementation refer to models/segformer_depth.py.
The model created by using a Mask2Former architecture plus double convolutional layer used pre-trained weights for the Mask2Former architecture and was fine-tuned on the training data as referred in train_split.csv. The fine-tuned weights used were taken from facebook/mask2former-swin-small-coco-instance.
For more details of the actual model's implementation refer to: models/Mask2Former_depth.py.
Table: Depth estimation performance across different architectures
| Model | Validation Loss | Training Loss | Kaggle Public Score |
|---|---|---|---|
| U-Net Based models | |||
| Base U-Net | 0.27101 | 0.32242 | -- |
| UNet++ | 0.36963 | 0.44006 | -- |
| ResNet Based models | |||
| ResNet | 0.15886 | 0.09157 | 0.16063 |
| ResNet-Transformer | 0.17421 | 0.11357 | 0.21408 |
| Mask2Former model | |||
| mask2former-swin-small-coco-instance | 0.12976 | 0.097775 | 0.13343 |
| SegFormer models | |||
| segformer-b4-512-512 | 0.14236 | 0.11561 | 0.14048 |
| segformer-b5-640-640 | 0.11474 | 0.15951 | 0.12621 |
Note: The SegFormer based model achieved a better result compared to the other models, showing a 27.8% improvement over the ResNet based model and a 68.9% improvement over the Base-U-Net and the UNet++ model.
To evaluate our results, we provide the model weights of the following models:
| Model | File |
|---|---|
| segformer-b4-512-512 | polybox |
| segformer-b5-640-640 | polybox |
| mask2former-swin-small-coco-instance | polybox |
| ResNet50 | polybox |