High PSNR after fine-tuning But single view separation

Excellent work! I loaded the huggingface weights to fine-tuning on nuScenes Datasets. nuScenes Train Dataset has 1000+ scenes, and every scenes have 40 key frames, every frames have 6 camera images. I split every key frames to 2 "scenes", it's mean that CAM_FRONT, CAM_FRONT_LEFT, CAM_FRONT_RIGHT or CAM_BACK CAM_BACK_RIGHT CAM_BACK_LEFT, for example:

```json
{
  "scenes_all": [
    {
      "scene_name": "000100000",
      "images": [
        {
          "camera_idx": 0,
          "camera": "CAM_FRONT",
          "image_filepath": "samples/CAM_FRONT/n015-2018-07-18-11-07-57+0800__CAM_FRONT__1531883530412470.jpg"
        },
        {
          "camera_idx": 1,
          "camera": "CAM_FRONT_LEFT",
          "image_filepath": "samples/CAM_FRONT_LEFT/n015-2018-07-18-11-07-57+0800__CAM_FRONT_LEFT__1531883530404844.jpg"
        },
        {
          "camera_idx": 2,
          "camera": "CAM_FRONT_RIGHT",
          "image_filepath": "samples/CAM_FRONT_RIGHT/n015-2018-07-18-11-07-57+0800__CAM_FRONT_RIGHT__1531883530420339.jpg"
        }
      ]
    },
    {
      "scene_name": "000100001",
      "images": [
        {
          "camera_idx": 0,
          "camera": "CAM_BACK",
          "image_filepath": "samples/CAM_BACK/n015-2018-07-18-11-07-57+0800__CAM_BACK__1531883530437525.jpg"
        },
        {
          "camera_idx": 1,
          "camera": "CAM_BACK_RIGHT",
          "image_filepath": "samples/CAM_BACK_RIGHT/n015-2018-07-18-11-07-57+0800__CAM_BACK_RIGHT__1531883530427893.jpg"
        },
        {
          "camera_idx": 2,
          "camera": "CAM_BACK_LEFT",
          "image_filepath": "samples/CAM_BACK_LEFT/n015-2018-07-18-11-07-57+0800__CAM_BACK_LEFT__1531883530447423.jpg"
        }
      ]
    },
......
]
}
```
So I get more than 68k scenes and every scenes just have 3 images.  Every images resize to 448\*448 from 1600\*900.
I used 4 \* H800 to train and rewrote the relevant code for data reading.
config as belo:
```yaml
model:
  encoder:
    backbone:
      name: croco
      model: ViTLarge_BaseDecoder
      patch_embed_cls: PatchEmbedDust3R
      asymmetry_decoder: true
      intrinsics_embed_loc: encoder
      intrinsics_embed_degree: 4
      intrinsics_embed_type: token
    name: anysplat
    opacity_mapping:
      initial: 0.0
      final: 0.0
      warm_up: 1
    num_monocular_samples: 32
    num_surfaces: 1
    predict_opacity: false
    gaussians_per_pixel: 1
    gaussian_adapter:
      gaussian_scale_min: 0.5
      gaussian_scale_max: 15.0
      sh_degree: 4
    d_feature: 32
    visualizer:
      num_samples: 8
      min_resolution: 256
      export_ply: false
    apply_bounds_shim: true
    gs_params_head_type: dpt_gs
    pose_free: true
    pretrained_weights: ''
    scale_align: false
    voxel_size: 0.002
    n_offsets: 2
    anchor_feat_dim: 128
    add_view: false
    color_attr: 3D
    mlp_type: unified
    scaffold: true
    intrinsics_embed_loc: encoder
    intrinsics_embed_type: token
    pred_pose: true
    gs_prune: false
    pred_head_type: depth
    freeze_backbone: false
    distill: true
    render_conf: false
    conf_threshold: 0.1
    freeze_module: patch_embed
    voxelize: true
    intermediate_layer_idx:
    - 4
    - 11
    - 17
    - 23
  decoder:
    name: splatting_cuda
    background_color:
    - 1.0
    - 1.0
    - 1.0
    make_scale_invariant: false
loss:
  mse:
    weight: 1.0
    conf: false
  lpips:
    weight: 0.05
    apply_after_step: 0
    conf: false
  depth_consis:
    weight: 0.1
    sigma_image: null
    use_second_derivative: false
    loss_type: MSE
wandb:
  project: anysplat
  entity: scene-representation-group
  name: custom
  mode: disabled
  log_video: false
  media_kwargs:
    video: None
  tags:
  - custom
  - 448x448
mode: train
data_loader:
  train:
    num_workers: 16
    persistent_workers: true
    batch_size: 4
    seed: 1234
  test:
    num_workers: 4
    persistent_workers: false
    batch_size: 1
    seed: 2345
  val:
    num_workers: 1
    persistent_workers: true
    batch_size: 1
    seed: 3456
optimizer:
  lr: 0.0002
  warm_up_steps: 2000
  backbone_lr_multiplier: 0.1
checkpointing:
  load: ckpts/ckpt
  every_n_train_steps: 2000
  save_top_k: 5
  save_weights_only: false
train:
  output_path: ${hydra.run.dir}
  depth_mode: null
  extended_visualization: false
  print_log_every_n_steps: 10
  distiller: ''
  distill_max_steps: 1000000
  random_context_views: false
  pose_loss_alpha: 1.0
  pose_loss_delta: 1.0
  cxt_depth_weight: 0.0
  weight_pose: 10.0
  weight_depth: 1.0
  weight_normal: 0.0
test:
  output_path: outputs/test-nopo
  align_pose: true
  pose_align_steps: 100
  rot_opt_lr: 0.005
  trans_opt_lr: 0.005
  compute_scores: true
  save_image: true
  save_video: false
  save_compare: true
  generate_video: false
  mode: inference
  image_folder: examples/bungeenerf
seed: 111123
trainer:
  max_steps: 30000
  val_check_interval: 2000
  gradient_clip_val: 0.5
  num_nodes: 1
  accumulate_grad_batches: 1
  precision: bf16-mixed
dataset:
  custom:
    make_baseline_1: false
    relative_pose: true
    augment: true
    background_color:
    - 1.0
    - 1.0
    - 1.0
    overfit_to_scene: null
    skip_bad_shape: true
    rescale_to_1cube: false
    view_sampler:
      name: all
      num_context_views: 3
      num_target_views: 1
      max_img_per_gpu: 24
    name: custom
    roots:
    - datasets/nuscenes
    scenes_json_path: ./dataset_scenes.json
    scene_data_json_path: ./dataset_scene_cams.ndjson
    input_image_shape:
    - 448
    - 448
    original_image_shape:
    - 448
    - 448
    cameras_are_circular: false
    baseline_min: 0.001
    baseline_max: 100.0
    max_fov: 180.0
    avg_pose: false
    intr_augment: true
    normalize_by_pts3d: false

```

In the later stages of training, maybe after epoch=9, I get high PSNR 40+dB, such as:

![Image](https://github.com/user-attachments/assets/9699666b-b49d-4f97-b781-bbb49b8450bc)

One scene input 3 images:

<img width="685" height="223" alt="Image" src="https://github.com/user-attachments/assets/a46e9828-8f2a-454b-bfd0-49fc6ebc948a" />

fine-tuning model output:

<img width="685" height="223" alt="Image" src="https://github.com/user-attachments/assets/74575335-6604-4f0c-92e5-6bef1d24149d" />

I wanted to see the entire scene, so I modified the intrinsics to broaden field of center view. However, it can be seen that the entire scene seems to have been separate into 3 scenes without alignment:

<img width="761" height="448" alt="Image" src="https://github.com/user-attachments/assets/2953f053-fe3c-4d0c-ab07-9b0f852361e1" />


Before the fine-tuning, although the PSNR was only 20+ dB and there is ghosting in the overlapping area, the entire scene felt unified:

<img width="685" height="223" alt="Image" src="https://github.com/user-attachments/assets/a6676557-8328-41ed-8891-2a7aaeaaf988" />

<img width="761" height="448" alt="Image" src="https://github.com/user-attachments/assets/f46cbb10-2976-474e-bfc0-981e2966d623" />

Is this phenomenon caused by overfitting the dataset? Do you have any suggestions on how to resolve it?
Looking forward to your reply! Thank you!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

High PSNR after fine-tuning But single view separation #79

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

High PSNR after fine-tuning But single view separation #79

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions