-
Notifications
You must be signed in to change notification settings - Fork 32
Description
Excellent work! I loaded the huggingface weights to fine-tuning on nuScenes Datasets. nuScenes Train Dataset has 1000+ scenes, and every scenes have 40 key frames, every frames have 6 camera images. I split every key frames to 2 "scenes", it's mean that CAM_FRONT, CAM_FRONT_LEFT, CAM_FRONT_RIGHT or CAM_BACK CAM_BACK_RIGHT CAM_BACK_LEFT, for example:
{
"scenes_all": [
{
"scene_name": "000100000",
"images": [
{
"camera_idx": 0,
"camera": "CAM_FRONT",
"image_filepath": "samples/CAM_FRONT/n015-2018-07-18-11-07-57+0800__CAM_FRONT__1531883530412470.jpg"
},
{
"camera_idx": 1,
"camera": "CAM_FRONT_LEFT",
"image_filepath": "samples/CAM_FRONT_LEFT/n015-2018-07-18-11-07-57+0800__CAM_FRONT_LEFT__1531883530404844.jpg"
},
{
"camera_idx": 2,
"camera": "CAM_FRONT_RIGHT",
"image_filepath": "samples/CAM_FRONT_RIGHT/n015-2018-07-18-11-07-57+0800__CAM_FRONT_RIGHT__1531883530420339.jpg"
}
]
},
{
"scene_name": "000100001",
"images": [
{
"camera_idx": 0,
"camera": "CAM_BACK",
"image_filepath": "samples/CAM_BACK/n015-2018-07-18-11-07-57+0800__CAM_BACK__1531883530437525.jpg"
},
{
"camera_idx": 1,
"camera": "CAM_BACK_RIGHT",
"image_filepath": "samples/CAM_BACK_RIGHT/n015-2018-07-18-11-07-57+0800__CAM_BACK_RIGHT__1531883530427893.jpg"
},
{
"camera_idx": 2,
"camera": "CAM_BACK_LEFT",
"image_filepath": "samples/CAM_BACK_LEFT/n015-2018-07-18-11-07-57+0800__CAM_BACK_LEFT__1531883530447423.jpg"
}
]
},
......
]
}So I get more than 68k scenes and every scenes just have 3 images. Every images resize to 448*448 from 1600*900.
I used 4 * H800 to train and rewrote the relevant code for data reading.
config as belo:
model:
encoder:
backbone:
name: croco
model: ViTLarge_BaseDecoder
patch_embed_cls: PatchEmbedDust3R
asymmetry_decoder: true
intrinsics_embed_loc: encoder
intrinsics_embed_degree: 4
intrinsics_embed_type: token
name: anysplat
opacity_mapping:
initial: 0.0
final: 0.0
warm_up: 1
num_monocular_samples: 32
num_surfaces: 1
predict_opacity: false
gaussians_per_pixel: 1
gaussian_adapter:
gaussian_scale_min: 0.5
gaussian_scale_max: 15.0
sh_degree: 4
d_feature: 32
visualizer:
num_samples: 8
min_resolution: 256
export_ply: false
apply_bounds_shim: true
gs_params_head_type: dpt_gs
pose_free: true
pretrained_weights: ''
scale_align: false
voxel_size: 0.002
n_offsets: 2
anchor_feat_dim: 128
add_view: false
color_attr: 3D
mlp_type: unified
scaffold: true
intrinsics_embed_loc: encoder
intrinsics_embed_type: token
pred_pose: true
gs_prune: false
pred_head_type: depth
freeze_backbone: false
distill: true
render_conf: false
conf_threshold: 0.1
freeze_module: patch_embed
voxelize: true
intermediate_layer_idx:
- 4
- 11
- 17
- 23
decoder:
name: splatting_cuda
background_color:
- 1.0
- 1.0
- 1.0
make_scale_invariant: false
loss:
mse:
weight: 1.0
conf: false
lpips:
weight: 0.05
apply_after_step: 0
conf: false
depth_consis:
weight: 0.1
sigma_image: null
use_second_derivative: false
loss_type: MSE
wandb:
project: anysplat
entity: scene-representation-group
name: custom
mode: disabled
log_video: false
media_kwargs:
video: None
tags:
- custom
- 448x448
mode: train
data_loader:
train:
num_workers: 16
persistent_workers: true
batch_size: 4
seed: 1234
test:
num_workers: 4
persistent_workers: false
batch_size: 1
seed: 2345
val:
num_workers: 1
persistent_workers: true
batch_size: 1
seed: 3456
optimizer:
lr: 0.0002
warm_up_steps: 2000
backbone_lr_multiplier: 0.1
checkpointing:
load: ckpts/ckpt
every_n_train_steps: 2000
save_top_k: 5
save_weights_only: false
train:
output_path: ${hydra.run.dir}
depth_mode: null
extended_visualization: false
print_log_every_n_steps: 10
distiller: ''
distill_max_steps: 1000000
random_context_views: false
pose_loss_alpha: 1.0
pose_loss_delta: 1.0
cxt_depth_weight: 0.0
weight_pose: 10.0
weight_depth: 1.0
weight_normal: 0.0
test:
output_path: outputs/test-nopo
align_pose: true
pose_align_steps: 100
rot_opt_lr: 0.005
trans_opt_lr: 0.005
compute_scores: true
save_image: true
save_video: false
save_compare: true
generate_video: false
mode: inference
image_folder: examples/bungeenerf
seed: 111123
trainer:
max_steps: 30000
val_check_interval: 2000
gradient_clip_val: 0.5
num_nodes: 1
accumulate_grad_batches: 1
precision: bf16-mixed
dataset:
custom:
make_baseline_1: false
relative_pose: true
augment: true
background_color:
- 1.0
- 1.0
- 1.0
overfit_to_scene: null
skip_bad_shape: true
rescale_to_1cube: false
view_sampler:
name: all
num_context_views: 3
num_target_views: 1
max_img_per_gpu: 24
name: custom
roots:
- datasets/nuscenes
scenes_json_path: ./dataset_scenes.json
scene_data_json_path: ./dataset_scene_cams.ndjson
input_image_shape:
- 448
- 448
original_image_shape:
- 448
- 448
cameras_are_circular: false
baseline_min: 0.001
baseline_max: 100.0
max_fov: 180.0
avg_pose: false
intr_augment: true
normalize_by_pts3d: false
In the later stages of training, maybe after epoch=9, I get high PSNR 40+dB, such as:
One scene input 3 images:
fine-tuning model output:
I wanted to see the entire scene, so I modified the intrinsics to broaden field of center view. However, it can be seen that the entire scene seems to have been separate into 3 scenes without alignment:
Before the fine-tuning, although the PSNR was only 20+ dB and there is ghosting in the overlapping area, the entire scene felt unified:
Is this phenomenon caused by overfitting the dataset? Do you have any suggestions on how to resolve it?
Looking forward to your reply! Thank you!
