Enquiry regarding training and dataset downloading

Hi Authors,

Thanks for your great work and your beautiful and clean code!

I have 4 questions about training and dataset downloading.

1. Can you elaborate on the training details, such as GPU type, number of GPUs, and how long the training process takes?
2. I saw and followed the download process from the original BEDLAM github. But I found that there are a lot of data and I am not sure which data is used for this training process. Can you elaborate on that as well? Sorry I currently do not have too much available disk space.

The issue for this question is that your tram script uses jpg images but I can not find any jpg files in the BEDLAM official website.

The following script is from the official BEDLAM download page.

```
Run this script with desired target data type from that folder 
#      to download data.
#      Do not use `all` if you don't need depth data.
#      We recommend to start with smallest folder first. 
#      + `all`:   will download everything (depth,gt,masks,mp4,png), ~6TB local space needed
#      + `depth`: depth images (EXR), ~3.8TB
#      + `gt`:    scene ground truth (CSV), ~100MB
#      + `masks`: segmentation masks (PNG), ~30GB
#      + `mp4`:   movies (MP4), ~20GB
#      + `png`:   image sequences (PNG), ~2.2TB
#
#      Example: `bash ./be_download.sh mp4` 
```

3. I am a little bit confused here in estimate_camera.py:
camera = {'pred_cam_R': cam_R.numpy(), 'pred_cam_T': cam_T.numpy(), 
          'world_cam_R': wd_cam_R.numpy(), 'world_cam_T': wd_cam_T.numpy(),
          'img_focal': cam_int[0], 'img_center': cam_int[2:], 'spec_focal': spec_f}
I commented the 'pred_cam_R' and 'pred_cam_T' and the results also look good. I am not sure about the usage of these two values.
It looks like the these ('world_cam_R': wd_cam_R.numpy(), 'world_cam_T': wd_cam_T.numpy()) is extrinsic matrix.
For intrinsic matrix, is it fx=fy='img_focal' in your case? Have you tried use different value for x and y focal length. cx,cy='img_center'. What does spec_focal mean? As for the skew factor s, is it 0 in all of your cases?

4. I noticed that for the relatively dense multi-person video, there is an obvious drop for the model performance. People were floating in the air and the model struggled to place them into the same plane. Have you encountered this and do you have any possible solutions for this?

https://github.com/user-attachments/assets/b0de4924-f040-4093-808c-7ef15ab2364c

Thanks a lot for your time and help in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enquiry regarding training and dataset downloading #29

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Enquiry regarding training and dataset downloading #29

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions