Thank you for your work!
I ran inference on custom data using only one frame with 6 surround-view images.
The prediction results are not very good — it seems that the relative positions are predicted incorrectly.
Could you please help me understand what might be going wrong?
Does this have anything to do with the input order of the images?

(Image A: ground-truth pointmap)

(Image B: inference result)
I also cropped some of the image borders, and the results became even worse.
Does the model require all surround-view images to have the same size?

(Image C: inference result after cropping)