MVsplat
MVsplat
donydchen.github.io/mvsplat
Novel Views pixelSplat’s 3D Gaussians Input Views MVSplat’s 3D Gaussians Novel Views
pixelSplat MVSplat
350
26 25.9
100
300
80 25
80 250
24
60
200
60
44
23
40 150
131
40
22
100
20
20 21
12.0 50
0 0 0 20
pixelSplat MVSplat pixelSplat MVSplat pixelSplat MVSplat pixelSplat MVSplat
Fig. 1: Our MVSplat outperforms pixelSplat [1] in terms of both appearance and
geometry quality with 10× fewer parameters and more than 2× faster inference speed.
1 Introduction
We consider the problem of 3D scene reconstruction and novel view synthe-
sis from very sparse (i.e., as few as two) images in just one forward pass of a
2 Y. Chen et al.
trained model. While remarkable progress has been made using neural scene rep-
resentations, e.g., Scene Representation Networks (SRN) [34], Neural Radiance
Fields (NeRF) [24] and Light Filed Networks (LFN) [33], these methods are still
not satisfactory for practical applications due to expensive per-scene optimiza-
tion [26, 39, 43], high memory cost [3, 18, 44] and slow rendering speed [40, 49].
Recently, 3D Gaussian Splatting (3DGS) [19] has emerged as an efficient and
expressive 3D representation thanks to its fast rendering speed and high quality.
Using rasterization-based rendering, 3DGS inherently avoids the expensive vol-
umetric sampling process of NeRF, leading to highly efficient and high-quality
3D reconstruction and novel view synthesis.
Very recently, several feed-forward Gaussian Splatting methods have been
proposed to explore 3D reconstruction from sparse view images, notably Splat-
ter Image [37] and pixelSplat [1]. Splatter Image regresses pixel-aligned Gaus-
sian parameters using a standard image-to-image architecture, which achieves
promising results for single-view object-level 3D reconstruction. However, recon-
structing a 3D scene from a single image is inherently ill-posed and ambiguous,
posing a significant challenge when applied to a more general and larger scene,
which is the key focus of our paper. pixelSplat [1] proposes to regress Gaussian
parameters for the binocular reconstruction problem. Specifically, it predicts
a probabilistic depth distribution for each input view and then samples depths
from that predicted distribution. Even though pixelSplat learns cross-view-aware
features with an epipolar Transformer, it is still challenging to predict a reliable
probabilistic depth distribution solely from image features, making pixelSplat’s
geometry reconstruction of comparably low quality and exhibiting noisy arti-
facts (see Fig. 1 and Fig. 4). For improved geometry reconstruction results, slow
depth finetuning with an additional depth regularization loss is required.
To accurately localize the 3D Gaussian centers, our solution is to build a
cost volume representation via plane sweeping [7,46,48] in the 3D space. Specifi-
cally, the cost volume stores cross-view feature similarities for all potential depth
candidates, where the similarities can provide valuable geometry cues to the lo-
calization of 3D surfaces (i.e., high similarity more likely indicates a surface
point). With our cost volume representation, the task is formulated as learn-
ing to perform feature matching to identify the Gaussian centers, unlike the
data-driven 3D regression from image features in previous works [1, 37]. Such a
formulation reduces the task’s learning difficulty, enabling our method to achieve
state-of-the-art performance with lightweight model size and fast speed.
We obtain 3D Gaussian centers by unprojecting the multi-view-consistent
depths estimated by our constructed multi-view cost volumes with a 2D network.
Additionally, we also predict other Gaussian properties (covariance, opacity, and
spherical harmonics coefficients), in parallel with the depths. This enables the
rendering of novel view images using the predicted 3D Gaussians with the differ-
entiable splatting operation [19]. Our full model MVSplat is trained end-to-end
purely with the photometric loss between rendered and ground truth images.
On the large-scale RealEstate10K [54] and ACID [21] benchmarks, MVS-
plat achieves state-of-the-art performance with the fastest feed-forward infer-
MVSplat 3
ence speed (22 fps). More impressively, compared to the state-of-the-art pixel-
Splat [1] (see Fig. 1), MVSplat uses 10× fewer parameters and infers more than
2× faster while providing higher appearance and geometry quality as well as
better cross-dataset generalization. Furthermore, extensive ablation studies and
analysis underscore the significance of our feature matching-based cost volume
design in enabling highly efficient feed-forward 3D Gaussian Splatting models.
2 Related Work
un
pr
oj
ec
matching t
(µi , αi , Σi , ci )
Multi-View (αi , Σi , ci ) ∪
I 2D U-Net Render
…i Transformer
(w/ cross-view
… … (w/ cross-view … …
attention) un
attention) pr
oj
ec
t ∪ Novel View
matching (µj , αj , Σj , cj )
(αj , Σj , cj )
Ij Feature Cost Volume 3D Gaussians
Fig. 2: Overview of MVSplat. Given multiple posed images as input, MVSplat first
extracts multi-view image features with a Transformer. Then, the per-view cost volumes
using plane sweeping are constructed. The Transformer features and cost volumes are
concatenated together as input to a 2D U-Net (with cross-view attention) for cost
volume refinement and predicting per-view depth maps. The per-view depth maps are
unprojected to 3D and combined using a simple deterministic union operation as the 3D
Gaussian centers. The opacity, covariance and color Gaussian parameters are predicted
jointly with the depth maps. Finally, novel views are rendered from the predicted 3D
Gaussians with the rasterization operation.
3 Method
f_{\bm \theta }: \{ ({\bm I}^{i}, {\bm P}^i) \}_{i=1}^K \mapsto \{(\bm {\mu }_j, \alpha _j, \bm {\Sigma }_j, \bm {c}_j )\}^{H \times W \times K}_{j=1}, (1)
we take view i’s cost volume construction as an example. Given the near and
far depth ranges, we first uniformly sample D depth candidates {dm }D m=1 in
the inverse depth domain and then warp view j’s feature F j to view i with the
camera projection matrices P i , P j and each depth candidate dm , to obtain D
warped features
\label {eq:warp} {\bm F}^{j \to i}_{d_m} = \mathcal {W}({\bm F}^j, {\bm P}^i, {\bm P}^j, d_m) \in \mathbb {R}^{\frac {H}{4} \times \frac {W}{4} \times C}, \quad m = 1, 2, \cdots , D, (2)
where W denotes the warping operation [46]. We then compute the dot prod-
uct [46, 47] between F i and Fdj→i
m
to obtain the correlation
\label {eq:correlation} {\bm C}_{d_m}^{i} = \frac {{\bm F}^i \cdot {\bm F}^{j \to i}_{d_m}}{\sqrt {C}} \in \mathbb {R}^{\frac {H}{4} \times \frac {W}{4}}, \quad m = 1, 2, \cdots , D. (3)
When there are more than two views as inputs, we similarly warp another view’s
feature to view i as in Eq. (2) and compute their correlations via Eq. (3). Finally,
all the correlations are pixel-wise averaged, enabling the model to accept an
arbitrary number of views as inputs.
Collecting all the correlations we obtain view i’s cost volume
\label {eq:cost_volume} {\bm C}^{i} = [{\bm C}_{d_1}^{i}, {\bm C}_{d_2}^{i}, \cdots , {\bm C}_{d_D}^{i}] \in \mathbb {R}^{\frac {H}{4} \times \frac {W}{4} \times D}. (4)
\label {eq:softmax_depth} {\bm V}^i = \mathrm {softmax} (\hat {\bm C}^i) {\bm G} \in \mathbb {R}^{H \times W}. (6)
4 Experiments
4.1 Settings
Datasets. We assess our model on the large-scale RealEstate10K [54] and
ACID [21] datasets. RealEstate10K contains real estate videos downloaded from
YouTube, which are split into 67,477 training scenes and 7,289 testing scenes,
while ACID contains nature scenes captured by aerial drones, which are split
into 11,075 training scenes and 1,972 testing scenes. Both datasets provide esti-
mated camera intrinsic and extrinsic parameters for each frame. Following pix-
elSpalt [1], we evaluate all methods on three target novel viewpoints for each test
8 Y. Chen et al.
Table 1: Comparisons with the state of the art. Running time includes both
encoder and render, note that 3DGS-based methods (pixelSplat and MVSplat) render
dramatically faster (∼ 500FPS for the render). Performances are averaged over thou-
sands of test scenes in each dataset. For each scene, the model takes two views as input
and renders three novel views for evaluation. MVSplat performs the best in terms of
all visual metrics and runs the fastest with a lightweight model size.
Fig. 3: Comparisons with the state of the art. The first three rows are from
RealEstate10K (indoor scenes), while the last one is from ACID (outdoor scenes).
Models are trained with a collection of training scenes from each indicated dataset, and
tested on novel scenes from the same dataset. MVSplat surpasses all other competitive
models in rendering challenging regions due to the effectiveness of our cost volume-
based geometry representation.
ments in the LPIPS metric, which is better aligned with human perception. This
includes pixelNeRF [49], GPNR [35], AttnRend [10] and pixelSplat [1], with re-
sults taken directly from the pixelSplat [1] paper, and the recent state-of-the-art
NeRF-based method MuRF [44], for which we re-train and evaluate its perfor-
mance using the officially released code.
The qualitative comparisons of the top three best models are visualized
in Fig. 3. MVSplat achieves the highest quality on novel view results even un-
der challenging conditions, such as these regions with repeated patterns (“win-
dow frames” in 1st row), or these present in only one of the input views (“stair
handrail” and “lampshade” in 2nd and 3rd rows), or when depicting large-scale
outdoor objects captured from distant viewpoints (“bridge” in 4th row). The
baseline methods exhibit obvious artifacts for these regions, while MVSplat
shows no such artifacts due to our cost volume-based geometry representation.
More evidence and detailed analysis regarding how MVSplat effectively infers
the geometry structures are presented in Sec. 4.3.
10 Y. Chen et al.
Input pixelSplat [1] MVSplat Ground Truth Input pixelSplat [1] MVSplat Ground Truth
elSplat degrade dramatically; the main reason is likely that pixelSplat relies on
purely feature aggregations that are tied to the absolute scale of feature values,
hindering its performance when it receives different image features from other
datasets. Quantitative results reported in Tab. 2 further uphold this observa-
tion. Note that the MVSplat significantly outperforms pixelSplat in terms of
LPIPS, and the gain is larger when the domain gap between source and tar-
get datasets becomes larger. More surprisingly, our cross-dataset generalization
results on ACID even slightly surpass the pixelSplat model that is specifically
trained on ACID (see Tab. 1). We attribute such results to the larger scale of
the RealEstate10K training set (∼ 7× larger than ACID) and our superior gen-
eralization ability. This also suggests the potential of our method for training on
more diverse and larger-scale datasets in the future.
Ground Truth w/o cost volume w/o cross-attn w/o U-Net base base + refine
Input Views
4.3 Ablations
robustness. Due to the over-fitting issue, we report this variant specifically with
its best performance instead of the final over-fitted one.
Ablations on the cost volume refinement U-Net. The initial cost volume
might be less effective in challenging regions, thus we propose to use a U-Net for
refinement. To investigate its importance, we performed a study (“w/o U-Net”)
that removes the U-Net architecture. Fig. 6 reveals that, for the middle regions,
the variant “w/o U-Net” render as well as the “base” model, where content is
present in both input views; but its left and right parts show obvious degraded
quality in this variant compared with the “base” model, for which content is only
present in one of the inputs. This is because our cost volume cannot find any
matches in these regions, leading to poorer geometry cues. In such a scenario, the
U-Net refinement is important for mapping high-frequency details from input
views to the Gaussian representation, resulting in an overall improvement of
∼ 0.7 dB PSNR as reported in Tab. 3.
Ablations on depth refinement. Additional depth refinement helps improve
the depth quality, essentially leading to better visual quality. As in Tab. 3, “base
+ refine” achieves the best outcome, hence it is used as our final model.
More ablations. We further demonstrate in the supplementary material Ap-
pendix A that our cost volume based design can also greatly enhance pixel-
Splat [1]. Swin Transformer is more suitable for our MVSplat than Epipolar
Transformer [15]. Our MVSplat can benefit from predicting more Gaussian prim-
itives. And our MVSplat is superior to existing methods even when training from
completely random initialization for the entire model.
5 Conclusion
References
1. Charatan, D., Li, S., Tagliasacchi, A., Sitzmann, V.: pixelsplat: 3d gaussian splats
from image pairs for scalable generalizable 3d reconstruction. In: CVPR (2024)
2. Chen, A., Xu, H., Esposito, S., Tang, S., Geiger, A.: Lara: Efficient large-baseline
radiance fields. In: ECCV (2024)
3. Chen, A., Xu, Z., Zhao, F., Zhang, X., Xiang, F., Yu, J., Su, H.: Mvsnerf: Fast
generalizable radiance field reconstruction from multi-view stereo. In: ICCV (2021)
4. Chen, G., Wang, W.: A survey on 3d gaussian splatting. arXiv (2024)
5. Chen, Y., Xu, H., Wu, Q., Zheng, C., Cham, T.J., Cai, J.: Explicit correspondence
matching for generalizable neural radiance fields. In: arXiv (2023)
6. Chibane, J., Bansal, A., Lazova, V., Pons-Moll, G.: Stereo radiance fields (srf):
Learning view synthesis for sparse views of novel scenes. In: CVPR (2021)
7. Collins, R.T.: A space-sweep approach to true multi-image matching. In: CVPR
(1996)
8. Deng, K., Liu, A., Zhu, J.Y., Ramanan, D.: Depth-supervised nerf: Fewer views
and faster training for free. In: CVPR (2022)
9. Ding, Y., Yuan, W., Zhu, Q., Zhang, H., Liu, X., Wang, Y., Liu, X.: Transmvs-
net: Global context-aware multi-view stereo network with transformers. In: CVPR
(2022)
10. Du, Y., Smith, C., Tewari, A., Sitzmann, V.: Learning to render novel views from
wide-baseline stereo pairs. In: CVPR (2023)
11. Fan, Z., Cong, W., Wen, K., Wang, K., Zhang, J., Ding, X., Xu, D., Ivanovic, B.,
Pavone, M., Pavlakos, G., et al.: Instantsplat: Unbounded sparse-view pose-free
gaussian splatting in 40 seconds. arXiv (2024)
12. Gao, R., Holynski, A., Henzler, P., Brussee, A., Martin-Brualla, R., Srinivasan, P.,
Barron, J.T., Poole, B.: Cat3d: Create anything in 3d with multi-view diffusion
models. arXiv (2024)
13. Gu, X., Fan, Z., Zhu, S., Dai, Z., Tan, F., Tan, P.: Cascade cost volume for high-
resolution multi-view stereo and stereo matching. In: CVPR (2020)
14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: CVPR (2016)
15. He, Y., Yan, R., Fragkiadaki, K., Yu, S.I.: Epipolar transformers. In: CVPR (2020)
16. Henzler, P., Reizenstein, J., Labatut, P., Shapovalov, R., Ritschel, T., Vedaldi, A.,
Novotny, D.: Unsupervised learning of 3d object categories from videos in the wild.
In: CVPR (2021)
17. Jensen, R., Dahl, A., Vogiatzis, G., Tola, E., Aanæs, H.: Large scale multi-view
stereopsis evaluation. In: CVPR (2014)
18. Johari, M.M., Lepoittevin, Y., Fleuret, F.: Geonerf: Generalizing nerf with geom-
etry priors. In: CVPR (2022)
19. Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for
real-time radiance field rendering. TOG 42(4) (2023)
20. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR
(2015)
21. Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N., Kanazawa, A.: Infinite
nature: Perpetual view generation of natural scenes from a single image. In: ICCV
(2021)
16 Y. Chen et al.
22. Liu, Y., Peng, S., Liu, L., Wang, Q., Wang, P., Theobalt, C., Zhou, X., Wang, W.:
Neural rays for occlusion-aware image-based rendering. In: CVPR (2022)
23. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin trans-
former: Hierarchical vision transformer using shifted windows. In: ICCV (2021)
24. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng,
R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: ECCV
(2020)
25. Miyato, T., Jaeger, B., Welling, M., Geiger, A.: Gta: A geometry-aware attention
mechanism for multi-view transformers. In: ICLR (2024)
26. Niemeyer, M., Barron, J.T., Mildenhall, B., Sajjadi, M.S., Geiger, A., Radwan, N.:
Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs.
In: CVPR (2022)
27. Reizenstein, J., Shapovalov, R., Henzler, P., Sbordone, L., Labatut, P., Novotny, D.:
Common objects in 3d: Large-scale learning and evaluation of real-life 3d category
reconstruction. In: ICCV (2021)
28. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution
image synthesis with latent diffusion models. In: CVPR (2022)
29. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed-
ical image segmentation. In: MICCAI (2015)
30. Sajjadi, M.S., Meyer, H., Pot, E., Bergmann, U., Greff, K., Radwan, N., Vora, S.,
Lučić, M., Duckworth, D., Dosovitskiy, A., et al.: Scene representation transformer:
Geometry-free novel view synthesis through set-latent scene representations. In:
CVPR (2022)
31. Schönberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M.: Pixelwise view selection
for unstructured multi-view stereo. In: ECCV (2016)
32. Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: Mvdream: Multi-view diffusion
for 3d generation. ICLR (2024)
33. Sitzmann, V., Rezchikov, S., Freeman, B., Tenenbaum, J., Durand, F.: Light field
networks: Neural scene representations with single-evaluation rendering. NeurIPS
(2021)
34. Sitzmann, V., Zollhöfer, M., Wetzstein, G.: Scene representation networks: Con-
tinuous 3d-structure-aware neural scene representations. NeurIPS (2019)
35. Suhail, M., Esteves, C., Sigal, L., Makadia, A.: Generalizable patch-based neural
rendering. In: ECCV (2022)
36. Szymanowicz, S., Insafutdinov, E., Zheng, C., Campbell, D., Henriques, J., Rup-
precht, C., Vedaldi, A.: Flash3d: Feed-forward generalisable 3d scene reconstruction
from a single image. arxiv (2024)
37. Szymanowicz, S., Rupprecht, C., Vedaldi, A.: Splatter image: Ultra-fast single-view
3d reconstruction. In: CVPR (2024)
38. Tang, J., Chen, Z., Chen, X., Wang, T., Zeng, G., Liu, Z.: Lgm: Large multi-view
gaussian model for high-resolution 3d content creation. arXiv (2024)
39. Truong, P., Rakotosaona, M.J., Manhardt, F., Tombari, F.: Sparf: Neural radiance
fields from sparse and noisy poses. In: CVPR (2023)
40. Wang, Q., Wang, Z., Genova, K., Srinivasan, P.P., Zhou, H., Barron, J.T., Martin-
Brualla, R., Snavely, N., Funkhouser, T.: Ibrnet: Learning multi-view image-based
rendering. In: CVPR (2021)
41. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment:
from error visibility to structural similarity. TIP 13(4) (2004)
42. Wewer, C., Raj, K., Ilg, E., Schiele, B., Lenssen, J.E.: latentsplat: Autoencoding
variational gaussians for fast generalizable 3d reconstruction. ECCV (2024)
MVSplat 17
43. Wu, R., Mildenhall, B., Henzler, P., Park, K., Gao, R., Watson, D., Srinivasan,
P.P., Verbin, D., Barron, J.T., Poole, B., et al.: Reconfusion: 3d reconstruction
with diffusion priors. arXiv (2023)
44. Xu, H., Chen, A., Chen, Y., Sakaridis, C., Zhang, Y., Pollefeys, M., Geiger, A.,
Yu, F.: Murf: Multi-baseline radiance fields. In: CVPR (2024)
45. Xu, H., Zhang, J., Cai, J., Rezatofighi, H., Tao, D.: Gmflow: Learning optical flow
via global matching. In: CVPR (2022)
46. Xu, H., Zhang, J., Cai, J., Rezatofighi, H., Yu, F., Tao, D., Geiger, A.: Unifying
flow, stereo and depth estimation. PAMI (2023)
47. Xu, H., Zhang, J.: Aanet: Adaptive aggregation network for efficient stereo match-
ing. In: CVPR (2020)
48. Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: Mvsnet: Depth inference for unstruc-
tured multi-view stereo. In: ECCV (2018)
49. Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelnerf: Neural radiance fields from
one or few images. In: CVPR (2021)
50. Yu, Z., Peng, S., Niemeyer, M., Sattler, T., Geiger, A.: Monosdf: Exploring monoc-
ular geometric cues for neural implicit surface reconstruction. NeurIPS (2022)
51. Zhang, K., Bi, S., Tan, H., Xiangli, Y., Zhao, N., Sunkavalli, K., Xu, Z.: Gs-lrm:
Large reconstruction model for 3d gaussian splatting. arXiv (2024)
52. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable
effectiveness of deep features as a perceptual metric. In: CVPR (2018)
53. Zheng, S., Zhou, B., Shao, R., Liu, B., Zhang, S., Nie, L., Liu, Y.: Gps-gaussian:
Generalizable pixel-wise 3d gaussian splatting for real-time human novel view syn-
thesis. In: CVPR (2024)
54. Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: learn-
ing view synthesis using multiplane images. TOG p. 65 (2018)
18 Y. Chen et al.
All experiments in this section follow the same settings as in Sec. 4.3 unless
otherwise specified, which are trained on RealEstate10K [54] and reported by
averaging over the full test set.
Using cost volume in pixelSplat. In the main paper, we have demonstrated
the importance of our cost volume design for learning feed-forward Gaussian
models. We note that such a concept is general and is not specifically designed for
a specific architecture. To verify this, we replace pixelSplat’s probability density-
based depth prediction module with our cost volume-based approach while keep-
ing other components intact. The results shown in Tab. A again demonstrate
the importance of the cost volume by significantly outperforming the original
pixelSplat, indicating the general applicability of our proposed method.
Table A: Using cost volume in pixelSplat. Our cost volume-based depth predic-
tion approach can also be used in the pixelSplat [1] model by replacing its probability
density-based depth branch and its performance can be significantly boosted, which
demonstrates the general applicability of our method.
Setup: base + refine base w/o U-Net w/o cross-view attention w/o cost volume
Fig. A: Validation curves of the ablations. The setup of each model is illustrated
on the top, which refers to the same one as in Tab. 3 of the main paper. The cost volume
plays a fundamental role in our full model, and interestingly, the model without cross-
view attention suffers from over-fitting after certain training iterations.
Input MVSplat Ground Truth Error Map Input MVSplat Ground Truth Error Map
Fig. B: Failure cases. Our MVSplat might be less effective on the non-Lambertian
and reflective surfaces.
our full model, and interestingly, the model without cross-view attention suffers
from overfitting after certain training iterations.
Limitation. Our MVSplat might be less effective on non-Lambertian and re-
flective surfaces, as shown in Fig. B. Integrating the rendering with additional
BRDF properties and training the model with more diverse datasets might be
helpful for addressing this issue in the future.
Potential negative societal impacts. Our model may produce unreliable
outcomes, particularly when applied to complex real-world scenes. Therefore, it
is imperative to exercise caution when implementing our model in safety-critical
situations, e.g., when augmenting data to train models for autonomous vehicles
with synthetic data rendered from our model.
Fig. C: More comparisons with the state of the art. These are the extended
visual results of Fig. 3. Scenes of the first four rows come from the RealEstate10K,
whilst scenes of the last two rows come from ACID. Our MVSplat performs the best
in all cases.
Input pixelSplat MVSplat Ground Truth Input pixelSplat MVSplat Ground Truth
different cost volumes. The following depth refinement U-Net also shares a sim-
ilar configuration, except that we apply 4 times of 2× down-sampling and add
attentions at the 16× down-sampled level.
More training details. As aforementioned, we initialize the backbone of MVS-
plat in all experiments with the UniMatch [46] pretrained weight. Our default
model is trained on a single A100 GPU. The batch size is set to 14, where each
batch contains one training scene, including two input views and four target
views. Similar to pixelSplat [1], the frame distance between two input views
is gradually increased as the training progressed. For both RealEstate10K and
ACID, we empirically set the near and far depth plane to 1 and 100, respectively,
while for DTU, we set them to 2.125 and 4.525 as provided by the dataset.