Paper 4 CVPR
Paper 4 CVPR
Number of Views ObjectSDF++ with 10 views ObjectSDF++ with 100 views Ours with 10 views
Reconstructed mesh in Blender Object geometry editing Object appearance editing Scene stylization VFX editing
Figure 1. We propose DP-R ECON, which capitalizes on pre-trained diffusion models for complete and decompositional neural scene
reconstruction. This approach significantly improves reconstruction quality in less captured regions, where previous methods often struggle.
Additionally, our method enables flexible text-based editing of geometry and appearance, as well as photorealistic VFX editing.
6022
ral implicit representations [23, 48] have enabled signifi- Extensive experiment results on Replica [67] and Scan-
cant progress in novel-view rendering [2, 12, 80, 81] and Net++ [85] demonstrate that our method significantly sur-
3D geometry reconstruction [53, 72, 83]. Despite these ad- passes all state-of-the-art methods in both geometry and ap-
vances, existing approaches are limited by representing an pearance reconstruction, particularly in heavily occluded re-
entire scene as a whole. On the other hand, decompitional gions. Remarkably, with only 10 input views, our method
reconstruction [25, 78] aims to break down the implicit 3D achieves object reconstruction quality superior to baseline
representation into individual objects in the scene and facili- methods that rely on 100 input views for heavily occluded
tate broader applications in embodied AI [1, 14, 16, 17, 26], scenes in Fig. 1. The ablative studies highlight the effec-
robotics [13, 21, 40, 64, 92], and more [6, 9]. tiveness of incorporating generative diffusion prior with vis-
Existing methods [33, 38, 49, 79] in decompositional ibility guidance. Our method enables seamless scene-level
neural reconstruction still fall short of expectations in down- and object-level editing, e.g., geometry and appearance styl-
stream applications to reconstruct complete 3D geometry ization, using SDS optimization. It produces decomposed
and accurate appearance (see Fig. 1), especially in less object meshes with detailed UV maps, enabling photoreal-
densely captured or heavily occluded areas with sparse in- istic rendering and VFX editing in common 3D software,
puts. To address the challenge of sparse-view reconstruc- thereby supporting various downstream applications.
tion, many approaches propose to incorporate semantic or In summary, our main contributions are three-fold:
geometric regularizations [18, 24, 52, 81]. Still, they of- • We introduce a novel method DP-R ECON that incorpo-
ten demonstrate significant degradation in non-observable rates generative prior into decompositional scene recon-
regions since they fail to provide additional information for struction, significantly improving geometry and appear-
the underconstrained areas. Thus, we believe the key is to ance recovery, particularly in heavily occluded regions.
introduce supplementary information for these areas based • We propose a visibility-guided approach to dynamically
on the observation from known views. adjust the SDS loss, alleviating the conflict between the
In this paper, we propose DP-R ECON to facilitate the reconstruction objective and generative prior guidance.
decompositional neural reconstruction with generative dif- • Extensive experiments demonstrate that our model sig-
fusion prior. Given multiple posed images, the neural im- nificantly enhances both geometry and appearance. Our
plicit representation is optimized to represent both individ- method enables seamless geometry and appearance edit-
ual objects and the background within the scene. Besides ing, yielding decomposed object meshes with detailed
the reconstruction loss, we employ a 2D diffusion model as UV maps for broad downstream applications.
a critic to supervise the optimization of each object through
SDS [55], which iteratively refines the 3D representation
by evaluating the quality of novel views from differentiable 2. Related Work
rendering. We use the pretrained Stable Diffusion [58], a
2.1. Neural Implicit Surface Reconstruction
more general diffusion model without fine-tuning on spe-
cific datasets. We meticulously design the optimization Recent advances in neural implicit representations [7, 8, 35,
pipeline so that the generative prior optimizes both the ge- 44, 50, 51, 89] has inspired the efforts to bridge the volume
ometry and appearance of each object alongside the recon- density in Neural Radiance Field (NeRF) [48, 90] with iso-
struction loss, filling in the missing information in unob- surface representations, e.g., occupancy [47] or signed dis-
served and occluded regions. tance function (SDF) [53, 54, 72, 83], enabling reconstruc-
However, directly integrating the diffusion prior into the tion from 2D images. To facilitate practical applications
reconstruction pipeline may compromise the overall consis- like scene editing [69, 73] and manipulation [30, 32], ad-
tency, particularly in observed regions, due to their potential vanced methods [25, 43, 78] target compositional scene re-
conflicts. Ideally, we want to preserve the visible area in the construction by decomposing the implicit 3D representation
input images while the diffusion prior completes the rest. into the individual objects, incorporating additional infor-
To alleviate this problem, we propose a novel visibility ap- mation in semantics [33, 79] or physics simulation [49, 82].
proach that models the visibility of 3D points across the in- While these methods realize plausible object disentangle-
put views using a learnable grid. The visibility information ment, they still face significant challenges in recovering
is derived from the accumulated transmittance in volume complete objects and smooth backgrounds, especially in re-
rendering, enabling us to optimize the visibility grid with- gions unobserved from the input images, such as uncap-
out introducing computationally intensive external visibility tured areas or objects behind occlusion. In this paper, we
priors [65]. For each novel view, the visibility map can be aim to enhance the reconstruction quality of both geom-
rendered from this grid, which can dynamically adjust the etry and appearance, recovering the objects in complete
per-pixel SDS and rendering loss weights, benefiting both shape and texture for more versatile downstream applica-
geometry and appearance optimization stages. tions [13, 19, 20, 39].
6023
2.2. Sparse-view NeRF pearance of objects and the background in the scene. Fig. 2
presents an overview of our proposed DP-R ECON.
NeRFs [2, 3, 48], while demonstrating impressive results,
rely on hundreds of posed images during training to ef- 3.1. Background
fectively optimize the 3D representations. Many works
SDF-based Neural Implicit Surfaces NeRF [48] learns
have attempted to reduce NeRF’s reliance on dense image
the density field σ(p) and color field c(p, d) from input im-
capture through various regularization techniques [15, 27,
ages for each point p and view direction d. To reconstruct
61, 65, 66, 68, 70, 75, 77, 84, 86]. For example, RegN-
the 3D geometric surface, current approaches [72, 83] re-
eRF [52] uses a depth smoothness loss alongside normal-
place the NeRF density σ(p) by a learnable transformation
izing flow to regularize both geometry and appearance in
function of SDF value s(p), which is defined as the signed
novel views. Other methods incorporate depth information
distance from point p to the boundary surface. Following
from Structure-from-Motion (SfM) [11, 57] or monocular
MonoSDF [88], for a ray r with view direction d, we use
estimators [71, 88]. DietNeRF [18] improves novel view
the SDF s(pi ) and the color c(pi , d) for each point pi along
geometry through cross-view semantic consistency, while
FreeNeRF [81] reduces artifacts in novel views by regular- the ray to render the color Ĉ(r) by volume rendering:
izing the frequency of NeRF’s positional encoding features. n−1
X
While these methods yield plausible results in regions with Ĉ(r) = Ti αi ci , (1)
limited image coverage, they still fail in areas with no cap- i=0
tured observations. We argue the key to addressing this is- where Ti is the discrete accumulated transmittance, and αi
sue lies in introducing external knowledge, i.e., from pre- is the discrete opacity value, defined as:
trained diffusion models, and harmonizing generative and
reconstruction guidance. i−1
Y
Ti = (1 − αj ), αi = 1 − exp(−σi δi ), (2)
2.3. Diffusion Prior for 3D Reconstruction j=0
Recently, diffusion models have proven effective in provid- where δi represents the distance between neighboring sam-
ing prior knowledge for reconstruction [5, 36, 42, 56, 62, ple points along the ray. Depth D̂(r) and normal N̂ (r) can
93]. DreamFusion [55] introduces SDS with Stable Diffu- also be derived through volume rendering.
sion [58] to guide 3D object generation from text prompts. Object-compositional Scene Reconstruction Following
Methods such as ReconFusion [80], NeRFiller [76], MVIP- previous work [33, 49, 78, 79], we consider the decomposi-
NeRF [4], and ExtraNeRF [63] refine fine-tuned 2D diffu- tional reconstruction of objects utilizing their corresponding
sion models to recover or inpaint high-fidelity NeRF from masks and treat the background as an object. Specifically,
sparse input views. More recent approaches leverage video for a scene with k objects, we predict k SDFs for each point
diffusion models [12, 34, 37, 41, 46] for improved consis- p and the j-th (1 ≤ j ≤ k) SDF sj (p) is for the j-th object.
tency across views. However, these methods only focus The scene SDF s(p) is the minimum of the object SDFs.
on the novel view synthesis by applying diffusion priors to We use the SDF sj (p) and the color c(pi , d) to render color
the entire scene, without awareness of individual objects. Ĉj (r), depth D̂j (r) and normal N̂j (r) for j-th object. See
While the results seem reasonable, they fail to maintain the more details in the supplementary.
3D consistency of objects across views, do not recover the
3D geometry of objects and cannot reconstruct regions be- Score distillation sampling DreamFusion [55] enables
hind occlusion. On the contrary, our method leverages the the optimization of any differentiable image generator, e.g.
benefits of decompositional scene reconstruction and ap- 3D NeRF, from textual descriptions by employing a pre-
plies a generative prior to each object. This substantially en- trained 2D diffusion model [58, 59]. Formally, let x = g(θ)
hances the reconstruction quality of both individual objects represent an image rendered by a differentiable generator g
and the overall scene, in terms of both geometry and ap- with parameter θ, DreamFusion leverages a diffusion model
pearance. Moreover, we identify a critical issue overlooked ϕ to provide a score function ϵ̂ϕ (xt ; y, t), which predicts the
in prior work: the conflict between generative and recon- sampled noise ϵ given the noisy image xt , text-embedding
struction guidance, and introduce a novel visibility-guided y, and noise level t. This score function guides the direction
strategy to dynamically adjust the SDS loss during training, of the gradient for updating the parameter θ, and the gradi-
effectively resolving this conflict. ent is calculated by Score Distillation Sampling (SDS):
3. Method ∂x
∇θ LSDS (ϕ, x) = Et,ϵ w(t) (ϵ̂ϕ (xt ; y, t) − ϵ) , (3)
∂θ
Given a set of posed RGB images and corresponding in-
stance masks, we aim to reconstruct the geometry and ap- where w(t) is a weighting function.
6024
Object-compositional Reconstruction Geometry Optimization Appearance Optimization
Text Embedding Text Embedding
Object
SDFs
Color
MLP
Volume Rendering Normal & Mask Map Color Map
Mesh
render render
Optimize
Figure 2. Overview of DP-R ECON. We first use reconstruction loss Lrecon for decompositional neural reconstruction, followed by
the prior-guided geometry optimization stage that incorporates SDS loss Lg−vSDS . We finally export the object meshes and optimize their
appearance with La−v
SDS . The visibility balances the guidance from prior and reconstruction by dynamically adjusting per-pixel SDS loss.
3.2. 3D Reconstruction with Generative Priors each training iteration, the color map cj for j-th is rendered
at a randomly selected camera view, and the appearance
The latent neural representation of the 3D scene is primarily
SDS loss is used to compute the gradient for updating ψ:
optimized by the reconstruction loss Lrecon derived from
volume rendering, following prior work [33, 78, 79]. How-
∂z ∂cj
ever, regions with sparse capture or heavy occlusions often ∇ψ LaSDS = Et,ϵ w(t) (ϵ̂ϕ (zt ; y, t) − ϵ) , (5)
∂cj ∂ψ
lead to suboptimal geometry and appearance recovery due
to insufficient information as reconstruction guidance. To where z is the latent code of cj . Note that the color render-
mitigate this gap, we introduce diffusion prior to optimize ing loss from input views is also used to optimize ψ.
the the 3D model, both in geometry and appearance, so that Background Appearance Optimization Applying ap-
it looks realistic at novel unobserved views. pearance SDS in Eq. (5) for background optimization can
Prior-guided Geometry Optimization We adopt the de- lead to degenerated results, e.g., introducing non-existent
compositional neural implicit surface as our 3D represen- objects, as the background lacks clear geometric cues from
tation, which is parameterized with a series of multi-layer the local camera perspective. To mitigate this shape-
perceptrons (MLPs) with parameter θ. The rendering func- appearance ambiguity, we use depth-guided inpainting [91]
tions in Sec. 3.1 serve as the image generator g(θ). At each for the background panorama color map and employ the
training iteration, we sample the j-th object and render its inpainted panorama to supervise background color during
normal map and mask map at a randomly sampled camera the appearance optimization stage. The inpainting mask is
pose. Following previous work [5, 56], we use a concate- based on the visibility of the pixel in the panorama, derived
nated map ñj of the normal and mask maps as the input for from our visibility modeling introduced in Sec. 3.3.
the diffusion model to improve geometric optimization sta-
bility. We then employ the SDS loss to compute the gradient 3.3. Visibility-guided Optimization
for updating θ as follows: Score Distillation Sampling (SDS), despite its wide ap-
plication, has been shown to suffer from significant arti-
∂z ∂ ñj
facts [29, 87], such as oversaturation, oversmoothing, and
∇θ LgSDS = Et,ϵ w(t) (ϵ̂ϕ (zt ; y, t) − ϵ) , (4) low-diversity, and optimization instability [45, 74]. They
∂ ñj ∂θ
become even more significant when optimizing the latent
where z is the latent code of ñj . The background is also 3D representation through both reconstruction and SDS
treated as one object for geometry optimization. guidance, due to their potential conflict, leading to inconsis-
Prior-guided Appearance Optimization To produce ob- tencies with the observations. We address this problem by
ject meshes with detailed UV maps, which are friendly for proposing a visibility-guided approach, which adjusts ge-
photorealistic rendering in common 3D software and en- ometry and appearance SDS loss based on pixel visibility
able more downstream applications, we directly optimize in the input view when rendered from a novel view.
the mesh appearance rather than NeRF’s appearance field. Visibility Modeling We introduce a learnable visibility
More specifically, we export the mesh for each object af- grid G to model the visibility v of a 3D point p in the input
ter the geometry optimization stage. Using NVDiffrast [28] views. We employ a view-independent modeling for visi-
for differentiable mesh rendering, we employ another small bility, i.e., v = G(p), as it only depends on the input views
network ψ to predict color for the mesh surface points. At and is independent of the ray direction from novel views.
6025
Table 1. Quantitative results on reconstruction and novel view synthesis. Our method achieves superior or comparable reconstruction
and rendering quality compared to the baselines. This highlights the effectiveness of incorporating a generative prior to improve overall
reconstruction quality, while the visibility modeling ensures stability in the observable parts, preventing drastic changes.
Reconstruction Rendering
Dataset Method
CD↓ F-Score ↑ NC↑ PSNR↑ SSIM↑ LPIPS↓ MUSIQ↑
ZeroNVS* 21.53 16.41 79.43 14.47 0.515 0.428 45.78
FreeNeRF 67.75 6.63 48.59 13.69 0.437 0.513 37.54
MonoSDF 12.57 43.25 83.14 22.44 0.809 0.246 36.02
Replica RICO 17.36 27.89 82.27 19.85 0.746 0.356 31.82
ObjectSDF++ 8.57 50.11 85.44 24.66 0.865 0.198 41.42
Oursgeo 7.91 50.99 89.36 25.08 0.868 0.196 43.33
Ours 7.91 50.99 89.36 24.52 0.846 0.286 49.22
ZeroNVS* 36.69 6.48 69.61 12.22 0.463 0.443 41.36
FreeNeRF 134.34 1.50 46.46 15.32 0.533 0.481 35.42
MonoSDF 18.52 26.72 76.26 17.58 0.646 0.451 28.63
ScanNet++ RICO 23.64 21.28 65.28 15.74 0.616 0.467 26.95
ObjectSDF++ 10.67 44.78 78.18 19.43 0.741 0.332 33.64
Oursgeo 10.18 45.13 81.87 20.13 0.752 0.319 38.43
Ours 10.18 45.13 81.87 20.17 0.715 0.442 45.71
ScanNet++
Replica
Figure 3. Qualitative comparison of 10-views reconstruction. We present examples from ScanNet++ [85] and Replica [67]. In each
example, the first row shows the background, the second the full scene, and the third individual objects. We reconstruct more complete and
reasonable 3D geometry, especially in less captured and occluded regions, such as the chair behind the table and the background.
6026
ScanNet++
Replica
Figure 4. Qualitative results of novel view synthesis. Our method significantly improves rendering quality, particularly in less captured
regions with low visibility, shown in darker colors in the visibility maps, such as the highlighted corner of the wall.
Ideally, points observed in more input views should have 3.4. To Make Prior-Guided Optimization Work
higher visibility. The accumulated transmittance T for a
Training process Our training consists of three stages:
3D point p represents the probability that the corresponding
ray reaches p without hitting any other particles - higher (1) Object-compositional reconstruction The implicit
transmittance T means greater visibility probability in the surfaces are optimized to decompose the scene into indi-
input views. Therefore, we initialize G as zero and utilize vidual objects with the reconstruction loss Lrecon follow-
the T from input views to optimize the visibility grid G via: ing prior work [33, 79]. See more details in supplementary.
n
X After this stage, we optimize the visibility grid G by Lv in
Lv = max(Ti − G(pi ), 0). (6) Eq. (6) and keep it frozen in the following two stages.
i=0 (2) Geometry optimization In addition to Lrecon , we
We detach the gradient of Ti to avoid the influence on the also apply visibility-guided geometry SDS Lg−vSDS for each
reconstruction network. We optimize G after finishing the object to optimize the latent representation.
decompositional reconstruction stage to ensure the accuracy (3) Appearance optimization We export the mesh for
of the transmittance and freeze G in the geometry and ap- each object after the geometry optimization and optimize
pearance optimization stage with generative diffusion prior. the appearance network ψ with La−vSDS and color rendering
Visibility-guided SDS We obtain the visibility map V un- loss. The appearance of the background is additionally re-
der novel view by volume rendering Psimilar to Eq. (1). V constructed with the inpainted panorama.
n−1
for a ray r is calculated as V (r) = i=0 Ti αi vi . The visi- g−v
Effective Rendering for SDS As introduced above, LSDS
bility weighting function wv (z) is calculated as:
( requires the normal and mask maps from volume rendering
v w0 + m0 V (z) if V (z) ≤ τ for iterative optimization. However, traditional volume ren-
w (z) = , (7) dering is slow, e.g., VolSDF [83] takes about 0.5 seconds to
w1 + m1 V (z) if V (z) > τ
render a full image at 128 × 128 resolution, which is im-
where w and m are piecewise linear coefficients, V (z) de- practical for optimization. To address this, we apply the
notes the pixel-wise visibility associated with latent z, and OccGrid sampling method [31] to render normal map and
τ a threshold separating high and low visibility area. We mask map for SDS novel views, reducing rendering time to
reduce the SDS loss weight in high visibility regions to en- only 0.01 seconds for a 128 × 128 resolution image.
hance reconstruction guidance while increasing SDS loss
Novel View Selection SDS optimization requires novel
weight in low visibility regions for higher generative prior
view images rendered under object-centric camera poses.
guidance. Then we rewrite Eq. (4) and Eq. (5) as:
h i However, due to insufficient constraints with sparse views,
∂z ∂nj
∇θ Lg−v v
SDS = Et,ϵ w (z)w(t) (ϵ̂ϕ (zt ; y, t) − ϵ) ∂nj ∂θ
NeRF-based methods produce floating artifacts throughout
h i (8) the scene, making it difficult to render object-centric im-
∂z ∂cj
∇ψ La−v
SDS = E t,ϵ w v
(z)w(t) (ϵ̂ ϕ (zt ; y, t) − ϵ) ∂cj ∂ψ ages for each object. To address this, we use visibility grid
6027
Table 2. Decompositional object reconstruction. Our approach
ScanNet++
significantly outperforms baselines, recovering complete object
meshes and smoother backgrounds with generative prior.
Replica
Replica
RICO 10.32 49.26 61.27 71.21 13.35 39.73 85.32
ObjectSDF++ 7.49 56.69 64.75 71.72 10.33 44.19 86.34
Ours 5.54 67.71 73.50 88.21 9.39 46.14 92.83 RICO ObjectSDF++ Ours Visibility Map GT
ScanNet++
RICO 24.09 39.26 58.26 42.25 18.37 34.72 78.26
ObjectSDF++ 14.52 46.87 61.57 45.73 13.20 38.92 80.47
Figure 5. Visualized novel view instance masks. Our method can
Ours 5.03 66.55 72.91 70.01 11.51 40.12 86.24 synthesize consistent and complete novel view instance masks.
G to predict the boundary of each object and filter out the 4.2. Results
floaters. After filtering, we obtain the object’s bounding box Holistic Scene Reconstruction As shown in Tab. 1, our
and sample novel views for SDS around it. For the back- method enhances both reconstruction and rendering re-
ground, we randomly sample novel views within the scene. sults compared to all baselines. These improvements stem
from incorporating generative priors into the reconstruction
4. Experiments pipeline, which enables more accurate reconstruction in less
captured areas, more precise object structures, smoother
We evaluate DP-R ECON on both geometry and appearance background reconstruction, and fewer floating artifacts, as
recovery for sparse-view 3D reconstruction. Additionally, illustrated in Fig. 3. Our appearance prior also supplies
we provide generalization results on YouTube videos, fail- reasonable additional information in these less captured re-
ure cases, and discuss limitations in supplementary. gions, allowing our NR rendering metric, MUSIQ, to signif-
icantly outperform the baselines. This indicates our render-
4.1. Settings ing achieves higher quality in these areas, in contrast to the
artifacts present in the baseline results, as shown in Fig. 4.
Datasets We conduct experiments on the synthetic dataset
Replica [67] with 8 scenes following MonoSDF [88] and Decompositional Object Reconstruction Incorporating
ObjectSDF++ [79]. Additionally, we use 6 scenes from the generative prior substantially improves the reconstruction
real-world dataset ScanNet++ [85]. We use 10 input views in occluded regions, including more accurate object struc-
for each scene, except experiments on different input view tures and fewer floating artifacts, e.g., the chair behind the
numbers in Tab. 3. See more details in supplementary. table and the occluded background in Fig. 3. The significant
increase in novel view mask mIoU compared to all base-
Baselines We compare against the state-of-tart sparse- lines shows that our method achieves complete and multi-
view NeRF method FreeNeRF [81] and dense-view view consistent object shapes, as illustrated in Tab. 2 and
MonoSDF [88] with geometric regularization. We also Fig. 5. Our results, shown in Fig. 3, remain faithful to input
compare with RICO [33] and ObjectSDF++ [79] for decom- images in observed regions, confirming that our visibility-
positional reconstruction. We adapt ZeroNVS [60], which guided approach effectively mitigates conflicts between the
synthesizes novel view of scenes from a single image, fol- guidance of generative prior and the input images.
lowing ReconFusion [80] to multiview settings.
Performance under Different View Number Tab. 3
Metrics For reconstruction metrics, we evaluate Chamfer shows our result consistently outperforms the baselines
Distance (CD), F-Score, and Normal Consistency (NC) fol- across varying numbers of input views, improving both
lowing MonoSDF [88] for both total scene and decom- geometry and appearance. Notably, it realizes better ob-
positional reconstruction. Additionally, we assess novel ject reconstruction with just 5 views than baselines with 15
view mask Mean Intersection over Union (mIoU) to eval- views. Our method is especially effective in large scenes
uate the completeness of object reconstruction. For ren- with heavy occlusions, as demonstrated in Fig. 1 where our
dering metrics, we evaluate full-reference (FR) and no- method with 10 views outperforms the baseline with 100.
reference (NR) following ExtraNeRF [63]. For the FR met-
4.3. Ablation Studies
rics, we use PSNR, SSIM, and LPIPS and for NR, we em-
ploy MUSIQ [22] to evaluate the visual quality of rendered We design ablative studies on the geometry prior (GP), vis-
images. We randomly sample 10 novel views within each ibility guidance for geometry prior (VG), appearance prior
scene to evaluate the rendering metrics and mask mIoU. (AP), and visibility guidance for appearance prior (VA).
6028
Table 3. Quantitative results on different view numbers. Our method outperforms baselines significantly under sparse-view settings.
Scene Reconstruction (CD↓ / NC↑) Object Reconstruction (CD↓ / mIoU↑) Rendering (PSNR↑/MUSIQ↑)
Method 5 views 10 views 15 views 5 views 10 views 15 views 5 views 10 views 15 views
ZeroNVS* 31.86 / 75.26 21.53 / 79.43 20.80 / 81.81 - - - 13.72 / 39.01 14.47 / 45.78 15.02 / 46.39
FreeNeRF 76.83 / 48.19 67.76 / 48.59 65.26 / 49.71 - - - 12.94 / 27.71 13.69 / 37.54 13.89 / 37.13
MonoSDF 31.26 / 76.03 12.57 / 83.14 8.94 / 87.23 - - - 18.57 / 31.47 22.44 / 36.02 26.20 / 42.25
RICO 37.83 / 72.30 17.36 / 82.27 8.84 / 86.28 15.81 / 45.27 10.32 / 71.21 8.43 / 78.04 16.78 / 28.69 19.85 / 31.82 20.57 / 30.81
ObjectSDF++ 35.49 / 68.86 8.57 / 85.44 7.21 / 88.21 9.35 / 47.63 7.49 / 71.72 6.93 / 80.01 17.43 / 28.13 24.66 / 41.42 27.33 / 44.36
Ours 12.63 / 83.72 7.91 / 89.36 6.78 / 91.79 6.66 / 69.48 5.54 / 88.21 4.88 / 87.39 20.43 / 39.28 24.52 / 49.22 28.12 / 50.71
Table 4. Ablation study. Geometry (GP) and appearance (AP) 4.4. Scene Editing
priors improve the reconstruction and rendering quality, while the
visibility (VG & VA) further enhances the consistency. Text-based Editing Leveraging decompositional recon-
struction and text-guided generative priors, DP-R ECON
GP VG AP VA
Scene Recon. Rendering Object Recon. BG Recon. enables seamless text-based editing of both geometry (e.g.,
CD↓ NC↑ PSNR↑ MUSIQ↑ CD↓ mIoU↑ CD↓ NC↑
“A Teddy bear”) and appearance (e.g., “Space-themed”)
× × × × 8.51 86.13 24.31 41.51 7.67 73.31 10.06 87.36
✓ × × × 8.14 88.68 24.83 41.98 6.35 84.45 9.83 91.36 for individual objects and background, as shown in Fig. 7.
✓ ✓ × × 7.91 89.36 25.08 43.33 5.54 88.21 9.39 92.83 This allows the generation of numerous digital replicas [67]
✓ × ✓ × 8.14 88.68 22.83 50.25 6.35 84.45 9.83 91.36
✓ ✓ ✓ × 7.91 89.36 23.42 52.34 5.54 88.21 9.39 92.83 or cousins [10] while preserving the original spatial layout.
✓ ✓ ✓ ✓ 7.91 89.36 24.52 49.22 5.54 88.21 9.39 92.83
VFX Editing Our method reconstructs high-fidelity, de-
composed object meshes with detailed UV maps, support-
Tab. 4 and Fig. 6 reveal several key observations: ing VFX workflows in common 3D software like Blender.
1. The integration of generative priors (GP & AP) sub- Fig. 7 demonstrates diverse photorealistic VFX edits for in-
stantially improves reconstruction and rendering quality. dividual objects (e.g., “Ignite it” or “Break it by a ball”),
However, directly incorporating them creates inconsis- which could benefit filmmaking and game development.
tencies with input views, potentially undermining the re-
“Break it by a ball!” “A teddy bear” “Space-themed style”
construction of observed regions (see Fig. 6 (b, e, f)).
2. As shown in Tab. 4 and Fig. 6 (c, g), visibility guidance
(VG and VA) effectively regulates the influence of diffu-
sion priors. By adaptively weighting the SDS loss, i.e., “Ignite it!” “A fire extinguisher” “Super Mario style”
5. Conclusion
We present DP-R ECON, a novel pipeline that utilizes gen-
erative priors in the form of SDS to optimize the neural rep-
(a) (b) (c) resentation of each object. It employs a visibility-guided
Base GP GP, VG mechanism that dynamically adjusts the SDS loss, ensur-
ing improved results while maintaining consistency with the
input images. Extensive experiments demonstrate that our
method outperforms state-of-the-art methods in both geom-
etry and appearance reconstruction, effectively enhancing
(d) (e) (f) (g)
GP, VG GP, AP GP, VG, AP Full quality in unobserved regions while preserving fidelity in
observed areas. Our method enables seamless text-based
Figure 6. Qualitative ablation comparison. We show the meshes
in the first row along with their textures in the second, demon- editing for geometry and stylization, and generates decom-
strating that prior knowledge can supplement missing information posed object meshes with detailed UV maps, supporting a
while the visibility modeling ensures consistency with input views. wide range of downstream applications.
6029
References Demetri Terzopoulos, Song-Chun Zhu, et al. Arnold:
A benchmark for language-grounded task learning with
[1] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark continuous states in realistic 3d scenes. arXiv preprint
Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and arXiv:2304.04321, 2023. 2
Anton Van Den Hengel. Vision-and-language navigation:
[14] Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang
Interpreting visually-grounded navigation instructions in real
Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei,
environments. In Conference on Computer Vision and Pat-
Yunchao Yao, et al. Maniskill2: A unified bench-
tern Recognition (CVPR), 2018. 2
mark for generalizable manipulation skills. arXiv preprint
[2] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P
arXiv:2302.04659, 2023. 2
Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded
[15] Shoukang Hu, Kaichen Zhou, Kaiyu Li, Longhui Yu, Lan-
anti-aliased neural radiance fields. In Conference on Com-
qing Hong, Tianyang Hu, Zhenguo Li, Gim Hee Lee, and
puter Vision and Pattern Recognition (CVPR), 2022. 2, 3
Ziwei Liu. Consistentnerf: Enhancing neural radiance fields
[3] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P
with 3d consistency for sparse view synthesis. arXiv preprint
Srinivasan, and Peter Hedman. Zip-nerf: Anti-aliased grid-
arXiv:2305.11031, 2023. 3
based neural radiance fields. In International Conference on
[16] Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun
Computer Vision (ICCV), pages 19697–19705, 2023. 3
Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu,
[4] Honghua Chen, Chen Change Loy, and Xingang Pan. Mvip-
Baoxiong Jia, and Siyuan Huang. An embodied generalist
nerf: Multi-view 3d inpainting on nerf scenes via diffusion
agent in 3d world. In International Conference on Machine
prior. In Conference on Computer Vision and Pattern Recog-
Learning (ICML), 2024. 2
nition (CVPR), 2024. 3
[17] Jiangyong Huang, Baoxiong Jia, Yan Wang, Ziyu Zhu,
[5] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fan-
Xiongkun Linghu, Qing Li, Song-Chun Zhu, and Siyuan
tasia3d: Disentangling geometry and appearance for high-
Huang. Unveiling the mist over 3d vision-language under-
quality text-to-3d content creation. In International Confer-
standing: Object-centric evaluation with chain-of-analysis.
ence on Computer Vision (ICCV), 2023. 3, 4
In Conference on Computer Vision and Pattern Recognition
[6] Yiwen Chen, Zilong Chen, Chi Zhang, Feng Wang, Xiaofeng
(CVPR), 2025. 2
Yang, Yikai Wang, Zhongang Cai, Lei Yang, Huaping Liu,
and Guosheng Lin. Gaussianeditor: Swift and controllable [18] Ajay Jain, Matthew Tancik, and Pieter Abbeel. Putting nerf
3d editing with gaussian splatting. In Conference on Com- on a diet: Semantically consistent few-shot view synthe-
puter Vision and Pattern Recognition (CVPR), 2024. 2 sis. In International Conference on Computer Vision (ICCV),
2021. 2, 3
[7] Yixin Chen, Junfeng Ni, Nan Jiang, Yaowei Zhang, Yixin
Zhu, and Siyuan Huang. Single-view 3d scene reconstruc- [19] Nan Jiang, Zimo He, Zi Wang, Hongjie Li, Yixin Chen,
tion with high-fidelity shape and texture. In International Siyuan Huang, and Yixin Zhu. Autonomous character-scene
Conference on 3D Vision (3DV), 2024. 2 interaction synthesis from text instruction. In SIGGRAPH
[8] Julian Chibane, Thiemo Alldieck, and Gerard Pons-Moll. Asia 2024 Conference Papers, 2024. 2
Implicit functions in feature space for 3d shape reconstruc- [20] Nan Jiang, Zhiyuan Zhang, Hongjie Li, Xiaoxuan Ma, Zan
tion and completion. In Conference on Computer Vision and Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, and Siyuan
Pattern Recognition (CVPR), 2020. 2 Huang. Scaling up dynamic human-scene interaction model-
[9] Jieming Cui, Ziren Gong, Baoxiong Jia, Siyuan Huang, Zi- ing. In Conference on Computer Vision and Pattern Recog-
long Zheng, Jianzhu Ma, and Yixin Zhu. Probio: A protocol- nition (CVPR), 2024. 2
guided multimodal dataset for molecular biology lab. In Ad- [21] Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang,
vances in Neural Information Processing Systems (NeurIPS), Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anand-
2023. 2 kumar, Yuke Zhu, and Linxi Fan. Vima: General robot
[10] Tianyuan Dai, Josiah Wong, Yunfan Jiang, Chen Wang, Cem manipulation with multimodal prompts. arXiv preprint
Gokmen, Ruohan Zhang, Jiajun Wu, and Li Fei-Fei. Acdc: arXiv:2210.03094, 2022. 2
Automated creation of digital cousins for robust policy learn- [22] Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and
ing. arXiv preprint arXiv:2410.07408, 2024. 8 Feng Yang. Musiq: Multi-scale image quality transformer. In
[11] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ra- International Conference on Computer Vision (ICCV), 2021.
manan. Depth-supervised NeRF: Fewer views and faster 7
training for free. In Conference on Computer Vision and Pat- [23] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler,
tern Recognition (CVPR), 2022. 3 and George Drettakis. 3d gaussian splatting for real-time ra-
[12] Ruiqi Gao*, Aleksander Holynski*, Philipp Henzler, Arthur diance field rendering. In ACM SIGGRAPH / Eurographics
Brussee, Ricardo Martin-Brualla, Pratul P. Srinivasan, Symposium on Computer Animation (SCA), 2023. 2
Jonathan T. Barron, and Ben Poole*. Cat3d: Create any- [24] Mijeong Kim, Seonguk Seo, and Bohyung Han. Infon-
thing in 3d with multi-view diffusion models. In Advances erf: Ray entropy minimization for few-shot neural volume
in Neural Information Processing Systems (NeurIPS), 2024. rendering. In Conference on Computer Vision and Pattern
2, 3 Recognition (CVPR), 2022. 2
[13] Ran Gong, Jiangyong Huang, Yizhou Zhao, Haoran Geng, [25] Xin Kong, Shikun Liu, Marwan Taher, and Andrew J Davi-
Xiaofeng Gao, Qingyang Wu, Wensi Ai, Ziheng Zhou, son. vmap: Vectorised object mapping for neural field slam.
6030
In Conference on Computer Vision and Pattern Recognition [39] Yu Liu, Baoxiong Jia, Ruijie Lu, Junfeng Ni, Song-Chun
(CVPR), 2023. 2 Zhu, and Siyuan Huang. Building interactable replicas of
[26] Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, complex articulated objects via gaussian splatting. In In-
and Stefan Lee. Beyond the nav-graph: Vision-and-language ternational Conference on Learning Representations (ICLR),
navigation in continuous environments. In European Confer- 2025. 2
ence on Computer Vision (ECCV), 2020. 2 [40] Guanxing Lu, Shiyi Zhang, Ziwei Wang, Changliu Liu, Ji-
[27] Minseop Kwak, Jiuhn Song, and Seungryong Kim. Gecon- wen Lu, and Yansong Tang. Manigaussian: Dynamic gaus-
erf: Few-shot neural radiance fields via geometric consis- sian splatting for multi-task robotic manipulation. In Euro-
tency. In International Conference on Machine Learning pean Conference on Computer Vision (ECCV), 2024. 2
(ICML), 2023. 3 [41] Ruijie Lu, Yixin Chen, Yu Liu, Jiaxiang Tang, Junfeng Ni,
[28] Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Diwen Wan, Gang Zeng, and Siyuan Huang. Taco: Taming
Jaakko Lehtinen, and Timo Aila. Modular primitives for diffusion for in-the-wild video amodal completion. arXiv
high-performance differentiable rendering. In ACM SIG- preprint arXiv:2503.12049, 2025. 3
GRAPH / Eurographics Symposium on Computer Animation [42] Ruijie Lu, Yixin Chen, Junfeng Ni, Baoxiong Jia, Yu Liu,
(SCA), 2020. 4 Diwen Wan, Gang Zeng, and Siyuan Huang. Movis: En-
[29] Kyungmin Lee, Kihyuk Sohn, and Jinwoo Shin. Dreamflow: hancing multi-object novel view synthesis for indoor scenes.
High-quality text-to-3d generation by approximating proba- In Conference on Computer Vision and Pattern Recognition
bility flow. arXiv preprint arXiv:2403.14966, 2024. 4 (CVPR), 2025. 3
[30] Puhao Li, Tengyu Liu, Yuyang Li, Muzhi Han, Haoran Geng, [43] Xiaoyang Lyu, Chirui Chang, Peng Dai, Yang-Tian Sun, and
Shu Wang, Yixin Zhu, Song-Chun Zhu, and Siyuan Huang. Xiaojuan Qi. Total-decom: Decomposed 3d scene recon-
Ag2manip: Learning novel manipulation skills with agent- struction with minimal interaction. In Conference on Com-
agnostic visual and action representations. In International puter Vision and Pattern Recognition (CVPR), 2024. 2
Conference on Intelligent Robots and Systems (IROS), 2024. [44] Julien NP Martel, David B Lindell, Connor Z Lin, Eric R
2 Chan, Marco Monteiro, and Gordon Wetzstein. Acorn:
[31] Ruilong Li, Hang Gao, Matthew Tancik, and Angjoo Adaptive coordinate networks for neural scene representa-
Kanazawa. Nerfacc: Efficient sampling accelerates nerfs. In tion. ACM SIGGRAPH / Eurographics Symposium on Com-
International Conference on Computer Vision (ICCV), 2023. puter Animation (SCA), 2021. 2
6 [45] David McAllister, Songwei Ge, Jia-Bin Huang, David W
[32] Yuyang Li, Bo Liu, Yiran Geng, Puhao Li, Yaodong Yang, Jacobs, Alexei A Efros, Aleksander Holynski, and Angjoo
Yixin Zhu, Tengyu Liu, and Siyuan Huang. Grasp multiple Kanazawa. Rethinking score distillation as a bridge between
objects with one hand. In International Conference on Intel- image distributions. arXiv preprint arXiv:2406.09417, 2024.
ligent Robots and Systems (IROS), 2024. 2 4
[33] Zizhang Li, Xiaoyang Lyu, Yuanyuan Ding, Mengmeng [46] Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Natalia
Wang, Yiyi Liao, and Yong Liu. Rico: Regularizing the un- Neverova, Andrea Vedaldi, Oran Gafni, and Filippos Kokki-
observable for indoor compositional reconstruction. In Inter- nos. Im-3d: Iterative multiview diffusion and reconstruction
national Conference on Computer Vision (ICCV), 2023. 2, for high-quality 3d generation. In International Conference
3, 4, 6, 7 on Machine Learning (ICML), 2024. 3
[34] Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, [47] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se-
Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. Re- bastian Nowozin, and Andreas Geiger. Occupancy networks:
conx: Reconstruct any scene from sparse views with video Learning 3d reconstruction in function space. In Conference
diffusion model. arXiv preprint arXiv:2408.16767, 2024. 3 on Computer Vision and Pattern Recognition (CVPR), 2019.
[35] Haolin Liu, Yujian Zheng, Guanying Chen, Shuguang Cui, 2
and Xiaoguang Han. Towards high-fidelity single-view [48] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik,
holistic reconstruction of indoor scenes. In European Con- Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
ference on Computer Vision (ECCV), 2022. 2 Representing scenes as neural radiance fields for view
[36] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- synthesis. In European Conference on Computer Vision
makov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: (ECCV), 2020. 2, 3
Zero-shot one image to 3d object. In International Confer- [49] Junfeng Ni, Yixin Chen, Bohan Jing, Nan Jiang, Bin Wang,
ence on Computer Vision (ICCV), 2023. 3 Bo Dai, Puhao Li, Yixin Zhu, Song-Chun Zhu, and Siyuan
[37] Xi Liu, Chaoyi Zhou, and Siyu Huang. 3dgs- Huang. Phyrecon: Physically plausible neural scene recon-
enhancer: Enhancing unbounded 3d gaussian splatting struction. In Advances in Neural Information Processing
with view-consistent 2d diffusion priors. arXiv preprint Systems (NeurIPS), 2024. 2, 3
arXiv:2410.16266, 2024. 3 [50] Yinyu Nie, Xiaoguang Han, Shihui Guo, Yujian Zheng, Jian
[38] Yu Liu, Baoxiong Jia, Yixin Chen, and Siyuan Huang. Chang, and Jian Jun Zhang. Total3dunderstanding: Joint lay-
Slotlifter: Slot-guided feature lifting for learning object- out, object pose and mesh reconstruction for indoor scenes
centric radiance fields. In European Conference on Com- from a single image. In Conference on Computer Vision and
puter Vision (ECCV), 2024. 2 Pattern Recognition (CVPR), 2020. 2
6031
[51] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen-
Andreas Geiger. Differentiable volumetric rendering: Learn- eration. arXiv preprint arXiv:2308.16512, 2023. 3
ing implicit 3d representations without 3d supervision. In [63] Meng-Li Shih, Wei-Chiu Ma, Lorenzo Boyice, Aleksander
Conference on Computer Vision and Pattern Recognition Holynski, Forrester Cole, Brian L. Curless, and Janne Kon-
(CVPR), 2020. 2 tkanen. Extranerf: Visibility-aware view extrapolation of
[52] Michael Niemeyer, Jonathan T. Barron, Ben Mildenhall, neural radiance fields with diffusion models. In Conference
Mehdi S. M. Sajjadi, Andreas Geiger, and Noha Radwan. on Computer Vision and Pattern Recognition (CVPR), 2024.
Regnerf: Regularizing neural radiance fields for view syn- 3, 7
thesis from sparse inputs. In Conference on Computer Vision [64] Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport:
and Pattern Recognition (CVPR), 2022. 2, 3 What and where pathways for robotic manipulation. In Con-
[53] Michael Oechsle, Songyou Peng, and Andreas Geiger. ference on Robot Learning, pages 894–906. PMLR, 2022.
Unisurf: Unifying neural implicit surfaces and radiance 2
fields for multi-view reconstruction. In International Con- [65] Nagabhushan Somraj and Rajiv Soundararajan. ViP-NeRF:
ference on Computer Vision (ICCV), 2021. 2 Visibility prior for sparse input neural radiance fields. In
[54] Jeong Joon Park, Peter Florence, Julian Straub, Richard ACM SIGGRAPH / Eurographics Symposium on Computer
Newcombe, and Steven Lovegrove. Deepsdf: Learning con- Animation (SCA), 2023. 2, 3
tinuous signed distance functions for shape representation. [66] Nagabhushan Somraj, Adithyan Karanayil, and Rajiv
In Conference on Computer Vision and Pattern Recognition Soundararajan. SimpleNeRF: Regularizing sparse input neu-
(CVPR), 2019. 2 ral radiance fields with simpler solutions. In SIGGRAPH
[55] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden- Asia, 2023. 3
hall. Dreamfusion: Text-to-3d using 2d diffusion. In In- [67] Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik
ternational Conference on Learning Representations (ICLR), Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal,
2022. 2, 3 Carl Ren, Shobhit Verma, Anton Clarkson, Mingfei Yan,
[56] Lingteng Qiu, Guanying Chen, Xiaodong Gu, Qi Zuo, Mu- Brian Budge, Yajie Yan, Xiaqing Pan, June Yon, Yuyang
tian Xu, Yushuang Wu, Weihao Yuan, Zilong Dong, Liefeng Zou, Kimberly Leon, Nigel Carter, Jesus Briales, Tyler
Bo, and Xiaoguang Han. Richdreamer: A generalizable Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva,
normal-depth diffusion model for detail richness in text-to- Dhruv Batra, Hauke M. Strasdat, Renzo De Nardi, Michael
3d. In Conference on Computer Vision and Pattern Recogni- Goesele, Steven Lovegrove, and Richard Newcombe. The
tion (CVPR), 2024. 3, 4 Replica dataset: A digital replica of indoor spaces. arXiv
[57] Barbara Roessle, Jonathan T. Barron, Ben Mildenhall, preprint arXiv:1906.05797, 2019. 2, 5, 7, 8
Pratul P. Srinivasan, and Matthias Nießner. Dense depth [68] Xiaotian Sun, Qingshan Xu, Xinjie Yang, Yu Zang, and
priors for neural radiance fields from sparse input views. Cheng Wang. Global and hierarchical geometry consistency
In Conference on Computer Vision and Pattern Recognition priors for few-shot nerfs in indoor scenes. In Conference on
(CVPR), 2022. 3 Computer Vision and Pattern Recognition (CVPR), 2024. 3
[58] Robin Rombach, Andreas Blattmann, Dominik Lorenz, [69] Jiaxiang Tang, Ruijie Lu, Xiaokang Chen, Xiang Wen, Gang
Patrick Esser, and Björn Ommer. High-resolution image syn- Zeng, and Ziwei Liu. Intex: Interactive text-to-texture syn-
thesis with latent diffusion models. In Conference on Com- thesis via unified depth-aware inpainting. arXiv preprint
puter Vision and Pattern Recognition (CVPR), 2022. 2, 3 arXiv:2403.11878, 2024. 2
[59] Chitwan Saharia, William Chan, Saurabh Saxena, Lala [70] Prune Truong, Marie-Julie Rakotosaona, Fabian Manhardt,
Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed and Federico Tombari. Sparf: Neural radiance fields from
Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, sparse and noisy poses. In Conference on Computer Vision
Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J and Pattern Recognition (CVPR), 2023. 3
Fleet, and Mohammad Norouzi. Photorealistic text-to-image [71] Guangcong Wang, Zhaoxi Chen, Chen Change Loy, and Zi-
diffusion models with deep language understanding. In Ad- wei Liu. Sparsenerf: Distilling depth ranking for few-shot
vances in Neural Information Processing Systems (NeurIPS), novel view synthesis. In International Conference on Com-
2022. 3 puter Vision (ICCV), 2023. 3
[60] Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, [72] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku
Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry La- Komura, and Wenping Wang. Neus: Learning neural im-
gun, Li Fei-Fei, Deqing Sun, and Jiajun Wu. ZeroNVS: plicit surfaces by volume rendering for multi-view recon-
Zero-shot 360-degree view synthesis from a single real im- struction. In Advances in Neural Information Processing
age. In Conference on Computer Vision and Pattern Recog- Systems (NeurIPS), 2021. 2, 3
nition (CVPR), 2024. 7 [73] Qi Wang, Ruijie Lu, Xudong Xu, Jingbo Wang, Michael Yu
[61] Seunghyeon Seo, Donghoon Han, Yeonjin Chang, and Nojun Wang, Bo Dai, Gang Zeng, and Dan Xu. Roomtex: Textur-
Kwak. Mixnerf: Modeling a ray with mixture density for ing compositional indoor scenes via iterative inpainting. In
novel view synthesis from sparse inputs. In Conference on European Conference on Computer Vision (ECCV), 2024. 2
Computer Vision and Pattern Recognition (CVPR), 2023. 3 [74] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan
[62] Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and
6032
diverse text-to-3d generation with variational score distilla- [87] Xin Yu, Yuan-Chen Guo, Yangguang Li, Ding Liang, Song-
tion. In Advances in Neural Information Processing Systems Hai Zhang, and Xiaojuan Qi. Text-to-3d with classifier score
(NeurIPS), 2024. 4 distillation. arXiv preprint arXiv:2310.19415, 2023. 4
[75] Frederik Warburg*, Ethan Weber*, Matthew Tancik, Alek- [88] Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sat-
sander Hołyński, and Angjoo Kanazawa. Nerfbusters: Re- tler, and Andreas Geiger. Monosdf: Exploring monocu-
moving ghostly artifacts from casually captured nerfs. In lar geometric cues for neural implicit surface reconstruc-
International Conference on Computer Vision (ICCV), 2023. tion. In Advances in Neural Information Processing Systems
3 (NeurIPS), 2022. 3, 7
[76] Ethan Weber, Aleksander Holynski, Varun Jampani, Saurabh [89] Cheng Zhang, Zhaopeng Cui, Yinda Zhang, Bing Zeng,
Saxena, Noah Snavely, Abhishek Kar, and Angjoo Marc Pollefeys, and Shuaicheng Liu. Holistic 3d scene un-
Kanazawa. Nerfiller: Completing scenes via generative 3d derstanding from a single image with implicit representation.
inpainting. In Conference on Computer Vision and Pattern In Conference on Computer Vision and Pattern Recognition
Recognition (CVPR), 2024. 3 (CVPR), 2021. 2
[77] Yi Wei, Shaohui Liu, Yongming Rao, Wang Zhao, Jiwen Lu, [90] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
and Jie Zhou. Nerfingmvs: Guided optimization of neural Koltun. Nerf++: Analyzing and improving neural radiance
radiance fields for indoor multi-view stereo. In International fields. arXiv preprint arXiv:2010.07492, 2020. 2
Conference on Computer Vision (ICCV), 2021. 3 [91] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding
conditional control to text-to-image diffusion models. In In-
[78] Qianyi Wu, Xian Liu, Yuedong Chen, Kejie Li, Chuanxia
ternational Conference on Computer Vision (ICCV), 2023.
Zheng, Jianfei Cai, and Jianmin Zheng. Object-
4
compositional neural implicit surfaces. In European Con-
ference on Computer Vision (ECCV), 2022. 2, 3, 4 [92] Zihang Zhao, Yuyang Li, Wanlin Li, Zhenghao Qi, Lecheng
Ruan, Yixin Zhu, and Kaspar Althoefer. Tac-Man: Tactile-
[79] Qianyi Wu, Kaisiyuan Wang, Kejie Li, Jianmin Zheng, and
informed prior-free manipulation of articulated objects.
Jianfei Cai. Objectsdf++: Improved object-compositional
Transactions on Robotics (T-RO), 2025. 2
neural implicit surfaces. In International Conference on
[93] Zhizhuo Zhou and Shubham Tulsiani. Sparsefusion: Dis-
Computer Vision (ICCV), 2023. 2, 3, 4, 6, 7
tilling view-conditioned diffusion for 3d reconstruction. In
[80] Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Conference on Computer Vision and Pattern Recognition
Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor (CVPR), 2023. 3
Verbin, Jonathan T. Barron, Ben Poole, and Aleksander
Holynski. Reconfusion: 3d reconstruction with diffusion pri-
ors. In Conference on Computer Vision and Pattern Recog-
nition (CVPR), 2024. 2, 3, 7
[81] Jiawei Yang, Marco Pavone, and Yue Wang. Freenerf: Im-
proving few-shot neural rendering with free frequency reg-
ularization. In Conference on Computer Vision and Pattern
Recognition (CVPR), 2023. 2, 3, 7
[82] Yandan Yang, Baoxiong Jia, Peiyuan Zhi, and Siyuan
Huang. Physcene: Physically interactable 3d scene synthe-
sis for embodied ai. In Conference on Computer Vision and
Pattern Recognition (CVPR), 2024. 2
[83] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Vol-
ume rendering of neural implicit surfaces. In Advances in
Neural Information Processing Systems (NeurIPS), 2021. 2,
3, 6
[84] Sheng Ye, Yuze He, Matthieu Lin, Jenny Sheng, Ruoyu Fan,
Yiheng Han, Yubin Hu, Ran Yi, Yu-Hui Wen, Yong-Jin Liu,
and Wenping Wang. Pvp-recon: Progressive view planning
via warping consistency for sparse-view surface reconstruc-
tion. arXiv preprint arXiv:2409.05474, 2024. 3
[85] Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner,
and Angela Dai. Scannet++: A high-fidelity dataset of 3d
indoor scenes. In International Conference on Computer Vi-
sion (ICCV), 2023. 2, 5, 7
[86] Mae Younes, Amine Ouasfi, and Adnane Boukhayma. Spar-
secraft: Few-shot neural reconstruction through stereopsis
guided geometric linearization. In European Conference on
Computer Vision (ECCV), 2024. 3
6033