Block Nerf
Block Nerf
F
uare, S
arXiv:2202.05263v1 [cs.CV] 10 Feb 2022
June
Sq
Alamo
Sept.
Block-NeRF
1 km
Figure 1. Block-NeRF is a method that enables large-scale scene reconstruction by representing the environment using multiple compact
NeRFs that each fit into memory. At inference time, Block-NeRF seamlessly combines renderings of the relevant NeRFs for the given area.
In this example, we reconstruct the Alamo Square neighborhood in San Francisco using data collected over 3 months. Block-NeRF can
update individual blocks of the environment without retraining on the entire scene, as demonstrated by the construction on the right. Video
results can be found on the project website waymo.com/research/block-nerf.
tion and novel view synthesis given a set of posed camera im-
Abstract ages [3, 40, 45]. Earlier works tended to focus on small-scale
and object-centric reconstruction. Though some methods
We present Block-NeRF, a variant of Neural Radiance now address scenes the size of a single room or building,
Fields that can represent large-scale environments. Specif- these are generally still limited and do not naı̈vely scale up
ically, we demonstrate that when scaling NeRF to render to city-scale environments. Applying these methods to large
city-scale scenes spanning multiple blocks, it is vital to de- environments typically leads to significant artifacts and low
compose the scene into individually trained NeRFs. This visual fidelity due to limited model capacity.
decomposition decouples rendering time from scene size, en-
ables rendering to scale to arbitrarily large environments, Reconstructing large-scale environments enables several
and allows per-block updates of the environment. We adopt important use-cases in domains such as autonomous driv-
several architectural changes to make NeRF robust to data ing [32, 44, 68] and aerial surveying [14, 35]. One example is
captured over months under different environmental condi- mapping, where a high-fidelity map of the entire operating
tions. We add appearance embeddings, learned pose refine- domain is created to act as a powerful prior for a variety of
ment, and controllable exposure to each individual NeRF, problems, including robot localization, navigation, and colli-
and introduce a procedure for aligning appearance between sion avoidance. Furthermore, large-scale scene reconstruc-
adjacent NeRFs so that they can be seamlessly combined. We tions can be used for closed-loop robotic simulations [13].
build a grid of Block-NeRFs from 2.8 million images to cre- Autonomous driving systems are commonly evaluated by
ate the largest neural scene representation to date, capable re-simulating previously encountered scenarios; however,
of rendering an entire neighborhood of San Francisco. any deviation from the recorded encounter may change the
vehicle’s trajectory, requiring high-fidelity novel view ren-
1. Introduction derings along the altered path. Beyond basic view synthesis,
scene conditioned NeRFs are also capable of changing en-
Recent advancements in neural rendering such as Neural vironmental lighting conditions such as camera exposure,
Radiance Fields [42] have enabled photo-realistic reconstruc- weather, or time of day, which can be used to further augment
*Work done as an intern at Waymo. simulation scenarios.
1
Reconstructing such large-scale environments introduces
additional challenges, including the presence of transient
objects (cars and pedestrians), limitations in model capacity, Combined
Color Prediction
along with memory and compute constraints. Furthermore,
training data for such large environments is highly unlikely
to be collected in a single capture under consistent condi-
tions. Rather, data for different parts of the environment may
need to be sourced from different data collection efforts, in-
troducing variance in both scene geometry (e.g., construction
work and parked cars), as well as appearance (e.g., weather
conditions and time of day).
We extend NeRF with appearance embeddings and Target View
2
Geometry-based Image Reprojection. Many ap- σ x
proaches to view synthesis start by applying traditional 3D x fσ d fv Visibility
reconstruction techniques to build a point cloud or triangle
d
mesh representing the scene. This geometric “proxy” is
Exposure
fc RGB Positional Encoding
then used to reproject pixels from the input images into Integrated Positional
Appearance Embedding Encoding
new camera views, where they are blended by heuristic [6]
or learning-based methods [24, 52, 53]. This approach has
been scaled to long trajectories of first-person video [31], Figure 3. Our model is an extension of the model presented in
panoramas collected along a city street [30], and single mip-NeRF [3]. The first MLP fσ predicts the density σ for a
position x in space. The network also outputs a feature vector
landmarks from the Photo Tourism dataset [41]. Methods
that is concatenated with viewing direction d, the exposure level,
reliant on geometry proxies are limited by the quality of the and an appearance embedding. These are fed into a second MLP
initial 3D reconstruction, which hurts their performance in fc that outputs the color for the point. We additionally train a
scenes with complex geometry or reflectance effects. visibility network fv to predict whether a point in space was visible
in the training views, which is used for culling Block-NeRFs during
Volumetric Scene Representations. Recent view synthe- inference.
sis work has focused on unifying reconstruction and render-
ing and learning this pipeline end-to-end, typically using
a volumetric scene representation. Methods for rendering as people or cars) [44, 73] across video sequences. As we
small baseline view interpolation often use feed-forward focus primarily on reconstructing the environment itself, we
networks to learn a mapping directly from input images to choose to simply mask out dynamic objects during training.
an output volume [15, 76], while methods such as Neural
Volumes [37] that target larger-baseline view synthesis run
a global optimization over all input images to reconstruct 2.3. Urban Scene Camera Simulation
every new scene, similar to traditional bundle adjustment.
Neural Radiance Fields (NeRF) [42] combines this single- Camera simulation has become a popular data source
scene optimization setting with a neural scene representation for training and validating autonomous driving systems on
capable of representing complex scenes much more effi- interactive platforms [2, 28]. Early works [13, 19, 51, 54] syn-
ciently than a discrete 3D voxel grid; however, its rendering thesized data from scripted scenarios and manually created
model scales very poorly to large-scale scenes in terms of 3D assets. These methods suffered from domain mismatch
compute. Followup work has proposed making NeRF more and limited scene-level diversity. Several recent works tackle
efficient by partitioning space into smaller regions, each con- the simulation-to-reality gaps by minimizing the distribution
taining its own lightweight NeRF network [48, 49]. Unlike shifts in the simulation and rendering pipeline. Kar et al. [26]
our method, these network ensembles must be trained jointly, and Devaranjan et al. [12] proposed to minimize the scene-
limiting their flexibility. Another approach is to provide extra level distribution shift from rendered outputs to real camera
capacity in the form of a coarse 3D grid of latent codes [36]. sensor data through a learned scenario generation frame-
This approach has also been applied to compress detailed work. Richter et al. [50] leveraged intermediate rendering
3D shapes into neural signed distance functions [62] and to buffers in the graphics pipeline to improve photorealism of
represent large scenes using occupancy networks [46]. synthetically generated camera images.
We build our Block-NeRF implementation on top of mip- Towards the goal of building photo-realistic and scalable
NeRF [3], which improves aliasing issues that hurt NeRF’s camera simulation, prior methods [9, 32, 68] leverage rich
performance in scenes where the input images observe the multi-sensor driving data collected during a single drive to
scene from many different distances. We incorporate tech- reconstruct 3D scenes for object injection [9] and novel view
niques from NeRF in the Wild (NeRF-W) [40], which adds synthesis [68] using modern machine learning techniques, in-
a latent code per training image to handle inconsistent scene cluding image GANs for 2D neural rendering. Relying on a
appearance when applying NeRF to landmarks from the sophisticated surfel reconstruction pipeline, SurfelGAN [68]
Photo Tourism dataset. NeRF-W creates a separate NeRF is still susceptible to errors in graphical reconstruction and
for each landmark from thousands of images, whereas our can suffer from the limited range and vertical field-of-view
approach combines many NeRFs to reconstruct a coherent of LiDAR scans. In contrast to existing efforts, our work
large environment from millions of images. Our model also tackles the 3D rendering problem and is capable of modeling
incorporates a learned camera pose refinement which has the real camera data captured from multiple drives under
been explored in previous works [34, 59, 66, 69, 70]. varying environmental conditions, such as weather and time
Some NeRF-based methods use segmentation data to of day, which is a prerequisite for reconstructing large-scale
isolate and reconstruct static [67] or moving objects (such areas.
3
3. Background intervals. To feed these frustums into the MLP, mip-NeRF
approximates each of them as Gaussian distributions with
We build upon NeRF [42] and its extension mip-NeRF [3]. parameters µi , Σi and replaces the positional encoding γPE
Here, we summarize relevant parts of these methods. For with its expectation over the input Gaussian
details, please refer to the original papers.
3.1. NeRF and mip-NeRF Preliminaries γIPE (µ, Σ) = EX∼N (µ,Σ) [γPE (X)] , (4)
Neural Radiance Fields (NeRF) [42] is a coordinate-based referred to as an integrated positional encoding.
neural scene representation that is optimized through a dif-
ferentiable rendering loss to reproduce the appearance of a 4. Method
set of input images from known camera poses. After opti-
mization, the NeRF model can be used to render previously Training a single NeRF does not scale when trying to
unseen viewpoints. represent scenes as large as cities. We instead propose split-
The NeRF scene representation is a pair of multilayer ting the environment into a set of Block-NeRFs that can
perceptrons (MLPs). The first MLP fσ takes in a 3D position be independently trained in parallel and composited during
x and outputs volume density σ and a feature vector. This inference. This independence enables the ability to expand
feature vector is concatenated with a 2D viewing direction the environment with additional Block-NeRFs or update
d and fed into the second MLP fc , which outputs an RGB blocks without retraining the entire environment (see Fig-
color c. This architecture ensures that the output color can ure 1). We dynamically select relevant Block-NeRFs for
vary when observed from different angles, allowing NeRF rendering, which are then composited in a smooth manner
to represent reflections and glossy materials, but that the when traversing the scene. To aid with this compositing,
underlying geometry represented by σ is only a function of we optimize the appearances codes to match lighting condi-
position. tions and use interpolation weights computed based on each
Each pixel in an image corresponds to a ray r(t) = o + Block-NeRF’s distance to the novel view.
td through 3D space. To calculate the color of r, NeRF
randomly samples distances {ti }N 4.1. Block Size and Placement
i=0 along the ray and passes
the points r(ti ) and direction d through its MLPs to calculate The individual Block-NeRFs should be arranged to col-
σi and ci . The resulting output color is lectively ensure full coverage of the target environment. We
typically place one Block-NeRF at each intersection, cov-
N ering the intersection itself and any connected street 75%
of the way until it converges into the next intersection (see
X
cout = wi ci , where wi = Ti (1 − e−∆i σi ), (1)
i=1 Figure 1). This results in a 50% overlap between any two
adjacent blocks on the connecting street segment, making
Ti = exp −
X
∆j σj , ∆i = ti − ti−1 . (2) appearance alignment easier between them. Following this
j<i
procedure means that the block size is variable; where neces-
sary, additional blocks may be introduced as connectors be-
The full implementation of NeRF iteratively resamples the tween intersections. We ensure that the training data for each
points ti (by treating the weights wi as a probability distribu- block stays exactly within its intended bounds by applying
tion) in order to better concentrate samples in areas of high a geographical filter. This procedure can be automated and
density. only relies on basic map data such as OpenStreetMap [22].
To enable the NeRF MLPs to represent higher frequency Note that other placement heuristics are also possible, as
detail [63], the inputs x and d are each preprocessed by a long as the entire environment is covered by at least one
componentwise sinusoidal positional encoding γPE : Block-NeRF. For example, for some of our experiments, we
instead place blocks along a single street segment at uniform
γPE (z) = [sin(20 z), cos(20 z), . . . , sin(2L−1 z), cos(2L−1 z)] (3) distances and define the block size as a sphere around the
Block-NeRF Origin (see Figure 2).
where L is the number of levels of positional encoding.
NeRF’s MLP fσ takes a single 3D point as input. How- 4.2. Training Individual Block-NeRFs
ever, this ignores both the relative footprint of the corre- 4.2.1 Appearance Embeddings
sponding image pixel and the length of the interval [ti−1 , ti ]
along the ray r containing the point, resulting in aliasing Given that different parts of our data may be captured under
artifacts when rendering novel camera trajectories. Mip- different environmental conditions, we follow NeRF-W [40]
NeRF [3] remedies this issue by using the projected pixel and use Generative Latent Optimization [5] to optimize per-
footprint to sample conical frustums along the ray rather than image appearance embedding vectors, as shown in Figure 3.
4
Figure 4. The appearance codes allow the model to represent different lighting and weather conditions.
This allows the NeRF to explain away several appearance- for changes in otherwise static parts of the environment,
changing conditions, such as varying weather and lighting. e.g. construction, it accommodates most common types of
We can additionally manipulate these appearance embed- geometric inconsistency.
dings to interpolate between different conditions observed
in the training data (such as cloudy versus clear skies, or
4.2.5 Visibility Prediction
day and night). Examples of rendering with different appear-
ances can be seen in Figure 4. In § 4.3.3, we use test-time When merging multiple Block-NeRFs, it can be useful to
optimization over these embeddings to match the appear- know whether a specific region of space was visible to a
ance of adjacent Block-NeRFs, which is important when given NeRF during training. We extend our model with an
combining multiple renderings. additional small MLP fv that is trained to learn an approx-
imation of the visibility of a sampled point (see Figure 3).
4.2.2 Learned Pose Refinement For each sample along a training ray, fv takes in the lo-
cation and view direction and regresses the corresponding
Although we assume that camera poses are provided, we find transmittance of the point (Ti in Equation 2). The model
it advantageous to learn regularized pose offsets for further is trained alongside fσ , which provides supervision. Trans-
alignment. Pose refinement has been explored in previous mittance represents how visible a point is from a particular
NeRF based models [34,59,66,70]. These offsets are learned input camera: points in free space or on the surface of the
per driving segment and include both a translation and a first intersected object will have transmittance near 1, and
rotation component. We optimize these offsets jointly with points inside or behind the first visible object will have trans-
the NeRF itself, significantly regularizing the offsets in the mittance near 0. If a point is seen from some viewpoints
early phase of training to allow the network to first learn a but not others, the regressed transmittance value will be the
rough structure prior to modifying the poses. average over all training cameras and lie between zero and
one, indicating that the point is partially observed. Our vis-
4.2.3 Exposure Input ibility prediction is similar to the visibility fields proposed
Training images may be captured across a wide range of by Srinivasan et al. [58]. However, they used an MLP to
exposure levels, which can impact NeRF training if left predict visibility to environment lighting for the purpose
unaccounted for. We find that feeding the camera exposure of recovering a relightable NeRF model, while we predict
information to the appearance prediction part of the model visibility to training rays.
allows the NeRF to compensate for the visual differences The visibility network is small and can be run indepen-
(see Figure 3). Specifically, the exposure information is dently from the color and density networks. This proves
processed as γPE (shutter speed × analog gain/t) where γPE useful when merging multiple NeRFs, since it can help to
is a sinusoidal positional encoding with 4 levels, and t is determine whether a specific NeRF is likely to produce mean-
a scaling factor (we use 1,000 in practice). An example of ingful outputs for a given location, as explained in § 4.3.1.
different learned exposures can be found in Figure 5. The visibility predictions can also be used to determine loca-
tions to perform appearance matching between two NeRFs,
4.2.4 Transient Objects as detailed in § 4.3.3.
While our method accounts for variation in appearance using 4.3. Merging Multiple Block-NeRFs
the appearance embeddings, we assume that the scene ge-
4.3.1 Block-NeRF Selection
ometry is consistent across the training data. Any movable
objects (e.g. cars, pedestrians) typically violate this assump- The environment can be composed of an arbitrary number
tion. We therefore use a semantic segmentation model [10] of Block-NeRFs. For efficiency, we utilize two filtering
to produce masks of common movable objects, and ignore mechanisms to only render relevant blocks for the given
masked areas during training. While this does not account target viewpoint. We only consider Block-NeRFs that are
5
the target in order to reduce the `2 loss between the respective
area renders. This optimization is quick, converging within
100 iterations. While not necessarily yielding perfect align-
ment, this procedure aligns most global and low-frequency
attributes of the scene, such as time of day, color balance, and
weather, which is a prerequisite for successful compositing.
Figure 6 shows an example optimization, where appearance
matching turns a daytime scene into nighttime to match the
adjacent Block-NeRF.
The optimized appearance is iteratively propagated
Figure 5. Our model is conditioned on exposure, which helps through the scene. Starting from one root Block-NeRF,
account for exposure changes present in the training data. This we optimize the appearance of the neighboring ones and
allows users to alter the appearance of the output images in a continue the process from there. If multiple blocks surround-
human-interpretable manner during inference. ing a target Block-NeRF have already been optimized, we
consider each of them when computing the loss.
within a set radius of the target viewpoint. Additionally,
for each of these candidates, we compute the associated 5. Results and Experiments
visibility. If the mean visibility is below a threshold, we
discard the Block-NeRF. An example of visibility filtering In this section we will discuss our datasets and exper-
is provided in Figure 2. Visibility can be computed quickly iments. The architectural and optimization specifics are
because its network is independent of the color network, and provided in the supplement. The supplement also provides
it does not need to be rendered at the target image resolution. comparisons to reconstructions from COLMAP [55], a tradi-
After filtering, there are typically one to three Block-NeRFs tional Structure from Motion approach. This reconstruction
left to merge. is sparse and fails to represent reflective surfaces and the sky.
6
Base Block-NeRF Adjacent Block-NeRF
Figure 6. When rendering scenes based on multiple Block-NeRFs, we use appearance matching to obtain a consistent appearance across the
scene. Given a fixed target appearance for one of the Block-NeRFs (left image), we optimize the appearances of the adjacent Block-NeRFs
to match. In this example, appearance matching produces a consistent night appearance across Block-NeRFs.
San Francisco Alamo Square Dataset. We select San NeRFs PSNR↑ SSIM↑ LPIPS↓
Francisco’s Alamo Square neighborhood as the target area mip-NeRF 17.86 0.563 0.509
for our scalability experiments. The dataset spans an area -Appearance 20.13 0.611 0.458
of approximately 960 m × 570 m, and was recorded in June, -Exposure 23.55 0.649 0.418
Ours
July, and August of 2021. We divide this dataset into 35 -Pose Opt. 23.05 0.625 0.442
Full 23.60 0.649 0.417
Block-NeRFs. Example renderings and Block-NeRF place-
ments can be seen in Figure 1. To best appreciate the scale Table 1. Ablations of different Block-NeRF components on a
of the reconstruction, please refer to supplementary videos. single intersection in the Alamo Square dataset. We show the
Each Block-NeRF was trained on data from 38 to 48 differ- performance of mip-NeRF as a baseline, as well as the effect of
ent data collection runs, adding up to a total driving time of removing individual components from our method.
18 to 28 minutes each. After filtering out some redundant
When our method is not trained with appearance embed-
image captures (e.g. stationary captures), each Block-NeRF
dings, these artifacts are still present. If our method is not
is trained on between 64,575 to 108,216 images. The over-
trained with pose optimization, the resulting scene is blurrier
all dataset is composed of 13.4 h of driving time sourced
and can contain duplicated objects due to pose misalignment.
from 1,330 different data collection runs, with a total of
Finally, the exposure input marginally improves the recon-
2,818,745 training images.
struction, but more importantly provides us with the ability
San Francisco Mission Bay Dataset. We choose San to change the exposure during inference.
Francisco’s Mission Bay District as the target area for our 5.3. Block-NeRF Size and Placement
baseline, block size, and placement experiments. Mission
Bay is an urban environment with challenging geometry and # Blocks Weights / Total Size Compute PSNR↑ SSIM↑ LPIPS↓
reflective facades. We identified a long stretch on Third
1 0.25M / 0.25M 544 m 1× 23.83 0.825 0.381
Street with far-range visibility, making it an interesting test 4 0.25M / 1.00M 271 m 2× 25.55 0.868 0.318
case. Notably, this dataset was recorded in a single capture in 8 0.25M / 2.00M 116 m 2× 26.59 0.890 0.278
16 0.25M / 4.00M 54 m 2× 27.40 0.907 0.242
November 2020, with consistent environmental conditions al-
1 1.00M / 1.00M 544 m 1× 24.90 0.852 0.340
lowing for simple evaluation. This dataset was recorded over 4 0.25M / 1.00M 271 m 0.5× 25.55 0.868 0.318
100 s, in which the data collection vehicle traveled 1.08 km 8 0.13M / 1.00M 116 m 0.25× 25.92 0.875 0.306
16 0.07M / 1.00M 54 m 0.125× 25.98 0.877 0.305
and captured 12,000 total images from 12 cameras. We will
release this single-capture dataset to aid reproducibility. Table 2. Comparison of different numbers of Block-NeRFs for
reconstructing the Mission Bay dataset. Splitting the scene into
5.2. Model Ablations multiple Block-NeRFs improves the reconstruction accuracy, even
We ablate our model modifications on a single intersec- when holding the total number of weights constant (bottom section).
tion from the Alamo Square dataset. We report PSNR, SSIM, The number of blocks determines the size of the area each block is
trained on and the relative compute expense at inference time.
and LPIPS [75] metrics for the test image reconstructions
in Table 1. The test images are split in half vertically, with We compare performance on our Mission Bay dataset
the appearance embeddings being optimized on one half and versus the number of Block-NeRFs used. We show details
tested on the other. We also provide qualitative examples in Table 2, where depending on granularity, the Block-NeRF
in Figure 7. Mip-NeRF alone fails to properly reconstruct sizes range from as small as 54 m to as large as 544 m. We
the scene and is prone to adding non-existent geometry and ensure that each pair of adjacent blocks overlaps by 50%
cloudy artifacts to explain the differences in appearance. and compare other overlap percentages in the supplement.
7
Block-NeRF
Ground Truth mip-NeRF -Appearance -Exposure -Pose Opt. Full
Figure 7. Model ablation results on multi segment data. Appearance embeddings help the network avoid adding cloudy geometry to explain
away changes in the environment like weather and lighting. Removing exposure slightly decreases the accuracy. The pose optimization
helps sharpen the results and removes ghosting from repeated objects, as observed with the telephone pole in the first row.
All were evaluated on the same set of held-out test images weighting (IDW) between the camera and Block-NeRF cen-
spanning the entire trajectory. We consider two regimes, ters, as described in § 4.3.2. We also explored a variant of
one where each Block-NeRF contains the same number of IDW where the interpolation was performed over projected
weights (top section) and one where the total number of 3D points predicted by the expected Block-NeRF depth.
weights across all Block-NeRFs is fixed (bottom section). This method suffers when the depth prediction is incorrect,
In both cases, we observe that increasing the number of leading to artifacts and temporal incoherence.
models improves the reconstruction metrics. In terms of Finally, we experiment with weighing the Block-NeRFs
computational expense, parallelization during training is based on per-pixel and per-image predicted visibility. This
trivial as each model can be optimized independently across produces sharper reconstructions of further-away areas but is
devices. At inference, our method only requires rendering prone to temporal inconsistency. Therefore, these methods
Block-NeRFs near the target view. Depending on the scene are best used only when rendering still images. Further
and NeRF layout, we typically render between one to three details are provided in the supplement.
NeRFs. We report the relative compute expense in each
setting without assuming any parallelization, which however 6. Limitations and Future Work
would be possible and lead to an additional speed-up. Our
results imply that splitting the scene into multiple lower The proposed method handles transient objects by filter-
capacity models can reduce the overall computational cost ing them out during training via masking using a segmen-
as not all of the models need to be evaluated (see bottom tation algorithm. If objects are not properly masked, they
section of Table 2). can cause artifacts in the resulting renderings. For exam-
ple, the shadows of cars often remain, even when the car
5.4. Interpolation Methods itself is correctly removed. Vegetation also breaks this as-
sumption as foliage changes seasonally and moves in the
Interpolation Consistent? PSNR↑ SSIM↑ LPIPS↓
wind; this results in blurred representations of trees and
Nearest – 26.40 0.887 0.280 plants. Similarly, temporal inconsistencies in the training
IDW 2D 3 26.59 0.890 0.278
IDW 3D – 26.57 0.890 0.278 data, such as construction work, are not automatically han-
Pixelwise Visibility – 27.39 0.906 0.242 dled and require the manual retraining of the affected blocks.
Imagewise Visibility – 27.41 0.907 0.242
Further, the inability to render scenes containing dynamic
Table 3. Comparison of interpolation methods. For our flythrough objects currently limits the applicability of Block-NeRF to-
video results, we opt for 2D inverse distance weighting (IDW) as it wards closed-loop simulation tasks in robotics. In the future,
produces temporally consistent results. these issues could be addressed by learning transient objects
We explore different interpolation methods in Table 3. during the optimization [40], or directly modeling dynamic
The simple method of only rendering the nearest Block- objects [44, 67]. In particular, the scene could be composed
NeRF to the camera requires the least amount of compute of multiple Block-NeRFs of the environment and individual
but results in harsh jumps when transitioning between blocks. controllable object NeRFs. Separation can be facilitated by
These transitions can be smoothed by using inverse distance the use of segmentation masks or bounding boxes.
8
In our model, distant objects in the scene are not sam- [7] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora,
pled with the same density as nearby objects which leads to Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi-
blurrier reconstructions. This is an issue with sampling un- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal
bounded volumetric representations. Techniques proposed dataset for autonomous driving. CVPR, 2020. 6
in NeRF++ [74] and concurrent Mip-NeRF 360 [4] could [8] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet
potentially be used to produce sharper renderings of distant Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr,
objects. Simon Lucey, Deva Ramanan, et al. Argoverse: 3d tracking
and forecasting with rich maps. CVPR, 2019. 6
In many applications, real-time rendering is key, but
[9] Yun Chen, Frieda Rong, Shivam Duggal, Shenlong Wang,
NeRFs are computationally expensive to render (up to mul-
Xinchen Yan, Sivabalan Manivasagam, Shangjie Xue, Ersin
tiple seconds per image). Several NeRF caching tech-
Yumer, and Raquel Urtasun. Geosim: Realistic video simula-
niques [20, 25, 72] or a sparse voxel grid [36] could be used tion via geometry-aware composition for self-driving. CVPR,
to enable real-time Block-NeRF rendering. Similarly, multi- 2021. 3
ple concurrent works have demonstrated techniques to speed [10] Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu,
up training of NeRF style representations by multiple orders Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen.
of magnitude [43, 60, 71]. Panoptic-deeplab: A simple, strong, and fast baseline for
bottom-up panoptic segmentation. CVPR, 2020. 5, 6
7. Conclusion [11] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo
In this paper we propose Block-NeRF, a method that Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke,
reconstructs arbitrarily large environments using NeRFs. Stefan Roth, and Bernt Schiele. The cityscapes dataset for
We demonstrate the method’s efficacy by building an entire semantic urban scene understanding. CVPR, 2016. 6
neighborhood in San Francisco from 2.8M images, forming [12] Jeevan Devaranjan, Amlan Kar, and Sanja Fidler. Meta-sim2:
the largest neural scene representation to date. We accom- Unsupervised learning of scene structure for synthetic data
plish this scale by splitting our representation into multiple generation. ECCV, 2020. 3
blocks that can be optimized independently. At such a scale, [13] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio
the data collected will necessarily have transient objects and Lopez, and Vladlen Koltun. Carla: An open urban driving
simulator. Conference on robot learning, 2017. 1, 3
variations in appearance, which we account for by modifying
[14] Dawei Du, Yuankai Qi, Hongyang Yu, Yifan Yang, Kaiwen
the underlying NeRF architecture. We hope that this can
Duan, Guorong Li, Weigang Zhang, Qingming Huang, and
inspire future work in large-scale scene reconstruction using
Qi Tian. The unmanned aerial vehicle benchmark: Object
modern neural rendering methods. detection and tracking. ECCV, 2018. 1
[15] John Flynn, Ivan Neulander, James Philbin, and Noah Snavely.
References Deepstereo: Learning to predict new views from the world’s
[1] Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian imagery. CVPR, 2016. 3
Simon, Brian Curless, Steven M Seitz, and Richard Szeliski. [16] Christian Früh and Avideh Zakhor. An automated method for
Building rome in a day. Communications of the ACM, 2011. large-scale, ground-based city model acquisition. IJCV, 2004.
2 2
[2] Alexander Amini, Igor Gilitschenski, Jacob Phillips, Julia [17] Yasutaka Furukawa, Brian Curless, Steven M Seitz, and
Moseyko, Rohan Banerjee, Sertac Karaman, and Daniela Rus. Richard Szeliski. Towards internet-scale multi-view stereo.
Learning robust control policies for end-to-end autonomous CVPR, 2010. 2
driving from data-driven simulation. IEEE Robotics and [18] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and
Automation Letters, 2020. 3 robust multi-view stereopsis. IEEE TPAMI, 2010. 2
[3] Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter
[19] Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora
Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan.
Vig. Virtual worlds as proxy for multi-object tracking analysis.
Mip-NeRF: A multiscale representation for anti-aliasing neu-
CVPR, 2016. 3
ral radiance fields. ICCV, 2021. 1, 3, 4
[20] Stephan J Garbin, Marek Kowalski, Matthew Johnson, Jamie
[4] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P
Shotton, and Julien Valentin. Fastnerf: High-fidelity neural
Srinivasan, and Peter Hedman. Mip-nerf 360: Un-
rendering at 200fps. arXiv:2103.10380, 2021. 9
bounded anti-aliased neural radiance fields. arXiv preprint
arXiv:2111.12077, 2021. 9 [21] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
[5] Piotr Bojanowski, Armand Joulin, David Lopez-Paz, and ready for autonomous driving? the kitti vision benchmark
Arthur Szlam. Optimizing the latent space of generative suite. CVPR, 2012. 6
networks. arXiv:1707.05776, 2017. 4 [22] Mordechai Haklay and Patrick Weber. Openstreetmap: User-
[6] Chris Buehler, Michael Bosse, Leonard McMillan, Steven generated street maps. IEEE Pervasive computing, 2008. 4
Gortler, and Michael Cohen. Unstructured lumigraph ren- [23] R. I. Hartley and A. Zisserman. Multiple View Geometry
dering. Computer graphics and interactive techniques, 2001. in Computer Vision. Cambridge University Press, second
3 edition, 2004. 2
9
[24] Peter Hedman, Julien Philip, True Price, Jan-Michael Frahm, [41] Moustafa Meshry, Dan B. Goldman, Sameh Khamis, Hugues
George Drettakis, and Gabriel Brostow. Deep blending for Hoppe, Rohit Pandey, Noah Snavely, and Ricardo Martin-
free-viewpoint image-based rendering. ACM Transactions on Brualla. Neural rerendering in the wild. CVPR, 2019. 3
Graphics (TOG), 2018. 3 [42] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
[25] Peter Hedman, Pratul P Srinivasan, Ben Mildenhall, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
Jonathan T Barron, and Paul Debevec. Baking neural ra- Representing scenes as neural radiance fields for view synthe-
diance fields for real-time view synthesis. arXiv:2103.14645, sis. ECCV, 2020. 1, 3, 4
2021. 9 [43] Thomas Müller, Alex Evans, Christoph Schied, and Alexan-
[26] Amlan Kar, Aayush Prakash, Ming-Yu Liu, Eric Cameracci, der Keller. Instant neural graphics primitives with a mul-
Justin Yuan, Matt Rusiniak, David Acuna, Antonio Torralba, tiresolution hash encoding. arXiv:2201.05989, Jan. 2022.
and Sanja Fidler. Meta-sim: Learning to generate synthetic 9
datasets. ICCV, 2019. 3 [44] Julian Ost, Fahim Mannan, Nils Thuerey, Julian Knodt, and
[27] Michael Kazhdan and Hugues Hoppe. Screened poisson Felix Heide. Neural scene graphs for dynamic scenes. CVPR,
surface reconstruction. ACM Transactions on Graphics (ToG), 2021. 1, 3, 8
2013. 13 [45] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien
[28] Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo
Fidler. Drivegan: Towards a controllable high-quality neural Martin-Brualla. Nerfies: Deformable neural radiance fields.
simulation. CVPR, 2021. 3 ICCV, 2021. 1
[29] Diederik P Kingma and Jimmy Ba. Adam: A method for [46] Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc
stochastic optimization. ICLR, 2015. 12 Pollefeys, and Andreas Geiger. Convolutional occupancy
networks. In Computer Vision–ECCV 2020: 16th European
[30] Johannes Kopf, Billy Chen, Richard Szeliski, and Michael
Conference, Glasgow, UK, August 23–28, 2020, Proceedings,
Cohen. Street slide: browsing street level imagery. ACM
Part III 16, pages 523–540. Springer, 2020. 3
Transactions on Graphics (TOG), 2010. 3
[47] Marc Pollefeys, David Nistér, J-M Frahm, Amir Akbarzadeh,
[31] Johannes Kopf, Michael Cohen, and Rick Szeliski. First-
Philippos Mordohai, Brian Clipp, Chris Engels, David Gallup,
person hyperlapse videos. SIGGRAPH, 2014. 3
S-J Kim, Paul Merrell, et al. Detailed real-time urban 3d
[32] Wei Li, CW Pan, Rong Zhang, JP Ren, YX Ma, Jin Fang, FL reconstruction from video. IJCV, 2008. 2
Yan, QC Geng, XY Huang, HJ Gong, et al. Aads: Augmented [48] Daniel Rebain, Wei Jiang, Soroosh Yazdani, Ke Li,
autonomous driving simulation using data-driven algorithms. Kwang Moo Yi, and Andrea Tagliasacchi. Derf: Decom-
Science robotics, 2019. 1, 3 posed radiance fields. CVPR, 2021. 3
[33] Xiaowei Li, Changchang Wu, Christopher Zach, Svetlana [49] Christian Reiser, Songyou Peng, Yiyi Liao, and Andreas
Lazebnik, and Jan-Michael Frahm. Modeling and recognition Geiger. KiloNeRF: Speeding up neural radiance fields with
of landmark image collections using iconic scene graphs. thousands of tiny MLPs. ICCV, 2021. 3
ECCV, 2008. 2 [50] Stephan R Richter, Hassan Abu AlHaija, and Vladlen Koltun.
[34] Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Simon Enhancing photorealism enhancement. arXiv:2105.04619,
Lucey. Barf: Bundle-adjusting neural radiance fields. arXiv 2021. 3
preprint arXiv:2104.06405, 2021. 3, 5 [51] Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen
[35] Andrew Liu, Richard Tucker, Varun Jampani, Ameesh Maka- Koltun. Playing for data: Ground truth from computer games.
dia, Noah Snavely, and Angjoo Kanazawa. Infinite nature: ECCV, 2016. 3
Perpetual view generation of natural scenes from a single [52] Gernot Riegler and Vladlen Koltun. Free view synthesis.
image. ICCV, 2021. 1 ECCV, 2020. 3
[36] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and [53] Gernot Riegler and Vladlen Koltun. Stable view synthesis.
Christian Theobalt. Neural sparse voxel fields. NeurIPS, CVPR, 2021. 3
2020. 3, 9 [54] German Ros, Laura Sellart, Joanna Materzynska, David
[37] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Vazquez, and Antonio M Lopez. The synthia dataset: A large
Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural collection of synthetic images for semantic segmentation of
volumes: Learning dynamic renderable volumes from images. urban scenes. CVPR, 2016. 3
SIGGRAPH, 2019. 3 [55] Johannes L Schonberger and Jan-Michael Frahm. Structure-
[38] Frank Losasso and Hugues Hoppe. Geometry clipmaps: ter- from-motion revisited. CVPR, 2016. 2, 6, 12
rain rendering using nested regular grids. Siggraph, 2004. [56] Qi Shan, Riley Adams, Brian Curless, Yasutaka Furukawa,
2 and Steven M. Seitz. The visual turing test for scene recon-
[39] David G Lowe. Distinctive image features from scale- struction. 3DV, 2013. 2
invariant keypoints. IJCV, 2004. 2 [57] Noah Snavely, Steven M. Seitz, and Richard Szeliski. Photo
[40] Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, tourism: Exploring photo collections in 3d. SIGGRAPH,
Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duck- 2006. 2
worth. Nerf in the wild: Neural radiance fields for uncon- [58] Pratul P. Srinivasan, Boyang Deng, Xiuming Zhang, Matthew
strained photo collections. CVPR, 2021. 1, 3, 4, 8 Tancik, Ben Mildenhall, and Jonathan T. Barron. NeRV:
10
Neural reflectance and visibility fields for relighting and view [73] Jiakai Zhang, Xinhang Liu, Xinyi Ye, Fuqiang Zhao, Yanshun
synthesis. CVPR, 2021. 5 Zhang, Minye Wu, Yingliang Zhang, Lan Xu, and Jingyi
[59] Shih-Yang Su, Frank Yu, Michael Zollhöfer, and Helge Yu. Editable free-viewpoint video using a layered neural
Rhodin. A-nerf: Articulated neural radiance fields for learn- representation. ACM Transactions on Graphics (TOG), 2021.
ing human shape, appearance, and pose. Advances in Neural 3
Information Processing Systems, 34, 2021. 3, 5 [74] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
[60] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel Koltun. Nerf++: Analyzing and improving neural radiance
grid optimization: Super-fast convergence for radiance fields fields. arXiv preprint arXiv:2010.07492, 2020. 9
reconstruction. arXiv preprint arXiv:2111.11215, 2021. 9 [75] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
[61] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien and Oliver Wang. The unreasonable effectiveness of deep
Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, features as a perceptual metric. CVPR, 2018. 7
Yuning Chai, Benjamin Caine, et al. Scalability in perception [76] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe,
for autonomous driving: Waymo open dataset. CVPR, 2020. and Noah Snavely. Stereo magnification: Learning view
6 synthesis using multiplane images. arXiv:1805.09817, 2018.
[62] Towaki Takikawa, Joey Litalien, Kangxue Yin, Karsten Kreis, 3
Charles Loop, Derek Nowrouzezahrai, Alec Jacobson, Mor- [77] Siyu Zhu, Runze Zhang, Lei Zhou, Tianwei Shen, Tian Fang,
gan McGuire, and Sanja Fidler. Neural geometric level of Ping Tan, and Long Quan. Very large-scale global SFM by
detail: Real-time rendering with implicit 3D shapes. CVPR, distributed motion averaging. CVPR, 2018. 2
2021. 3
[63] Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara
Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra-
mamoorthi, Jonathan T. Barron, and Ren Ng. Fourier features
let networks learn high frequency functions in low dimen-
sional domains. NeurIPS, 2020. 4
[64] Sebastian Thrun. Probabilistic robotics. Communications of
the ACM, 2002. 2
[65] Bill Triggs, Philip F McLauchlan, Richard I Hartley, and An-
drew W Fitzgibbon. Bundle adjustment—a modern synthesis.
International workshop on vision algorithms, 1999. 2
[66] Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Vic-
tor Adrian Prisacariu. Nerf–: Neural radiance fields without
known camera parameters. arXiv preprint arXiv:2102.07064,
2021. 3, 5
[67] Bangbang Yang, Yinda Zhang, Yinghao Xu, Yijin Li, Han
Zhou, Hujun Bao, Guofeng Zhang, and Zhaopeng Cui. Learn-
ing object-compositional neural radiance field for editable
scene rendering. ICCV, 2021. 3, 8
[68] Zhenpei Yang, Yuning Chai, Dragomir Anguelov, Yin Zhou,
Pei Sun, Dumitru Erhan, Sean Rafferty, and Henrik Kret-
zschmar. Surfelgan: Synthesizing realistic sensor data for
autonomous driving. CVPR, 2020. 1, 3
[69] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan
Atzmon, Basri Ronen, and Yaron Lipman. Multiview neural
surface reconstruction by disentangling geometry and appear-
ance. Advances in Neural Information Processing Systems,
33, 2020. 3
[70] Lin Yen-Chen, Pete Florence, Jonathan T. Barron, Alberto
Rodriguez, Phillip Isola, and Tsung-Yi Lin. iNeRF: Invert-
ing neural radiance fields for pose estimation. In IEEE/RSJ
International Conference on Intelligent Robots and Systems
(IROS), 2021. 3, 5
[71] Alex Yu, Sara Fridovich-Keil, Matthew Tancik, Qinhong
Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels:
Radiance fields without neural networks. arXiv preprint
arXiv:2112.05131, 2021. 9
[72] Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and
Angjoo Kanazawa. Plenoctrees for real-time rendering of
neural radiance fields. arXiv:2103.14024, 2021. 9
11
A. Model Parameters / Optimization Details The metrics imply that reducing overlap is beneficial for
image quality metrics. However, this can likely be attributed
Our network follows the mip-NeRF structure. The net- to the resulting reduction in block size. In practice, having
work fσ is composed of 8 layers with width 512 (Mission an overlap between blocks is important to avoid temporal
Bay experiments) or 1024 (all other experiments). fc has 3 artifacts when interpolating between Block-NeRFs.
layers with width 128 and fv has 4 layers with width 128.
The appearance embeddings are 32 dimensional. We train
Overlap Size PSNR↑ SSIM↑ LPIPS↓
each Block-NeRF using the Adam [29] optimizer for 300 K
iterations with a batch size of 16384. Similar to mip-NeRF, 0% 77 m 26.77 0.895 0.262
the learning rate is an annealed logarithmically from 2 · 10−3 25% 97 m 26.75 0.894 0.269
to 2 · 10−5 , with a warm up phase during the first 1024 itera- 50%* 116 m 26.59 0.890 0.278
tions. The coarse and fine networks are sampled 256 times 75% 136 m 26.51 0.887 0.283
during training and 512 times when rendering the videos. Table 4. Effect of different NeRF overlaps in the 8 block scenario
The visibility is supervised with MSE loss and is scaled by with 0.25M weights per block (2M weights in total). The original
10−6 . The learned pose correction consists of a position setting used in the main paper is marked*.
offset and a 3 × 3 residual rotation matrix, which is added to
the identity matrix and normalized before being applied to
ensure it is orthogonal. The pose corrections are initialized
D. Block-NeRF Interpolation Details
to 0 and their element-wise `2 norm is regularized during We experiment with multiple methods to interpolate be-
training. This regularization is scaled by 105 at the start of tween Block-NeRFs and find that simple inverse distance
training and linearly decays to 10−1 after 5000 iterations. weighting (IDW) in image space produces the most appeal-
This allows the network to learn initial geometry prior to ing videos due to temporal smoothness. We use an IDW
applying pose offsets. power p of 4 for the Alamo Square renderings and a power
Each Block-NeRF takes between 9 and 24 hours to train of 1 for the Mission Bay renderings. We experiment with 3D
(depending on hyperparameters). We train each Block-NeRF inverse distance weighting for each individual pixel by pro-
on 32 TPU v3 cores available through Google Cloud Com- jecting the rendered pixels into 3D space using the expected
pute, which combined offer a total of 1680 TFLOPS and 512 ray termination depth from the Block-NeRF closest to the
GB memory. Rendering an 1200 × 900px image for a sin- target view. The color value of the projected pixel is then
gle Block-NeRF takes approximately 5.9 seconds. Multiple determined using inverse distance weighting with the nearest
Block-NeRF can be processed in parallel during inference Block-NeRFs. Artifacts occur in the resulting composited
(typically fewer than 3 Block-NeRFs need to be rendered for renders due to noise in the depth predictions. We also ex-
a single frame). periment with using the Block-NeRF predicted visibility for
interpolation. We consider imagewise visibility where we
B. Block-NeRF Size and Placement take the mean visibility of the entire image and pixelwise
visibility where were directly utilize the per-pixel visibility
We include qualitative comparisons in Figure 9 on the predictions. Both of these methods lead to sharper results
Mission Bay dataset to complement the quantitative compar- but come at the cost of temporal inconsistencies. Finally we
isons in (§5.3, Table 2). In this figure, we provide compar- compare to nearest neighbor interpolation where we only ren-
isons on two regimes, one where each Block-NeRF contains der the Block-NeRF closest to the target view. This results
the same number of weights (left section) and one where in harsh jumps when transiting between Block-NeRFs.
the total number of weights across all Block-NeRFs is fixed
(right section). E. Structure from Motion (COLMAP)
C. Block-NeRF Overlap Comparison We use COLMAP [55] to reconstruct the Mission Bay
dataset. We first split the dataset into 8 overlapping blocks
In the main paper, we include experiments on Block- with 97 m radius each based on camera positions (each block
NeRF size and placement (§5.3). For these experiments, has roughly 25% overlap with the adjacent block). The bun-
we assumed a relative overlap of 50% between each pair of dle adjustment step takes most of the time in reconstruction
Block-NeRFs, which aids with appearance alignment. and we do not see significant improvements if we increase
Table 4 is a direct extension of Table 2 in the main paper the radius per block. We mask out movable objects when
and shows the effect of varying block overlap in the 8 block extracting feature points for matching, using the same seg-
scenario. Note that varying the overlap changes the spatial mentation model as Block-NeRF. We assume a pinhole cam-
block size. The original setting in the main paper is marked era model and provide camera intrinsics and camera pose
with an asterisk. as priors for running structure-from-motion. We then run
12
Point Surfel Point Surfel
Ground Truth Rendering Rendering Ground Truth Rendering Rendering F. Examples from our Datasets
In Figure 10, we show the camera images from our Mis-
sion Bay dataset. In Figure 11, we show both camera images
and corresponding segmentation masks from our Alamo
Square dataset.
G. Societal Impact
G.1. Methodological
Our method inherits the heavy compute footprint of NeRF
models and we propose to apply them at an unprecedented
scale. Our method also unlocks new use-cases for neural
rendering, such as building detailed maps of the environment
(mapping), which could cause more wide-spread use in favor
of less computationally involved alternatives. Depending on
the scale this work is being applied at, its compute demands
can lead to or worsen environmental damage if the energy
used for compute leads to increased carbon emissions. As
Figure 8. Qualitative results for COLMAP. We demonstrate the mentioned in the paper, we foresee further work, such as
two rendering options using the fused pointcloud computed by
caching methods, that could reduce the compute demands
COLMAP.
and thus mitigate the environmental damage.
multi-view stereo within each block to produce dense depth G.2. Application
and normal maps in 3D and produce a dense point cloud of We apply our method to real city environments. During
the scene. In our preliminary experiments, we ran Poisson our own data collection efforts for this paper, we were care-
meshing [27] on the fused dense pointcloud to reconstruct ful to blur faces and sensitive information, such as license
textured meshes but found that the method fails to produce plates, and limited our driving to public roads. Future appli-
reasonably-looking results due to the challenging geometry cations of this work might entail even larger data collection
and depth errors introduced by reflective surfaces and the sky. efforts, which raises further privacy concerns. While detailed
Instead, we leverage the fused pointcloud and explore two imagery of public roads can already be found on services
alternatives, namely, point rendering and surfel rendering, like Google Street View, our methodology could promote
respectively. To render the test view, we selected the nearest repeated and more regular scans of the environment. Several
scene and use OSMesa off-screen rendering assuming the companies in the autonomous vehicle space are also known
Lambertian model and a single light source. to perform regular area scans using their fleet of vehicles;
In Table 5, we compare two different rendering options however some might only utilize LiDAR scans which can be
for the densely reconstructed pointcloud. We discard the less sensitive than collecting camera imagery.
invisible pixels when computing the PSNR for both methods,
making the quantitative results comparable to our Block-
NeRF setting.
In Figure 8, we show the qualitative comparisons between
two rendering options with PSNR on the corresponding im-
ages. This reconstruction is sparse and fails to represent
reflective surfaces and the sky.
https://docs.mesa3d.org/osmesa.html
13
Fixed # of Weights per Block Fixed Total # of Weights
Ground Truth 1 Block-NeRF 4 Block-NeRFs 8 Block-NeRFs 16 Block-NeRFs 1 Block-NeRF 4 Block-NeRFs 8 Block-NeRFs 16 Block-NeRFs
Figure 9. Qualitative results on Block-NeRF size and placement. We show results on the Mission Bay dataset using different options
discussed in § 5.3 of the main paper.
14
Figure 11. Selection of front-facing images from our Alamo Square Dataset, alongside their transient object mask predicted by a pretrained
semantic segmentation model.
15