0% found this document useful (0 votes)

25 views15 pages

Block Nerf

Uploaded by

ypxypxypx700

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views15 pages

Block Nerf

Uploaded by

ypxypxypx700

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Block-NeRF: Scalable Large Scene Neural View Synthesis

Matthew Tancik1∗ Vincent Casser2 Xinchen Yan2 Sabeek Pradhan2

Ben Mildenhall3 Pratul P. Srinivasan3 Jonathan T. Barron3 Henrik Kretzschmar2
1 2 3
UC Berkeley Waymo Google Research

F
uare, S
arXiv:2202.05263v1 [cs.CV] 10 Feb 2022

June
Sq
Alamo

Sept.

Block-NeRF

1 km
Figure 1. Block-NeRF is a method that enables large-scale scene reconstruction by representing the environment using multiple compact
NeRFs that each fit into memory. At inference time, Block-NeRF seamlessly combines renderings of the relevant NeRFs for the given area.
In this example, we reconstruct the Alamo Square neighborhood in San Francisco using data collected over 3 months. Block-NeRF can
update individual blocks of the environment without retraining on the entire scene, as demonstrated by the construction on the right. Video
results can be found on the project website waymo.com/research/block-nerf.

tion and novel view synthesis given a set of posed camera im-
Abstract ages [3, 40, 45]. Earlier works tended to focus on small-scale
and object-centric reconstruction. Though some methods
We present Block-NeRF, a variant of Neural Radiance now address scenes the size of a single room or building,
Fields that can represent large-scale environments. Specif- these are generally still limited and do not naı̈vely scale up
ically, we demonstrate that when scaling NeRF to render to city-scale environments. Applying these methods to large
city-scale scenes spanning multiple blocks, it is vital to de- environments typically leads to significant artifacts and low
compose the scene into individually trained NeRFs. This visual fidelity due to limited model capacity.
decomposition decouples rendering time from scene size, en-
ables rendering to scale to arbitrarily large environments, Reconstructing large-scale environments enables several
and allows per-block updates of the environment. We adopt important use-cases in domains such as autonomous driv-
several architectural changes to make NeRF robust to data ing [32, 44, 68] and aerial surveying [14, 35]. One example is
captured over months under different environmental condi- mapping, where a high-fidelity map of the entire operating
tions. We add appearance embeddings, learned pose refine- domain is created to act as a powerful prior for a variety of
ment, and controllable exposure to each individual NeRF, problems, including robot localization, navigation, and colli-
and introduce a procedure for aligning appearance between sion avoidance. Furthermore, large-scale scene reconstruc-
adjacent NeRFs so that they can be seamlessly combined. We tions can be used for closed-loop robotic simulations [13].
build a grid of Block-NeRFs from 2.8 million images to cre- Autonomous driving systems are commonly evaluated by
ate the largest neural scene representation to date, capable re-simulating previously encountered scenarios; however,
of rendering an entire neighborhood of San Francisco. any deviation from the recorded encounter may change the
vehicle’s trajectory, requiring high-fidelity novel view ren-
1. Introduction derings along the altered path. Beyond basic view synthesis,
scene conditioned NeRFs are also capable of changing en-
Recent advancements in neural rendering such as Neural vironmental lighting conditions such as camera exposure,
Radiance Fields [42] have enabled photo-realistic reconstruc- weather, or time of day, which can be used to further augment
*Work done as an intern at Waymo. simulation scenarios.

1
Reconstructing such large-scale environments introduces
additional challenges, including the presence of transient
objects (cars and pedestrians), limitations in model capacity, Combined
Color Prediction
along with memory and compute constraints. Furthermore,
training data for such large environments is highly unlikely
to be collected in a single capture under consistent condi-
tions. Rather, data for different parts of the environment may
need to be sourced from different data collection efforts, in-
troducing variance in both scene geometry (e.g., construction
work and parked cars), as well as appearance (e.g., weather
conditions and time of day).
We extend NeRF with appearance embeddings and Target View

learned pose refinement to address the environmental Discarded

changes and pose errors in the collected data. We addi-
tionally add exposure conditioning to provide the ability
to modify the exposure during inference. We refer to this Block-NeRF Origin Visibility Prediction
modified model as a Block-NeRF. Scaling up the network Block-NeRF Training Radius Color Prediction
capacity of Block-NeRF enables the ability to represent in-
creasingly large scenes. However this approach comes with a Figure 2. The scene is split into multiple Block-NeRFs that are each
number of limitations; rendering time scales with the size of trained on data within some radius (dotted orange line) of a specific
the network, networks can no longer fit on a single compute Block-NeRF origin coordinate (orange dot). To render a target
device, and updating or expanding the environment requires view in the scene, the visibility maps are computed for all of the
retraining the entire network. NeRFs within a given radius. Block-NeRFs with low visibility are
To address these challenges, we propose dividing up large discarded (bottom Block-NeRF) and the color output is rendered
environments into individually trained Block-NeRFs, which for the remaining blocks. The renderings are then merged based on
are then rendered and combined dynamically at inference each block origin’s distance to the target view.
time. Modeling these Block-NeRFs independently allows
for maximum flexibility, scales up to arbitrarily large en- and Building Rome in a Day [1]. Core graphics research
vironments and provides the ability to update or introduce has also explored breaking up scenes for fast high quality
new regions in a piecewise manner without retraining the rendering [38].
entire environment as demonstrated in Figure 1. To com- These approaches typically output a camera pose for each
pute a target view, only a subset of the Block-NeRFs are input image and a sparse 3D point cloud. To get a complete
rendered and then composited based on their geographic lo- 3D scene model, these outputs must be further processed by
cation compared to the camera. To allow for more seamless a dense multi-view stereo algorithm (e.g., PMVS [18]) to
compositing, we propose an appearance matching technique produce a dense point cloud or triangle mesh. This process
which brings different Block-NeRFs into visual alignment presents its own scaling difficulties [17]. The resulting 3D
by optimizing their appearance embeddings. models often contain artifacts or holes in areas with limited
texture or specular reflections as they are challenging to
2. Related Work triangulate across images. As such, they frequently require
further postprocessing to create models that can be used to
2.1. Large Scale 3D Reconstruction
render convincing imagery [56]. However, this task is mainly
Researchers have been developing and refining tech- the domain of novel view synthesis, and 3D reconstruction
niques for 3D reconstruction from large image collections techniques primarily focus on geometric accuracy.
for decades [1, 16, 33, 47, 57, 77], and much current work re- In contrast, our approach does not rely on large-scale
lies on mature and robust software implementations such as SfM to produce camera poses, instead performing odome-
COLMAP to perform this task [55]. Nearly all of these recon- try using various sensors on the vehicle as the images are
struction methods share a common pipeline: extract 2D im- collected [64].
age features (such as SIFT [39]), match these features across
different images, and jointly optimize a set of 3D points and 2.2. Novel View Synthesis
camera poses to be consistent with these matches (the well- Given a set of input images of a given scene and their
explored problem of bundle adjustment [23, 65]). Extending camera poses, novel view synthesis seeks to render observed
this pipeline to city-scale data is largely a matter of imple- scene content from previously unobserved viewpoints, al-
menting highly robust and parallelized versions of these lowing a user to navigate through a recreated environment
algorithms, as explored in work such as Photo Tourism [57] with high visual fidelity.

2
Geometry-based Image Reprojection. Many ap- σ x
proaches to view synthesis start by applying traditional 3D x fσ d fv Visibility
reconstruction techniques to build a point cloud or triangle
d
mesh representing the scene. This geometric “proxy” is
Exposure
fc RGB Positional Encoding
then used to reproject pixels from the input images into Integrated Positional
Appearance Embedding Encoding
new camera views, where they are blended by heuristic [6]
or learning-based methods [24, 52, 53]. This approach has
been scaled to long trajectories of first-person video [31], Figure 3. Our model is an extension of the model presented in
panoramas collected along a city street [30], and single mip-NeRF [3]. The first MLP fσ predicts the density σ for a
position x in space. The network also outputs a feature vector
landmarks from the Photo Tourism dataset [41]. Methods
that is concatenated with viewing direction d, the exposure level,
reliant on geometry proxies are limited by the quality of the and an appearance embedding. These are fed into a second MLP
initial 3D reconstruction, which hurts their performance in fc that outputs the color for the point. We additionally train a
scenes with complex geometry or reflectance effects. visibility network fv to predict whether a point in space was visible
in the training views, which is used for culling Block-NeRFs during
Volumetric Scene Representations. Recent view synthe- inference.
sis work has focused on unifying reconstruction and render-
ing and learning this pipeline end-to-end, typically using
a volumetric scene representation. Methods for rendering as people or cars) [44, 73] across video sequences. As we
small baseline view interpolation often use feed-forward focus primarily on reconstructing the environment itself, we
networks to learn a mapping directly from input images to choose to simply mask out dynamic objects during training.
an output volume [15, 76], while methods such as Neural
Volumes [37] that target larger-baseline view synthesis run
a global optimization over all input images to reconstruct 2.3. Urban Scene Camera Simulation
every new scene, similar to traditional bundle adjustment.
Neural Radiance Fields (NeRF) [42] combines this single- Camera simulation has become a popular data source
scene optimization setting with a neural scene representation for training and validating autonomous driving systems on
capable of representing complex scenes much more effi- interactive platforms [2, 28]. Early works [13, 19, 51, 54] syn-
ciently than a discrete 3D voxel grid; however, its rendering thesized data from scripted scenarios and manually created
model scales very poorly to large-scale scenes in terms of 3D assets. These methods suffered from domain mismatch
compute. Followup work has proposed making NeRF more and limited scene-level diversity. Several recent works tackle
efficient by partitioning space into smaller regions, each con- the simulation-to-reality gaps by minimizing the distribution
taining its own lightweight NeRF network [48, 49]. Unlike shifts in the simulation and rendering pipeline. Kar et al. [26]
our method, these network ensembles must be trained jointly, and Devaranjan et al. [12] proposed to minimize the scene-
limiting their flexibility. Another approach is to provide extra level distribution shift from rendered outputs to real camera
capacity in the form of a coarse 3D grid of latent codes [36]. sensor data through a learned scenario generation frame-
This approach has also been applied to compress detailed work. Richter et al. [50] leveraged intermediate rendering
3D shapes into neural signed distance functions [62] and to buffers in the graphics pipeline to improve photorealism of
represent large scenes using occupancy networks [46]. synthetically generated camera images.
We build our Block-NeRF implementation on top of mip- Towards the goal of building photo-realistic and scalable
NeRF [3], which improves aliasing issues that hurt NeRF’s camera simulation, prior methods [9, 32, 68] leverage rich
performance in scenes where the input images observe the multi-sensor driving data collected during a single drive to
scene from many different distances. We incorporate tech- reconstruct 3D scenes for object injection [9] and novel view
niques from NeRF in the Wild (NeRF-W) [40], which adds synthesis [68] using modern machine learning techniques, in-
a latent code per training image to handle inconsistent scene cluding image GANs for 2D neural rendering. Relying on a
appearance when applying NeRF to landmarks from the sophisticated surfel reconstruction pipeline, SurfelGAN [68]
Photo Tourism dataset. NeRF-W creates a separate NeRF is still susceptible to errors in graphical reconstruction and
for each landmark from thousands of images, whereas our can suffer from the limited range and vertical field-of-view
approach combines many NeRFs to reconstruct a coherent of LiDAR scans. In contrast to existing efforts, our work
large environment from millions of images. Our model also tackles the 3D rendering problem and is capable of modeling
incorporates a learned camera pose refinement which has the real camera data captured from multiple drives under
been explored in previous works [34, 59, 66, 69, 70]. varying environmental conditions, such as weather and time
Some NeRF-based methods use segmentation data to of day, which is a prerequisite for reconstructing large-scale
isolate and reconstruct static [67] or moving objects (such areas.

3
3. Background intervals. To feed these frustums into the MLP, mip-NeRF
approximates each of them as Gaussian distributions with
We build upon NeRF [42] and its extension mip-NeRF [3]. parameters µi , Σi and replaces the positional encoding γPE
Here, we summarize relevant parts of these methods. For with its expectation over the input Gaussian
details, please refer to the original papers.
3.1. NeRF and mip-NeRF Preliminaries γIPE (µ, Σ) = EX∼N (µ,Σ) [γPE (X)] , (4)
Neural Radiance Fields (NeRF) [42] is a coordinate-based referred to as an integrated positional encoding.
neural scene representation that is optimized through a dif-
ferentiable rendering loss to reproduce the appearance of a 4. Method
set of input images from known camera poses. After opti-
mization, the NeRF model can be used to render previously Training a single NeRF does not scale when trying to
unseen viewpoints. represent scenes as large as cities. We instead propose split-
The NeRF scene representation is a pair of multilayer ting the environment into a set of Block-NeRFs that can
perceptrons (MLPs). The first MLP fσ takes in a 3D position be independently trained in parallel and composited during
x and outputs volume density σ and a feature vector. This inference. This independence enables the ability to expand
feature vector is concatenated with a 2D viewing direction the environment with additional Block-NeRFs or update
d and fed into the second MLP fc , which outputs an RGB blocks without retraining the entire environment (see Fig-
color c. This architecture ensures that the output color can ure 1). We dynamically select relevant Block-NeRFs for
vary when observed from different angles, allowing NeRF rendering, which are then composited in a smooth manner
to represent reflections and glossy materials, but that the when traversing the scene. To aid with this compositing,
underlying geometry represented by σ is only a function of we optimize the appearances codes to match lighting condi-
position. tions and use interpolation weights computed based on each
Each pixel in an image corresponds to a ray r(t) = o + Block-NeRF’s distance to the novel view.
td through 3D space. To calculate the color of r, NeRF
randomly samples distances {ti }N 4.1. Block Size and Placement
i=0 along the ray and passes
the points r(ti ) and direction d through its MLPs to calculate The individual Block-NeRFs should be arranged to col-
σi and ci . The resulting output color is lectively ensure full coverage of the target environment. We
typically place one Block-NeRF at each intersection, cov-
N ering the intersection itself and any connected street 75%
of the way until it converges into the next intersection (see
X
cout = wi ci , where wi = Ti (1 − e−∆i σi ), (1)
i=1 Figure 1). This results in a 50% overlap between any two
  adjacent blocks on the connecting street segment, making
Ti = exp −
X
∆j σj  , ∆i = ti − ti−1 . (2) appearance alignment easier between them. Following this
j<i
procedure means that the block size is variable; where neces-
sary, additional blocks may be introduced as connectors be-
The full implementation of NeRF iteratively resamples the tween intersections. We ensure that the training data for each
points ti (by treating the weights wi as a probability distribu- block stays exactly within its intended bounds by applying
tion) in order to better concentrate samples in areas of high a geographical filter. This procedure can be automated and
density. only relies on basic map data such as OpenStreetMap [22].
To enable the NeRF MLPs to represent higher frequency Note that other placement heuristics are also possible, as
detail [63], the inputs x and d are each preprocessed by a long as the entire environment is covered by at least one
componentwise sinusoidal positional encoding γPE : Block-NeRF. For example, for some of our experiments, we
instead place blocks along a single street segment at uniform
γPE (z) = [sin(20 z), cos(20 z), . . . , sin(2L−1 z), cos(2L−1 z)] (3) distances and define the block size as a sphere around the
Block-NeRF Origin (see Figure 2).
where L is the number of levels of positional encoding.
NeRF’s MLP fσ takes a single 3D point as input. How- 4.2. Training Individual Block-NeRFs
ever, this ignores both the relative footprint of the corre- 4.2.1 Appearance Embeddings
sponding image pixel and the length of the interval [ti−1 , ti ]
along the ray r containing the point, resulting in aliasing Given that different parts of our data may be captured under
artifacts when rendering novel camera trajectories. Mip- different environmental conditions, we follow NeRF-W [40]
NeRF [3] remedies this issue by using the projected pixel and use Generative Latent Optimization [5] to optimize per-
footprint to sample conical frustums along the ray rather than image appearance embedding vectors, as shown in Figure 3.

4
Figure 4. The appearance codes allow the model to represent different lighting and weather conditions.

This allows the NeRF to explain away several appearance- for changes in otherwise static parts of the environment,
changing conditions, such as varying weather and lighting. e.g. construction, it accommodates most common types of
We can additionally manipulate these appearance embed- geometric inconsistency.
dings to interpolate between different conditions observed
in the training data (such as cloudy versus clear skies, or
4.2.5 Visibility Prediction
day and night). Examples of rendering with different appear-
ances can be seen in Figure 4. In § 4.3.3, we use test-time When merging multiple Block-NeRFs, it can be useful to
optimization over these embeddings to match the appear- know whether a specific region of space was visible to a
ance of adjacent Block-NeRFs, which is important when given NeRF during training. We extend our model with an
combining multiple renderings. additional small MLP fv that is trained to learn an approx-
imation of the visibility of a sampled point (see Figure 3).
4.2.2 Learned Pose Refinement For each sample along a training ray, fv takes in the lo-
cation and view direction and regresses the corresponding
Although we assume that camera poses are provided, we find transmittance of the point (Ti in Equation 2). The model
it advantageous to learn regularized pose offsets for further is trained alongside fσ , which provides supervision. Trans-
alignment. Pose refinement has been explored in previous mittance represents how visible a point is from a particular
NeRF based models [34,59,66,70]. These offsets are learned input camera: points in free space or on the surface of the
per driving segment and include both a translation and a first intersected object will have transmittance near 1, and
rotation component. We optimize these offsets jointly with points inside or behind the first visible object will have trans-
the NeRF itself, significantly regularizing the offsets in the mittance near 0. If a point is seen from some viewpoints
early phase of training to allow the network to first learn a but not others, the regressed transmittance value will be the
rough structure prior to modifying the poses. average over all training cameras and lie between zero and
one, indicating that the point is partially observed. Our vis-
4.2.3 Exposure Input ibility prediction is similar to the visibility fields proposed
Training images may be captured across a wide range of by Srinivasan et al. [58]. However, they used an MLP to
exposure levels, which can impact NeRF training if left predict visibility to environment lighting for the purpose
unaccounted for. We find that feeding the camera exposure of recovering a relightable NeRF model, while we predict
information to the appearance prediction part of the model visibility to training rays.
allows the NeRF to compensate for the visual differences The visibility network is small and can be run indepen-
(see Figure 3). Specifically, the exposure information is dently from the color and density networks. This proves
processed as γPE (shutter speed × analog gain/t) where γPE useful when merging multiple NeRFs, since it can help to
is a sinusoidal positional encoding with 4 levels, and t is determine whether a specific NeRF is likely to produce mean-
a scaling factor (we use 1,000 in practice). An example of ingful outputs for a given location, as explained in § 4.3.1.
different learned exposures can be found in Figure 5. The visibility predictions can also be used to determine loca-
tions to perform appearance matching between two NeRFs,
4.2.4 Transient Objects as detailed in § 4.3.3.

While our method accounts for variation in appearance using 4.3. Merging Multiple Block-NeRFs
the appearance embeddings, we assume that the scene ge-
4.3.1 Block-NeRF Selection
ometry is consistent across the training data. Any movable
objects (e.g. cars, pedestrians) typically violate this assump- The environment can be composed of an arbitrary number
tion. We therefore use a semantic segmentation model [10] of Block-NeRFs. For efficiency, we utilize two filtering
to produce masks of common movable objects, and ignore mechanisms to only render relevant blocks for the given
masked areas during training. While this does not account target viewpoint. We only consider Block-NeRFs that are

5
the target in order to reduce the `2 loss between the respective
area renders. This optimization is quick, converging within
100 iterations. While not necessarily yielding perfect align-
ment, this procedure aligns most global and low-frequency
attributes of the scene, such as time of day, color balance, and
weather, which is a prerequisite for successful compositing.
Figure 6 shows an example optimization, where appearance
matching turns a daytime scene into nighttime to match the
adjacent Block-NeRF.
The optimized appearance is iteratively propagated
Figure 5. Our model is conditioned on exposure, which helps through the scene. Starting from one root Block-NeRF,
account for exposure changes present in the training data. This we optimize the appearance of the neighboring ones and
allows users to alter the appearance of the output images in a continue the process from there. If multiple blocks surround-
human-interpretable manner during inference. ing a target Block-NeRF have already been optimized, we
consider each of them when computing the loss.
within a set radius of the target viewpoint. Additionally,
for each of these candidates, we compute the associated 5. Results and Experiments
visibility. If the mean visibility is below a threshold, we
discard the Block-NeRF. An example of visibility filtering In this section we will discuss our datasets and exper-
is provided in Figure 2. Visibility can be computed quickly iments. The architectural and optimization specifics are
because its network is independent of the color network, and provided in the supplement. The supplement also provides
it does not need to be rendered at the target image resolution. comparisons to reconstructions from COLMAP [55], a tradi-
After filtering, there are typically one to three Block-NeRFs tional Structure from Motion approach. This reconstruction
left to merge. is sparse and fails to represent reflective surfaces and the sky.

4.3.2 Block-NeRF Compositing 5.1. Datasets

We render color images from each of the filtered Block- We perform experiments on datasets that we collect
NeRFs and interpolate between them using inverse distance specifically for the task of novel view synthesis of large-
weighting between the camera origin c and the centers xi of scale scenes. Our dataset is collected on public roads using
each Block-NeRF. Specifically, we calculate the respective data collection vehicles. While several large-scale driving
weights as wi ∝ distance(c, xi )−p , where p influences the datasets already exist, they are not designed for the task of
rate of blending between Block-NeRF renders. The inter- view synthesis. For example, some datasets lack sufficient
polation is done in 2D image space and produces smooth camera coverage (e.g., KITTI [21], Cityscapes [11]) or pri-
transitions between Block-NeRFs. We also explore other oritize visual diversity over repeated observations of a target
interpolation methods in § 5.4. area (e.g., NuScenes [7], Waymo Open Dataset [61], Argov-
erse [8]). Instead, they are typically designed for tasks such
4.3.3 Appearance Matching as object detection or tracking, where similar observations
across drives can lead to generalization issues.
The appearance of our learned models can be controlled by We capture both long-term sequence data (100 s or more),
an appearance latent code after the Block-NeRF has been as well as distinct sequences captured repeatedly in a particu-
trained. These codes are randomly initialized during train- lar target area over a period of several months. We use image
ing and therefore the same code typically leads to different data captured from 12 cameras that collectively provide a
appearances when fed into different Block-NeRFs. This is 360° view. 8 of the cameras provide a complete surround
undesirable when compositing as it may lead to inconsis- view from the roof of the car, with 4 additional cameras
tencies between views. Given a target appearance in one located at the vehicle front pointing forward and sideways.
of the Block-NeRFs, we aim to match its appearance in the Each camera captures images at 10 Hz and stores a scalar
remaining blocks. To accomplish this, we first select a 3D exposure value. The vehicle pose is known and all cameras
matching location between pairs of adjacent Block-NeRFs. are calibrated. Using this information, we calculate the cor-
The visibility prediction at this location should be high for responding camera ray origins and directions in a common
both Block-NeRFs. coordinate system, also accounting for the rolling shutter
Given the matching location, we freeze the Block-NeRF of the cameras. As described in § 4.2.4, we use a semantic
network weights and only optimize the appearance code of segmentation model [10] to detect movable objects.

6
Base Block-NeRF Adjacent Block-NeRF

Before Appearance Matching After Appearance Matching

Figure 6. When rendering scenes based on multiple Block-NeRFs, we use appearance matching to obtain a consistent appearance across the
scene. Given a fixed target appearance for one of the Block-NeRFs (left image), we optimize the appearances of the adjacent Block-NeRFs
to match. In this example, appearance matching produces a consistent night appearance across Block-NeRFs.

San Francisco Alamo Square Dataset. We select San NeRFs PSNR↑ SSIM↑ LPIPS↓
Francisco’s Alamo Square neighborhood as the target area mip-NeRF 17.86 0.563 0.509
for our scalability experiments. The dataset spans an area -Appearance 20.13 0.611 0.458
of approximately 960 m × 570 m, and was recorded in June, -Exposure 23.55 0.649 0.418

Ours
July, and August of 2021. We divide this dataset into 35 -Pose Opt. 23.05 0.625 0.442
Full 23.60 0.649 0.417
Block-NeRFs. Example renderings and Block-NeRF place-
ments can be seen in Figure 1. To best appreciate the scale Table 1. Ablations of different Block-NeRF components on a
of the reconstruction, please refer to supplementary videos. single intersection in the Alamo Square dataset. We show the
Each Block-NeRF was trained on data from 38 to 48 differ- performance of mip-NeRF as a baseline, as well as the effect of
ent data collection runs, adding up to a total driving time of removing individual components from our method.
18 to 28 minutes each. After filtering out some redundant
When our method is not trained with appearance embed-
image captures (e.g. stationary captures), each Block-NeRF
dings, these artifacts are still present. If our method is not
is trained on between 64,575 to 108,216 images. The over-
trained with pose optimization, the resulting scene is blurrier
all dataset is composed of 13.4 h of driving time sourced
and can contain duplicated objects due to pose misalignment.
from 1,330 different data collection runs, with a total of
Finally, the exposure input marginally improves the recon-
2,818,745 training images.
struction, but more importantly provides us with the ability
San Francisco Mission Bay Dataset. We choose San to change the exposure during inference.
Francisco’s Mission Bay District as the target area for our 5.3. Block-NeRF Size and Placement
baseline, block size, and placement experiments. Mission
Bay is an urban environment with challenging geometry and # Blocks Weights / Total Size Compute PSNR↑ SSIM↑ LPIPS↓
reflective facades. We identified a long stretch on Third
1 0.25M / 0.25M 544 m 1× 23.83 0.825 0.381
Street with far-range visibility, making it an interesting test 4 0.25M / 1.00M 271 m 2× 25.55 0.868 0.318
case. Notably, this dataset was recorded in a single capture in 8 0.25M / 2.00M 116 m 2× 26.59 0.890 0.278
16 0.25M / 4.00M 54 m 2× 27.40 0.907 0.242
November 2020, with consistent environmental conditions al-
1 1.00M / 1.00M 544 m 1× 24.90 0.852 0.340
lowing for simple evaluation. This dataset was recorded over 4 0.25M / 1.00M 271 m 0.5× 25.55 0.868 0.318
100 s, in which the data collection vehicle traveled 1.08 km 8 0.13M / 1.00M 116 m 0.25× 25.92 0.875 0.306
16 0.07M / 1.00M 54 m 0.125× 25.98 0.877 0.305
and captured 12,000 total images from 12 cameras. We will
release this single-capture dataset to aid reproducibility. Table 2. Comparison of different numbers of Block-NeRFs for
reconstructing the Mission Bay dataset. Splitting the scene into
5.2. Model Ablations multiple Block-NeRFs improves the reconstruction accuracy, even
We ablate our model modifications on a single intersec- when holding the total number of weights constant (bottom section).
tion from the Alamo Square dataset. We report PSNR, SSIM, The number of blocks determines the size of the area each block is
trained on and the relative compute expense at inference time.
and LPIPS [75] metrics for the test image reconstructions
in Table 1. The test images are split in half vertically, with We compare performance on our Mission Bay dataset
the appearance embeddings being optimized on one half and versus the number of Block-NeRFs used. We show details
tested on the other. We also provide qualitative examples in Table 2, where depending on granularity, the Block-NeRF
in Figure 7. Mip-NeRF alone fails to properly reconstruct sizes range from as small as 54 m to as large as 544 m. We
the scene and is prone to adding non-existent geometry and ensure that each pair of adjacent blocks overlaps by 50%
cloudy artifacts to explain the differences in appearance. and compare other overlap percentages in the supplement.

7
Block-NeRF
Ground Truth mip-NeRF -Appearance -Exposure -Pose Opt. Full

Figure 7. Model ablation results on multi segment data. Appearance embeddings help the network avoid adding cloudy geometry to explain
away changes in the environment like weather and lighting. Removing exposure slightly decreases the accuracy. The pose optimization
helps sharpen the results and removes ghosting from repeated objects, as observed with the telephone pole in the first row.

All were evaluated on the same set of held-out test images weighting (IDW) between the camera and Block-NeRF cen-
spanning the entire trajectory. We consider two regimes, ters, as described in § 4.3.2. We also explored a variant of
one where each Block-NeRF contains the same number of IDW where the interpolation was performed over projected
weights (top section) and one where the total number of 3D points predicted by the expected Block-NeRF depth.
weights across all Block-NeRFs is fixed (bottom section). This method suffers when the depth prediction is incorrect,
In both cases, we observe that increasing the number of leading to artifacts and temporal incoherence.
models improves the reconstruction metrics. In terms of Finally, we experiment with weighing the Block-NeRFs
computational expense, parallelization during training is based on per-pixel and per-image predicted visibility. This
trivial as each model can be optimized independently across produces sharper reconstructions of further-away areas but is
devices. At inference, our method only requires rendering prone to temporal inconsistency. Therefore, these methods
Block-NeRFs near the target view. Depending on the scene are best used only when rendering still images. Further
and NeRF layout, we typically render between one to three details are provided in the supplement.
NeRFs. We report the relative compute expense in each
setting without assuming any parallelization, which however 6. Limitations and Future Work
would be possible and lead to an additional speed-up. Our
results imply that splitting the scene into multiple lower The proposed method handles transient objects by filter-
capacity models can reduce the overall computational cost ing them out during training via masking using a segmen-
as not all of the models need to be evaluated (see bottom tation algorithm. If objects are not properly masked, they
section of Table 2). can cause artifacts in the resulting renderings. For exam-
ple, the shadows of cars often remain, even when the car
5.4. Interpolation Methods itself is correctly removed. Vegetation also breaks this as-
sumption as foliage changes seasonally and moves in the
Interpolation Consistent? PSNR↑ SSIM↑ LPIPS↓
wind; this results in blurred representations of trees and
Nearest – 26.40 0.887 0.280 plants. Similarly, temporal inconsistencies in the training
IDW 2D 3 26.59 0.890 0.278
IDW 3D – 26.57 0.890 0.278 data, such as construction work, are not automatically han-
Pixelwise Visibility – 27.39 0.906 0.242 dled and require the manual retraining of the affected blocks.
Imagewise Visibility – 27.41 0.907 0.242
Further, the inability to render scenes containing dynamic
Table 3. Comparison of interpolation methods. For our flythrough objects currently limits the applicability of Block-NeRF to-
video results, we opt for 2D inverse distance weighting (IDW) as it wards closed-loop simulation tasks in robotics. In the future,
produces temporally consistent results. these issues could be addressed by learning transient objects
We explore different interpolation methods in Table 3. during the optimization [40], or directly modeling dynamic
The simple method of only rendering the nearest Block- objects [44, 67]. In particular, the scene could be composed
NeRF to the camera requires the least amount of compute of multiple Block-NeRFs of the environment and individual
but results in harsh jumps when transitioning between blocks. controllable object NeRFs. Separation can be facilitated by
These transitions can be smoothed by using inverse distance the use of segmentation masks or bounding boxes.

8
In our model, distant objects in the scene are not sam- [7] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora,
pled with the same density as nearby objects which leads to Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi-
blurrier reconstructions. This is an issue with sampling un- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal
bounded volumetric representations. Techniques proposed dataset for autonomous driving. CVPR, 2020. 6
in NeRF++ [74] and concurrent Mip-NeRF 360 [4] could [8] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet
potentially be used to produce sharper renderings of distant Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr,
objects. Simon Lucey, Deva Ramanan, et al. Argoverse: 3d tracking
and forecasting with rich maps. CVPR, 2019. 6
In many applications, real-time rendering is key, but
[9] Yun Chen, Frieda Rong, Shivam Duggal, Shenlong Wang,
NeRFs are computationally expensive to render (up to mul-
Xinchen Yan, Sivabalan Manivasagam, Shangjie Xue, Ersin
tiple seconds per image). Several NeRF caching tech-
Yumer, and Raquel Urtasun. Geosim: Realistic video simula-
niques [20, 25, 72] or a sparse voxel grid [36] could be used tion via geometry-aware composition for self-driving. CVPR,
to enable real-time Block-NeRF rendering. Similarly, multi- 2021. 3
ple concurrent works have demonstrated techniques to speed [10] Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu,
up training of NeRF style representations by multiple orders Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen.
of magnitude [43, 60, 71]. Panoptic-deeplab: A simple, strong, and fast baseline for
bottom-up panoptic segmentation. CVPR, 2020. 5, 6
7. Conclusion [11] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo
In this paper we propose Block-NeRF, a method that Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke,
reconstructs arbitrarily large environments using NeRFs. Stefan Roth, and Bernt Schiele. The cityscapes dataset for
We demonstrate the method’s efficacy by building an entire semantic urban scene understanding. CVPR, 2016. 6
neighborhood in San Francisco from 2.8M images, forming [12] Jeevan Devaranjan, Amlan Kar, and Sanja Fidler. Meta-sim2:
the largest neural scene representation to date. We accom- Unsupervised learning of scene structure for synthetic data
plish this scale by splitting our representation into multiple generation. ECCV, 2020. 3
blocks that can be optimized independently. At such a scale, [13] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio
the data collected will necessarily have transient objects and Lopez, and Vladlen Koltun. Carla: An open urban driving
simulator. Conference on robot learning, 2017. 1, 3
variations in appearance, which we account for by modifying
[14] Dawei Du, Yuankai Qi, Hongyang Yu, Yifan Yang, Kaiwen
the underlying NeRF architecture. We hope that this can
Duan, Guorong Li, Weigang Zhang, Qingming Huang, and
inspire future work in large-scale scene reconstruction using
Qi Tian. The unmanned aerial vehicle benchmark: Object
modern neural rendering methods. detection and tracking. ECCV, 2018. 1
[15] John Flynn, Ivan Neulander, James Philbin, and Noah Snavely.
References Deepstereo: Learning to predict new views from the world’s
[1] Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian imagery. CVPR, 2016. 3
Simon, Brian Curless, Steven M Seitz, and Richard Szeliski. [16] Christian Früh and Avideh Zakhor. An automated method for
Building rome in a day. Communications of the ACM, 2011. large-scale, ground-based city model acquisition. IJCV, 2004.
2 2
[2] Alexander Amini, Igor Gilitschenski, Jacob Phillips, Julia [17] Yasutaka Furukawa, Brian Curless, Steven M Seitz, and
Moseyko, Rohan Banerjee, Sertac Karaman, and Daniela Rus. Richard Szeliski. Towards internet-scale multi-view stereo.
Learning robust control policies for end-to-end autonomous CVPR, 2010. 2
driving from data-driven simulation. IEEE Robotics and [18] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and
Automation Letters, 2020. 3 robust multi-view stereopsis. IEEE TPAMI, 2010. 2
[3] Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter
[19] Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora
Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan.
Vig. Virtual worlds as proxy for multi-object tracking analysis.
Mip-NeRF: A multiscale representation for anti-aliasing neu-
CVPR, 2016. 3
ral radiance fields. ICCV, 2021. 1, 3, 4
[20] Stephan J Garbin, Marek Kowalski, Matthew Johnson, Jamie
[4] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P
Shotton, and Julien Valentin. Fastnerf: High-fidelity neural
Srinivasan, and Peter Hedman. Mip-nerf 360: Un-
rendering at 200fps. arXiv:2103.10380, 2021. 9
bounded anti-aliased neural radiance fields. arXiv preprint
arXiv:2111.12077, 2021. 9 [21] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
[5] Piotr Bojanowski, Armand Joulin, David Lopez-Paz, and ready for autonomous driving? the kitti vision benchmark
Arthur Szlam. Optimizing the latent space of generative suite. CVPR, 2012. 6
networks. arXiv:1707.05776, 2017. 4 [22] Mordechai Haklay and Patrick Weber. Openstreetmap: User-
[6] Chris Buehler, Michael Bosse, Leonard McMillan, Steven generated street maps. IEEE Pervasive computing, 2008. 4
Gortler, and Michael Cohen. Unstructured lumigraph ren- [23] R. I. Hartley and A. Zisserman. Multiple View Geometry
dering. Computer graphics and interactive techniques, 2001. in Computer Vision. Cambridge University Press, second
3 edition, 2004. 2

9
[24] Peter Hedman, Julien Philip, True Price, Jan-Michael Frahm, [41] Moustafa Meshry, Dan B. Goldman, Sameh Khamis, Hugues
George Drettakis, and Gabriel Brostow. Deep blending for Hoppe, Rohit Pandey, Noah Snavely, and Ricardo Martin-
free-viewpoint image-based rendering. ACM Transactions on Brualla. Neural rerendering in the wild. CVPR, 2019. 3
Graphics (TOG), 2018. 3 [42] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
[25] Peter Hedman, Pratul P Srinivasan, Ben Mildenhall, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
Jonathan T Barron, and Paul Debevec. Baking neural ra- Representing scenes as neural radiance fields for view synthe-
diance fields for real-time view synthesis. arXiv:2103.14645, sis. ECCV, 2020. 1, 3, 4
2021. 9 [43] Thomas Müller, Alex Evans, Christoph Schied, and Alexan-
[26] Amlan Kar, Aayush Prakash, Ming-Yu Liu, Eric Cameracci, der Keller. Instant neural graphics primitives with a mul-
Justin Yuan, Matt Rusiniak, David Acuna, Antonio Torralba, tiresolution hash encoding. arXiv:2201.05989, Jan. 2022.
and Sanja Fidler. Meta-sim: Learning to generate synthetic 9
datasets. ICCV, 2019. 3 [44] Julian Ost, Fahim Mannan, Nils Thuerey, Julian Knodt, and
[27] Michael Kazhdan and Hugues Hoppe. Screened poisson Felix Heide. Neural scene graphs for dynamic scenes. CVPR,
surface reconstruction. ACM Transactions on Graphics (ToG), 2021. 1, 3, 8
2013. 13 [45] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien
[28] Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo
Fidler. Drivegan: Towards a controllable high-quality neural Martin-Brualla. Nerfies: Deformable neural radiance fields.
simulation. CVPR, 2021. 3 ICCV, 2021. 1
[29] Diederik P Kingma and Jimmy Ba. Adam: A method for [46] Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc
stochastic optimization. ICLR, 2015. 12 Pollefeys, and Andreas Geiger. Convolutional occupancy
networks. In Computer Vision–ECCV 2020: 16th European
[30] Johannes Kopf, Billy Chen, Richard Szeliski, and Michael
Conference, Glasgow, UK, August 23–28, 2020, Proceedings,
Cohen. Street slide: browsing street level imagery. ACM
Part III 16, pages 523–540. Springer, 2020. 3
Transactions on Graphics (TOG), 2010. 3
[47] Marc Pollefeys, David Nistér, J-M Frahm, Amir Akbarzadeh,
[31] Johannes Kopf, Michael Cohen, and Rick Szeliski. First-
Philippos Mordohai, Brian Clipp, Chris Engels, David Gallup,
person hyperlapse videos. SIGGRAPH, 2014. 3
S-J Kim, Paul Merrell, et al. Detailed real-time urban 3d
[32] Wei Li, CW Pan, Rong Zhang, JP Ren, YX Ma, Jin Fang, FL reconstruction from video. IJCV, 2008. 2
Yan, QC Geng, XY Huang, HJ Gong, et al. Aads: Augmented [48] Daniel Rebain, Wei Jiang, Soroosh Yazdani, Ke Li,
autonomous driving simulation using data-driven algorithms. Kwang Moo Yi, and Andrea Tagliasacchi. Derf: Decom-
Science robotics, 2019. 1, 3 posed radiance fields. CVPR, 2021. 3
[33] Xiaowei Li, Changchang Wu, Christopher Zach, Svetlana [49] Christian Reiser, Songyou Peng, Yiyi Liao, and Andreas
Lazebnik, and Jan-Michael Frahm. Modeling and recognition Geiger. KiloNeRF: Speeding up neural radiance fields with
of landmark image collections using iconic scene graphs. thousands of tiny MLPs. ICCV, 2021. 3
ECCV, 2008. 2 [50] Stephan R Richter, Hassan Abu AlHaija, and Vladlen Koltun.
[34] Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Simon Enhancing photorealism enhancement. arXiv:2105.04619,
Lucey. Barf: Bundle-adjusting neural radiance fields. arXiv 2021. 3
preprint arXiv:2104.06405, 2021. 3, 5 [51] Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen
[35] Andrew Liu, Richard Tucker, Varun Jampani, Ameesh Maka- Koltun. Playing for data: Ground truth from computer games.
dia, Noah Snavely, and Angjoo Kanazawa. Infinite nature: ECCV, 2016. 3
Perpetual view generation of natural scenes from a single [52] Gernot Riegler and Vladlen Koltun. Free view synthesis.
image. ICCV, 2021. 1 ECCV, 2020. 3
[36] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and [53] Gernot Riegler and Vladlen Koltun. Stable view synthesis.
Christian Theobalt. Neural sparse voxel fields. NeurIPS, CVPR, 2021. 3
2020. 3, 9 [54] German Ros, Laura Sellart, Joanna Materzynska, David
[37] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Vazquez, and Antonio M Lopez. The synthia dataset: A large
Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural collection of synthetic images for semantic segmentation of
volumes: Learning dynamic renderable volumes from images. urban scenes. CVPR, 2016. 3
SIGGRAPH, 2019. 3 [55] Johannes L Schonberger and Jan-Michael Frahm. Structure-
[38] Frank Losasso and Hugues Hoppe. Geometry clipmaps: ter- from-motion revisited. CVPR, 2016. 2, 6, 12
rain rendering using nested regular grids. Siggraph, 2004. [56] Qi Shan, Riley Adams, Brian Curless, Yasutaka Furukawa,
2 and Steven M. Seitz. The visual turing test for scene recon-
[39] David G Lowe. Distinctive image features from scale- struction. 3DV, 2013. 2
invariant keypoints. IJCV, 2004. 2 [57] Noah Snavely, Steven M. Seitz, and Richard Szeliski. Photo
[40] Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, tourism: Exploring photo collections in 3d. SIGGRAPH,
Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duck- 2006. 2
worth. Nerf in the wild: Neural radiance fields for uncon- [58] Pratul P. Srinivasan, Boyang Deng, Xiuming Zhang, Matthew
strained photo collections. CVPR, 2021. 1, 3, 4, 8 Tancik, Ben Mildenhall, and Jonathan T. Barron. NeRV:

10
Neural reflectance and visibility fields for relighting and view [73] Jiakai Zhang, Xinhang Liu, Xinyi Ye, Fuqiang Zhao, Yanshun
synthesis. CVPR, 2021. 5 Zhang, Minye Wu, Yingliang Zhang, Lan Xu, and Jingyi
[59] Shih-Yang Su, Frank Yu, Michael Zollhöfer, and Helge Yu. Editable free-viewpoint video using a layered neural
Rhodin. A-nerf: Articulated neural radiance fields for learn- representation. ACM Transactions on Graphics (TOG), 2021.
ing human shape, appearance, and pose. Advances in Neural 3
Information Processing Systems, 34, 2021. 3, 5 [74] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
[60] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel Koltun. Nerf++: Analyzing and improving neural radiance
grid optimization: Super-fast convergence for radiance fields fields. arXiv preprint arXiv:2010.07492, 2020. 9
reconstruction. arXiv preprint arXiv:2111.11215, 2021. 9 [75] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
[61] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien and Oliver Wang. The unreasonable effectiveness of deep
Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, features as a perceptual metric. CVPR, 2018. 7
Yuning Chai, Benjamin Caine, et al. Scalability in perception [76] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe,
for autonomous driving: Waymo open dataset. CVPR, 2020. and Noah Snavely. Stereo magnification: Learning view
6 synthesis using multiplane images. arXiv:1805.09817, 2018.
[62] Towaki Takikawa, Joey Litalien, Kangxue Yin, Karsten Kreis, 3
Charles Loop, Derek Nowrouzezahrai, Alec Jacobson, Mor- [77] Siyu Zhu, Runze Zhang, Lei Zhou, Tianwei Shen, Tian Fang,
gan McGuire, and Sanja Fidler. Neural geometric level of Ping Tan, and Long Quan. Very large-scale global SFM by
detail: Real-time rendering with implicit 3D shapes. CVPR, distributed motion averaging. CVPR, 2018. 2
2021. 3
[63] Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara
Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra-
mamoorthi, Jonathan T. Barron, and Ren Ng. Fourier features
let networks learn high frequency functions in low dimen-
sional domains. NeurIPS, 2020. 4
[64] Sebastian Thrun. Probabilistic robotics. Communications of
the ACM, 2002. 2
[65] Bill Triggs, Philip F McLauchlan, Richard I Hartley, and An-
drew W Fitzgibbon. Bundle adjustment—a modern synthesis.
International workshop on vision algorithms, 1999. 2
[66] Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Vic-
tor Adrian Prisacariu. Nerf–: Neural radiance fields without
known camera parameters. arXiv preprint arXiv:2102.07064,
2021. 3, 5
[67] Bangbang Yang, Yinda Zhang, Yinghao Xu, Yijin Li, Han
Zhou, Hujun Bao, Guofeng Zhang, and Zhaopeng Cui. Learn-
ing object-compositional neural radiance field for editable
scene rendering. ICCV, 2021. 3, 8
[68] Zhenpei Yang, Yuning Chai, Dragomir Anguelov, Yin Zhou,
Pei Sun, Dumitru Erhan, Sean Rafferty, and Henrik Kret-
zschmar. Surfelgan: Synthesizing realistic sensor data for
autonomous driving. CVPR, 2020. 1, 3
[69] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan
Atzmon, Basri Ronen, and Yaron Lipman. Multiview neural
surface reconstruction by disentangling geometry and appear-
ance. Advances in Neural Information Processing Systems,
33, 2020. 3
[70] Lin Yen-Chen, Pete Florence, Jonathan T. Barron, Alberto
Rodriguez, Phillip Isola, and Tsung-Yi Lin. iNeRF: Invert-
ing neural radiance fields for pose estimation. In IEEE/RSJ
International Conference on Intelligent Robots and Systems
(IROS), 2021. 3, 5
[71] Alex Yu, Sara Fridovich-Keil, Matthew Tancik, Qinhong
Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels:
Radiance fields without neural networks. arXiv preprint
arXiv:2112.05131, 2021. 9
[72] Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and
Angjoo Kanazawa. Plenoctrees for real-time rendering of
neural radiance fields. arXiv:2103.14024, 2021. 9

11
A. Model Parameters / Optimization Details The metrics imply that reducing overlap is beneficial for
image quality metrics. However, this can likely be attributed
Our network follows the mip-NeRF structure. The net- to the resulting reduction in block size. In practice, having
work fσ is composed of 8 layers with width 512 (Mission an overlap between blocks is important to avoid temporal
Bay experiments) or 1024 (all other experiments). fc has 3 artifacts when interpolating between Block-NeRFs.
layers with width 128 and fv has 4 layers with width 128.
The appearance embeddings are 32 dimensional. We train
Overlap Size PSNR↑ SSIM↑ LPIPS↓
each Block-NeRF using the Adam [29] optimizer for 300 K
iterations with a batch size of 16384. Similar to mip-NeRF, 0% 77 m 26.77 0.895 0.262
the learning rate is an annealed logarithmically from 2 · 10−3 25% 97 m 26.75 0.894 0.269
to 2 · 10−5 , with a warm up phase during the first 1024 itera- 50%* 116 m 26.59 0.890 0.278
tions. The coarse and fine networks are sampled 256 times 75% 136 m 26.51 0.887 0.283
during training and 512 times when rendering the videos. Table 4. Effect of different NeRF overlaps in the 8 block scenario
The visibility is supervised with MSE loss and is scaled by with 0.25M weights per block (2M weights in total). The original
10−6 . The learned pose correction consists of a position setting used in the main paper is marked*.
offset and a 3 × 3 residual rotation matrix, which is added to
the identity matrix and normalized before being applied to
ensure it is orthogonal. The pose corrections are initialized
D. Block-NeRF Interpolation Details
to 0 and their element-wise `2 norm is regularized during We experiment with multiple methods to interpolate be-
training. This regularization is scaled by 105 at the start of tween Block-NeRFs and find that simple inverse distance
training and linearly decays to 10−1 after 5000 iterations. weighting (IDW) in image space produces the most appeal-
This allows the network to learn initial geometry prior to ing videos due to temporal smoothness. We use an IDW
applying pose offsets. power p of 4 for the Alamo Square renderings and a power
Each Block-NeRF takes between 9 and 24 hours to train of 1 for the Mission Bay renderings. We experiment with 3D
(depending on hyperparameters). We train each Block-NeRF inverse distance weighting for each individual pixel by pro-
on 32 TPU v3 cores available through Google Cloud Com- jecting the rendered pixels into 3D space using the expected
pute, which combined offer a total of 1680 TFLOPS and 512 ray termination depth from the Block-NeRF closest to the
GB memory. Rendering an 1200 × 900px image for a sin- target view. The color value of the projected pixel is then
gle Block-NeRF takes approximately 5.9 seconds. Multiple determined using inverse distance weighting with the nearest
Block-NeRF can be processed in parallel during inference Block-NeRFs. Artifacts occur in the resulting composited
(typically fewer than 3 Block-NeRFs need to be rendered for renders due to noise in the depth predictions. We also ex-
a single frame). periment with using the Block-NeRF predicted visibility for
interpolation. We consider imagewise visibility where we
B. Block-NeRF Size and Placement take the mean visibility of the entire image and pixelwise
visibility where were directly utilize the per-pixel visibility
We include qualitative comparisons in Figure 9 on the predictions. Both of these methods lead to sharper results
Mission Bay dataset to complement the quantitative compar- but come at the cost of temporal inconsistencies. Finally we
isons in (§5.3, Table 2). In this figure, we provide compar- compare to nearest neighbor interpolation where we only ren-
isons on two regimes, one where each Block-NeRF contains der the Block-NeRF closest to the target view. This results
the same number of weights (left section) and one where in harsh jumps when transiting between Block-NeRFs.
the total number of weights across all Block-NeRFs is fixed
(right section). E. Structure from Motion (COLMAP)
C. Block-NeRF Overlap Comparison We use COLMAP [55] to reconstruct the Mission Bay
dataset. We first split the dataset into 8 overlapping blocks
In the main paper, we include experiments on Block- with 97 m radius each based on camera positions (each block
NeRF size and placement (§5.3). For these experiments, has roughly 25% overlap with the adjacent block). The bun-
we assumed a relative overlap of 50% between each pair of dle adjustment step takes most of the time in reconstruction
Block-NeRFs, which aids with appearance alignment. and we do not see significant improvements if we increase
Table 4 is a direct extension of Table 2 in the main paper the radius per block. We mask out movable objects when
and shows the effect of varying block overlap in the 8 block extracting feature points for matching, using the same seg-
scenario. Note that varying the overlap changes the spatial mentation model as Block-NeRF. We assume a pinhole cam-
block size. The original setting in the main paper is marked era model and provide camera intrinsics and camera pose
with an asterisk. as priors for running structure-from-motion. We then run

12
Point Surfel Point Surfel
Ground Truth Rendering Rendering Ground Truth Rendering Rendering F. Examples from our Datasets
In Figure 10, we show the camera images from our Mis-
sion Bay dataset. In Figure 11, we show both camera images
and corresponding segmentation masks from our Alamo
Square dataset.

G. Societal Impact
G.1. Methodological
Our method inherits the heavy compute footprint of NeRF
models and we propose to apply them at an unprecedented
scale. Our method also unlocks new use-cases for neural
rendering, such as building detailed maps of the environment
(mapping), which could cause more wide-spread use in favor
of less computationally involved alternatives. Depending on
the scale this work is being applied at, its compute demands
can lead to or worsen environmental damage if the energy
used for compute leads to increased carbon emissions. As
Figure 8. Qualitative results for COLMAP. We demonstrate the mentioned in the paper, we foresee further work, such as
two rendering options using the fused pointcloud computed by
caching methods, that could reduce the compute demands
COLMAP.
and thus mitigate the environmental damage.

multi-view stereo within each block to produce dense depth G.2. Application
and normal maps in 3D and produce a dense point cloud of We apply our method to real city environments. During
the scene. In our preliminary experiments, we ran Poisson our own data collection efforts for this paper, we were care-
meshing [27] on the fused dense pointcloud to reconstruct ful to blur faces and sensitive information, such as license
textured meshes but found that the method fails to produce plates, and limited our driving to public roads. Future appli-
reasonably-looking results due to the challenging geometry cations of this work might entail even larger data collection
and depth errors introduced by reflective surfaces and the sky. efforts, which raises further privacy concerns. While detailed
Instead, we leverage the fused pointcloud and explore two imagery of public roads can already be found on services
alternatives, namely, point rendering and surfel rendering, like Google Street View, our methodology could promote
respectively. To render the test view, we selected the nearest repeated and more regular scans of the environment. Several
scene and use OSMesa off-screen rendering assuming the companies in the autonomous vehicle space are also known
Lambertian model and a single light source. to perform regular area scans using their fleet of vehicles;
In Table 5, we compare two different rendering options however some might only utilize LiDAR scans which can be
for the densely reconstructed pointcloud. We discard the less sensitive than collecting camera imagery.
invisible pixels when computing the PSNR for both methods,
making the quantitative results comparable to our Block-
NeRF setting.
In Figure 8, we show the qualitative comparisons between
two rendering options with PSNR on the corresponding im-
ages. This reconstruction is sparse and fails to represent
reflective surfaces and the sky.

Method PSNR* (train) ↑ PSNR* (test) ↑

COLMAP (point) 13.019 11.933
COLMAP (surfel) 13.291 12.343

Table 5. Quantitative results for COLMAP. We discard invisible

pixels (e.g., sky pixels that COLMAP fails to reconstruct) when
computing the PSNR.

https://docs.mesa3d.org/osmesa.html

13
Fixed # of Weights per Block Fixed Total # of Weights

Ground Truth 1 Block-NeRF 4 Block-NeRFs 8 Block-NeRFs 16 Block-NeRFs 1 Block-NeRF 4 Block-NeRFs 8 Block-NeRFs 16 Block-NeRFs

Figure 9. Qualitative results on Block-NeRF size and placement. We show results on the Mission Bay dataset using different options
discussed in § 5.3 of the main paper.

Figure 10. Selection of images from our Mission Bay Dataset.

14
Figure 11. Selection of front-facing images from our Alamo Square Dataset, alongside their transient object mask predicted by a pretrained
semantic segmentation model.

Texturing Pipeline - Texturing A Graffitied Dumpster in Substance Painter.
No ratings yet
Texturing Pipeline - Texturing A Graffitied Dumpster in Substance Painter.
20 pages
Master RGB Conversion Chart
No ratings yet
Master RGB Conversion Chart
12 pages
2312 Photogrammetry and Remote Sensing
100% (2)
2312 Photogrammetry and Remote Sensing
4 pages
Samsung Pay Brand Guidelines
No ratings yet
Samsung Pay Brand Guidelines
34 pages
Lc420eun Sdv1 LG
No ratings yet
Lc420eun Sdv1 LG
37 pages
GeoNeRF: Generalizing NeRF With Geometry Priors ( 2022 CVPR)
No ratings yet
GeoNeRF: Generalizing NeRF With Geometry Priors ( 2022 CVPR)
19 pages
Generalizable Neural Radiance Fields For Novel View Synthesis With Transformer ( 2022 ArXiv)
No ratings yet
Generalizable Neural Radiance Fields For Novel View Synthesis With Transformer ( 2022 ArXiv)
12 pages
Grid-Guided Neural Radiance Fields For Large Urban Scenes
No ratings yet
Grid-Guided Neural Radiance Fields For Large Urban Scenes
11 pages
Art Color Annotations by Alex Kanevsky
No ratings yet
Art Color Annotations by Alex Kanevsky
12 pages
NeRDi: Single-View NeRF Synthesis With Language-Guided Diffusion As General Image Priors (Waymo, Stanford, Google 2022 )
No ratings yet
NeRDi: Single-View NeRF Synthesis With Language-Guided Diffusion As General Image Priors (Waymo, Stanford, Google 2022 )
12 pages
Ukrainian Art: Gapchinska's Style
No ratings yet
Ukrainian Art: Gapchinska's Style
9 pages
Increase Fps Android Xda
No ratings yet
Increase Fps Android Xda
3 pages
NEW AW Volumeshare 7 Ext5 PDS
100% (1)
NEW AW Volumeshare 7 Ext5 PDS
10 pages
Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations
No ratings yet
Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations
15 pages
Scitex Dolev: Plug-In Manual
No ratings yet
Scitex Dolev: Plug-In Manual
12 pages
07
No ratings yet
07
1 page
Sun Direct Voxel Grid Optimization Super-Fast Convergence For Radiance Fields Reconstruction CVPR 2022 Paper
No ratings yet
Sun Direct Voxel Grid Optimization Super-Fast Convergence For Radiance Fields Reconstruction CVPR 2022 Paper
11 pages
Grade 7 - Lesson 3
No ratings yet
Grade 7 - Lesson 3
28 pages
Street Gaussians For Modeling Dynamic Urban Scenes
No ratings yet
Street Gaussians For Modeling Dynamic Urban Scenes
13 pages
#1660908-Data Management and Statistical Computing
No ratings yet
#1660908-Data Management and Statistical Computing
21 pages
NeRF Seminar Report Part3
No ratings yet
NeRF Seminar Report Part3
13 pages
Ebooks File Developing Graphics Frameworks With Java and OpenGL 1st Edition Lee Stemkoski All Chapters
100% (4)
Ebooks File Developing Graphics Frameworks With Java and OpenGL 1st Edition Lee Stemkoski All Chapters
49 pages
Neural Radiance Fields NeRFs A Review and Some Rec
No ratings yet
Neural Radiance Fields NeRFs A Review and Some Rec
5 pages
NeRF Seminar Report Part2
No ratings yet
NeRF Seminar Report Part2
9 pages
SUDS
No ratings yet
SUDS
11 pages
OpenGL Graphics Primitives Guide
No ratings yet
OpenGL Graphics Primitives Guide
10 pages
NeRF Seminar Report Part1
No ratings yet
NeRF Seminar Report Part1
5 pages
Model, Render, Animate: Houdini Foundations
No ratings yet
Model, Render, Animate: Houdini Foundations
24 pages
Hand-Out No. 2
No ratings yet
Hand-Out No. 2
2 pages
登山基础山区徒步、露营、郊游指南全彩图解版 - (日) 田部井淳子主编;陈刚,王爽威译 - 2016 - 北京 - 人民邮电出版社 - 14080409 - - Anna's Archive
No ratings yet
登山基础山区徒步、露营、郊游指南全彩图解版 - (日) 田部井淳子主编;陈刚,王爽威译 - 2016 - 北京 - 人民邮电出版社 - 14080409 - - Anna's Archive
184 pages
3721 InNeRF Learning Interpret
No ratings yet
3721 InNeRF Learning Interpret
9 pages
Point2Pix: Photo-Realistic Point Cloud Rendering Via Neural Radiance Fields
No ratings yet
Point2Pix: Photo-Realistic Point Cloud Rendering Via Neural Radiance Fields
10 pages
Sn-Lidar: Semantic Neural Fields For Novel Space-Time View Lidar Synthesis
No ratings yet
Sn-Lidar: Semantic Neural Fields For Novel Space-Time View Lidar Synthesis
9 pages
HUGS
No ratings yet
HUGS
10 pages
Deep Review and Analysis of Recent Nerfs: Original Paper
No ratings yet
Deep Review and Analysis of Recent Nerfs: Original Paper
32 pages
Reconfusion: 3D Reconstruction With Diffusion Priors: 30 PSNR
No ratings yet
Reconfusion: 3D Reconstruction With Diffusion Priors: 30 PSNR
13 pages
Neural RGB-D 3D Reconstruction
No ratings yet
Neural RGB-D 3D Reconstruction
12 pages
DemonSteele v0.9 Credits
No ratings yet
DemonSteele v0.9 Credits
5 pages
TNG05, Stinis Main Trolley SSH (Split Spreader Headblock) V02, 2016-11-21 Ok ENI
No ratings yet
TNG05, Stinis Main Trolley SSH (Split Spreader Headblock) V02, 2016-11-21 Ok ENI
100 pages
From 2D To 3D: Leveraging Sparse Inputs For High-Fidelity Model Generation With Neural Radiance Fields
No ratings yet
From 2D To 3D: Leveraging Sparse Inputs For High-Fidelity Model Generation With Neural Radiance Fields
5 pages
Abay Zholy 1 Kitap PDF
No ratings yet
Abay Zholy 1 Kitap PDF
369 pages
Editing Implicit and Explicit Representations of Radiance Fields: A Survey
No ratings yet
Editing Implicit and Explicit Representations of Radiance Fields: A Survey
34 pages
NeRF 3D Scene Reconstruction Report
No ratings yet
NeRF 3D Scene Reconstruction Report
5 pages
AutoRecon 自动检测物体并重建
No ratings yet
AutoRecon 自动检测物体并重建
10 pages
NeRF-DA Neural Radiance Fields Deblurring With Active Learning
No ratings yet
NeRF-DA Neural Radiance Fields Deblurring With Active Learning
5 pages
Nerf Lets
No ratings yet
Nerf Lets
13 pages
Nerfmeshing: Distilling Neural Radiance Fields Into Geometrically-Accurate 3D Meshes
No ratings yet
Nerfmeshing: Distilling Neural Radiance Fields Into Geometrically-Accurate 3D Meshes
11 pages
GIRAFFE Representing Scenes As Compositional Generative Neural Feature Fields - 2011.12100v2
No ratings yet
GIRAFFE Representing Scenes As Compositional Generative Neural Feature Fields - 2011.12100v2
12 pages
Tang NeRFDeformer NeRF Transformation From A Single View Via 3D Scene CVPR 2024 Paper
No ratings yet
Tang NeRFDeformer NeRF Transformation From A Single View Via 3D Scene CVPR 2024 Paper
11 pages
Turki Mega-NERF Scalable Construction of Large-Scale NeRFs For Virtual Fly-Throughs CVPR 2022 Paper
No ratings yet
Turki Mega-NERF Scalable Construction of Large-Scale NeRFs For Virtual Fly-Throughs CVPR 2022 Paper
10 pages
Chapter 5
No ratings yet
Chapter 5
5 pages
柱
No ratings yet
柱
2 pages
Icra25 4116 MS
No ratings yet
Icra25 4116 MS
7 pages
Computer Graphics
No ratings yet
Computer Graphics
2 pages
PA-1 Class-12th Set B
No ratings yet
PA-1 Class-12th Set B
3 pages
Mip-Nerf 360: Unbounded Anti-Aliased Neural Radiance Fields
No ratings yet
Mip-Nerf 360: Unbounded Anti-Aliased Neural Radiance Fields
18 pages
Beyondpixels: A Comprehensive Review of The Evolution of Neural Radiance Fields
No ratings yet
Beyondpixels: A Comprehensive Review of The Evolution of Neural Radiance Fields
33 pages
DENSER: 3D Gaussians Splatting For Scene Reconstruction of Dynamic Urban Environments
No ratings yet
DENSER: 3D Gaussians Splatting For Scene Reconstruction of Dynamic Urban Environments
7 pages
Pixel Nerf
No ratings yet
Pixel Nerf
10 pages
NeRF - Neural Radiance Field in 3D Vision, A Comprehensive Review
No ratings yet
NeRF - Neural Radiance Field in 3D Vision, A Comprehensive Review
28 pages
NeRF for Low-Light 3D Scene Enhancement
No ratings yet
NeRF for Low-Light 3D Scene Enhancement
17 pages
NeRF Regularization Framework
No ratings yet
NeRF Regularization Framework
10 pages
Point-NeRF Point-Based Neural Radiance Fields
No ratings yet
Point-NeRF Point-Based Neural Radiance Fields
16 pages
NeRF-SR High-Quality Neural Radiance Fields Using Super-Sampling
No ratings yet
NeRF-SR High-Quality Neural Radiance Fields Using Super-Sampling
14 pages
NeRF: A Survey for Researchers
No ratings yet
NeRF: A Survey for Researchers
26 pages
Sparse Input View Synthesis
No ratings yet
Sparse Input View Synthesis
11 pages
LiDAR4D Dynamic Neural Fields For Novel Space-Time View LiDAR Synthesis
No ratings yet
LiDAR4D Dynamic Neural Fields For Novel Space-Time View LiDAR Synthesis
17 pages
DiffusioNeRF Regularizing Neural Radiance Fields With Denoising Diffusion Models CVPR 2023 Paper
No ratings yet
DiffusioNeRF Regularizing Neural Radiance Fields With Denoising Diffusion Models CVPR 2023 Paper
10 pages
Neural Radiance Fields for View Synthesis
No ratings yet
Neural Radiance Fields for View Synthesis
8 pages
Generation of 3D Textured Models From 2D Images
No ratings yet
Generation of 3D Textured Models From 2D Images
24 pages
3D from Single Image: Zero-1-to-3
No ratings yet
3D from Single Image: Zero-1-to-3
13 pages
Enhancing View Synthesis With Depth-Guided Neural Radiance Fields and Improved Depth Completion
No ratings yet
Enhancing View Synthesis With Depth-Guided Neural Radiance Fields and Improved Depth Completion
17 pages
Advances in Neural Rendering
No ratings yet
Advances in Neural Rendering
33 pages
Atmosphere 11 01241 PDF
No ratings yet
Atmosphere 11 01241 PDF
15 pages
RobustNeRF - Ignoring Distractors With Robust Losses
No ratings yet
RobustNeRF - Ignoring Distractors With Robust Losses
20 pages
Nerf Slam
No ratings yet
Nerf Slam
10 pages
An Improved Ant Colony Algorithm For Robot Path Planningsoft Computing
No ratings yet
An Improved Ant Colony Algorithm For Robot Path Planningsoft Computing
11 pages
Plenoctree Met SH Uitleg in
No ratings yet
Plenoctree Met SH Uitleg in
18 pages
Nerf RPN
No ratings yet
Nerf RPN
13 pages
Plen Octree
No ratings yet
Plen Octree
10 pages
Real-Time Neural Rendering of Large-Scale Scenes On The Web
No ratings yet
Real-Time Neural Rendering of Large-Scale Scenes On The Web
15 pages
Ne RF
No ratings yet
Ne RF
20 pages
RaFE: NeRF Restoration with GANs
No ratings yet
RaFE: NeRF Restoration with GANs
23 pages
Robust Dynamic Radiance Fields
No ratings yet
Robust Dynamic Radiance Fields
11 pages
Point Net GPD
No ratings yet
Point Net GPD
7 pages
论文
No ratings yet
论文
17 pages
Office Furniture Specs
No ratings yet
Office Furniture Specs
1 page
Deep Learning for Photo Exposure Correction
No ratings yet
Deep Learning for Photo Exposure Correction
23 pages
MICA DataVis Program Overview
No ratings yet
MICA DataVis Program Overview
12 pages
Cross Stitch Pattern P2P-19686041 Images (3) .Jpeg
No ratings yet
Cross Stitch Pattern P2P-19686041 Images (3) .Jpeg
4 pages
Eq-523711-96 Asus
No ratings yet
Eq-523711-96 Asus
3 pages
B.Tech Computer Graphics Lab
No ratings yet
B.Tech Computer Graphics Lab
2 pages

Block Nerf

Uploaded by

Block Nerf

Uploaded by

Block-NeRF: Scalable Large Scene Neural View Synthesis

Matthew Tancik1∗ Vincent Casser2 Xinchen Yan2 Sabeek Pradhan2

learned pose refinement to address the environmental Discarded

4.3.2 Block-NeRF Compositing 5.1. Datasets

Before Appearance Matching After Appearance Matching

Method PSNR* (train) ↑ PSNR* (test) ↑

Table 5. Quantitative results for COLMAP. We discard invisible

Figure 10. Selection of images from our Mission Bay Dataset.

You might also like