1 Introduction
Recent learning based methods have shown impressive results in 3D modeling. In particular, implicit neural volume rendering [27, 31, 22, 28] has become a popular framework to learn compact 3D scene representations from a sparse set of images. Among these methods, Neural Radiance Fields (NeRF) [28] has received a lot of attention given its ability to render photorealistic novel views of the scene. Additionally, several works have shown that the 3D representations learned by NeRF can be used for different downstream tasks, such as camerapose recovery [38], 3D semantic segmentation [16] or depth estimation [28]. Even though all these tasks have relevant applications in fields such as robotics or augmented reality, existing NeRFbased approaches are limited in these scenarios, for being unable to provide information about the confidence associated with the model outputs. For instance, consider a robot using Neural Radiance Fields to reason about its environment. In order to plan the optimal actions and reduce potential risks, the robot must take into account not only the outputs produced by NeRF, but also their associated uncertainty.
In this context, we propose StochasticNeRF, a generalization of the original NeRF framework able to quantify the uncertainty associated with the implicit 3D representation. Unlike standard NeRF which only estimates deterministic radiancedensity values for all the spatiallocations in the scene, SNeRF models these pairs as stochastic variables following a distribution whose parameters are optimized during learning. In this manner, our method implicitly encodes a distribution over all the possible radiance fields modeling the scene. The introduction of this stochasticity enables SNeRF to quantify the uncertainty associated with the the resulting outputs in different tasks such as novelview rendering or depthmap estimation (see Figure LABEL:fig:intro). During learning, we follow a Bayesian approach to estimate the posterior distribution of all the possible radiance fields given training data. To make this optimization problem tractable, we devise a learning procedure for SNeRF based on Variational Inference [5]. Conducting exhaustive experiments over benchmark datasets, we show that SNeRF is able to provide more reliable uncertainty estimates than generic approaches previously proposed for uncertainty estimation in other domains. In particular, we evaluate the ability of SNERF to quantify the uncertainty in novelview synthesis and depthmap estimation.
2 Related Work
Neural Radiance Fields.
Similar to other neural volumetric approaches such as Scene Representation Networks
[35] or Neural Volumes [22], NeRF uses a collection of sparse 2D views to learn a neural network encoding an implicit 3D representation of the scene. NeRF employs a simple yet effective approach where the network predicts the volume density and emitted radiance for any given viewdirection and spatial coordinate. These outputs are then combined with volume rendering techniques
[26] to synthesize novel views or estimate the implicit 3D geometry of the scene.Since it was firstly introduced, many works have extended the original NeRF framework to address some of its limitations. For instance, [21, 20, 29, 23] explored several techniques to accelerate the timeconsuming training and rendering process. Other works [17, 39] introduced the notion of “scene priors”, allowing a single NeRF model to encode the information of different scenes and generalize to novel ones. Similarly, [25] proposed to account for illumination changes and transient occluders in order to leverage inthewild training views. Other recent works [33, 8, 34, 19, 37, 32] have extended NeRF to cases with dynamic objects in the scene.
Different from the aforementioned methods, the proposed SNeRF explicitly addresses the problem of estimating the uncertainty associated with the learned implicit 3D representation. Our framework is a probabilistic generalization of original NeRF and thus, our formulation can be easily combined with most of previous works in order to improve different aspects of the model.
NeRF Applications. Implicit representations learned by NeRF can be used to infer useful scene information for domains such as Robotics or Augmented Reality (AR). For instance, the ability to render novel views, estimate 3D meshes [28, 33] or recover cameraposes[38] can be used to allow robots to reason about the environment and plan navigation paths or object manipulations. Additionally, [30] proposed to incorporate a compositional 3D scene representation into a generative model to achieve controllable novel view synthesis. This capability is specially interesting in AR scenarios. Despite these potential applications, previous NeRF approaches are still limited in the aforementioned domains. The reason is that they are not able to quantify the underlying uncertainty of the model and thus, it is not possible to evaluate the risk associated with downstream decisions based on the output estimations.
Recently, NeRFintheWild (NeRFW) [25]
considered to identify the uncertainty produced by transient objects in the scene such as pedestrians or vehicles. In particular, the authors proposed to estimate a value indicating the variance for each rendered pixel in a novel synthetic view. This variance is computed by treating it as an additional value to be rendered analogously to pixel RGB intensities. However, this approach has two critical limitations. Firstly, pixelcolors are produced by a specific physical process that is not related to the model uncertainty. As a consequence, estimating the latter with volume rendering techniques is not theoreticallyfounded and can lead to suboptimal results. On the other hand, while NeRFW is able to predict the variance associated with the rendered pixels, it does not explicitly model the uncertainty of the radiance field representing the scene. Hence, it can not quantify confidence estimates about the underlying 3D geometry.
To the best of our knowledge, SNeRF is the first approach to explicitly model the uncertainty of the implicit representation learned by NeRF. In contrast to NeRFW, our method allows us to quantify the uncertainty associated not only with rendered views, but also with estimations related with the 3D geometry (see Fig. LABEL:fig:intro again).
Uncertainty Estimation
is a longstanding problem in Deep Learning
[1, 10, 13, 9, 18]. To address it, a popular approach imposes the Bayesian Learning framework [4] to estimate the posterior distribution over the model given observed data. This posterior distribution can be used during inference to quantify the uncertainty of the model outputs. In the context of deep learning, Bayesian Neural Networks [36, 7, 13, 24] use different strategies to learn the posterior distribution of the network parameters given the training set. However, these approaches are typically computationally expensive and require significant modifications over network architectures and training procedures.To address this limitation, other approaches have explored other strategies to implicitly learn the parameter distribution. For instance, dropoutbased methods [9, 2, 12, 6]
introduce stochasticity over the intermediate neurons of the network in order to efficiently encode different possible solutions in the parameter space. By evaluating the model with different dropout configurations over the same input, the uncertainty can be quantified by computing the variance over the set of obtained outputs. A similar strategy consists on using deep ensembles
[18, 14], where a finite set of independent networks are trained and evaluated in order to measure the output variance. Whereas these solutions are more simple and efficient than Bayesian Neural Networks, they still require multiple model evaluations. This limits their application in NeRF, where the rendering process is already computationally expensive for a single model.Different from the previous approaches learning a posterior distribution over the model parameters, the proposed SNeRF learns a single network encoding the distribution over all the possible radiance fields modelling the scene. As we will discuss in the following sections, this allows to efficiently obtain uncertainty estimates without the need of evaluating multiple model instances.
3 Stochastic Neural Radiance Fields
3.1 From Standard to StochasticNeRF
Standard NeRF [28] models a 3D volumetric scene as a radiance field defining a set:
(1) 
where is the volume density in a specific 3D spatiallocation and is the emitted RGB radiance which is also dependent on the view direction .
To model the radiance field , NeRF uses a parametric function which encodes the radiance and density for every possible locationview pair in the scene . Concretely, this function is implemented by a deep neural network with parameters .
NeRF Optimization: For a given scene, NeRF optimizes the network by leveraging a training set formed by triplets, where is a RGB pixelcolor captured by a camera located in a 3D position in the scene. Additionally, is the normalized direction from the camera origin to the pixel in worldcoordinates. This training set can be obtained by capturing a collection of views from the scene using different cameras with known poses.
By assuming that training samples are independent observations, the network parameters are optimized by minimizing the negative loglikelihood as:
(2) 
where the quadratic error follows from defining
as a Gaussian distribution with unit variance and a mean defined by the volumetric rendering function:
(3) 
where is a specific spatiallocation along a ray with direction which crosses the scene from the pixel position in worldcoordinates to the point . Additionally, we express and as and , respectively. More details about the volumetric rendering function in Eq. (3) and how it is approximated can be found in [28].
Stochastic NeRF is a generalization of the previously described framework. Specifically, instead of learning a single radiance field , SNeRF models a distribution over all the possible fields modelling the scene. For that purpose, we consider that for each locationview pair
, the volume density and emitted radiance are random variables following an unknown joint distribution. In this manner, any radiance field
defined in Eq. (1) can be considered a realization over the distribution . As we will discuss in Sec. 3.3, treating the radiance field as a set of stochastic variables allows to reason about the underlying uncertainty in the implicit 3D representation.SNeRF Optimization: Different from the optimization strategy used in original NeRF, SNeRF adopts a Bayesian approach where the goal is to estimate the posterior distribution of the possible radiance fields given the observed training set :
(4) 
where is the likelihood of given a radiance field and is a distribution modelling our prior knowledge about the radiance and density pairs over the different spatiallocations in the scene.
3.2 Learning SNeRF with Variational Inference
Given that the explicit computation of the posterior in Eq. (4) is intractable, we employ variational inference [5] in order to approximate it. In particular, we define a parametric distribution approximating the true posterior and optimize its parameters by minimizing the KullbackLeibler (KL) divergence between both as:
(5) 
Intuitively, the first term in Eq. (3.2) measures the expected training set likelihood over the radiance field distribution . On the other hand, the second term measures the KL divergence between the approximate posterior and the prior distribution . In the following, we detail how SNeRF addresses this optimization problem.
3.2.1 Modelling the approximate posterior
In order to make the approximate posterior tractable, we define it as a fullyfactorized distribution:
(6) 
where we assume that the density and radiance are independent variables given any locationview pair . In particular, SNeRF models with a neural network defining a function , where and
are a mean and standard deviation defining the distribution
. Similarly, and define the density distribution .Given that the radiance values need to be bounded between and
, we use a logistic normal distribution
[3] for each RGB channel independently. In this manner, is a random variable defined by:(7) 
resulting from applying a sigmoid function to a Gaussian variable
with mean and standard deviation . Similarly, the support for the density distribution needs to be positive. Therefore, we model as a random variable following a rectified normal distribution [11]:(8) 
where
is as a rectified linear unit that sets all the negative values to
.3.2.2 Computing the loglikelihood
In the following, we introduce how SNeRF computes the likelihood term in Eq. (3.2). Firstly, note that given the variational posterior and the training set , is equivalent to:
(9) 
where: (i) is the set of 3D coordinates along a ray with direction and origin , (ii) is a set of radiancedensity pairs for each ray position and (iii) is the probability of the pixel color given the radiance and density values accumulated along the ray. The latter probability is defined similarly to standard NeRF, where we assume that follows a normal distribution with a mean defined by applying the volumetric rendering function Eq. (3) to the radiancedensity trajectory along the ray.
Given previous definitions, Eq. (9) can be computed using a MonteCarlo approximation:
(10) 
where each is a sample from the radiance distribution . These samples can be generated using Eq. (7) with parameters obtained by evaluating the network . Similarly, is a sample from the volume density distribution using Eq. (8) with mean and variance parameters also defined by the network output. An illustration of the whole process is provided in Figure 1
. The introduced strategy is used during training to compute the loglikelihood and apply stochastic gradient descent to optimize the parameters
. This is possible by using the reparametrizationtrick [15] to backpropagate the gradients through the generated samples and . See Appendix A.2 for a more detailed explanation.3.2.3 Estimating the posteriorprior KL divergence
As previously discussed, the KL term in Eq. (3.2) measures the difference between the approximated posterior over the radiance fields learned by SNeRF and a prior distribution. Similarly to the definition of in Eq. (6), we model the prior as a fullyfactorized distribution,
where the radiance and density priors are assumed to be the same for all the spatiallocations in the scene. Concretely, is again modelled with a logistic normal distribution as in the case of . In this case, however, the mean parameter is optimized during training and its variance is fixed to 10. This high value models our knowledge that, without considering any observation, the uncertainty over the radiance values must be high. Analogously, is modelled with a rectifiednormal distribution also with an optimized mean and fixed variance (=10).
Given previous definitions, the KL term in Eq. (3.2) can be expressed as:
(11) 
which is equivalent to the sum of the KL divergence between the posterior and prior distributions for all the possible locationview pairs in the scene. During training, Eq. (3.2.3) is minimized by sampling random locationview pairs in 3D and approximating the radiance and density KL terms using automatic integration. See Appendix A.3 for more details.
In our preliminary experiments, we observed that it is also beneficial to consider a different prior distributions for locationview pairs belonging to the rays traced to estimate the pixelcolor likelihood in Eq. (9). The reason is that, in these locations, we know that the provided observations are reducing the uncertainty about the radiancedensity. Therefore, setting a prior with a high variance for the distributions contradicts this prior knowledge. For these reasons, we also compute the KL term for the spatiallocations sampled along these rays but, in this case, the variances defining the prior distributions and are not fix but also optimized during training. A pseudoalgorithm summarizing the learning process of SNeRF is provided in Appendix B.
Neg. Log. Likelihood (NLL)  MSEUncertainty Correlation  
MCDO [9]  D. Ens. [18]  NeRFW [25]  SNeRF  MCDO [9]  D. Ens. [18]  NeRFW [25]  SNeRF  
Flower  4.63  1.63  1.71  1.27  0.38  0.59  0.49  0.63 
Frotress  5.19  2.29  1.04  0.03  0.24  0.37  0.44  0.55 
Leaves  2.72  2.66  0.79  0.68  0.39  0.57  0.65  0.73 
Horns  4.18  2.17  0.78  0.60  0.43  0.50  0.50  0.70 
Trex  4.10  2.28  1.91  1.37  0.42  0.53  0.66  0.68 
Fern  4.90  2.47  2.16  2.01  0.50  0.65  0.59  0.69 
Orchids  5.74  2.23  2.24  1.95  0.50  0.60  0.60  0.65 
Room  5.06  2.13  4.93  2.35  0.46  0.65  0.38  0.74 
Avg.  4.57  2.23  1.95  1.27  0.40  0.56  0.54  0.67 
3.3 Inference and Uncertainty Estimation
By learning a distribution over radiance fields, SNeRF is able to quantify the uncertainty associated with rendered views for any given camera pose. For this purpose, we firstly sample a set of color values for each pixel in the rendered image. As illustrated in Figure 1, these values are obtained by applying the volume rendering equation to different radiancedensity trajectories along a ray traced from the pixel coordinates. Intuitively, each of the sampled colors represents an estimate produced by a single radiance field in the learned distribution. Finally, we treat the mean and variance over the samples as the predicted pixel color and its associated uncertainty.
Similar to the case of image rendering, SNeRF is also able to quantify the uncertainty associated with estimated depthmaps. In this case, we ignore the radiance values and use trajectories obtained by sampling density values along the ray. Then, for each sampled trajectory, we compute the expected termination depth of the ray as in [28]. In this way, we can get samples for each pixel in the depth maps. The mean and variance of these samples correspond to the estimated depth and its uncertainty.
4 Experiments
4.1 Experimental setup
Datasets. We conduct our experiments over the LLFF benchmark dataset introduced in [28]. It contains multiple views with calibrated camera poses for 8 different scenes including indoor (Horns, Trex, Room, Fortress, Fern) and outdoor environments (Flower, Leaves, Orchids). Given that our goal is to evaluate the reliability of the quantified uncertainty, we use a more reduced number of scene views during training compared to the experimental setups used in the original paper. The rationale is the following: in lowdata regimes, uncertainty estimation is of particular importance given that the model should be able to identify the parts of the scene that are not covered by the training. In these cases, the model is expected to automatically assign a high uncertainty to these regions. Motivated by this observation, we randomly choose only a of the total views for training and use the rest for testing.
Baselines. As discussed in Sec. 2, only NeRFW [25] has attempted to quantify uncertainty in Neural Radiance Fields. For this reason, we also compare SNeRF with stateoftheart approaches that have been proposed in other domains for the same purpose. In particular, we consider MCdropout [9] and DeepEnsembles [18]
. In the first case, we add a dropout layer after each odd layer in the network to sample multiple outputs using random dropout configurations. Considering a tradeoff between computation and performance, we use five samples in our experiments and compute their variance as the uncertainty value.
On the other hand, in DeepEnsembles we train and evaluate five different NeRF models in parallel. Again, the variance of their outputs is used as the uncertainty associated with the prediction. Finally, we also compare SNeRF with the proposed strategy used in NeRFW [25] for uncertainty estimation. Given that there are no variable illumination or moving objects in the scene of the evaluated datasets, we remove the latent embedding component of their approach and keep only the uncertainty estimation layers.
Evaluation Metrics. Previous works typically evaluate the rendered novelviews using imagequality metrics such as PSNR, SSIM, and LPIPS. However, these validation criteria are not informative in our context given that we aim to measure the reliability of the uncertainty estimates. For this reason, we use two alternative metrics: the negative loglikelihood (NLL) and the correlation between the Mean Squared Error (MSE) and the obtained uncertainty values. The use of the NLL is motivated by the observation that all the evaluated methods provide uncertainty estimations based on a predicted variance for each estimated pixel color. In this manner, we can compute the NLL for each pixel by computing the probability of the groundtruth given a Gaussian distribution with mean equivalent to the estimated color and variance equal to the predicted uncertainty. More intuitively, this metric measures the average MSE error with respect to the color goundtruth weighted by the model confidence associated with each pixel color. In the second metric, we compute the correlation between the MSE error for each pixel and the estimated uncertainty values. Note that this correlation will be better if the model assigns higher uncertainty to estimations that are more likely to be inaccurate. Therefore, this metric indicates whether the uncertainty estimates can be used as a predictive value for the expected error in real scenarios where no groundtruth is available.
Implementation details. To implement the different compared baselines, we use the same network architecture, hyperparameters and optimization process employed in the original NeRF paper ^{1}^{1}1https://github.com/bmild/nerf. For SNeRF implementation, we also use the same architecture to implement the function . The only introduced modification is in the last layer, where we double the number of outputs to account for the mean and variance parameters of the density and radiance distributions. During training, we uniformly sample 128 spatiallocations across each ray. Then, for each location, we sample radiancedensity pairs from the distributions defined by the ouput parameters. Finally, to compute the volume rendering formula in Eq. (3), we approximate its integral using the trapezoidal rule (detailed in Appendix A.4). In our preliminary experiments, this integration method showed better stability than the original alphacompositing used in standard NeRF.
4.2 Uncertainty estimation in novelview synthesis
Quantitative results of SNeRF and the other evaluated methods can be found in Table 1. As we can observe, our method outperforms all the previous approaches across all the scenes and metrics. In particular, SNeRF improves over the previous stateoftheart with an average decrease of 35% for NLL and with more than 10% increased MSEUncertainty correlation. The better results obtained by our approach can be explained by the following reasons. Firstly, the quality of the uncertainty estimates provided by DeepEnsembles and MCDropout typically increases when more models in the ensemble or dropout samples are used, respectively. In our experiments, we have limited this number to which can partially explain the worse results of MCDropout and DeepEnsembles compared to SNeRF. While increasing the number of model evaluations in these approaches could improve their performance, this strategy is not practical for NeRF given that the rendering time also grows dramatically. This can be seen in Table 2, where we report the rendering time for a single scene required by the different methods. Note that MCDropout and DeepEnsembles increase the computational complexity of NeRF by a factor of . Whereas NeRFW has a similar computational complexity to SNeRF, our method also obtains significantly improved results in all metrics. As discussed in Sec. 2, NeRFW treats the variance for each pixel as an additional value rendered in the same manner as pixel RGB intensities. This strategy is not theoretically founded and can lead to suboptimal results. In contrast, our SNeRF obtains the uncertainty estimates by sampling multiple color values from the posterior distribution over the radiance fields modeling the scene. As we have shown empirically, this strategy produces more accurate uncertainty estimates without increasing the rendering time.
Qualitative results. To give more insights on the previous experiments, Figure 3 shows an example of qualitative results produced by the evaluated models on a testing view. It is important to notice that the right part of the rendered image corresponds to a region that is not covered by the training views used to learn the model. Therefore, we expect to obtain high error and uncertainty estimates in the pixels belonging to this area. As can be observed, the uncertainty values estimated by MCdropout, DeepEnsembles and NeRFW are poorly correlated with their predictive error. As expected, the MSE is high in the right image region that was not observed in the training set views. However, the corresponding uncertainty values provided by these methods are low. In contrast, SNeRF is able to assign a high uncertainty to the pixels belonging to the scene region that was not covered by the training views. The ability of our model to identify these regions can be explained by the the highvariance defining our prior distributions. Concretely, note that the minimized KL divergence in Eq. (3.2.3) forces the learned posterior distribution to resemble this prior when no spatiallocations are observed. As a consequence, the rendered parts of the scene which were not covered by the training views will have an associated high uncertainty. In the following, we conduct an ablation study in order to analyse effect of the prior distributions.
Neg. Log. Likelihood(NLL)  MSEUncertainty Correlation  
w/o KL  w/ KL  SNeRF  w/o KL  w/ KL  SNeRF  
Flower  1.41  2.18  1.27  0.54  0.51  0.63 
Frotress  1.26  0.82  0.03  0.31  0.53  0.55 
Leaves  0.98  0.96  0.68  0.60  0.65  0.73 
Horns  3.08  0.75  0.60  0.53  0.64  0.70 
Trex  1.82  1.54  1.37  0.58  0.66  0.68 
Fern  2.46  1.63  2.01  0.51  0.66  0.69 
Orchids  4.45  2.37  1.95  0.31  0.59  0.65 
Room  4.39  3.49  2.35  0.54  0.65  0.74 
Avg.  2.48  1.72  1.28  0.49  0.61  0.67 
4.3 Analysing the effect of the prior distribution
The prior distribution defined by SNeRF allows to identify regions in the scene that are not observed in the training views. Additionally, we also impose a prior with learned variance for the spatiallocations belonging to rays crossing the scene from the observed pixels in the training set (see Sec. 3.2.3). To validate the effectiveness of this approach, we have evaluated three different variants of SNeRF trained by using different strategies: (i) minimizing only the negative loglikelihood term defined in Sec. 3.2.2 and ignoring the KL term, (ii) considering also the KL divergence in Eq. (3.2.3) between the prior and posterior distribution, (iii) using the proposed optimization objective where we also minimize the additional prior over the ”observed” spatial locations.
According to the reported results in Table 3, compared to the case where only the loglikelihood is optimized (w/o KL), minimizing the KL divergence using a highvariance prior significantly improves the performance on both NLL and MSEUncertainty correlation (w/KL). As previously discussed, this allows SNeRF to identify the scene regions which are not observed in the training views and assign a highuncertainty to the corresponding pixels in rendered images. However, we can also observe a significant improvement for our proposed SNeRF optimization process, where we impose a learned prior for the observed spatiallocations. The reason is that defining a highvariance prior in these areas can lead to suboptimal results, given that the KL term hinders the model to minimize the negative loglikelihood. In contrast, this is effectively addressed by applying our proposed prior with learned parameters in observed spatiallocations.
4.4 Uncertainty estimation in depthmap synthesis
One of the main advantages of SNeRF compared to the evaluated methods is that it is also able to quantify the uncertainty associated with the 3D geometry of the scene. In order to illustrate this, Figure 4 shows estimated depthmaps and their associated uncertainty generated for different scenes. By looking at the figure, we can see that our framework can also provide useful information about the model’s confidence on the underlying 3D geometry of the scene. For instance, we can observe high uncertainty at the border of foreground objects. This is because this borders correspond to discontinuous changes in the depthmap which produces highlyuncertain estimations. Additionally, we can also observe in the bottom example how SNeRF is able to assign low confidence values to the depth associated with pixels corresponding to areas of the scene that were not observed in the training set.
5 Conclusions
We have presented StochasticNeRF, a novel framework to address the problem of uncertainty estimation in neural volume rendering. The proposed approach is a probabilistic generalization of the original NeRF, which is able to produce uncertainty estimates by modelling a distribution over all the possible radiance fields modelling the scene. Compared to stateoftheart approaches that can be applied for this problem, we have shown that the proposed method achieves significantly better results without increasing the computational complexity. Additionally, we have also illustrated the ability of SNeRF to provide uncertainty estimates for different tasks such as depthmap estimation. To conclude, it is also worth mentioning that our formulation is generic and can be combined with any existing or future method based on the NeRF framework in order to incorporate uncertainty estimation in neural 3D representations.
References
 [1] (2021) A review of uncertainty quantification in deep learning: techniques, applications and challenges. Information Fusion 76, pp. 243–297. External Links: ISSN 15662535 Cited by: §2.
 [2] (2018) Global snr estimation of speech signals using entropy and uncertainty estimates from dropout networks. In INTERSPEECH, Cited by: §2.
 [3] (198008) Logisticnormal distributions:Some properties and uses. Biometrika 67 (2), pp. 261–272. External Links: ISSN 00063444 Cited by: §3.2.1.
 [4] (2009) Bayesian theory. Vol. 405, John Wiley & Sons. Cited by: §2.
 [5] (2017) Variational inference: a review for statisticians. Journal of the American statistical Association. Cited by: §1, §3.2.
 [6] (2015) Variational dropout and the local reparameterization trick. In Adv. Neural Inform. Process. Syst., Cited by: §2.
 [7] (2020) Posterior network: uncertainty estimation without ood samples via densitybased pseudocounts. In Adv. Neural Inform. Process. Syst., Cited by: §2.
 [8] (2021) Dynamic neural radiance fields for monocular 4D facial avatar reconstruction. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §2.
 [9] (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In Int. Conf. Machine Learning, Cited by: §2, §2, Table 1, Table 2, §4.1.
 [10] (2017) On calibration of modern neural networks. In Int. Conf. Machine Learning, pp. 1321–1330. Cited by: §2.
 [11] (2007) Variational learning for rectified factor analysis. Signal Processing. Cited by: §3.2.1.
 [12] (2020) Improving predictive uncertainty estimation using dropout–hamiltonian monte carlo. Soft Computing 24, pp. 4307–4322. Cited by: §2.

[13]
(2015)
Probabilistic backpropagation for scalable learning of bayesian neural networks
. In Int. Conf. Machine Learning, Cited by: §2.  [14] (2020) Maximizing overall diversity for improved uncertainty estimates in deep ensembles. In AAAI, Cited by: §2.
 [15] (2014) Autoencoding variational bayes. In Int. Conf. Learn. Represent., Cited by: §A.2, §3.2.2.
 [16] (2020) Semantic implicit neural scene representations with semisupervised training. In 3DV, Cited by: §1.
 [17] (2021) NeRFVAE: a geometry aware 3d scene generative model. In PMLR, Cited by: §2.
 [18] (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Adv. Neural Inform. Process. Syst., Cited by: §2, §2, Table 1, Table 2, §4.1.
 [19] (2021) Neural scene flow fields for spacetime view synthesis of dynamic scenes. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §2.
 [20] (2021) AutoInt: automatic integration for fast neural volume rendering. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §2.
 [21] (2020) Neural sparse voxel fields. In Adv. Neural Inform. Process. Syst., Cited by: §2.
 [22] (201907) Neural volumes: learning dynamic renderable volumes from images. ACM Trans. Graph. 38 (4), pp. 65:1–65:14. Cited by: §1, §2.
 [23] (2021) Mixture of volumetric primitives for efficient neural rendering. ACM Trans. Graph. 40, pp. 1 – 13. Cited by: §2.
 [24] (2019) A simple baseline for bayesian uncertainty in deep learning. In Adv. Neural Inform. Process. Syst., Cited by: §2.
 [25] (2021) NeRF in the Wild: neural radiance fields for unconstrained photo collections. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §2, §2, Table 1, Table 2, §4.1, §4.1.
 [26] (1995) Optical models for direct volume rendering. IEEE Trans. Vis. Comput. Graph.. Cited by: §2.
 [27] (2019) Occupancy Networks: learning 3d reconstruction in function space. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §1.
 [28] (2020) NeRF: representing scenes as neural radiance fields for view synthesis. In Eur. Conf. Comput. Vis., Cited by: §A.4, §A.4, §1, §2, §3.1, §3.1, §3.3, Table 1, §4.1.
 [29] (2021) DONeRF: Towards RealTime Rendering of Compact Neural Radiance Fields using Depth Oracle Networks. Comput. Graph. Forum 40 (4). External Links: ISSN 14678659, Document, Link Cited by: §2.
 [30] (2021) GIRAFFE: representing scenes as compositional generative neural feature fields. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §2.
 [31] (2019) DeepSDF: learning continuous signed distance functions for shape representation. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §1.
 [32] (2021) Neural Body: implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §2.
 [33] (2021) DNeRF: neural radiance fields for dynamic scenes. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §2, §2.
 [34] (2021) PVA: pixelaligned volumetric avatars. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §2.
 [35] (2019) Scene representation networks: continuous 3dstructureaware neural scene representations. In Adv. Neural Inform. Process. Syst., Cited by: §2.
 [36] (2019) Bayesian layers: a module for neural network uncertainty. In Adv. Neural Inform. Process. Syst., Cited by: §2.
 [37] (2021) Spacetime neural irradiance fields for freeviewpoint video. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §2.

[38]
(2021)
iNeRF: inverting neural radiance fields for pose estimation
. In IROS, Cited by: §1, §2.  [39] (2021) pixelNeRF: neural radiance fields from one or few images. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §2.
Supplemental Materials
Appendix A Methods
In the following, we provide more technical details about our proposed Stochastic Neural Radiance Fields described in Sec. 3.
a.1 Distributions
We present the explicit mathematical expression of the specific distributions used by SNeRF to model the radiance and density distributions and , respectively. Given that the radiance values need to be bounded between 0 and 1, we use a logistic normal distribution for each RGB channel independently. Concretely, its probability density function is defined as:
(12) 
where and are the mean and std. deviation of the logit form of variable which is output by the neural network . Similarly, we model the positive density value as a random variable following a rectified normal distribution. Its cumulative density function () and probability density function () are:
(13) 
(14) 
where and are again the mean and std. deviation of the random variable output by the SNeRF network.
a.2 Backpropagation through sampling with the reparametrizationtrick
In the following, we provide a detailed explanation on how to properly sample from the learned distributions and backpropagate their gradients. Sampling directly from the learned distribution for density and radiance is not differentiable, which prevents gradient computation of their parameters during backpropagation. Inspired by [15], we introduce a normally distributed variable to reparameterize the variables density and radiance . Concretely, we sample density values as in Eq. (8) and radiance variables as in Eq. (7), where is a unitary gaussian variable with mean and std. dev.. In this manner, the gradients of the distribution parameters can be computed given that this process is fullydifferentiable w.r.t them.
a.3 The posteriorprior KL divergence
As we discuss in Sec. 3.2, SNeRF optimization involves the minimization of the KL term in Eq. (3.2). This divergence measures the difference between the approximated posterior learned by SNeRF and a prior distribution over the radiance fields modelling the scene. Given that computing the KL divergence for all the possible locationview pairs in the scene is intractable, we approximate the sum in Eq. (3.2.3) by sampling random 3D spatiallocations in the scene as follows. Firstly, we define the space bounds at each 3D axis from the captured images of the scene. For instance, we use and to denote the left and right bound of x axis respectively. Then we partition into evenlyspaced bins and then randomly draw a sample within each bin. After applying this stratified sampling strategy on each axis, we obtain points paired with random directions. Secondly, for each sampled locationview pair, we use the network to compute the posterior distribution parameters . Finally, we compute the KL divergence with the prior for the density and radiance variables at each spatiallocation. Given that the logistic normal and rectified normal distributions ( Sec. A.1), the KL divergence between the prior and the posterior in both cases has the following explicit expression:
Mathematical derivations for KL in Eq. (3.2.3):
(15)  
Where and are computed by Eq. (13) and Eq. (14) respectively. For the fifth equality, note that the density value is bounded to be positive after a rectified linear unit that sets all the negative values to . We have shown the intuitive graphics in Figure 2. In practice, we use a MonteCarlo estimator over density variable during optimization to approximate the integration the previous equations.
a.4 Trapezoidal rule
To estimate the continuous integral in Eq. (3), firstly we follow the original NeRF [28] to use a stratified sampling strategy to sample spatiallocations along each ray. Concretely, we partition into evenlyspaced bins and then randomly draw a sample within each bin:
(16) 
For these samples, we can utilize our framework to produce densityradiance pairs for each spatiallocation along each ray. As mentioned in Sec. 4.1, we use a different strategy to estimate the continuous integral in Eq. (3) compared to SNeRF. The motivation is that, in our preliminary experiments, we observed that using an alternative trapezoidal rule ^{2}^{2}2Rahman, Qazi I.; Schmeisser, Gerhard (December 1990), ”Characterization of the speed of convergence of the trapezoidal rule”, Numerische Mathematik, 57 (1): 123–138. to approximate the volume rendering integral is much stable than the traditional alpha compositing used in the original paper [28]. The reason is that the latter can produce extremely large density values when large variances are associated with the sampled distributions. As a consequence, this produces numerical instabilities during optimization. The alternative trapezoidal method used to approximated the aforementioned integral is able to address this limitation and can be expressed as:
(17) 
where is the distance between adjacent sampled spatiallocations along the ray. is the th sample of the densityradiance pairs at the th ray location.
Appendix B Pseudoalgorithm
In Algorithm 1, we provide the pseudocode for the learning process of SNeRF.
Appendix C Additional Qualitative Results
In Figure 5, we show additional qualitative obtained by our SNeRF results across different scenes in the evaluated dataset. For each scene, we show not only the quantified uncertainty (third column) associated with the rendered novel view (second column), but also the estimated uncertainty (fifth column) associated with the generated depthmap (fourth column).
Comments
There are no comments yet.