## 16.3 Emerging Topics

Rendering research continues to be a vibrant field, as should be evident by
the length of the “Further Reading” sections at the conclusions of the
previous chapters. In addition to the topics discussed earlier, there are
two important emerging areas of rendering research that we have not covered
in this book—inverse and differentiable rendering and the use of machine
learning techniques in image synthesis. Work in these areas is
progressing rapidly, and so we believe that it would be premature to
include implementations of associated techniques in `pbrt` and to discuss
them in the book text; whichever algorithms we chose would likely be obsolete
in a year or two. However, given the amount of activity in these areas, we
will briefly summarize the landscape of each.

### 16.3.1 Inverse and Differentiable Rendering

This book has so far focused on *forward* rendering, in which rendering
algorithms convert an input scene description (“”) into a
synthetic image (“”) taken in the corresponding virtual world. Assuming that
the underlying computation is consistent across runs, we can think of the
entire process as the evaluation of an intricate function
satisfying . The main
appeal of physically based forward-rendering methods is that they account for
global light transport effects, which improves the visual realism of the output
.

However, many applications instead require an *inverse*
to infer a scene description that is
consistent with a given image , which may be a real-world
photograph. Examples of disciplines where such inverses are needed include
autonomous driving, robotics, biomedical imaging, microscopy, architectural
design, and many others.

Evaluating is a surprisingly difficult and ambiguous problem: for example, a bright spot on a surface could be alternatively explained by texture or shape variation, illumination from a light source, focused reflection from another object, or simply shadowing at all other locations. Resolving this ambiguity requires multiple observations of the scene and reconstruction techniques that account for the interconnected nature of light transport and scattering. In other words, physically based methods are not just desirable—they are a prerequisite.

Directly inverting is possible in some cases, though doing so tends to involve drastic simplifying assumptions: consider measurements taken by an X-ray CT scanner, which require further processing to reveal a specimen’s interior structure. (X-rays are electromagnetic radiation just like visible light that are simply characterized by much shorter wavelengths in the 0.1–10nm range.) Standard methods for this reconstruction assume a purely absorbing medium, in which case a 3D density can be found using a single pass over all data. However, this approximate inversion leads to artifacts when dense bone or metal fragments reflect some of the X-rays.

The function that is computed by a physically based renderer like `pbrt` is
beyond the reach of such an explicit inversion. Furthermore, a scene that
perfectly reproduces images seen from a given set of viewpoints may not exist at all. Inverse
rendering methods therefore pursue a relaxed minimization problem of the form

where refers to a *loss function*
that quantifies the quality of a rendered image of the scene . For example, the
definition could be used to
measure the distance to a reference image . This type of optimization
is often called *analysis-by-synthesis* due to the reliance on
repeated simulation (synthesis) to gain understanding about an inverse problem.
The approach easily generalizes to simultaneous optimization of multiple viewpoints.
An extra *regularization* term depending only on the scene
parameters is often added on the right hand side to encode prior knowledge about
reasonable parameter ranges.
Composition with further computation is also possible: for example, we could
alternatively optimize , where is a neural
network that produces the scene from learned parameters .

Irrespective of such extensions, the nonlinear optimization problem in Equation (16.1) remains too challenging to solve in one step and must be handled using iterative methods. The usual caveats about their use apply here: iterative methods require a starting guess and may not converge to the optimal solution. This means that selecting an initial configuration and incorporating prior information (valid parameter ranges, expected smoothness of the solution, etc.) are both important steps in any inverse rendering task. The choice of loss and parameterization of the scene can also have a striking impact on the convexity of the optimization task (for example, direct optimization of triangle meshes tends to be particularly fragile, while implicit surface representations are better behaved).

Realistic scene descriptions are composed of millions of floating-point values that together specify the shapes, BSDFs, textures, volumes, light sources, and cameras. Each value contributes a degree of freedom to an extremely high-dimensional optimization domain (for example, a quadrilateral with a RGB image map texture adds roughly 1.7 million dimensions to ). Systematic exploration of a space with that many dimensions is not possible, making gradient-based optimization the method of choice for this problem. The gradient is invaluable here because it provides a direction of steepest descent that can guide the optimization toward higher-quality regions of the scene parameter space.

Let us consider the most basic gradient descent update equation for this problem:

where denotes the step size. A single iteration of this optimization can be split into four individually simpler steps via the chain rule:

where and are the Jacobian matrices of the rendering algorithm and loss function, and and respectively denote the number of scene parameters and rendered pixels. These four steps correspond to:

- Rendering an image of the scene .
- Differentiating the loss function to obtain an image-space gradient vector . (A positive component in this vector indicates that increasing the value of the associated pixel in the rendered image would reduce the loss; the equivalent applies for a negative component.)
- Converting the image-space gradient into a parameter-space gradient .
- Taking a gradient step.

In practice, more sophisticated descent variants than the one in
Equation (16.3) are often used for step 4—for
example, to introduce per-variable momentum and track the variance of gradients,
as is done in the commonly used *Adam* (Kigma and Ba 2014) optimizer.
Imposing a metric on the optimization domain to pre-condition gradient steps
can substantially accelerate convergence, as demonstrated by Nicolet et
al. (2021) in the case of differentiable mesh optimization.

The third step evaluates the vector-matrix product ,
which is the main challenge in this sequence. At size , the Jacobian
of the rendering algorithm is far too large to store or even compute, as both
and could be in the range of multiple millions of elements. Methods
in the emerging field of *differentiable rendering* therefore directly evaluate this
product without ever constructing the matrix . The remainder of
this subsection reviews the history and principles of these methods.

For completeness, we note that a great variety of techniques have used derivatives to improve or accelerate the process of physically based rendering; these are discussed in “Further Reading” sections throughout the book. In the following, we exclusively focus on parametric derivatives for inverse problems.

Inverse problems are of central importance in computer vision, and so it should
be of no surprise that the origins of differentiable rendering as well as many
recent advances can be found there: following pioneering work on *OpenDR*
by Loper and Black (2014), a number of approximate
differentiable rendering techniques have been proposed and applied to
challenging inversion tasks. For example, Rhodin et al. (2015)
reconstructed the pose of humans by optimizing a translucent medium composed of
Gaussian functions. Kato et al. (2018) and Liu et
al. (2019a) proposed different ways of introducing smoothness into
the traditional rasterization pipeline. Laine et al. (2020)
recently proposed a highly efficient modular GPU-accelerated rasterizer based
on deferred shading followed by a differentiable antialiasing step. While
rasterization-based methods can differentiate the rendering of directly lit
objects, they cannot easily account for effects that couple multiple scene
objects like shadows or interreflection.

Early work that used physically based differentiable rendering focused on the optimization of a small number of parameters, where there is considerable flexibility in how the differentiation is carried out. For example, Gkioulekas et al. (2013b) used stochastic gradient descent to reconstruct homogeneous media represented by a low-dimensional parameterization. Khungurn et al. (2015) differentiated a transport simulation to fit fabric parameters to the appearance in a reference photograph. Hašan and Ramamoorthi (2013) used volumetric derivatives to enable near-instant edits of path-traced heterogeneous media. Gkioulekas et al. (2016) studied the challenges of differentiating local properties of heterogeneous media, and Zhao et al. (2016) performed local gradient-based optimization to drastically reduce the size of heterogeneous volumes while preserving their appearance.

Besides the restriction to volumetric representations, a shared limitation of
these methods is that they cannot efficiently differentiate a simulation with
respect to the full set of scene parameters, particularly when and are
large (in other words, they are not practical choices for the third step of the
previous procedure). Subsequent work has adopted *reverse-mode differentiation*,
which can simultaneously propagate derivatives to an essentially arbitrarily
large number of parameters. (The same approach also powers
training of neural networks, where it is known as *backpropagation*.)

Of particular note is the groundbreaking work by Li et al. (2018)
along with their *redner* reference implementation, which performs
reverse-mode derivative propagation using a hand-crafted implementation of the
necessary derivatives. In the paper, the authors make the important observation that 3D scenes are
generally riddled with visibility-induced discontinuities at object
silhouettes, where the radiance function undergoes sudden changes. These
are normally no problem in a Monte Carlo renderer, but they cause a severe
problem following differentiation. To see why, consider a hypothetical integral
that computes the average incident illumination at some position .
When computing the derivative of such a calculation, it is normally fine to
exchange the order of differentiation and integration:

The left hand side is the desired answer, while the right hand side represents the result of differentiating the simulation code. Unfortunately, the equality generally no longer holds when is discontinuous in the argument being integrated. Li et al. recognized that an extra correction term must be added to account for how perturbations of the scene parameters cause the discontinuities to shift. They resolved primary visibility by integrating out discontinuities via the pixel reconstruction filter and used a hierarchical data structure to place additional edge samples on silhouettes to correct for secondary visibility.

Building on the Reynolds transport theorem, Zhang et al. (2019) generalized this approach into a more general theory of differential transport that also accounts for participating media. (In that framework, the correction by Li et al. (2018) can also be understood as an application of the Reynolds transport theorem to a simpler 2D integral.) Zhang et al. also studied further sources of problematic discontinuities such as open boundaries and shading discontinuities and showed how they can also be differentiated without bias.

Gkioulekas et al. (2016) and Azinović et al. (2019) observed that the gradients produced by a differentiable renderer are generally biased unless extra care is taken to decorrelate the forward and differential computation (i.e., steps 1 and 3)—for example, by using different random seeds.

Manual differentiation of simulation code can be a significant development and
maintenance burden. This problem can be addressed using tools for *automatic
differentiation* (AD), in which case derivatives are obtained by mechanically
transforming each step of the forward simulation code. See the excellent book
by Griewank and Walther (2008) for a review of AD
techniques. A curious aspect of differentiation is that the computation becomes
unusually dynamic and problem-dependent: for example, derivative propagation
may only involve a small subset of the program variables, which may not be
known until the user launches the actual optimization.

Mirroring similar developments in the machine learning world, recent work on
differentiable rendering has therefore involved combinations of AD with
*just-in-time* (JIT) compilation to embrace the dynamic nature of this
problem and take advantage of optimization opportunities. There are several
noteworthy differences between typical machine learning and rendering
workloads: the former tend to be composed of a relatively small number of
arithmetically intense operations like matrix multiplications and convolutions,
while the latter use vast numbers of simple arithmetic operations. Besides this
difference, ray-tracing operations and polymorphism are ubiquitous in rendering
code; polymorphism refers to the property that function calls (e.g., texture
evaluation or BSDF sampling) can indirectly branch to many different parts of a
large codebase. These differences have led to tailored AD/JIT frameworks for
differentiable rendering.

The *Mitsuba 2* system described by Nimier-David et
al. (2019) traces the flow of computation in rendering
algorithms while applying forward- or reverse-mode AD; the resulting code is
then JIT-compiled into wavefront-style GPU kernels. Later work on the
underlying *Enoki* just-in-time compiler added more flexibility: in
addition to wavefront-style execution, the system can also generate megakernels
with reduced memory usage. Polymorphism-aware optimization passes simplify the
resulting kernels, which are finally compiled into vectorized machine code that
runs on the CPU or GPU.

A fundamental issue of any method based on reverse-mode differentiation (whether using AD or hand-written derivatives) is that the backpropagation step requires access to certain intermediate values computed by the forward simulation. The sequence of accesses to these values occurs in reverse order compared to the original program execution, which is inconvenient because they must either be stored or recomputed many times. The intermediate state needed to differentiate a realistic simulation can easily exhaust the available system memory, limiting performance and scalability.

Nimier-David et al. (2020) and
Stam (2020) observed that differentiating a
light transport simulation can be interpreted as a simulation in its own right,
where a differential form of radiance propagates through the scene. This
derivative radiation is “emitted” from the camera, reflected by scene
objects, and eventually “received” by scene objects with differentiable
parameters. This idea, termed *radiative backpropagation*, can drastically
improve the scalability limitation mentioned above (the authors report speedups
of up to compared to naive AD). Following this idea, costly
recording of program state followed by reverse-mode differentiation can be
replaced by a Monte Carlo simulation of the “derivative radiation.” The
runtime complexity of the original radiative backpropagation method is
quadratic in the length of the simulated light paths, which can be prohibitive
in highly scattering media. Vicini et al. (2021) addressed this
flaw and enabled backpropagation in linear time by exploiting two different
flavors of reversibility: the physical reciprocity of light and the
mathematical invertibility of deterministic computations in the rendering code.

We previously mentioned how visibility-related discontinuities can bias computed gradients unless precautions are taken. A drawback of the original silhouette edge sampling approach by Li et al. (2018) was relatively poor scaling with geometric complexity. Zhang et al. (2020) extended differentiable rendering to Veach’s path space formulation, which brings unique benefits in such challenging situations: analogous to how path space forward-rendering methods open the door to powerful sampling techniques, differential path space methods similarly enable access to previously infeasible ways of generating silhouette edges. For example, instead of laboriously searching for silhouette edges that are visible from a specific scene location, we can start with any triangle edge in the scene and simply trace a ray to find suitable scene locations. Zhang et al. (2021b) later extended this approach to a larger path space including volumetric scattering interactions.

Loubet et al. (2019) made the observation that discontinuous integrals themselves are benign: it is the fact that they move with respect to scene parameter perturbations that causes problems under differentiation. They therefore proposed a reparameterization of all spherical integrals that has the curious property that it moves along with each discontinuity. The integrals are then static in the new coordinates, which makes differentiation under the integral sign legal.

Bangaru et al. (2020) differentiated the rendering equation and applied the divergence theorem to convert a troublesome boundary integral into a more convenient interior integral, which they subsequently showed to be equivalent to a reparameterization. They furthermore identified a flaw in Loubet et al.’s method that causes bias in computed gradients and proposed a construction that finally enables unbiased differentiation of discontinuous integrals.

Differentiating under the integral sign changes the integrand, which means that sampling strategies that were carefully designed for a particular forward computation may no longer be appropriate for its derivative. Zeltner et al. (2021) investigated the surprisingly large space of differential rendering algorithms that results from differentiating standard constructions like importance sampling and MIS in different ways (for example, differentiation followed by importance sampling is not the same as importance sampling followed by differentiation). They also proposed a new sampling strategy specifically designed for the differential transport simulation. In contrast to ordinary rendering integrals, their differentiated counterparts also contain both positive and negative-valued regions, which means that standard sampling approaches like the inversion method are no longer optimal from the viewpoint of minimizing variance. Zhang et al. (2021a) applied antithetic sampling to reduce gradient variance involving challenging cases that arise when optimizing the geometry of objects in scenes with glossy interreflection.

While differentiable rendering still remains challenging, fragile, and computationally expensive, steady advances continue to improve its practicality over time, leading to new applications made possible by this capability.

### 16.3.2 Machine Learning and Rendering

As noted by Hertzmann (2003) in a prescient early paper, machine learning offers effective approaches to many important problems in computer graphics, including regression and clustering. Yet until recently, application of ideas from that field was limited. However, just as in other areas of computer science, machine learning and deep neural networks have recently become an important component of many techniques at the frontiers of rendering research.

This work can be (roughly) organized into three broad categories that are progressively farther afield from the topics discussed in this book:

- Application of
*learned data structures*, typically based on neural networks, to replace traditional data structures in traditional rendering algorithms. - Using machine learning–based algorithms (often deep convolutional neural networks) to improve images generated by traditional rendering algorithms.
- Directly synthesizing photorealistic images using deep neural networks.

Early work in the first category includes Nowrouzezahrai et al. (2009), who used neural networks to encode spherical harmonic coefficients that represented the reflectance of dynamic objects; Dachsbacher (2011), who used neural networks to represent inter-object visibility; and Ren et al. (2013), who encoded scenes’ radiance distributions using neural networks.

Previous chapters’ “Further Reading” sections have discussed many techniques based on learned data structures, including approaches that use neural networks to represent complex materials (Rainer et al. 2019, 2020; Kuznetsov et al. 2021), complex light sources (Zhu et al. 2021), and the scene’s radiance distribution to improve sampling (Müller et al. 2019, 2020, 2021). Many other techniques based on caching and interpolating radiance in the scene can be viewed through the lens of learned data structures, spanning Vorba et al.’s (2014) use of Gaussian mixture models even to techniques like irradiance caching (Ward et al. 1988).

One challenge in using learned data structures with traditional rendering
algorithms is that the ability to just evaluate a learned function is often
not sufficient, since effective Monte Carlo integration generally requires
the ability to draw samples from a matching distribution and to quantify
their density. Another challenge is that *online learning* is often
necessary, where the learned data structure is constructed while rendering
proceeds rather than being initialized ahead of time. For interactive
rendering of dynamic scenes, incrementally updating learned representations
can be especially beneficial.

More broadly, it may be desirable to represent an entire scene with a
neural representation; there is no requirement that the abstractions of
meshes, BRDFs, textures, lights, and media be separately and explicitly
encoded. Furthermore, learning the parameters to such representations in
inverse rendering applications can be challenging due to the ambiguities
noted earlier. At writing, *neural radiance fields* (NeRF)
(Mildenhall et al. 2020) are seeing widespread adoption as a learned scene
representation due to the effectiveness and efficiency of the approach.
NeRF is a volumetric representation that gives radiance and opacity at a
given point and viewing direction. Because it is based on volume
rendering, it has the additional advantage that it avoids the challenges of
discontinuities in the light transport integral discussed in the previous
section.

In rendering, work in the second category—using machine learning to improve conventionally rendered images—began with neural denoising algorithms, which are discussed in the “Further Reading” section at the end of Chapter 5. These algorithms can be remarkably effective; as with many areas of computer vision, deep convolutional neural networks have rapidly become much more effective at this problem than previous non-learned techniques.

*(Image denoised with the NVIDIA OptiX 7.3 denoiser.)*

Figure 16.2 shows an example of the result of using such a denoiser. Given a noisy image rendered with 32 samples per pixel as well as two auxiliary images that encode the surface albedo and surface normal, the denoiser is able to produce a noise-free image in a few tens of milliseconds. Given such results, the alternative of paying the computational cost of rendering a clean image by taking thousands of pixel samples is unappealing; doing so would take much longer, especially given that Monte Carlo error only decreases at a rate in the number of samples . Furthermore, neural denoisers are usually effective at eliminating the noise from spiky high-variance pixels, which otherwise would require enormous numbers of samples to achieve acceptable error.

Most physically based renderers today are therefore used with denoisers.
This leads to an important question: *what is the role of the
renderer, if its output is to be consumed by a neural network?* Given a
denoiser, the renderer’s task is no longer to try to make the most accurate
or visually pleasing image for a human observer, but is to generate output
that is most easily converted by the neural network to the desired final
representation. This question has deep implications for the design of both
renderers and denoisers and is likely to see much attention in coming
years. (For an example of recent work in this area, see the paper by Cho
et al. (2021), who improved denoising by incorporating
information directly from the paths traced by the renderer and not just
from image pixels.)

The question of the renderer’s role is further provoked by neural
post-rendering approaches that do much more than denoise images; a recent
example is *GANcraft*, which converts low-fidelity blocky images of
*Minecraft* scenes to be near-photorealistic (Hao et al. 2021).
A space of techniques lies in between this extreme and less intrusive
post-processing approaches like denoising: *deep shading* (Nalbach et al. 2017)
synthesizes expensive effects starting from a cheaply computed set of G-buffers
(normals, albedo, etc.). Granskog et al. (2020) improved
shading inference using additional view-independent context extracted from a
set of high-quality reference images.
More generally, *neural style transfer* algorithms (Gatys et al. 2016) can
be an effective way to achieve a desired visual style without fully simulating
it in a renderer. Providing nuanced artistic control to such approaches
remains an open problem, however.

In the third category, a number of researchers have investigated training
deep neural networks to encode a full rendering algorithm that goes from a
scene description to an image. See Hermosilla et
al. (2019) and Chen et al. (2021) for
recent work in this area.
Images may also be synthesized without using conventional rendering
algorithms at all, but solely from characteristics learned from real-world
images. A recent example of such a *generative model*
is *StyleGAN*, which was developed by Karras et
al. (2018, 2020); it is capable of
generating high-resolution and photorealistic images of a variety of
objects, including human faces, cats, cars, and interior scenes. Techniques
based on *segmentation maps* (Chen and Koltun 2017; Park et al. 2019) allow a
user to denote that regions of an image should be of general categories
like “sky,” “water,” “mountain,” or “car” and then synthesize a
realistic image that follows those categories. See the report by Tewari et
al. (2020) for a comprehensive summary of recent work in
such areas.