① Modify soac so that the code it generates leaves
objects in AOS layout in memory and recompile pbrt. (You will need to
manually update a few places in the WavefrontPathIntegrator that only
access a single field of a structure, as well.) How is performance affected
by this change?
②pbrt’s SampledWavelengths class stores two
Floats for each wavelength: one for the wavelength value and one for
its PDF. This class is passed along between almost all kernels. Render a
scene on the GPU and work out an estimate of the amount of bandwidth
consumed in communicating these values between kernels. (You may need to
make some assumptions to do so.)
Then, implement an alternative SOA representation for
SampledWavelengths that stores only two values: the Float
sample used to originally sample the wavelengths and a Boolean value that
indicates whether the secondary wavelengths have been terminated. You
might use the sign bit to encode the Boolean value, or you might even try a
16-bit encoding, with the sample value quantized to 15 bits and the
16th used to indicate termination. Write code to encode
SampledWavelengths to this representation when they are pushed to a
queue and to decode this representation back to SampledWavelengths
when work is read from the queue via a call to
Film::SampleWavelengths() and then possibly a call to
SampledWavelengths::TerminateSecondary(). Estimate how much
bandwidth your improved representation saves. How is runtime performance
affected? Can you draw any conclusions about whether your GPU is memory or
bandwidth limited when running these kernels?
② The direct lighting code in the
EvaluateMaterialsAndBSDFs()
kernel may suffer from divergence in the Light::SampleLi() call
if the scene has a variety of types of light source. Construct such a
scene and then experiment with moving light sampling into a separate
kernel, using a work queue to supply work to it and where the light samples
are pushed on to a queue for the rest of the direct lighting computation.
What is the effect on performance for your test scene? Is performance
negatively impacted for scenes with just a single type of light?
③ Add support for ray differentials to the
WavefrontPathIntegrator, including both generating them for camera
rays and computing updated differentials for reflected and refracted rays.
(You will likely want to repurpose the code in the implementation of the
SurfaceInteractionSpawnRay() method in
Section 10.1.3.)
After ensuring that texture filtering results match pbrt running on the
CPU, measure the performance impact of your changes. How much performance
is lost from the bandwidth used in passing ray differentials between
kernels? Do any kernels have better performance? If so, can you explain
why?
Next, implement one of the more space-efficient techniques for representing
derivative information with rays that are described by Akenine-Möller
et al. (2019). How do performance and filtering
quality compare to ray differentials?
③ The WavefrontPathIntegrator’s performance can suffer
from scenes with very high maximum ray depths when there are few active
rays remaining at high depths and, in turn, insufficient parallelism for the
GPU to reach its peak capabilities. One approach to address this problem
is path regeneration, which was described by Novák et
al. (2010).
Following this approach, modify pbrt so that each ray traced handles its
termination individually when it reaches the maximum depth. Execute a
modified camera ray generation kernel each time through the main rendering
loop so that additional pixel samples are taken and camera rays are
generated until the current RayQueue is filled or there are no more
samples to take. Note that you will have to handle Film updates in a
different way than the current implementation—for example, via a work queue when
rays terminate. You may also have to handle the case of multiple threads
updating the same pixel sample. Finally, implement a mechanism for the GPU
to notify the CPU when all rays have terminated so that it knows when to
stop launching kernels.
With all that taken care of, measure pbrt’s performance for a scene with a
high maximum ray depth. (Scenes that include volumetric scattering with media with very
high albedos are a good choice for this measurement.) How much is
performance improved with your approach? How is performance affected for
easier scenes with lower maximum depths that do not suffer from this problem?
③ In pbrt’s current implementation, the wavefront path tracer
is usually slower than the VolPathIntegrator when running on the CPU.
Render a few scenes using both approaches and benchmark pbrt’s
performance. Are any opportunities to improve the performance of the
wavefront approach on the CPU evident?
Next, measure how performance changes as you increase or decrease the queue
sizes (and consequently, the number of pixel samples that are evaluated in
parallel). Performance may be suboptimal with the current value of
WavefrontPathIntegrator::maxQueueSize, which leads to queues
much larger than can fit in the on-chip caches. However, too small a queue size may
offer insufficient parallelism or may lead to too little work being done in
each ParallelFor() call, which may also hurt performance. Are there
better default queue sizes for the CPU than the ones used currently?
③ When the WavefrontPathIntegrator runs on the CPU,
there is currently minimal performance benefit from organizing work in
queues. However, the queues offer the possibility of making it easier to
use SIMD instructions on the CPU: kernels might remove 8 work items at a
time, for example, processing them together using the 8 elements of a
256-bit SIMD register. Implement this approach and investigate pbrt’s
performance. (You may want to consider using a language such as
ispc (Pharr and Mark 2012) to avoid the challenges of manually writing
code using SIMD intrinsics.)
③ Implement a GPU ray tracer that is based on pbrt’s class
implementations from previous chapters but uses the GPU’s ray-tracing API for
scheduling rendering work instead of the wavefront-based architecture used
in this chapter. (You may want to start by supporting only a subset of the
full functionality of the WavefrontPathIntegrator.) Measure the
performance of the two implementations and discuss their differences. You
may find it illuminating to use a profiler to measure the bandwidth
consumed by each implementation. Can you find cases where the wavefront
integrator’s performance is limited by available memory bandwidth but yours
is not?