1.4 Parallelization of pbrt

It’s now nearly impossible to buy a new laptop or desktop computer with only one processing core. (The same even holds for mobile phones, for that matter.) The computer systems of today and of the future will increasingly feature multiple processing cores, both on CPUs and on highly parallel throughput processors like GPUs. At the same time, the computational capabilities of their cores are only improving slowly; as such, significant increases in performance over time are now only available to programs that can run in parallel, with many separate threads of execution performing computation simultaneously on multiple cores.

Writing a parallel program is more difficult than writing a serial program. When two computations that the programmer believes are independent are executing simultaneously but then interact unexpectedly, the program may crash or generate unexpected results. However, such a bug may not manifest itself again if the program is run again, perhaps due to those particular computations not running simultaneously during the next run. Fortunately, increasingly good tools to help developers find these sorts of interactions are increasingly available.

For a parallel program to scale well to many processors, it needs to be able to provide a substantial amount of independent computation: any computation dependent on results of prior computation can’t be run concurrently with the computation it depends on. Fortunately, most rendering algorithms based on ray tracing have abundant parallelism; for the SamplerIntegrator, each image sample is independent of all of the other ones, and many millions of samples may be used for high-quality images.

One of the biggest challenges with parallel ray tracing is the impact of non-parallel phases of computation. For example, it’s not as easy to effectively parallelize the construction of many acceleration structures while the scene is being constructed as it is to parallelize rendering. While this may seem like a minor issue, Amdahl’s law, which describes the speedup of a workload that has both serial and parallel phases, points to the challenge. Given n cores performing computation and a workload where the fraction s of its overall computation is inherently serial, then the maximum speedup possible is

StartStartFraction 1 OverOver s plus StartFraction 1 Over n EndFraction left-parenthesis 1 minus s right-parenthesis EndEndFraction period

Thus, even with an infinite number of cores, the maximum speedup is 1 slash s . If, for example, a seemingly innocuous 5 percent-sign of the run time is spent in a serial phase of parsing the scene file and building acceleration structures, the maximum speedup possible is 1 slash 0.05 equals 20 times , no matter how quickly the parallel phase executes.

1.4.1 Data Races and Coordination

In pbrt, we assume that the computation is running on processors that provide coherent shared memory. The main idea of coherent shared memory is that all threads can read and write to a common set of memory locations and that changes to memory made by one thread will eventually be seen by other threads. These properties greatly simplify the implementation of the system as there’s no need to explicitly communicate data between cores. (Coherent shared memory is generally available on today’s CPUs and is likely to continue to be on future CPUs. On the other hand, if one is running a computation across multiple computers in a cluster, coherent shared memory is generally not available.)

Although coherent shared memory relieves the need for separate threads to explicitly communicate data with each other, they still need to coordinate their access to shared data; a danger of coherent shared memory is data races. If two threads modify the same memory location without coordination between the two of them, the program will almost certainly compute incorrect results or even crash. Consider the example of two processors simultaneously running the following innocuous-looking code, where globalCounter starts with a value of two:

extern int globalCounter; if (--globalCounter == 0) printf("done!\n");

Because the two threads don’t coordinate their reading and writing of globalCounter, it is possible that “done” will be printed zero, one, or even two times! The assembly instructions generated by the compiler are likely to correspond to steps something like the following:

extern int globalCounter; int temp = globalCounter; temp = temp - 1; globalCounter = temp; if (globalCounter == 0) printf("done!\n");

Now, consider different ways this code could be executed on two processors. For example, the second processor could start executing slightly after the first one, but the first one could go idle for a few cycles after executing the first few instructions:

Thread A Thread B int temp = globalCounter; temp = temp - 1; (idle) globalCounter = temp; int temp = globalCounter; // (idle) temp = temp - 1; globalCounter = temp; if (globalCounter == 0) if (globalCounter == 0) printf("done!\n"); printf("done!\n");

(Many unpredictable events can cause these sorts of execution bubbles, ranging from the OS interrupting the thread to cache misses.) In this ordering, both threads read the value of zero from globalCounter, and both execute the printf() call. In this case, the error is not fatal, but if instead the system was freeing resources in the if block, then it would attempt to free the same resources twice, which would very likely cause a crash. Consider now this potential execution order:

Thread A Thread B int temp = globalCounter; int temp = globalCounter; temp = temp - 1; temp = temp - 1; globalCounter = temp; // (idle) // (idle) globalCounter = temp; if (globalCounter == 0) if (globalCounter == 0) printf("done!\n"); printf("done!\n");

In this case, globalCounter ends up with a value of one, and neither thread executes the if block. These examples illustrate the principle that when multiple threads of execution are accessing shared modified data, they must somehow synchronize their access.

Two main mechanisms are available today for doing this type of synchronization: mutual exclusion and atomic operations. Mutual exclusion is implemented with std::mutex objects in pbrt. A std::mutex can be used to protect access to some resource, ensuring that only one thread can access it at a time. Consider the following updated version of the previous computation; here a std::lock_guard object acquires a lock on the mutex and releases it when it goes out of scope at the final brace.

extern int globalCounter; extern std::mutex globalCounterMutex; { std::lock_guard<std::mutex> lock(globalCounterMutex); int temp = globalCounter; temp = temp - 1; globalCounter = temp; if (globalCounter == 0) printf("done!\n"); }

If two threads are executing this code and try to acquire the mutex at the same time, then the mutex will allow only one of them to proceed, stalling the other one in the std::lock_guard constructor. Only when the first thread has finished the computation and its std::lock_guard goes out of scope, releasing the lock on the mutex, is the second thread able to acquire the mutex itself and continue the computation.

Thread A Thread B { std::lock_guard<std::mutex> lock( { std::lock_guard<std::mutex> lock( globalCounterMutex); globalCounterMutex); int temp = globalCounter; // (stalled by mutex) // … } // (mutex released) // (mutex acquired) int temp = globalCounter; // … } // (mutex released)

With correct mutual exclusion here, the printf() will only be executed once, no matter what the ordering of execution between the two threads is.

Atomic memory operations (or atomics) are the other option for correctly performing this type of memory update with multiple threads. Atomics are machine instructions that guarantee that their respective memory updates will be performed in a single transaction. (Atomic in this case refers to the notion that the memory updates are indivisible.) The implementations of atomic operations in pbrt are from the C++11 standard library and are further discussed in Appendix A.6.2. Using atomics, the computation above could be written to use the std::atomic<int> type, which has overloaded add, subtract, increment, and decrement operations, as below:

extern std::atomic<int> globalCounter; if (--globalCounter == 0) printf("done!\n");

The std::atomic operator subtracts 1 from the given variable, globalCounter, and returns the new value of the variable. Using an atomic operation ensures that if two threads simultaneously try to update the variable then not only will the final value of the variable be the expected value but each thread will be returned the value of the variable after its update alone. In this example, then, globalCounter will end up with a value of zero, as expected, with one thread guaranteed to have the value one returned from the atomic subtraction and the other thread guaranteed to have zero returned.

An additional option, transactional memory, is just starting to become available in CPUs as of this writing. With transactional memory, a set of memory writes are bundled as a transaction; if no other threads access those memory locations while the transaction is executing, then all of the writes are committed in a single atomic operation. Otherwise, it is rolled back and none of the writes reach memory, and thus the computation has had no effect; the transaction must then be tried again. Transactional memory helps bridge the fine-grained operation of atomics and the higher overhead of mutexes. However, because it isn’t yet widely available, transactional memory isn’t currently used in pbrt.

Section A.6 in Appendix A has more information about parallel programming, with additional details on performance issues and pitfalls, as well as the various routines used in the parallel implementation of pbrt.

1.4.2 Conventions in pbrt

In pbrt (as is the case for most ray tracers) the vast majority of data at render time is read only (e.g., the scene description and texture maps). Almost all of the parsing of the scene file and creation of the scene representation in memory is done with a single thread of execution, so there are no synchronization issues during that phase of execution. During rendering, concurrent read access to all of the read-only data by multiple threads works with no problems; we only need to be concerned with situations where data in memory is being modified.

When adding new code to pbrt, it’s important to make sure to not inadvertently add code that modifies shared data without proper synchronization. This is usually straightforward; for example, when adding a new Shape implementation, the Shape will normally only perform read accesses to its member variables after it has been created. Sometimes, however, shared data may be inadvertently introduced. Consider the following code idiom, often seen in single-threaded code:

static bool firstCall = true; if (firstCall) { // additional initialization … firstCall = false; }

This code is unsafe with multiple threads of execution, as multiple threads may see the value of firstCall as true and all execute the initialization code. Writing this safely requires either atomic operations or mutexes. (This particular idiom can also be implemented safely using the std::call_once() function.)

1.4.3 Thread Safety Expectations in pbrt

Many class methods in pbrt are required to be safe for multiple concurrent threads of execution. Particular instances of these methods must either be safe naturally due to not updating shared global data or due to using mutexes or atomic operations to safely perform any updates that are needed.

As a general rule, the low-level classes and structures in the system are not thread-safe. For example, the Point3f class, which stores three float values to represent a point in 3D space, is not safe for multiple threads to call methods that modify it at the same time. (Multiple threads can use Point3fs as read-only data simultaneously, of course.) The run-time overhead to make Point3f thread-safe would have a substantial effect on performance with little benefit in return.

The same is true for classes like Vector3f, Normal3f, Spectrum, Transform, Quaternion, and SurfaceInteraction. These classes are usually either created at scene construction time and then used as read-only data or allocated on the stack during rendering and used only by a single thread.

The utility classes MemoryArena (used for high-performance temporary memory allocation) and RNG (pseudo-random number generation) are also not safe for use by multiple threads; these classes store state that is modified when their methods are called, and the overhead from protecting modification to their state with mutual exclusion would be excessive relative to the amount of computation they perform. Consequently, in code like the SamplerIntegrator::Render() method above, the implementation allocates per-thread instances of these classes on the stack.

With two exceptions, implementations of the higher level abstract base classes listed in Table 1.1 are all expected to be safe for multiple threads to use simultaneously. With a little care, it is usually straightforward to implement specific instances of these base classes so they don’t modify any shared state in their methods.

The first exceptions are the SamplerIntegrator and Light Preprocess() methods. These are called by the system during scene construction, and implementations of them generally modify shared state in their implementations—for example, by building data structures that represent the distribution of illumination in the scene. Therefore, it’s helpful to allow the implementer to assume that only a single thread will call into these methods. (This is a separate issue from the consideration that implementations of these methods that are computationally intensive may use ParallelFor() to parallelize their computation.)

The second exception is the Sampler; its methods are also not expected to be thread safe. This is another instance where this requirement would impose an excessive performance and scalability impact; many threads simultaneously trying to get samples from a single Sampler would limit the system’s overall performance. Therefore, as described in Section 1.3.4, a unique Sampler is created for each image tile using Sampler::Clone(); this sampler can then be used for just the one tile, without any mutual exclusion overhead.

All stand-alone functions in pbrt are thread-safe (as long as multiple threads don’t pass pointers to the same data to them).