Further Reading

Hacker’s Delight (Warren 2006), is a delightful and thought-provoking exploration of the bit-twiddling algorithms like those used in some of the utility routines in this appendix. Sean Anderson (2004) has a Web page filled with a collection of bit-twiddling techniques like the ones in IsPowerOf2() and RoundUpPow2() at graphics.stanford.edu/~seander/bithacks.html.

Numerical Recipes, by Press et al. (1992), and Atkinson’s book (1993) on numerical analysis both discuss algorithms for matrix inversion and solving linear systems.

The PCG random number generator was developed by O’Neill (2014). The paper describing its implementation is well written and also features extensive discussion of a range of previous pseudo-random number generators and the challenges that they have faced in passing rigorous tests of their quality (L’Ecuyer and Simard 2007).

Many papers have been written on cache-friendly programming techniques; only a few are surveyed here. Ericson’s chapter (2004) on high-performance programming techniques has very good coverage of this topic. Lam, Rothberg, and Wolf (1991) investigated blocking (tiling) for improving cache performance and developed techniques for selecting appropriate block sizes, given the size of the arrays and the cache size. Grunwald, Zorn, and Henderson (1993) were one of the first groups of researchers to investigate the interplay between memory allocation algorithms and the cache behavior of applications.

In pbrt, we only worry about cache layout issues for dynamically allocated data. However, Calder et al. (1998) show a profile-driven system that optimizes memory layout of global variables, constant values, data on the stack, and dynamically allocated data from the heap in order to reduce cache conflicts among them all, giving an average 30% reduction in data cache misses for the applications they studied.

Blocking for tree data structures was investigated by Chilimbi et al. (1999); they ensured that nodes of the tree and a few levels of its children were allocated contiguously. Among other applications, they applied their tool to the layout of the acceleration octree in the Radiance renderer and reported a 42% speedup in run time. Chilimbi et al. (1999) also evaluated the effectiveness of reordering fields inside structures to improve locality.

Drepper’s paper (2007) is a useful resource for understanding performance issues related to caches, cache coherence, and main memory access, particularly in multicore systems.

Boehm’s paper Threads Cannot Be Implemented as a Library (2005) makes the remarkable (and disconcerting) observation that multi-threading cannot be reliably implemented without the compiler having explicit knowledge of the fact that multi-threaded execution is expected. Boehm presented a number of examples that demonstrate the corresponding dangers in 2005-era compilers and language standards like C and C++ that did not have awareness of threading. Fortunately, the C++11 and C11 standards addressed the issues that he identified.

pbrt’s parallel for loop-based approach to multi-threading is a widely used technique for multi-threaded programming; the OpenMP standard supports a similar construct (and much more) (OpenMP Architecture Review Board 2013). A slightly more general model for multi-core parallelism is available from task systems, where computations are broken up into a set of independent tasks that can be executed concurrently. Blumofe et al. (1996) described the task scheduler in Cilk, and Blumofe and Leiserson (1999) describe the work-stealing algorithm that is the mainstay of many current high-performance task systems.

References

  1. Anderson, S. 2004. graphics.stanford.edu/ tilde seander/bithacks.html.
  2. Atkinson, K. Elementary Numerical Analysis. New York: John Wiley & Sons.
  3. Berger, E. D., B. G. Zorn, and K. S. McKinley. 2001. Composing high-performance memory allocators. In SIGPLAN Conference on Programming Language Design and Implementation, 114–24.
  4. Berger, E. D., B. G. Zorn, and K. S. McKinley. 2002. Reconsidering custom memory allocation. In Proceedings of ACM OOPSLA 2002.
  5. Blumofe, R., and C. Leiserson. 1999. Scheduling multithreaded computations by work stealing. Journal of the ACM 46 (5), 720–48.
  6. Blumofe, R., C. Joerg, B. Kuszmaul, C. Leiserson, K. Randall, and Y. Zhou. 1996. Cilk: an efficient multithreaded runtime system. Journal of Parallel and Distributed Compututing 37 (1), 55–69.
  7. Boehm, H.-J. 2005. Threads cannot be implemented as a library. ACM SIGPLAN Notices 40 (6), 261–68.
  8. Calder, B., K. Chandra, S. John, and T. Austin. 1998. Cache-conscious data placement. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), San Jose.
  9. Chilimbi, T. M., B. Davidson, and J. R. Larus. 1999a. Cache-conscious structure definition. In SIGPLAN Conference on Programming Language Design and Implementation, 13–24.
  10. Chilimbi, T. M., M. D. Hill, and J. R. Larus. 1999b. Cache-conscious structure layout. In SIGPLAN Conference on Programming Language Design and Implementation, 1–12.
  11. Drepper, U. 2007. What every programmer should know about memory. people.redhat.com/drepper/cpumemory.pdf.
  12. Ericson, C. 2004. Real-Time Collision Detection. Morgan Kaufmann Series in Interactive 3D Technology. San Francisco: Morgan Kaufmann.
  13. Grunwald, D., B. G. Zorn, and R. Henderson. 1993. Improving the cache locality of memory allocation. In SIGPLAN Conference on Programming Language Design and Implementation, 177–86.
  14. Johnstone, M. S., and P. R. Wilson. 1999. The memory fragmentation problem: solved? ACM SIGPLAN Notices 34 (3), 26–36.
  15. Kainz, F., R. Bogart, and D. Hess. 2004. The OpenEXR File Format. In R. Fernando (Ed.), GPU Gems. Reading, Massachusetts: Addison-Wesley, 425–44.
  16. L’Ecuyer, P., and R Simard. TestU01: a C library for empirical testing of random number generators. In ACM Transactions on Mathemathical Software 33 (4).
  17. Lam, M. S., E. E. Rothberg, and M. E. Wolf. 1991. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV), Palo Alto, California.
  18. O’Neill, M. PCG: A family of simple fast space-efficient statistically good algorithms for random number generation. Unpublished manuscript. http://www.pcg-random.org/paper.html.
  19. OpenMP Architecture Review Board. 2013. OpenMP Application Program Interface. http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf.
  20. Patterson, D., and J. Hennessy. 2006. Computer Architecture: A Quantitative Approach. San Francisco: Morgan Kaufmann.
  21. Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. 1992. Numerical Recipes in C: The Art of Scientific Computing (2nd ed.). Cambridge: Cambridge University Press.
  22. Truong, D. N., F. Bodin, and A. Seznec. 1998. Improving cache behavior of dynamically allocated data structures. In IEEE PACT, 322–29.
  23. Warren, H. 2006. Hacker’s Delight. Reading, Massachusetts: Addison-Wesley.
  24. Wilson, P. R., M. S. Johnstone, M. Neely, and D. Boles. 1995. Dynamic storage allocation: a survey and critical review. In Proceedings International Workshop on Memory Management, Kinross, Scotland.