A.7 Statistics

Collecting data about the run-time behavior of the system can provide a substantial amount of insight into its behavior and opportunities for improving its performance. For example, we might want to track the average number of primitive intersection tests performed for all of the rays; if this number is surprisingly high, then there may be a latent bug somewhere in the system. pbrt’s statistics system makes it possible to measure and aggregate this sort of data in a variety of ways.

It’s important to make it as easy as possible to add new measurements to track the system’s run-time behavior; the easier it is to do this, the more measurements end up being added to the system, and the more likely that “interesting” data will be discovered, leading to new insights and improvements. Therefore, adding new measurements to the system is fairly straightforward. For example, the following lines declare two counters that can be used to record how many times the corresponding events happen.

STAT_COUNTER("Integrator/Regular ray intersection tests", nIntersectionTests); STAT_COUNTER("Integrator/Shadow ray intersection tests", nShadowTests);

As appropriate, counters can be incremented with simple statements like

++nIntersectionTests;

With no further intervention from the developer, the preceding is enough for the statistics system to be able to automatically print out nicely formatted results like the following when rendering completes:

Integrator Regular ray intersection tests 752982 Shadow ray intersection tests 4237165

The statistics system supports the following measurements:

  • STAT_COUNTER("name", var): A simple count of the number of instances of an event. The counter variable var can be updated as if it was a regular int variable; for example, ++var and var += 10 are both valid.
  • STAT_MEMORY_COUNTER("name", var): A specialized counter for recording memory usage. In particular, the values reported at the end of rendering are in terms of kilobytes, megabytes, or gigabytes, as appropriate. The counter is updated the same way as a regular counter: var += count * sizeof(MyStruct) and so forth.
  • STAT_INT_DISTRIBUTION("name", dist): Tracks the distribution of some value; at the end of rendering, the minimum, maximum, and average of the reported values are reported. Call ReportValue(dist, val) to include val in the distribution.
  • STAT_FLOAT_DISTRIBUTION("name", dist): This counter also tracks the distribution of a value but expects floating-point values to be reported. ReportValue(dist, val) is also used to report values.
  • STAT_PERCENT("name", num, denom): Tracks how often a given event happens; the aggregate value is reported as the percentage num/denom when statistics are printed. Both num and denom can be incremented as if they were integers—for example, one might write if (event) ++num; or ++denom.
  • STAT_RATIO("name", num, denom): This tracks how often an event happens but reports the result as a ratio num/denom rather than a percentage. This is often a more useful presentation if num is often greater than denom. (For example, we might record the percentage of ray–triangle intersection tests that resulted in an intersection but the ratio of triangle intersection tests to the total number of rays traced.)
  • STAT_TIMER("name", timer): The timer can be used to record how much time is spent in a given block of code. To run the timer, one can add a declaration such as StatTimer t(&timer); to the start of a block of code; the timer will run until the declared t variable goes out of scope and the StatTimer destructor runs. Note that there is some overhead to getting the current system time, so timers shouldn’t be used for very small blocks of code.

All of the macros to define statistics trackers can only be used at file scope and should only be used in cpp files (for reasons that will become apparent as we dig into their implementations). They specifically should not be used in header files or function or class definitions.

Note also that the string names provided for each measurement should be of the form “category/statistic.” When values are reported, everything under the same category is reported together (as in the preceding example).

A.7.1 Implementation

There are a number of challenges in making the statistics system both efficient and easy to use. The efficiency challenges stem from pbrt being multi-threaded: if there wasn’t any parallelism, we could associate regular integer or floating-point variables with each measurement and just update them like regular variables. In the presence of multiple concurrent threads of execution, however, we need to ensure that two threads didn’t try to modify these variables at the same time (recall the discussion of mutual exclusion in Section 1.4).

While atomic operations like those described in Section A.6.2 could be used to safely increment counters without using a mutex, there would still be a performance impact from multiple threads modifying the same location in memory. Recall from Section A.6.1 that the cache coherence protocols can introduce substantial overhead in this case. Because the statistics measurements are updated so frequently during the course of rendering, we found that an atomics-based implementation caused the overall renderer to be 10–15% slower than the current implementation, which avoids the overhead of multiple threads frequently modifying the same memory location.

The implementation here is based on having separate counters for each running thread, allowing the counters to be updated without atomics and without cache coherence overhead (since each thread increments its own counters). This approach means that in order to report statistics, it’s necessary to merge all of these per-thread counters into final aggregate values, which we’ll see is possible with a bit of trickiness.

To see how this all works, we’ll dig into the implementation for regular counters; the other types of measurements are all along similar lines. First, here is the STAT_COUNTER macro, which packs three different things into its definition.

<<Statistics Macros>>= 
#define STAT_COUNTER(title, var) \ static thread_local int64_t var; \ static void STATS_FUNC##var(StatsAccumulator &accum) { \ accum.ReportCounter(title, var); \ var = 0; \ } \ static StatRegisterer STATS_REG##var(STATS_FUNC##var)

First, and most obviously, the macro defines a 64-bit integer variable named var, the second argument passed to the macro. The variable definition has the thread_local qualifier, which indicates that there should be a separate copy of the variable for each executing thread. Given these per-thread instances, we need to be able to sum together the per-thread values and to aggregate all of the individual counters into the final program output.

To this end, the macro next defines a function, giving it a (we hope!) unique name derived from var. When called, this function passes along the value of var for the current thread to an instance of the StatsAccumulator class. StatsAccumulator accumulates values of measurement statistics counters into its own storage; here is its ReportCounter() method, which totals up the given values, associating them with their respective string names.

<<StatsAccumulator Public Methods>>= 
void ReportCounter(const std::string &name, int64_t val) { counters[name] += val; }

<<StatsAccumulator Private Data>>= 
std::map<std::string, int64_t> counters;

All we need now is for the STATS_FUNC##var() function to be called for each per-thread instance of each counter variable when rendering has finished; this is what the last line of the STAT_COUNTER macro takes care of. To understand what it does, first recall that in C++, constructors of global static objects run when program execution starts; thus, each static instance of the StatRegisterer class runs its constructor before main() starts running.

The constructor for StatRegisterer objects in turn adds the function passed to it to a vector that holds all of the various STATS_FUNC##var() functions. Note that because the StatRegisterer constructors run at program start time, the constructor for the funcs vector (another global static) may not yet have run. Therefore, funcs is a pointer to a vector rather than a vector, so that it can be explicitly constructed before it’s first used.

<<StatRegisterer Public Methods>>= 
StatRegisterer(std::function<void(StatsAccumulator &)> func) { if (!funcs) funcs = new std::vector<std::function<void(StatsAccumulator &)>>; funcs->push_back(func); }

<<StatRegisterer Private Data>>= 
static std::vector<std::function<void(StatsAccumulator &)>> *funcs;

Now all the pieces can come together. Each thread launched for multi-threading in pbrt runs the function workerThreadFunc(). When this function exits at the end of program execution, it calls ReportThreadStats(), which in turn causes the per-thread measurements for the thread to be merged into the given StatsAccumulator. (The initial processing thread also calls this function after rendering has completed so that its values are reported.)

The ReportThreadStats() method uses a mutex to ensure that no other thread can be updating the StatsAccumulator concurrently and then dispatches to StatRegisterer::CallCallbacks(), passing in a single StatsAccumulator to store all of the aggregated values.

<<Statistics Definitions>>= 
void ReportThreadStats() { static std::mutex mutex; std::lock_guard<std::mutex> lock(mutex); StatRegisterer::CallCallbacks(statsAccumulator); }

<<Statistics Local Variables>>= 
static StatsAccumulator statsAccumulator;

Finally, CallCallbacks() calls each of the function pointers to the STATS_FUNC##var() functions that were assembled when the program started. This is where the magic happens: because these functions are called repeatedly, once in each of the threads, each call to the STATS_FUNC##var() functions reports the value of the per-thread value for each thread. In the end, we will have aggregated all of the various measurement values.

<<Statistics Definitions>>+=  
void StatRegisterer::CallCallbacks(StatsAccumulator &accum) { for (auto func : *funcs) func(accum); }

The PrintStats() function prints all of the statistics that have been accumulated in StatsAccumulator. This function is called by pbrtWorldEnd() at the end of rendering. Its task is straightforward: it sorts the measurements by category and formats the values so that columns of numbers line up and so forth.

A.7.2 Profiling

When doing significant work on optimizing the performance of any sort of software system, using a stand-alone profiler is a necessity. Modern profilers provide a wealth of information about program execution—which functions, and even which lines of code most execution time is spent in. Equally useful, many also provide information about the program’s interaction with the computer system—which lines of code suffered the most from CPU cache misses, places in the program where cache coherence overhead was high, and so forth.

We have found that a useful supplement to stand-alone profilers is to have a simple profiling system embedded in pbrt. This system provides high-level information about the fraction of overall run time consumed by various parts of the system. The information it provides is useful for understanding the differences across scenes where pbrt spends its time and for helping to guide more focused profiling investigations. In exchange for only providing high-level information, it adds minimal overhead; in practice, we have measured a roughly 2.5% increase in run time.

Here is an example of the profiler’s output for one of the test scenes:

Integrator::Render() 95.0 % Sampler::StartPixelSample() 10.1 % SamplerIntegrator::Li() 81.1 % Accelerator::Intersect() 3.6 % Triangle::Intersect() 1.1 % BSDF::f() 0.1 % Direct lighting 75.1 % Accelerator::Intersect() 16.4 % Triangle::Intersect() 4.3 %

There are a few things to know in order to understand this output. First, all percentages reported are with respect to the system’s overall run time. Thus, here we can see that about 5% of the run time was spent parsing the scene file and constructing the scene, and the remainder was spent rendering it. Most of the rendering time was spent doing direct lighting calculations.

Second, note that there is a hierarchy in the reported results. For example, Accelerator::Intersect() appears twice, and we can see that 3.6% of the run time was spent in calls to the accelerator’s traversal method from Li(), but 16.4% of the run time was spent in calls to it from the direct lighting code. The percentage reported for each level of this hierarchy includes the time of all of the indented items reported underneath it.

The profiling system is based on explicit annotations in the program code to denote these phases of execution. (We haven’t included those annotations in the program source code in the book in the interests of simplicity.) In turn, pbrt uses operating-system-specific functionality to periodically interrupt the system and record which phases are currently active. At the end of execution, it counts up how many times each phase was found to be active and reports these counts with respect to the totals, as above.

As suggested by the profiling output, there are a series of categories that a phase of execution can be charged to; these are represented by the Prof enumerant.

<<Statistics Declarations>>= 
enum class Prof { IntegratorRender, SamplerIntegratorLi, DirectLighting, AccelIntersect, AccelIntersectP, TriIntersect, <<Remainder of Prof enum entries>> 
TriIntersectP, ComputeScatteringFuncs, GenerateCameraRay, BSDFEvaluation, BSSRDFEvaluation, MergeFilmTile, SplatFilm, StartPixel, TexFiltTrilerp, TexFiltEWA, NumProfEvents
};

The hierarchy of reported percentages is implicitly defined by the ordering of values in this enumerant; when multiple execution phases are active, later values in the enumerant are assumed to be children of earlier ones.

One bit in the ProfilerState variable is used to indicate whether each category in Prof is currently active. Note that this variable has the thread_local qualifier; thus, a separate instance is maintained for each program execution thread.

<<Statistics Declarations>>+=  
extern thread_local uint32_t ProfilerState;

CurrentProfilerState() makes the state available to the ParallelFor() implementation; this makes it possible for work done in parallel for loop bodies to be charged to the execution phases that were active when ParallelFor() was called.

<<Statistics Declarations>>+=  
inline uint32_t CurrentProfilerState() { return ProfilerState; }

In turn, a method like BVHAccel::Intersect() starts with a single line of code for the profiling system’s benefit:

ProfilePhase p(Prof::AccelIntersect);

The ProfilePhase helper class’s constructor records that the provided phase of execution has begun; when its destructor later runs when the ProfilePhase instance goes out of scope, it records that the corresponding phase is no longer active.

<<Statistics Declarations>>+= 
class ProfilePhase { public: <<ProfilePhase Public Methods>> 
ProfilePhase(Prof p) { categoryBit = (1 << (int)p); reset = (ProfilerState & categoryBit) == 0; ProfilerState |= categoryBit; } ~ProfilePhase() { if (reset) ProfilerState &= ~categoryBit; }
private: <<ProfilePhase Private Data>> 
bool reset; uint32_t categoryBit;
};

The constructor starts by mapping the provided phase to a bit and then enabling that bit in ProfilerState. In the presence of recursion (notably, via recursive calls to SamplerIntegrator::Li()), it’s possible that a higher level of the recursion will already have indicated the start of an execution phase. In this case, reset is set to false, and the ProfilePhase destructor won’t clear the corresponding bit when it runs, leaving that task to the destructor of the earlier ProfilePhase instance that first set that bit.

<<ProfilePhase Public Methods>>= 
ProfilePhase(Prof p) { categoryBit = (1 << (int)p); reset = (ProfilerState & categoryBit) == 0; ProfilerState |= categoryBit; }

<<ProfilePhase Private Data>>= 
bool reset; uint32_t categoryBit;

<<ProfilePhase Public Methods>>+= 
~ProfilePhase() { if (reset) ProfilerState &= ~categoryBit; }

The InitProfiler() function has two tasks. First, it allocates the profileSamples array; this array has one entry for each possible combination of active phases of execution. While many of these combinations aren’t actually possible, using a simple data structure like this makes it possible to very efficiently record the sample counts of the current phase of execution. (However, note that the number of entries grows at the rate 2 Superscript n , where n is the number of distinct phases of execution in the Prof enumerant. Thus, this implementation is not suited to extremely fine-grained decomposition of profiling categories.)

Second, InitProfiler() uses system-dependent functionality to ask that pbrt be periodically interrupted to record the current phase of execution. We won’t include this code here.

<<Statistics Definitions>>+=  
void InitProfiler() { profileSamples.reset(new std::atomic<uint64_t>[1 << NumProfEvents]); for (int i = 0; i < (1 << NumProfEvents); ++i) profileSamples[i] = 0; <<Set timer to periodically interrupt the system for profiling>> 
#ifndef PBRT_IS_WINDOWS struct sigaction sa; memset(&sa, 0, sizeof(sa)); sa.sa_sigaction = ReportProfileSample; sa.sa_flags = SA_RESTART | SA_SIGINFO; sigemptyset(&sa.sa_mask); sigaction(SIGPROF, &sa, NULL); static struct itimerval timer; timer.it_interval.tv_sec = 0; timer.it_interval.tv_usec = 1000000 / 100; // 100 Hz sampling timer.it_value = timer.it_interval; if (setitimer(ITIMER_PROF, &timer, NULL) != 0) Error("Timer could not be initialized"); #endif
}

<<Statistics Local Variables>>+= 
static std::unique_ptr<std::atomic<uint64_t>[]> profileSamples;

Each time the system interrupt occurs for a thread, ReportProfileSample() is called. This function is extremely simple and just increments the tally for the current set of active phases as encoded in ProfilerState.

<<Statistics Definitions>>+= 
static void ReportProfileSample(int, siginfo_t *, void *) { if (profileSamples) profileSamples[ProfilerState]++; }

At the end of rendering, the ReportProfilerResults() function is called. It aggregates the profile counters and generates reports of the form above; its implementation isn’t particularly interesting with respect to its length, so we won’t include it here.