Memory Management

A.4 Memory Management

Memory management is often a complex issue in a system written in a language without garbage collection. The situation is mostly simple in pbrt, since most dynamic memory allocation is done as the scene description file is parsed, and most of this memory remains in use until rendering is finished. Nevertheless, there are a few issues related to memory management—most of them performance related—that warrant classes and utility routines to address them.

A.4.1 Variable Stack Allocation

Sometimes it is necessary to allocate a variable amount of memory that will be used temporarily in a single function but isn’t needed after the function returns. If only a small amount of memory is needed, the overhead of new and delete (or malloc() and free()) may be high relative to the amount of actual computation being done. Instead, it is frequently more efficient to use alloca(), which allocates memory on the stack with just a few machine instructions. This memory is automatically deallocated when the function exits, which also saves bookkeeping work in the routine that uses it.

alloca() is an extremely useful tool, but there are two pitfalls to be aware of when using it. First, because the memory is deallocated when the function that called alloca() returns, the pointer must not be returned from the function or stored in a data structure with a longer lifetime than the function that allocated it. (However, the pointer may be passed to functions called by the allocating function.) Second, stack size is limited, and so alloca() shouldn’t be used for more than a few kilobytes of storage. Unfortunately, there is no way to detect the error condition when more space is requested from alloca() than is available on the stack, so it’s important to be conservative with its use.

pbrt provides a macro that makes it easy to allocate space for a given number of objects of a given type.

<<Global Macros>>=

#define ALLOCA(TYPE, COUNT) (TYPE *)alloca((COUNT) * sizeof(TYPE))

A.4.2 Cache-Friendly Memory Usage

The speed at which memory can respond to read requests has historically been getting faster at a rate of roughly 10% per year, while the computational capabilities of modern CPUs have been growing much more quickly. As such, a CPU will typically have to wait a hundred or so execution cycles to read from main memory. The CPU is usually idle for much of this time, so a substantial amount of its computational potential may be lost.

One of the most effective techniques to address this problem is the judicious use of small, fast cache memory located in the CPU itself. The cache holds recently accessed data and is able to service memory requests much faster than main memory, thus greatly reducing the frequency of stalls in the CPU.

Because of the high penalty for accessing main memory, designing algorithms and data structures that make good use of the cache can substantially improve overall system performance. This section will discuss general programming techniques for improving cache performance. These techniques are used in many parts of pbrt, particularly the KdTreeAccel, BVHAccel, MIPMap, and Film. We assume that the reader has a basic familiarity with computer architecture and caching technology; readers needing a review are directed to a computer architecture text such as Hennessy and Patterson (1997). In particular, the reader should be generally familiar with topics like cache lines, cache associativity, and the difference between compulsory, capacity, and conflict misses.

One easy way to reduce the number of cache misses incurred by pbrt is to make sure that some key memory allocations are aligned with the blocks of memory that the cache manages. (pbrt’s overall performance was improved by approximately 3% when allocation for the kd-tree accelerator in Section 4.4 was rewritten to use cache-aligned allocation.) Figure A.1 illustrates the basic technique.

Figure A.1: The Layout of Three 16-Byte Objects in Memory on a System with 32-Byte Cache Lines. Cache-aligned memory allocation ensures that the address returned by the memory allocation routines are aligned with the start of a cache line. (a) The starting address is not cache aligned; the first and last of the three objects span two cache lines, such that two cache misses may be incurred when accessing their elements. (b) The starting address is cache aligned, guaranteeing that a maximum of one cache miss will be incurred per object.

The AllocAligned() and FreeAligned() functions provide an interface to allocate and release cache-aligned memory blocks. If the preprocessor constant PBRT_L1_CACHE_LINE_SIZE is not set, a default cache line size of 64 bytes is used, which is representative of many current architectures.

<<Global Constants>>+=

#ifndef PBRT_L1_CACHE_LINE_SIZE #define PBRT_L1_CACHE_LINE_SIZE 64 #endif

Unfortunately there aren’t portable methods to allocate memory aligned to a particular granularity. Therefore AllocAligned() must call various operating-system-specific functions to do these allocations.

<<Memory Allocation Functions>>=

void *AllocAligned(size_t size) { #if defined(PBRT_IS_WINDOWS) return _aligned_malloc(size, PBRT_L1_CACHE_LINE_SIZE); #elif defined (PBRT_IS_OPENBSD) || defined(PBRT_IS_OSX) void *ptr; if (posix_memalign(&ptr, PBRT_L1_CACHE_LINE_SIZE, size) != 0) ptr = nullptr; return ptr; #else return memalign(PBRT_L1_CACHE_LINE_SIZE, size); #endif }

A convenience routine is also provided for allocating a collection of objects so that code like AllocAligned<Foo>(n) can be written to allocate an array of n instances of type Foo.

<<Memory Declarations>>+=

template <typename T> T *AllocAligned(size_t count) { return (T *)AllocAligned(count * sizeof(T)); }

The routine for freeing aligned memory calls the corresponding operating-system-specific routine. We won’t include its implementation here.

<<Memory Declarations>>+=

void FreeAligned(void *);

Another family of techniques for improving cache performance is based on reorganizing data structures themselves. For example, using bit fields to reduce the size of a frequently used data structure can be helpful. This approach improves the spatial locality of memory access at run time, since code that accesses multiple packed values won’t incur more than one cache miss to get them all. Furthermore, by reducing the overall size of the structure, this technique can reduce capacity misses if fewer cache lines are consequently needed to store the structure.

If not all of the elements of a structure are frequently accessed, there are a few possible strategies to improve cache performance. For example, if the structure has a size of 128 bytes and the computer has 64-byte cache lines, two cache misses may be needed to access it. If the commonly used fields are collected into the first 64 bytes rather than being spread throughout, then no more than one cache miss will be incurred when only those fields are needed (Truong et al. 1998).

A related technique is splitting, where data structures are split into “hot” and “cold” parts, each stored in separate regions of memory. For example, given an array of some structure type, we can split it into two arrays, one for the more frequently accessed (or “hot”) portions and one for the less frequently accessed (or “cold”) portions. This way, cold data doesn’t displace useful information in the cache except when it is actually needed.

Cache-friendly programming is a complex engineering task, and we will not cover all the variations here. Readers are directed to the “Further Reading” section of this appendix for more information.

A.4.3 Arena-Based Allocation

Conventional wisdom says that the system’s memory allocation routines (e.g., malloc() and new()) are slow and that custom allocation routines for objects that are frequently allocated or freed can provide a measurable performance gain. However, this conventional wisdom seems to be wrong. Wilson et al. (1995), Johnstone and Wilson (1999), and Berger, Zorn, and McKinley (2001, 2002) all investigated the performance impact of memory allocation in real-world applications and found that custom allocators almost always result in worse performance than a well-tuned generic system memory allocation, in both execution time and memory use.

One type of custom allocation technique that has proved to be useful in some cases is arena-based allocation, which allows the user to quickly allocate objects from a large contiguous region of memory. In this scheme, individual objects are never explicitly freed; the entire region of memory is released when the lifetime of all of the allocated objects ends. This type of memory allocator is a natural fit for many of the objects in pbrt.

There are two main advantages to arena-based allocation. First, allocation is extremely fast, usually just requiring a pointer increment. Second, it can improve locality of reference and lead to fewer cache misses, since the allocated objects are contiguous in memory. A more general dynamic memory allocator will typically prepend a bookkeeping structure to each block it returns, which adversely affects locality of reference.

pbrt provides the MemoryArena class to implement this approach; it supports variable-sized allocation from the arena.

The MemoryArena quickly allocates memory for objects of variable size by handing out pointers into a preallocated block. It does not support freeing of individual blocks of memory, only freeing of all of the memory in the arena at once. Thus, it is useful when a number of allocations need to be done quickly and all of the allocated objects have similar lifetimes.

<<Memory Declarations>>+=

class MemoryArena { public: <<MemoryArena Public Methods>>

MemoryArena(size_t blockSize = 262144) : blockSize(blockSize) { } ~MemoryArena() { FreeAligned(currentBlock); for (auto &block : usedBlocks) FreeAligned(block.second); for (auto &block : availableBlocks) FreeAligned(block.second); } void *Alloc(size_t nBytes) { <<Round up nBytes to minimum machine alignment>>

nBytes = ((nBytes + 15) & (~15));

if (currentBlockPos + nBytes > currentAllocSize) { <<Add current block to usedBlocks list>>

if (currentBlock) { usedBlocks.push_back(std::make_pair(currentAllocSize, currentBlock)); currentBlock = nullptr; }

<<Get new block of memory for MemoryArena>>

<<Try to get memory block from availableBlocks>>

for (auto iter = availableBlocks.begin(); iter != availableBlocks.end(); ++iter) { if (iter->first >= nBytes) { currentAllocSize = iter->first; currentBlock = iter->second; availableBlocks.erase(iter); break; } }

if (!currentBlock) { currentAllocSize = std::max(nBytes, blockSize); currentBlock = AllocAligned<uint8_t>(currentAllocSize); } currentBlockPos = 0;

} void *ret = currentBlock + currentBlockPos; currentBlockPos += nBytes; return ret; } template<typename T> T *Alloc(size_t n = 1, bool runConstructor = true) { T *ret = (T *)Alloc(n * sizeof(T)); if (runConstructor) for (size_t i = 0; i < n; ++i) new (&ret[i]) T(); return ret; } void Reset() { currentBlockPos = 0; availableBlocks.splice(availableBlocks.begin(), usedBlocks); } size_t TotalAllocated() const { size_t total = currentAllocSize; for (const auto &alloc : usedBlocks) total += alloc.first; for (const auto &alloc : availableBlocks) total += alloc.first; return total; }

private: <<MemoryArena Private Data>>

const size_t blockSize; size_t currentBlockPos = 0, currentAllocSize = 0; uint8_t *currentBlock = nullptr; std::list<std::pair<size_t, uint8_t *>> usedBlocks, availableBlocks;

};

MemoryArena allocates memory in chunks of size MemoryArena::blockSize, the value of which is set by a parameter passed to the constructor. If no value is provided to the constructor, a default of 256 kB is used.

<<MemoryArena Public Methods>>=

MemoryArena(size_t blockSize = 262144) : blockSize(blockSize) { }

<<MemoryArena Private Data>>=

const size_t blockSize;

The implementation maintains a pointer to the current block of memory, currentBlock, and the offset of the first free location in the block, currentPos. currentAllocSize stores the total size of the currentBlock allocation; it generally has the value blockSize but is larger in certain cases (discussed in the following).

<<MemoryArena Private Data>>+=

size_t currentBlockPos = 0, currentAllocSize = 0; uint8_t *currentBlock = nullptr;

To service an allocation request, the allocation routine first rounds the requested amount of memory up so that it meets the computer’s word alignment requirements. The routine then checks to see if the current block has enough space to handle the request, allocating a new block if necessary. Finally, it returns the pointer and updates the current block offset.

<<MemoryArena Public Methods>>+=

void *Alloc(size_t nBytes) { <<Round up nBytes to minimum machine alignment>>

nBytes = ((nBytes + 15) & (~15));

if (currentBlockPos + nBytes > currentAllocSize) { <<Add current block to usedBlocks list>>

if (currentBlock) { usedBlocks.push_back(std::make_pair(currentAllocSize, currentBlock)); currentBlock = nullptr; }

<<Get new block of memory for MemoryArena>>

<<Try to get memory block from availableBlocks>>

if (!currentBlock) { currentAllocSize = std::max(nBytes, blockSize); currentBlock = AllocAligned<uint8_t>(currentAllocSize); } currentBlockPos = 0;

} void *ret = currentBlock + currentBlockPos; currentBlockPos += nBytes; return ret; }

Most modern computer architectures impose alignment requirements on the positioning of objects in memory. For example, it is frequently a requirement that float values be stored at memory locations that are word aligned. To be safe, the implementation always hands out 16-byte-aligned pointers (i.e., their address is a multiple of 16).

<<Round up nBytes to minimum machine alignment>>=

nBytes = ((nBytes + 15) & (~15));

If a new block of memory must be dynamically allocated to service an allocation request, the MemoryArena stores the pointer to the current block of memory in the usedBlocks list so that it is not lost. Later, when MemoryArena::Reset() is called, it will be able to reuse the block for the next series of allocations.

<<Add current block to usedBlocks list>>=

if (currentBlock) { usedBlocks.push_back(std::make_pair(currentAllocSize, currentBlock)); currentBlock = nullptr; }

MemoryArena uses two linked lists to hold pointers to blocks of memory that have been fully used as well as available blocks that were previously allocated but aren’t currently in use.

<<MemoryArena Private Data>>+=

std::list<std::pair<size_t, uint8_t *>> usedBlocks, availableBlocks;

If a block of memory of suitable size isn’t available from availableBlocks, a new one is allocated.

<<Get new block of memory for MemoryArena>>=

<<Try to get memory block from availableBlocks>>

if (!currentBlock) { currentAllocSize = std::max(nBytes, blockSize); currentBlock = AllocAligned<uint8_t>(currentAllocSize); } currentBlockPos = 0;

The allocation routine first checks to see if there are any already allocated free blocks in availableBlocks.

<<Try to get memory block from availableBlocks>>=

The MemoryArena also provides a convenience template method to allocate an array of objects of the given type.

<<MemoryArena Public Methods>>+=

template<typename T> T *Alloc(size_t n = 1, bool runConstructor = true) { T *ret = (T *)Alloc(n * sizeof(T)); if (runConstructor) for (size_t i = 0; i < n; ++i) new (&ret[i]) T(); return ret; }

When the user is done with all of the memory, the arena resets its offset in the current block and moves all of the memory from the usedBlocks list onto the availableBlocks list.

<<MemoryArena Public Methods>>+=

void Reset() { currentBlockPos = 0; availableBlocks.splice(availableBlocks.begin(), usedBlocks); }

A.4.4 Blocked 2D Arrays

In C++, 2D arrays are arranged in memory so that entire rows of values are contiguous in memory, as shown in Figure A.2(a). This is not always an optimal layout, however; for such an array indexed by , nearby array positions will often map to distant memory locations. For all but the smallest arrays, the adjacent values in the direction will be on different cache lines; thus, if the cost of a cache miss is incurred to reference a value at a particular location , there is no chance that handling that miss will also load into memory the data for values , , and so on. Thus, spatially coherent array indices in do not necessarily lead to the spatially coherent memory access patterns that modern memory caches depend on.

Figure A.2: (a) In C++, the natural layout for a 2D array of size width*height is a block of width*height entries, where the array element is at the u+v*width offset. (b) A blocked array has been split into smaller square blocks, each of which is laid out linearly. Although it is slightly more complex to find the memory location associated with a given array position in the blocked scheme, the improvement in cache performance due to more coherent memory access patterns often more than makes up for this in overall faster performance.

To address this problem, the BlockedArray template implements a generic 2D array of values, with the items ordered in memory using a blocked memory layout, as shown in Figure A.2(b). The array is subdivided into square blocks of a small fixed size that is a power of 2. Each block is laid out row by row, as if it were a separate 2D C++ array. This organization substantially improves the memory coherence of 2D array references in practice and requires only a small amount of additional computation to determine the memory address for a particular position (Lam et al. 1991).

To ensure that the block size is a power of 2, the caller specifies its logarithm (base 2), which is given by the template parameter logBlockSize.

<<Memory Declarations>>+=

template <typename T, int logBlockSize> class BlockedArray { public: <<BlockedArray Public Methods>>

BlockedArray(int uRes, int vRes, const T *d = nullptr) : uRes(uRes), vRes(vRes), uBlocks(RoundUp(uRes) >> logBlockSize) { int nAlloc = RoundUp(uRes) * RoundUp(vRes); data = AllocAligned<T>(nAlloc); for (int i = 0; i < nAlloc; ++i) new (&data[i]) T(); if (d) for (int v = 0; v < vRes; ++v) for (int u = 0; u < uRes; ++u) (*this)(u, v) = d[v * uRes + u]; } constexpr int BlockSize() const { return 1 << logBlockSize; } int RoundUp(int x) const { return (x + BlockSize() - 1) & ~(BlockSize() - 1); } int uSize() const { return uRes; } int vSize() const { return vRes; } ~BlockedArray() { for (int i = 0; i < uRes * vRes; ++i) data[i].~T(); FreeAligned(data); } int Block(int a) const { return a >> logBlockSize; } int Offset(int a) const { return (a & (BlockSize() - 1)); } T &operator()(int u, int v) { int bu = Block(u), bv = Block(v); int ou = Offset(u), ov = Offset(v); int offset = BlockSize() * BlockSize() * (uBlocks * bv + bu); offset += BlockSize() * ov + ou; return data[offset]; } const T &operator()(int u, int v) const { int bu = Block(u), bv = Block(v); int ou = Offset(u), ov = Offset(v); int offset = BlockSize() * BlockSize() * (uBlocks * bv + bu); offset += BlockSize() * ov + ou; return data[offset]; } void GetLinearArray(T *a) const { for (int v = 0; v < vRes; ++v) for (int u = 0; u < uRes; ++u) *a++ = (*this)(u, v); }

private: <<BlockedArray Private Data>>

T *data; const int uRes, vRes, uBlocks;

};

The constructor allocates space for the array and optionally initializes its values from a pointer to a standard C++ array. Because the array size may not be an exact multiple of the block size, it may be necessary to round up the size in one or both directions to find the total amount of memory needed for the blocked array. The BlockedArray::RoundUp() method rounds both dimensions up to be a multiple of the block size.

<<BlockedArray Public Methods>>=

<<BlockedArray Private Data>>=

T *data; const int uRes, vRes, uBlocks;

<<BlockedArray Public Methods>>+=

constexpr int BlockSize() const { return 1 << logBlockSize; } int RoundUp(int x) const { return (x + BlockSize() - 1) & ~(BlockSize() - 1); }

For convenience, the BlockedArray can also report its size in each dimension:

<<BlockedArray Public Methods>>+=

int uSize() const { return uRes; } int vSize() const { return vRes; }

Looking up a value from a particular position in the array requires some indexing work to find the memory location for that value. There are two steps to this process: finding which block the value is in and finding its offset within that block. Because the block sizes are always powers of 2, the logBlockSize low-order bits in each of the and array positions give the offset within the block, and the high-order bits give the block number (Figure A.3).

Figure A.3: Given a binary array coordinate, the block number that it is in can be found by shifting off the logBlockSize low-order bits for both and . For example, with a logBlockSize of 2 and thus a block size of 4, we can see that this correctly maps 1D array positions from 0 to 3 to block 0, 4 to 7 to block 1, and so on. To find the offset within the particular block, it is just necessary to mask off the high-order bits, leaving the logBlockSize low-order bits. Because the block size is a power of two, these computations can all be done with efficient bit operations.

<<BlockedArray Public Methods>>+=

int Block(int a) const { return a >> logBlockSize; } int Offset(int a) const { return (a & (BlockSize() - 1)); }

Then, given the block number and the offset within the block , it is necessary to compute what memory location this maps to in the blocked array layout. First consider the task of finding the starting address of the block; since the blocks are laid out row by row, this corresponds to the block number bu + bv * uBlocks, where uBlocks is the number of blocks in the direction. Because each block has BlockSize()*BlockSize() values in it, the product of the block number and this value gives us the offset to the start of the block. We then just need to account for the additional offset from the start of the block, which is ou + ov * BlockSize().

<<BlockedArray Public Methods>>+=

T &operator()(int u, int v) { int bu = Block(u), bv = Block(v); int ou = Offset(u), ov = Offset(v); int offset = BlockSize() * BlockSize() * (uBlocks * bv + bu); offset += BlockSize() * ov + ou; return data[offset]; }