Swallowing the elephant (part 2)

Flush from a first victory of getting pbrt to parse Disney’s Moana island scene description a bit more efficiently, I dug into memory use next. There was still plenty more to do with respect to running time, but I thought it would be helpful to first better understand the overall lay of the land before doing more work on that.

As with runtime, I started with pbrt’s built-in statistics; pbrt has manual instrumentation around significant memory allocations to track memory usage and a summary of allocations is printed out when rendering finishes. Here’s pbrt’s original take on what it had allocated for this scene:

  Memory
    BVH tree                                             9.01 GiB
    Curves                                               1.44 GiB
    Texture MIP maps                                     2.00 GiB
    Triangle meshes                                     11.02 GiB

And as with runtime, the built-in statistics came up short, only reporting about 24 GB worth of allocations of known objects. top reported about 70 GB of memory actually used, which leaves us with about 45 GB unaccounted for. Some discrepancy is expected: dynamic memory allocators need to allocate extra space for bookkeeping, there’s some amount of fragmentation, and so forth. But 45 GB? There’s definitely something stinky to be found.

In order to figure out what was missing (and to validate the correctness of what had been tracked), I used massif to trace the actual dynamic memory allocations. It’s fairly slow, but at least it works well.

Primitive woes

The first thing I found in the massif trace was two lines of code that allocated instances of the Primitive base class that weren’t tracked in the statistics. A small oversight, easily enough fixed. With that, we now see:

    Primitives                                          24.67 GiB

Ouch. So what’s a primitive, and what’s with all that memory?

pbrt makes a distinction between a Shape, which is pure geometry (a sphere, a triangle, etc.), and a Primitive, which is the combination of a shape, a material, possibly an emission function, and which participating media, if any, are inside and outside of the shape’s surface.

There are a few variants of the Primitive base class: GeometricPrimitive, which is the common case: a vanilla combination of shape, material, etc., and TransformedPrimitive, which represents a primitive that has an additional transformation applied to it, either as an object instance, or for moving primitives with time-varying transformations. It turns out that for this scene, there’s wasted space in both of them.

GeometricPrimitive: 50% wasted space hurts

Note: there were some mistaken assumptions in the analysis here; they’re revisited in a later post.

4.3 GB was used for GeometricPrimitives. It’s a funny world to live in where 4.3 GB of RAM used is far from your biggest concern, but let’s nevertheless look at why there are 4.3 GB of GeometricPrimitives. Here are the relevant parts of the class definition:

class GeometricPrimitive : public Primitive {
    std::shared_ptr<Shape> shape;
    std::shared_ptr<Material> material;
    std::shared_ptr<AreaLight> areaLight;
    MediumInterface mediumInterface;
};

We have a vtable pointer, three more pointers, and then a MediumInterface, which holds two more pointers for a total of 48 bytes. For this scene, only a handful of the shapes are emissive, so areaLight is almost always a null pointer and there is no participating media, so the mediumInterface pointers are also both null. Thus, if we had a specialized Primitive implementation that was used instead when there was no emission function and no participating media, we’d save about half the storage for GeometricPrimitives—roughly 2 GB here.

However, I didn’t go ahead and make the fix and add a new Primitive implementation to pbrt. As much as possible we try minimize divergences between the pbrt-v3 source code on github and the system described in the book, for the simple reason that keeping them in sync makes it easier to read the book and then work with the code (and vice versa). In this case, I decided that an entirely new Primitive implementation, never mentioned in the book, would be too big a divergence. That fix will certainly be there in the next version of pbrt.

Before we continue, a rendering for sustenance:

Moana island beach rendered with pbrt-v3 at 2048x858 resolution with 256 samples per pixel. Total rendering time using a 12 core / 24 thread Google Compute Engine instance running at 2 GHz with the latest version of pbrt-v3 was 2h 25m 43s.

TransformedPrimitives: 95% wasted space really hurts

The 4.3 GB allocation for GeometricPrimitives was painful enough, but what about the 17.4 GB for TransformedPrimitives?

As mentioned before, TransformedPrimitive is both used for moving objects with a time-varying transformation and for object instances. In both cases, we want to apply an additional transformation to an existing Primitive. There are only two members of the TransformedPrimitive class:

    std::shared_ptr<Primitive> primitive;
    const AnimatedTransform PrimitiveToWorld;

So far, so good: a pointer to a primitive and a time-varying transformation. But what exactly does an AnimatedTransform store?

    const Transform *startTransform, *endTransform;
    const Float startTime, endTime;
    const bool actuallyAnimated;
    Vector3f T[2];
    Quaternion R[2];
    Matrix4x4 S[2];
    bool hasRotation;
    struct DerivativeTerm {
        // ...
        Float kc, kx, ky, kz;
    };
    DerivativeTerm c1[3], c2[3], c3[3], c4[3], c5[3];

In addition to pointers to two transformation matrices and their associated times, there’s also a decomposition of the matrices into translation, rotation, and scale components, as well as precomputed values used to bound the volume swept by moving bounding boxes (see Section 2.4.9 of Physically Based Rendering). It adds up to 456 bytes.

For this scene, nothing is moving. All we need transformation-wise for the object instances is one transformation pointer, no decomposition, and no values for bounding moving bounding boxes. (Otherwise known as 8 bytes). With a separate Primitive implementation for instanced-but-not-moving objects, the 17.4 GB would go down to just over 900 MB(!).

As with GeometricPrimitive, fixing that is a non-trivial change to what’s described in the book, so we’ll also punt it to pbrt-next for now. At least we understand what’s going on with the mess that is 24.7 GB of Primitive memory.

The transform cache embarrassment

The next biggest block of unaccounted-for memory that massif identified was the TransformCache, which was allocating about 16 GB. (Here’s a link to the original implementation.) The idea is that the same transformation matrix is often used multiple times in a scene, so it’s best to only have a single copy of it in memory and let everyone who uses it just store a pointer to the same transformation.

TransformCache used a std::map to store the cache and massif reported that 6 GB of that 16 GB was for red-black tree nodes in the std::map. That seemed awfully high; it’s 60% of what’s used for the actual transformations themselves. Let’s look at the declaration for that map:

std::map<Transform, std::pair<Transform *, Transform *>> cache;

A+ work right there: we were using entire Transforms as keys for the map. Even better, in pbrt a Transform stores two 4x4 matrices (the transformation and its inverse), so that works out to 128 bytes stuffed in each tree node, all of it completely redundant with the value stored for it.

This was maybe a sort-of ok design in a world where we were worried about the same transformation matrix used for hundreds or thousands of primitives but didn’t have too many transformation matrices in total, but it’s a terrible approach for a scene with a ton of mostly unique transformation matrices, as we have here.

Beyond the storage waste from the keys, there’s a lot of pointer-chasing involved with lookups in std::map to go through the red-black tree, so it seemed reasonable to try something completely new. Fortunately, TransformCache is barely discussed in the book, so completely rewriting it was fair game.

One last thing before we get started: another issue is evident from inspection of the Lookup() method’s signature:

void Lookup(const Transform &t, Transform **tCached,
            Transform **tCachedInverse)

When the caller provides a Transform, the cache helpfully stores and returns pointers to a transform equal to the provided one but also its inverse. To be able to do this, the original implementation would always compute and store the inverse when adding a transform to the cache, so that it would be ready to be returned.

The foolishness here is that most of the call sites that use the transform cache don’t ask for and don’t use the inverse. Thus, all sorts of memory was being wasted for unused inverse transformations.

The new implementation has a number of improvements:

It uses a hash table, which allows for faster lookups and no need to store anything besides an array of Transform *s, which basically brings memory usage down to the amount actually needed to store the Transforms.
The lookup method signature is now Transform *Lookup(const Transform &t); In the one place where the caller wants the inverse from the cache as well, it just calls Lookup() twice.

For hashing, I used the FNV1a hash function. After implementing it, I made my way to Aras’s post on hash functions; I probably should just use xxHash or CityHash instead for their better performance; I’ll probably shame myself into fixing that at some point.

With the new TransformCache implementation, overall system startup time improved markedly, to 21m 42s—an incremental savings of 5m 7s, or another 1.27x speedup. Further, the improved memory efficiency brought us from 16 GB to 5.7 GB of storage for transformation matrices, which is about as much as we need for what’s actually there, barring something like trying to take advantage of the fact that they aren’t projective and storing 3x4 matrices rather than 4x4s. (Normally, I’d be skeptical about the value of that sort of optimization, but here it’d save us over a gigabyte—real memory! It’d certainly be worth doing in a production renderer.)

A small performance optimization to finish

The generality of TransformedPrimitive costs both memory and time: the profiler had reported that a meaningful chunk of time at startup was spent in the AnimatedTransform::Decompose() function, which decomposes matrix transformations into a quaternion rotation, a translation, and a scale. Because nothing’s moving in this scene, that work is unneeded, and a quick double-check of the AnimatedTransform implementation made it clear that none of those values are accessed if the two transformation matrices are in fact the same.

With a two-line addition to the constructor to not do the transformation decompositions if they aren’t needed, we save another 1m 31s at startup time: that brings us to end today down to 20m 9s, in aggregate a 1.73x speedup from where we started.

Next time we’ll finally attack the parser for real and will look at something that started to matter once other things became faster.