Swallowing the elephant (part 5)

To conclude this series, we’ll start by looking at the performance of rendering Disney’s Moana island scene with pbrt-next, a branch of pbrt that I use to try out new ideas where it’s possible to make more radical changes to the system than is possible in pbrt-v3, which has to stay true to the system described in the book. We’ll wrap up with a discussion of a few possible directions for further improvements, ranging from the straightforward to the perhaps a little extreme.

Rendering time

pbrt-next has many changes in its light transport algorithms, including changes to how BSDFs are sampled and improvements to Russian roulette algorithms. It ends up tracing more rays than pbrt-v3 to render this scene, so it’s not possible to directly compare the runtime of the two renderers. Performance is generally similar, with one important exception: when rendering the Moana island scene view below, pbrt-v3 spends 14.5% of runtime doing ptex texture lookups. That had seemed not unreasonable to me before, but pbrt-next spends just 2.2% of runtime. That’s awfully interesting.

Looking at the statistics, we have:¹

pbrt-v3:
    Ptex block reads                              20828624
    Ptex lookups                                 712324767

pbrt-next:
    Ptex block reads                               3378524
    Ptex lookups                                 825826507

We can see that in pbrt-v3, ptex texture is read from disk every 34 texture lookups. In pbrt-next, it’s only every 244 lookups—roughly a 7x reduction in disk I/O. I guessed that this is because pbrt-next computes ray differentials for indirect rays, causing it to access higher MIP levels for them, which in turn gives a more coherent set of accesses for the ptex texture cache, reduces the number of misses and thence the amount of I/O.² A quick test confirmed the guess: with ray differentials disabled, ptex performance was much worse.

The effect of better ptex performance went beyond the savings in computation and I/O. On a 32 CPU system, pbrt-v3 only had a 14.9x speedup during rendering after the scene description was parsed. pbrt typically exhibits near-linear parallel scaling, so that was pretty disappointing. Thanks to much less lock contention in ptex, pbrt-next is 29.2x faster on a 32 CPU system and is 94.9x faster on a 96 CPU system—back to its happy place.

Moana island beach roots rendered with pbrt at 2048x858 resolution with 256 samples per pixel. Total rendering time using a 96 vCPU Google Compute Engine instance running at 2 GHz with pbrt-next was 41m 22s. Speedup from multithreading during rendering was 94.9x. (I'm not sure what's going on with the bump mapping here.)

Future work

Reducing memory use with such a complex scene is rather addictive: saving multiple gigabytes from a change is much more exciting than the tens of MB that might be saved with a scene of lesser complexity. I have a good list of things I’m hoping to look into in the future, time permitting. Here’s an overview.

Further reducing triangle buffer memory

Even with the reuse of buffers that store the same values for multiple triangle meshes, a fair amount of memory is still used for triangle buffers. Here’s a breakdown of memory use for various triangle buffer types in the scene:

Type	Memory
Positions	2.5 GB
Normals	2.5 GB
UVs	98 MB
indices	252 MB

I’m wary of doing anything at all to the provided vertex positions, but there are a few possibilities for the others. There are a number of memory-efficient representations for normal vectors, offering various storage/compute trade-offs. Adopting one of the 24-bit or 32-bit representations that would reduce normal storage to 663 MB and 864 MB, respectively, saving over 1.5 GB of RAM.

Memory use for texture coordinates and index buffers for this scene is surprisingly low. I assume that’s because this scene has many procedurally generated plants and that all of the variations of a given plant type have the same topology (thus, index buffer) and parameterization (thus, UV coordinates). In turn, reuse of matching buffers is quite effective.

For other scenes, it would probably be fine to quantize UV texture coordinates to 16-bits or to use half-precision floats, depending on their range. For this scene, it seems that all of the texture coordinate values are either zero or one and thus, could be represented with a single bit—thus, a 32x reduction in storage for those is possible. This state of affairs is presumably thanks to the use of ptex for texturing, eliminating the need for UV atlasing. Given how little memory is used for texture coordinates now, implementing that isn’t particularly compelling.

pbrt always uses 32-bit integers for index buffers. For small meshes with fewer than 256 vertices, only 8 bits per index is needed and for meshes with fewer than 65,536 vertices, 16 bits could be used. Changing pbrt to adapt in that manner wouldn’t be too difficult. If we wanted to get fancy, we could allocate just enough bits for the indices to represent the necessary range, at the cost of some complexity in looking up their values. With just a quarter of a GB of memory used for vertex indices at this point, it’s hard to get too excited about any of those options for this scene.

The BVH construction memory spike

A detail that hasn’t been mentioned in the discussion of memory use so far: there’s a brief spike of 10 GB of additional memory use right before rendering. This happens when the (big) BVH for the entire scene is built. pbrt’s BVH construction code is written to work in two phases: first it creates a BVH with a traditional representation: two child pointers at each node. Once the tree is built, that tree is converted to a memory-efficient layout, where the first child of a node is immediately after it in memory and the offset to the second child is stored as an integer.

The idea behind that separation was pedagogical—that it would be easier to understand BVH construction algorithms without the messiness of their needing to also convert the tree to the compact representation along the way. The result is that memory spike, however; fixing that has some more appeal now given its impact for this scene.

Converting pointers to integers

There are many 64-bit pointers in various data structures that could be represented as 32-bit integers. For example, each SimplePrimitive has a pointer to a Material. Most Material instances are shared among many primitives in the scene and there are never more than a few thousand of them; as such, we could maintain a single global vector of all materials,

std::vector<Material *> allMaterials;

and could then just store 32-bit integer offsets into that vector in SimplePrimitive, saving 4 bytes. The same trick could be used for the pointer to the TriangleMesh in each Triangle and in many other places.

With that change, there’d be a slight overhead to accessing the actual pointers and the system would be a little less clear to students who were trying to understand how the system worked; this also is probably a case where in the context of pbrt it’s better to keep the implementation a bit more understandable at the cost of not fully optimizing memory use.

Arena-based allocation

A separate call to new (actually, make_unique, but same difference) is made for each individual Triangle and primitive in the scene. Those allocations all carry some bookkeeping overhead that presumably accounts for most of the five or so GB of memory unaccounted for in the statistics. Because the lifetimes of all those allocations are the same—until rendering finishes—we could save that allocation overhead by allocating them from a memory arena.

vtable hacks

My last idea is gross, and I apologize, but I’m intrigued by it.

Each triangle in the scene carries an overhead of at least two vtable pointers: one for its Triangle and one for its SimplePrimitive. That’s 16 bytes. The Moana island scene has a total of 146,162,124 unique triangles, which adds up to almost 2.2 GB of highly-redundant vtable pointers.

What if we didn’t have an abstract base-class for Shape and what if each shape implementation didn’t inherit from anything? That’d save us the vtable pointers, but of course then given a pointer to a shape, we wouldn’t know what kind of shape it was, which would be sort of useless.

It turns out that on x86 CPUs today, only 48 bits of 64-bit pointers are actually used. As such, there are 16 extra bits that we could hijack to store some information in, such as… the type of shape that it’s pointing to. In turn, with a little bit of work, we can work our way back to being able to do the equivalent of virtual function calls.

Here’s how it would go: we’d first define a ShapeMethods structure that held function pointers, like this:³

struct ShapeMethods {
   Bounds3f (*WorldBound)(void *);
   // Intersect, etc. ...
};

Each shape implementation would implement a bounding function, and intersection function, and so forth, taking the equivalent of a this pointer as its first argument, like this:

Bounds3f TriangleWorldBound(void *t) {
    // Only Triangle pointers will ever be passed to this function.
    Triangle *tri = (Triangle *)t;
    // ...

We’d have a global table of the ShapeMethods structures, where the nth entry was for the shape type with index n:

ShapeMethods shapeMethods[] = {
  { TriangleWorldBound, /*...*/ },
  { CurveWorldBound, /*...*/ };
  // ...
};

When a shape was created, we’d encode its type into some of the unused bits in the returned pointer. Later, given a shape pointer that we wanted to call a particular method of, we’d extract its type index from the pointer and use it to index into shapeMethods to find the appropriate function pointer. We’d basically be implementing a vtable manually, handling dispatch ourselves. If that was done for both shapes and primitives, we’d save 16 bytes per Triangle, though following a pretty grungy path to get there.

I assume this hack to do virtual function dispatch isn’t novel, but I haven’t so far been able to find an online reference to it. There is a Wikipedia page on tagged pointers, though that only talks about things like reference counts. Please send me an email if you know of a better reference.

With that goofball hack outlined, this series of postings is done for now. Huge thanks again to Disney for making this scene available. It’s been a delight to work with it; the gears in my head continue to churn.

notes

pbrt-next ends up tracing more rays for this scene than pbrt-v3 does, which presumably explains the increase in the number of lookups. ↩
Ray differentials for indirect rays in pbrt-next are computed with the same hack used in the texture cache extension for pbrt-v3. It seems to work well enough, but doesn’t feel particularly principled. ↩
This is how rayshade handles method dispatch, as is done when one wants the equivalent of virtual methods in C. rayshade didn’t do anything fancy to eliminate the per-object pointer, though. ↩