Swallowing the elephant (part 5)
To conclude this series, we’ll start by looking at the performance of rendering Disney’s Moana island scene with pbrt-next, a branch of pbrt that I use to try out new ideas where it’s possible to make more radical changes to the system than is possible in pbrt-v3, which has to stay true to the system described in the book. We’ll wrap up with a discussion of a few possible directions for further improvements, ranging from the straightforward to the perhaps a little extreme.
Rendering time
pbrt-next has many changes in its light transport algorithms, including changes to how BSDFs are sampled and improvements to Russian roulette algorithms. It ends up tracing more rays than pbrt-v3 to render this scene, so it’s not possible to directly compare the runtime of the two renderers. Performance is generally similar, with one important exception: when rendering the Moana island scene view below, pbrt-v3 spends 14.5% of runtime doing ptex texture lookups. That had seemed not unreasonable to me before, but pbrt-next spends just 2.2% of runtime. That’s awfully interesting.
Looking at the statistics, we have:1
pbrt-v3:
Ptex block reads 20828624
Ptex lookups 712324767
pbrt-next:
Ptex block reads 3378524
Ptex lookups 825826507
We can see that in pbrt-v3, ptex texture is read from disk every 34 texture lookups. In pbrt-next, it’s only every 244 lookups—roughly a 7x reduction in disk I/O. I guessed that this is because pbrt-next computes ray differentials for indirect rays, causing it to access higher MIP levels for them, which in turn gives a more coherent set of accesses for the ptex texture cache, reduces the number of misses and thence the amount of I/O.2 A quick test confirmed the guess: with ray differentials disabled, ptex performance was much worse.
The effect of better ptex performance went beyond the savings in computation and I/O. On a 32 CPU system, pbrt-v3 only had a 14.9x speedup during rendering after the scene description was parsed. pbrt typically exhibits near-linear parallel scaling, so that was pretty disappointing. Thanks to much less lock contention in ptex, pbrt-next is 29.2x faster on a 32 CPU system and is 94.9x faster on a 96 CPU system—back to its happy place.
Moana island beach roots rendered with pbrt at 2048x858 resolution with 256 samples per pixel. Total rendering time using a 96 vCPU Google Compute Engine instance running at 2 GHz with pbrt-next was 41m 22s. Speedup from multithreading during rendering was 94.9x. (I'm not sure what's going on with the bump mapping here.)
Future work
Reducing memory use with such a complex scene is rather addictive: saving multiple gigabytes from a change is much more exciting than the tens of MB that might be saved with a scene of lesser complexity. I have a good list of things I’m hoping to look into in the future, time permitting. Here’s an overview.
Further reducing triangle buffer memory
Even with the reuse of buffers that store the same values for multiple triangle meshes, a fair amount of memory is still used for triangle buffers. Here’s a breakdown of memory use for various triangle buffer types in the scene:
Type | Memory |
---|---|
Positions | 2.5 GB |
Normals | 2.5 GB |
UVs | 98 MB |
indices | 252 MB |
I’m wary of doing anything at all to the provided vertex positions, but there are a few possibilities for the others. There are a number of memory-efficient representations for normal vectors, offering various storage/compute trade-offs. Adopting one of the 24-bit or 32-bit representations that would reduce normal storage to 663 MB and 864 MB, respectively, saving over 1.5 GB of RAM.
Memory use for texture coordinates and index buffers for this scene is surprisingly low. I assume that’s because this scene has many procedurally generated plants and that all of the variations of a given plant type have the same topology (thus, index buffer) and parameterization (thus, UV coordinates). In turn, reuse of matching buffers is quite effective.
For other scenes, it would probably be fine to quantize UV texture coordinates to 16-bits or to use half-precision floats, depending on their range. For this scene, it seems that all of the texture coordinate values are either zero or one and thus, could be represented with a single bit—thus, a 32x reduction in storage for those is possible. This state of affairs is presumably thanks to the use of ptex for texturing, eliminating the need for UV atlasing. Given how little memory is used for texture coordinates now, implementing that isn’t particularly compelling.
pbrt always uses 32-bit integers for index buffers. For small meshes with fewer than 256 vertices, only 8 bits per index is needed and for meshes with fewer than 65,536 vertices, 16 bits could be used. Changing pbrt to adapt in that manner wouldn’t be too difficult. If we wanted to get fancy, we could allocate just enough bits for the indices to represent the necessary range, at the cost of some complexity in looking up their values. With just a quarter of a GB of memory used for vertex indices at this point, it’s hard to get too excited about any of those options for this scene.
The BVH construction memory spike
A detail that hasn’t been mentioned in the discussion of memory use so far: there’s a brief spike of 10 GB of additional memory use right before rendering. This happens when the (big) BVH for the entire scene is built. pbrt’s BVH construction code is written to work in two phases: first it creates a BVH with a traditional representation: two child pointers at each node. Once the tree is built, that tree is converted to a memory-efficient layout, where the first child of a node is immediately after it in memory and the offset to the second child is stored as an integer.
The idea behind that separation was pedagogical—that it would be easier to understand BVH construction algorithms without the messiness of their needing to also convert the tree to the compact representation along the way. The result is that memory spike, however; fixing that has some more appeal now given its impact for this scene.
Converting pointers to integers
There are many 64-bit pointers in various data structures that could be
represented as 32-bit integers. For example, each SimplePrimitive
has a
pointer to a Material
. Most Material
instances are shared among many
primitives in the scene and there are never more than a few thousand of
them; as such, we could maintain a single global vector
of all materials,
std::vector<Material *> allMaterials;
and could then just store 32-bit integer offsets into that vector in
SimplePrimitive
, saving 4 bytes. The same trick could be used for the
pointer to the TriangleMesh
in each Triangle
and in many other places.
With that change, there’d be a slight overhead to accessing the actual pointers and the system would be a little less clear to students who were trying to understand how the system worked; this also is probably a case where in the context of pbrt it’s better to keep the implementation a bit more understandable at the cost of not fully optimizing memory use.
Arena-based allocation
A separate call to new
(actually, make_unique
, but same difference) is
made for each individual Triangle
and primitive in the scene. Those
allocations all carry some bookkeeping overhead that presumably accounts
for most of the five or so GB of memory unaccounted for in the statistics.
Because the lifetimes of all those allocations are the same—until
rendering finishes—we could save that allocation overhead by allocating
them from a memory
arena.
vtable hacks
My last idea is gross, and I apologize, but I’m intrigued by it.
Each triangle in the scene carries an overhead of at least two vtable
pointers: one for its Triangle
and one for its SimplePrimitive
. That’s
16 bytes. The Moana island scene has a total of 146,162,124 unique
triangles, which adds up to almost 2.2 GB of highly-redundant vtable
pointers.
What if we didn’t have an abstract base-class for Shape
and what if
each shape implementation didn’t inherit from anything? That’d save us the
vtable pointers, but of course then given a pointer to a shape, we wouldn’t
know what kind of shape it was, which would be sort of useless.
It turns out that on x86 CPUs today, only 48 bits of 64-bit pointers are actually used. As such, there are 16 extra bits that we could hijack to store some information in, such as… the type of shape that it’s pointing to. In turn, with a little bit of work, we can work our way back to being able to do the equivalent of virtual function calls.
Here’s how it would go: we’d first define a ShapeMethods
structure that
held function pointers, like this:3
struct ShapeMethods {
Bounds3f (*WorldBound)(void *);
// Intersect, etc. ...
};
Each shape implementation would implement a bounding function, and
intersection function, and so forth, taking the equivalent of a this
pointer as its first argument, like this:
Bounds3f TriangleWorldBound(void *t) {
// Only Triangle pointers will ever be passed to this function.
Triangle *tri = (Triangle *)t;
// ...
We’d have a global table of the ShapeMethods
structures, where the
nth entry was for the shape type with index n:
ShapeMethods shapeMethods[] = {
{ TriangleWorldBound, /*...*/ },
{ CurveWorldBound, /*...*/ };
// ...
};
When a shape was created, we’d encode its type into some of the unused bits
in the returned pointer. Later, given a shape pointer that we wanted to
call a particular method of, we’d extract its type index from the pointer
and use it to index into shapeMethods
to find the appropriate function
pointer. We’d basically be implementing a vtable manually, handling
dispatch ourselves. If that was done for both shapes and primitives, we’d
save 16 bytes per Triangle
, though following a pretty grungy path to get
there.
I assume this hack to do virtual function dispatch isn’t novel, but I haven’t so far been able to find an online reference to it. There is a Wikipedia page on tagged pointers, though that only talks about things like reference counts. Please send me an email if you know of a better reference.
With that goofball hack outlined, this series of postings is done for now. Huge thanks again to Disney for making this scene available. It’s been a delight to work with it; the gears in my head continue to churn.
notes
-
pbrt-next ends up tracing more rays for this scene than pbrt-v3 does, which presumably explains the increase in the number of lookups. ↩
-
Ray differentials for indirect rays in pbrt-next are computed with the same hack used in the texture cache extension for pbrt-v3. It seems to work well enough, but doesn’t feel particularly principled. ↩
-
This is how rayshade handles method dispatch, as is done when one wants the equivalent of virtual methods in C. rayshade didn’t do anything fancy to eliminate the per-object pointer, though. ↩