Swallowing the elephant (part 2)
Flush from a first victory of getting pbrt to parse Disney’s Moana island scene description a bit more efficiently, I dug into memory use next. There was still plenty more to do with respect to running time, but I thought it would be helpful to first better understand the overall lay of the land before doing more work on that.
As with runtime, I started with pbrt’s built-in statistics; pbrt has manual instrumentation around significant memory allocations to track memory usage and a summary of allocations is printed out when rendering finishes. Here’s pbrt’s original take on what it had allocated for this scene:
Memory
BVH tree 9.01 GiB
Curves 1.44 GiB
Texture MIP maps 2.00 GiB
Triangle meshes 11.02 GiB
And as with runtime, the built-in statistics came up short, only reporting
about 24 GB worth of allocations of known objects. top
reported about 70
GB of memory actually used, which leaves us with about 45 GB unaccounted
for. Some discrepancy is expected: dynamic memory allocators need to
allocate extra space for bookkeeping, there’s some amount of fragmentation,
and so forth. But 45 GB? There’s definitely something stinky to be found.
In order to figure out what was missing (and to validate the correctness of what had been tracked), I used massif to trace the actual dynamic memory allocations. It’s fairly slow, but at least it works well.
Primitive woes
The first thing I found in the massif trace was two lines of code that
allocated instances of the Primitive
base class that weren’t tracked in
the statistics. A small oversight, easily enough
fixed.
With that, we now see:
Primitives 24.67 GiB
Ouch. So what’s a primitive, and what’s with all that memory?
pbrt makes a distinction between a Shape
, which is pure geometry (a
sphere, a triangle, etc.), and a Primitive
, which is the combination of a
shape, a material, possibly an emission function, and which participating
media, if any, are inside and outside of the shape’s surface.
There are a few
variants
of the Primitive
base class: GeometricPrimitive
, which is the common
case: a vanilla combination of shape, material, etc., and
TransformedPrimitive
, which represents a primitive that has an additional
transformation applied to it, either as an object instance, or for moving
primitives with time-varying transformations. It turns out that for this
scene, there’s wasted space in both of them.
GeometricPrimitive: 50% wasted space hurts
Note: there were some mistaken assumptions in the analysis here; they’re revisited in a later post.
4.3 GB was used for GeometricPrimitive
s. It’s a funny world to live in
where 4.3 GB of RAM used is far from your biggest concern, but let’s
nevertheless look at why there are 4.3 GB of GeometricPrimitive
s. Here
are the relevant parts of the class definition:
class GeometricPrimitive : public Primitive {
std::shared_ptr<Shape> shape;
std::shared_ptr<Material> material;
std::shared_ptr<AreaLight> areaLight;
MediumInterface mediumInterface;
};
We have a vtable
pointer, three more
pointers, and then a MediumInterface
, which holds two more pointers for a
total of 48 bytes. For this scene, only a handful of the shapes are
emissive, so areaLight
is almost always a null pointer and there is no
participating media, so the mediumInterface
pointers are also both null.
Thus, if we had a specialized Primitive
implementation that was used
instead when there was no emission function and no participating media,
we’d save about half the storage for GeometricPrimitive
s—roughly 2 GB
here.
However, I didn’t go ahead and make the fix and add a new Primitive
implementation to pbrt. As much as possible we try minimize divergences
between the pbrt-v3 source code on github and the system described in the
book, for the simple reason that keeping them in sync makes it easier to
read the book and then work with the code (and vice versa). In this case,
I decided that an entirely new Primitive
implementation, never mentioned
in the book, would be too big a divergence. That fix will certainly be
there in the next version of pbrt.
Before we continue, a rendering for sustenance:
Moana island beach rendered with pbrt-v3 at 2048x858 resolution with 256 samples per pixel. Total rendering time using a 12 core / 24 thread Google Compute Engine instance running at 2 GHz with the latest version of pbrt-v3 was 2h 25m 43s.
TransformedPrimitives: 95% wasted space really hurts
The 4.3 GB allocation for GeometricPrimitive
s was painful enough,
but what about the 17.4 GB for TransformedPrimitive
s?
As mentioned before, TransformedPrimitive
is both used for moving objects
with a time-varying transformation and for object instances. In both
cases, we want to apply an additional transformation to an existing
Primitive
. There are only two members of the TransformedPrimitive
class:
std::shared_ptr<Primitive> primitive;
const AnimatedTransform PrimitiveToWorld;
So far, so good: a pointer to a primitive and a time-varying
transformation. But what exactly does an AnimatedTransform
store?
const Transform *startTransform, *endTransform;
const Float startTime, endTime;
const bool actuallyAnimated;
Vector3f T[2];
Quaternion R[2];
Matrix4x4 S[2];
bool hasRotation;
struct DerivativeTerm {
// ...
Float kc, kx, ky, kz;
};
DerivativeTerm c1[3], c2[3], c3[3], c4[3], c5[3];
In addition to pointers to two transformation matrices and their associated times, there’s also a decomposition of the matrices into translation, rotation, and scale components, as well as precomputed values used to bound the volume swept by moving bounding boxes (see Section 2.4.9 of Physically Based Rendering). It adds up to 456 bytes.
For this scene, nothing is moving. All we need
transformation-wise for the object instances is one transformation pointer,
no decomposition, and no values for bounding moving bounding boxes.
(Otherwise known as 8 bytes). With a separate Primitive
implementation
for instanced-but-not-moving objects, the 17.4 GB would go down to just
over 900 MB(!).
As with GeometricPrimitive
, fixing that is a non-trivial change to what’s
described in the book, so we’ll also punt it to pbrt-next for now. At
least we understand what’s going on with the mess that is 24.7 GB of
Primitive
memory.
The transform cache embarrassment
The next biggest block of unaccounted-for memory that massif identified was
the TransformCache
, which was allocating about 16 GB. (Here’s a link to
the original
implementation.)
The idea is that the same transformation matrix is often used multiple
times in a scene, so it’s best to only have a single copy of it in memory
and let everyone who uses it just store a pointer to the same
transformation.
TransformCache
used a std::map
to store the cache and massif reported
that 6 GB of that 16 GB was for red-black tree nodes in the std::map
.
That seemed awfully high; it’s 60% of what’s used for the actual
transformations themselves. Let’s look at the declaration for that map:
std::map<Transform, std::pair<Transform *, Transform *>> cache;
A+ work right there: we were using entire Transform
s as keys for the map.
Even better, in pbrt a Transform
stores two 4x4 matrices (the
transformation and its inverse), so that works out to 128 bytes stuffed in
each tree node, all of it completely redundant with the value stored for it.
This was maybe a sort-of ok design in a world where we were worried about the same transformation matrix used for hundreds or thousands of primitives but didn’t have too many transformation matrices in total, but it’s a terrible approach for a scene with a ton of mostly unique transformation matrices, as we have here.
Beyond the storage waste from the keys, there’s a lot of pointer-chasing
involved with lookups in std::map
to go through the red-black tree, so it
seemed reasonable to try something completely new. Fortunately,
TransformCache
is barely discussed in the book, so completely rewriting
it was fair game.
One last thing before we get started: another issue is evident from
inspection of the Lookup()
method’s signature:
void Lookup(const Transform &t, Transform **tCached,
Transform **tCachedInverse)
When the caller provides a Transform
, the cache helpfully stores and
returns pointers to a transform equal to the provided one but also its
inverse. To be able to do this, the original implementation would always
compute and store the inverse when adding a transform to the cache, so that
it would be ready to be returned.
The foolishness here is that most of the call sites that use the transform cache don’t ask for and don’t use the inverse. Thus, all sorts of memory was being wasted for unused inverse transformations.
The new implementation has a number of improvements:
-
It uses a hash table, which allows for faster lookups and no need to store anything besides an array of
Transform *
s, which basically brings memory usage down to the amount actually needed to store theTransform
s. -
The lookup method signature is now
Transform *Lookup(const Transform &t)
; In the one place where the caller wants the inverse from the cache as well, it just callsLookup()
twice.
For hashing, I used the FNV1a hash function. After implementing it, I made my way to Aras’s post on hash functions; I probably should just use xxHash or CityHash instead for their better performance; I’ll probably shame myself into fixing that at some point.
With the new TransformCache
implementation, overall system startup time
improved markedly, to 21m 42s—an incremental savings of 5m 7s, or another
1.27x speedup. Further, the improved memory efficiency brought us from 16 GB
to 5.7 GB of storage for transformation matrices, which is about as much
as we need for what’s actually there, barring something like trying to take
advantage of the fact that they aren’t projective and storing 3x4 matrices
rather than 4x4s. (Normally, I’d be skeptical about the value of that sort
of optimization, but here it’d save us over a gigabyte—real memory! It’d
certainly be worth doing in a production renderer.)
A small performance optimization to finish
The generality of TransformedPrimitive
costs both memory and time: the
profiler had reported that a meaningful chunk of time at startup was spent
in the AnimatedTransform::Decompose()
function, which decomposes matrix
transformations into a quaternion rotation, a translation, and a scale.
Because nothing’s moving in this scene, that work is unneeded, and a quick
double-check of the AnimatedTransform
implementation made it clear that
none of those values are accessed if the two transformation matrices are in
fact the same.
With a two-line addition to the constructor to not do the transformation decompositions if they aren’t needed, we save another 1m 31s at startup time: that brings us to end today down to 20m 9s, in aggregate a 1.73x speedup from where we started.
Next time we’ll finally attack the parser for real and will look at something that started to matter once other things became faster.