Swallowing the Elephant (Part 11): Once More Unto The Beach

There are two perspectives that one might take in assessing how far things have come with pbrt-v4’s performance with Disney’s Moana Island scene: the optimist’s and the pessimist’s. As far as rendering performance goes, I can’t find much to be pessimistic about. The GPU especially has done well on that front, to the point that time spent parsing the scene description and preparing data structures for rendering—time to first pixel—is a significant contributor to overall performance when it is used.

About time to first pixel: here is the optimist’s view, which looks at progress measured with respect to where we started.

	Time	Speedup
pbrt-v3 (July 2018)	2098s	1x
pbrt-next (August 2018)	558s	3.76x
pbrt-v4 (April 2021 start)	410s	5.12x
pbrt-v4 (April 2021 end)	97s	21.6x
pbrt-v4 GPU (July 2021)	59.6s	35.2x

From that perspective, it’s been a smashing success, bringing a heavy scene from something that is nearly intolerable to render to something that has a bit of a hitch getting started, no big deal for what you get in return.

For an alternative viewpoint, here is the graph of pbrt’s CPU utilization over those 59.6 seconds up until the start of rendering, with all of those improvements in there and measured with 64 threads on a 32-core AMD 3970X CPU. As before, the x axis is measured in seconds, a value of 1 on the y axis represents all 64 threads running, and the time at which rendering starts is indicated with a vertical dashed line. Unlike the last two posts, this graph starts from the very beginning, so parsing and related work is back in there again.

CPU utilization (64 threads, starting point)

After what seemed like good progress from addressing post-parsing bottlenecks, it’s disheartening to see how bad the complete graph is—there’s still an enormous amount of space above the line, all of it wasted potential. Therefore, today it’ll be one more go at improving parallelism and reducing time to first pixel.

Before getting to work, let’s distract ourselves with an image. Here’s the beach view, again at 256 samples per pixel, rendered on both the CPU and the GPU. My hacky GPU Ptex implementation fares better here than in the roots view, though there are still issues on the sand dunes and the tree trunks.

With the version of pbrt that is today’s starting point, the RTX A6000 GPU renders its image in 36.3 seconds, taking a total of 105 seconds of wall-clock time including parsing the scene and setting up the data structures. The 32 core AMD 3970X takes 928 seconds to render it, or 1033 seconds including everything. Rendering is slightly over 25x faster with the GPU.

Speed of Light?

That graph made it clear that time to first pixel could be better. But how much better? The area under the graph represents the amount of CPU work that is done over the course of getting started and summing it up gives us the total amount of CPU work required.

It’s easy enough to modify pbrt’s logging code to track the total CPU utilization, from which we can learn that it would be 463 seconds of single-threaded work to get everything up and running. Thus, in fantasy perfect parallel scaling world, we would start rendering after 7.2 seconds on a 32-core/64-thread system as long as the GPU didn’t become the bottleneck.

Put another way, the 59 second time to first pixel with 64 threads is slightly more than 8x worse than speed of light performance. Knowing that pbrt was still that far off was a painful realization, but it was enough to motivate giving the code more attention. So what’s holding it back?

Serialization Everywhere

A hindrance that comes from pbrt’s design is that we have thus far been trying to parallelize within distinct phases of computation but never across them. It starts reading textures only after parsing is finished; it starts creating lights only after textures are done; the BVHs for non-instanced geometry are created before the BVHs for the object instances, and so forth. This constraint means that available parallelism is often limited, making it harder to effectively scale up to use many CPU cores.

Concretely, within the first milliseconds of the start of parsing the Moana Island scene, we know that there’s an environment light with its emission specified by a PNG image; however, the slow (single threaded) reading and decompression of that PNG doesn’t get started until more than 30 seconds later when parsing has finished. Then, it’s pretty much the only work available and all the other threads are unable to do anything useful. Why not start sooner and keep an otherwise-idle thread busy while other threads are processing those few last large scene files?

Honestly, pbrt’s startup phase had mostly been designed without parallelism in mind; the biggest goal was that it be easily understood by the reader and that it be simple enough to not use too many pages in the printed book. Thus, the idea of “first there is a parsing phase”, “now there is a light creation phase”, and so forth. However, given the state of CPU utilization in that earlier graph, I spent a while mulling over whether that design should be revisited. I was sure that there was performance to be had from doing so, but I worried about making the system harder to understand and debug. At tension with the goal for pbrt to be understandable is an aspiration to show best practices and to illustrate more broadly-useful programming techniques; if that part of the system has really bad parallel scaling, then that isn’t really best practices, is it?

Something that helped tip the balance was the fact that C++17 has built-in support for futures¹ and has reasonably clean support for asynchronous tasks. With the building blocks in the core language, showing what the concepts are good for seemed like it might be worthwhile in exchange for any added complexity. And the beauty of futures is that your thread will just stall if it tries to access something before it’s ready; there’s not a risk of tricky race conditions.

As I started exploring a redesign of that part of pbrt, I found that C++’s language built-ins weren’t quite right out of the box: I wanted to run asynchronous tasks using pbrt’s already-existing thread pool, which isn’t supported by std::async. Fortunately, it was under 30 lines of code to wrap up an asynchronous job into something that could run in pbrt’s task system. The other thing necessary was a small wrapper around std::future that would call into the job system to do work in the current thread if a future that was waited on wasn’t ready; that avoided deadlock, as would have been a problem otherwise with a fixed number of threads in a thread pool.²

Embracing The Asynchronous Life

With infrastructure that made it easy to kick off asynchronous work, it was time to start putting it to use. An easy first step was to asynchronously create the Media objects that represent participating media; that doesn’t matter for the Moana Island scene, but it was the simplest first thing to port over. It wasn’t much code and, auspiciously, it worked the first time.

Next up: lights, the known troublemaker. Again, not too much code, and more importantly, having written it, I don’t think there’s anything too complex going on there: as soon as the renderer hears about a light in the scene, it can kick off a call to RunAsync() to create it. It holds on to a vector of Future<Light>s and then consumes the values returned in the futures much later—usually well after the time they were actually created. That change alone improved time to first pixel with the Moana Island scene by 5 seconds, which was more than enough encouragement to keep going.

Textures were next and they weren’t too tricky either, mostly rearranging preexisting texture creation code. They were good for another few seconds improvement, which brought pbrt to just over 50 seconds time to first pixel. Building the top-level acceleration structures for non-instanced geometry asynchronously while BVHs for instances were being created was another easy one and gave another second or so’s improvement on top of that.

Here’s how things look with those changes. Note that the “Lights/Textures/Materials” category has disappeared into nothing, with all of that work already done during parsing and ready by the time it is needed afterward. (That work is charged to parsing in the CPU utilization reported in this graph. For simplicity, we’ll report CPU utilization by phases even as we blur the lines between them.)

CPU utilization (64 threads, asynchronous light, texture, and BVH creation)

Mind The Parser

Eyeballing that graph, we can see that we’re down to just over 10 seconds of post-parsing work; nearly 40 seconds of parsing time remain, and that’s where we ought to to look for further improvements. I realized that I hadn’t ever profiled pbrt-v4’s parsing and initial scene processing code with the Moana Island scene; a quick run of perf delivered: nearly 20% of parsing time was spent in the method that is called to record each use of an object instance in the scene.

There are 39 million instances in the scene so that method gets a workout, but near 20% of total parsing time there seemed high. Half of that was in a single std::vector::push_back() method call, so I looked more closely at what was going on with it.

That vector is used to record the uses of object instances—for each one, it needs to store both the name of the object instance being used as well as a transformation matrix. Even with 39 million instances, that still seemed excessive. I looked at the definition of the object that it stores, InstanceSceneEntity. It was essentially:

struct InstanceSceneEntity : public SceneEntity {
    AnimatedTransform *renderFromInstanceAnim = nullptr;
    Transform *renderFromInstance = nullptr;
};

So you’ve got your instance transformation in one of two flavors and then whatever we get from SceneEntity. What does it have?

struct SceneEntity {
    std::string name;
    FileLoc loc;
    ParameterDictionary parameters;
};

There’s name, which here stores the name of the object instance, there’s loc, which stores its location in a scene description file, handy if we need it for an error message, and then we’ve got ourselves a ParameterDictionary.

What’s a ParameterDictionary? Unnecessary is what it is. It’s a fairly heavy structure that stores user-specified parameters for entities in the scene description file—e.g., “this sphere has a float-valued parameter, radius, that has a value of 2.5.” It isn’t needed at all for object instances—it’s just weight with lots of extra unused data.

The fix to stop inheriting from SceneEntity and to store name and loc directly in InstanceSceneEntity changed 6 lines of code. Time to first pixel improved by 6 seconds, with the fix apparently contributing to performance improvements in other places that were copying InstanceSceneEntity objects as well. A bit more time was shaved off in a follow-on change that tuned up the hash table used in TransformCache. Together those fixes brought pbrt to 44 seconds to first pixel:

CPU utilization (64 threads, with parser improvements)

Increasing Imports

That 18 second tail of low CPU utilization at the end of parsing had become untenable. Looking at pbrt’s logs, it was easy to see that the “isBeach” and “isCoral” models were laggards there. A few additional uses of the new-ish Import statement that allows parallel parsing were enough to put more threads working on those files, which was enough to bring us to 28.7 seconds to first pixel, now 73 times faster than pbrt-v3 was when all this began.

Put it all together, and here is where things stand with 64 threads:

CPU utilization (64 threads, with more use of Import)

Here now, I’m happy to stop, even with some potential left above the line. For the last ten or so seconds the GPU is busy building acceleration structures, so non-full CPU utilization there is perfectly fine. I am a little surprised that the CPU utilization isn’t better at the very start of parsing, though haven’t dug into that further.

With 4 cores and 8 threads, the news is even better: CPU spends much of its time at close to full utilization and it’s 53 seconds to first pixel—better than it was with 32 cores when we started today. Here is the CPU utilization graph for that case:

CPU utilization (8 threads)

(This graph uses the same x axis scale as the other graphs in this post but here the value 1 on the y axis corresponds to all 8 threads being busy.)

I was surprised to see pbrt spending a few seconds with low CPU utilization in “Lights/Textures/Materials” there; presumably that is waiting for the environment light source’s future, but I’m not sure why that work wouldn’t have been finished earlier along the way. I’ll also leave answering that question for another time.

Conclusion

As it usually goes with the Moana Island scene, it’s been quite a journey. Three years after its release, the complexity it offers continues to be the best kind of challenging, even after I think I’ve already learned all of my lessons from it.

Going back to that beach view from earlier in this post: at the start of this post, it took a total of 105 seconds of wall-clock time to render with the GPU at 256 samples per pixel from start to finish. With the version of pbrt-v4 at the end, that’s down to 67 seconds, roughly evenly split between processing the scene and doing actual rendering—a fine place for wrap up this go-round with the island scene.

notes

What the heck, Google??!? ↩
As was learned during the initial implementation of this feature. ↩