Swallowing the Elephant (Part 7): Time To First Pixel

With memory use under control, today’s topic will be “time to first pixel” when rendering Disney’s Moana Island scene with pbrt—that is, how much time goes by between launching the renderer and the start of actual rendering. This measure covers all of the costs of system startup, including parsing the scene, creating lights, shapes, and textures, and building acceleration structures.

Time to first pixel is a useful metric in that it is often the main constraint on iteration time: have a bug in a ray intersection routine and want to see if your fix took care of it? Moved a light and want to see how the image looks? Time to first pixel is a big part of how quickly you get those answers.

Before we dig into the numbers, here is another view of the Moana Island scene rendered with pbrt-v4:

Moana Island beach view rendered with pbrt-v4. Rendering time at 1920x804 resolution with 1,024 samples per pixel was 63m58s on a 64 core AMD 3970X CPU.

Foundations

The starting point was pretty ugly when I was first given access to the Moana Island scene 2.5 years ago: pbrt-v3’s time to first pixel was 34m58s, which is nothing short of horrific. By the end of my efforts then, it was down to 9m18s, a 3.76x improvement. At the time, that felt pretty good. This is, after all, a scene that exhibits the complexity of what is present in film production today (or at least, what was present 5 or so years ago when Moana was made), and so it is to be expected that there will be some work to be done before it’s ready to render.

To get started, I measured time to first pixel with the latest version of pbrt-v4 using a single thread–it was 6m50s. Good news at the start for once! Here is a table that summarizes these timings and the respective speedups:

	Time	Speedup
pbrt-v3 (original)	2098s	1x
pbrt-next (2.5 years ago)	558s	3.76x
pbrt-v4 (starting point today)	410s	5.12x

Where did that unexpected performance improvement come from? Part of it is that I ran the pbrt-v4 test on a modern CPU, while the earlier measurements were on a Google Compute Engine instance with what is now a 5 or so year old CPU. Thus, the latest measurement benefited from a higher CPU clock rate and a few years of microarchitectural improvements.

Another factor is an improvement to the surface area heuristic cost computation in pbrt’s BVH construction algorithm. In pbrt-v3 it used an O(n^2) algorithm, while it’s an O(n) algorithm in pbrt-v4. In my defense, n in this case is fixed at 12, which saves me from the full infamy of Dawson’s law, though it’s still pretty indefensible. Anyway, if I remember correctly, improving that in pbrt-v4 roughly doubled the performance of BVH construction, so that was surely part of it.

For a starting point, here is a plot of CPU utilization over time with single-threaded pbrt-v4 with the Moana island scene. The horizontal axis is time in seconds and the vertical is CPU utilization, measured with respect to the 64 threads offered by an AMD 3970X CPU. There’s not a lot to see vertically, but the graph gives us a baseline and also shows where the time is going: mostly parsing and BVH construction, both for instances and for the top-level BVH.

CPU utilization with one thread

(For this and following graphs, I’ve put light and texture creation into a single category, since it’s about 8 seconds for both of them, most of those spent reading the PNG for the environment light source.)

About all those idle threads…

Over the past year, I had already spent some time working on reducing pbrt-v4’s time to first pixel. That work was largely motivated by pbrt’s GPU rendering path: it wasn’t unusual to spend more time loading the scene description than rendering it with the GPU. Optimizing startup was thence the most effective way to speed up rendering—Amdahl’s law strikes again.

That work was easier to do with pbrt-v4’s redesign of the scene parsing code: once the high-level ParsedScene object is initialized, various opportunities for parallelism are easily harvested. With pbrt-v3, parsing the scene description was intermingled with creating the scene data structures, so there was less opportunity for extracting parallelism and all of the work until the start of rendering was single-threaded.

With pbrt-v4, it’s easy to parallelize loading textures when parameters for all of the ones to be loaded are at hand in a single vector. Shapes are created in parallel as well. In practice, this means that if meshes are provided in PLY format files, those can be loaded in parallel. Finally, the BVHs for object instances are created in parallel, since they’re all independent. This is all opportunistic parallelism—for loops over independent items that can be processed concurrently. It doesn’t scale well if there are only a few items to loop over and it’s susceptible to load imbalance, but it’s something, and something’s worth taking if it’s easy to do so.

The BVH construction code has also been (slightly) parallelized: sub-trees are built in parallel when there are many primitives. This isn’t the state of the art in parallel BVH construction, but it, too, is something.

Given those improvements and 64 threads, pbrt-v4 does better; here is a graph of CPU usage over time until rendering begins. Note that this graph has the same scale as the earlier one, so we can directly see how much time to first pixel has been reduced—it’s about 115 seconds less.

CPU utilization with 64 threads

The big wins come from instance BVH construction and creating the final top-level scene-wide BVH, which are sped up by factors of 4.2x and 2.4x, respectively. Neither is an exemplar of ideal parallel speedup, but again, it’s not bad for not much work.

Parsing in parallel

It is evident from that graph that parsing performance must be improved in order to make a meaningful further reduction in time to first pixel—with BVH construction performance improved, parsing is about 5/6 of the total time. After a bit of thought, I realized that pbrt-v4’s new approach to parsing and scene construction also offered the opportunity to parse the scene description in parallel.

For context, pbrt has always offered an Include directive in its scene description files; it corresponds to #include in C/C++ and is semantically the same as expanding the text of the file inline at the point where it is included. This is a handy capability to have, but it effectively requires serial processing of included files. For example, if one first Includes a file that has

Material "conductor"

and then Includes a file that has

Shape "trianglemesh" ...

then the triangle mesh will have the “conductor” material applied to it.

While one could perhaps imagine a more sophisticated implementation of Include that allowed parsing files in parallel and then patching things up afterward, I decided to add a new directive, Import. It’s the same idea as Include—parse the given file and then add the stuff described in it to the scene description—but its semantics are different. While it inherits the current graphics state—the current material, transformation matrix, and so forth—at the start of its parsing, changes to the graphics state do not persist afterward. However, the shapes, lights, object instances, and participating media that are specified in the file are added to the scene. In practice, most uses of Include can be replaced with an Import.

Thanks to the ParsedScene representation, we can kick off a new thread to parse each Imported file , have it initialize a separate ParsedScene, and then merge that one in with the main ParsedScene when the thread has finished parsing. It’s a hundred or so lines of code in the end.

Turning to the Moana scene, Disney’s original pbrt conversion of it has a top-level file, island.pbrt, that then has 20 Include statements for the main parts of the scene: the geometry of the mountains, the ocean surface, the beach dunes, the various hero trees, and so forth. All of those can safely be brought in using Import.

With that simple change, parsing is 3.5x faster and time to first pixel is down to 123 seconds. Progress!

CPU utilization with top-level Import statements

Parsing time has greatly improved, though for most of that phase only four threads are running, trailing down to a single thread for the last few seconds. We have a good old load imbalance, where most of the imported files are parsed quickly but then getting through the four heaviest ones is the bottleneck.

Each of those four has a ~5GB pbrt file to be parsed along the way. I went ahead and manually split each of those into 10 or so smaller files that are themselves loaded via Import. With that, parsing has sped up by a total of 6.1x and we’re down to 97 seconds of time to first pixel. If 64 threads are giving a 6.1x speedup, one might think that fewer cores might do well. It is so: with 16 cores, it’s 124 seconds to first pixel, and with 8, it’s 149.

CPU utilization with multiple levels of Import statements

The second half of the parsing phase is still just a few CPU cores chugging along, but I ran out of motivation to keep manually splitting up the big files; that’s the sort of thing that would ideally be done automatically by an exporter anyway.

Conclusion

As we wrap up now, pbrt-v4 is 21.6x faster in time to first pixel for the Moana Island scene than pbrt-v3 was originally and 4.2x faster than it was where things stood at the start of this write-up, all of that additional improvement due to parallelism. That’s satisfying progress, and I have to admit that seeing 30 threads all simultaneously working on parsing the scene description is a thrill; I wouldn’t have expected any parallelism in that phase a few weeks ago.

That said, even in the final graph there’s a lot more area above the curve where CPUs are idle than there is below where they’re getting things done. Eyeballing it, I’d guess that if the 64 CPU threads were used to their full capabilities, it might be possible to have a 20 second time to first pixel.

Getting to that point would require much more complexity in pbrt’s implementation, however: it would likely end up kicking off work as soon as it was available (“start creating this light in a separate thread”, etc.). Having all of the object creation happening concurrently with parsing would also introduce the risks of ugly bugs due to race conditions, which I’m not sure is worth it, especially for a system with pbrt’s goals. Therefore, we’ll stop there for time to first pixel, at least for the moment.

To wrap up these updates, next time we’ll look at pbrt-v4’s performance rendering this scene, focusing on the GPU rendering path.