Swallowing the Elephant (Part 7): Time To First Pixel
With memory use under control, today’s topic will be “time to first pixel” when rendering Disney’s Moana Island scene with pbrt—that is, how much time goes by between launching the renderer and the start of actual rendering. This measure covers all of the costs of system startup, including parsing the scene, creating lights, shapes, and textures, and building acceleration structures.
Time to first pixel is a useful metric in that it is often the main constraint on iteration time: have a bug in a ray intersection routine and want to see if your fix took care of it? Moved a light and want to see how the image looks? Time to first pixel is a big part of how quickly you get those answers.
Before we dig into the numbers, here is another view of the Moana Island scene rendered with pbrt-v4:
Moana Island beach view rendered with pbrt-v4. Rendering time at 1920x804 resolution with 1,024 samples per pixel was 63m58s on a 64 core AMD 3970X CPU.
Foundations
The starting point was pretty ugly when I was first given access to the Moana Island scene 2.5 years ago: pbrt-v3’s time to first pixel was 34m58s, which is nothing short of horrific. By the end of my efforts then, it was down to 9m18s, a 3.76x improvement. At the time, that felt pretty good. This is, after all, a scene that exhibits the complexity of what is present in film production today (or at least, what was present 5 or so years ago when Moana was made), and so it is to be expected that there will be some work to be done before it’s ready to render.
To get started, I measured time to first pixel with the latest version of pbrt-v4 using a single thread–it was 6m50s. Good news at the start for once! Here is a table that summarizes these timings and the respective speedups:
Time | Speedup | |
---|---|---|
pbrt-v3 (original) | 2098s | 1x |
pbrt-next (2.5 years ago) | 558s | 3.76x |
pbrt-v4 (starting point today) | 410s | 5.12x |
Where did that unexpected performance improvement come from? Part of it is that I ran the pbrt-v4 test on a modern CPU, while the earlier measurements were on a Google Compute Engine instance with what is now a 5 or so year old CPU. Thus, the latest measurement benefited from a higher CPU clock rate and a few years of microarchitectural improvements.
Another factor is an improvement to the surface area heuristic cost computation in pbrt’s BVH construction algorithm. In pbrt-v3 it used an O(n^2) algorithm, while it’s an O(n) algorithm in pbrt-v4. In my defense, n in this case is fixed at 12, which saves me from the full infamy of Dawson’s law, though it’s still pretty indefensible. Anyway, if I remember correctly, improving that in pbrt-v4 roughly doubled the performance of BVH construction, so that was surely part of it.
For a starting point, here is a plot of CPU utilization over time with single-threaded pbrt-v4 with the Moana island scene. The horizontal axis is time in seconds and the vertical is CPU utilization, measured with respect to the 64 threads offered by an AMD 3970X CPU. There’s not a lot to see vertically, but the graph gives us a baseline and also shows where the time is going: mostly parsing and BVH construction, both for instances and for the top-level BVH.
(For this and following graphs, I’ve put light and texture creation into a single category, since it’s about 8 seconds for both of them, most of those spent reading the PNG for the environment light source.)
About all those idle threads…
Over the past year, I had already spent some time working on reducing pbrt-v4’s time to first pixel. That work was largely motivated by pbrt’s GPU rendering path: it wasn’t unusual to spend more time loading the scene description than rendering it with the GPU. Optimizing startup was thence the most effective way to speed up rendering—Amdahl’s law strikes again.
That work was easier to do with pbrt-v4’s redesign of the scene parsing
code:
once the high-level ParsedScene
object is initialized, various
opportunities for parallelism are easily harvested. With pbrt-v3, parsing
the scene description was intermingled with creating the scene data
structures, so there was less opportunity for extracting parallelism and
all of the work until the start of rendering was single-threaded.
With pbrt-v4, it’s easy to parallelize loading
textures
when parameters for all of the ones to be loaded are at hand in a single
vector
. Shapes are created in parallel as
well.
In practice, this means that if meshes are provided in PLY format files,
those can be loaded in parallel. Finally, the BVHs for object
instances
are created in parallel, since they’re all independent. This is all
opportunistic parallelism—for loops over independent items that can be
processed concurrently. It doesn’t scale well if there are only a few
items to loop over and it’s susceptible to load imbalance, but it’s
something, and something’s worth taking if it’s easy to do so.
The BVH construction code has also been (slightly) parallelized: sub-trees are built in parallel when there are many primitives. This isn’t the state of the art in parallel BVH construction, but it, too, is something.
Given those improvements and 64 threads, pbrt-v4 does better; here is a graph of CPU usage over time until rendering begins. Note that this graph has the same scale as the earlier one, so we can directly see how much time to first pixel has been reduced—it’s about 115 seconds less.
The big wins come from instance BVH construction and creating the final top-level scene-wide BVH, which are sped up by factors of 4.2x and 2.4x, respectively. Neither is an exemplar of ideal parallel speedup, but again, it’s not bad for not much work.
Parsing in parallel
It is evident from that graph that parsing performance must be improved in order to make a meaningful further reduction in time to first pixel—with BVH construction performance improved, parsing is about 5/6 of the total time. After a bit of thought, I realized that pbrt-v4’s new approach to parsing and scene construction also offered the opportunity to parse the scene description in parallel.
For context, pbrt has always offered an Include
directive in its scene
description files; it corresponds to #include
in C/C++ and is
semantically the same as expanding the text of the file inline at the point
where it is included. This is a handy capability to have, but it
effectively requires serial processing of included files. For example, if
one first Include
s a file that has
Material "conductor"
and then Include
s a file that has
Shape "trianglemesh" ...
then the triangle mesh will have the “conductor” material applied to it.
While one could perhaps imagine a more sophisticated implementation of
Include
that allowed parsing files in parallel and then patching things
up afterward, I decided to add a new directive, Import
. It’s the same
idea as Include
—parse the given file and then add the stuff described in
it to the scene description—but its semantics are different. While it
inherits the current graphics state—the current material, transformation
matrix, and so forth—at the start of its parsing, changes to the graphics
state do not persist afterward. However, the shapes, lights, object
instances, and participating media that are specified in the file are added
to the scene. In practice, most uses of Include
can be replaced with an
Import
.
Thanks to the ParsedScene
representation, we can kick off a new thread
to
parse
each Import
ed file , have it initialize a separate
ParsedScene
,
and then merge that one
in
with the main ParsedScene
when the thread has finished
parsing.
It’s a hundred or so lines of code in the end.
Turning to the Moana scene, Disney’s original pbrt conversion of it has a
top-level file, island.pbrt
, that then has 20 Include
statements
for the main parts of the scene: the geometry of the mountains, the ocean
surface, the beach dunes, the various hero trees, and so forth. All of
those can safely be brought in using Import
.
With that simple change, parsing is 3.5x faster and time to first pixel is down to 123 seconds. Progress!
Parsing time has greatly improved, though for most of that phase only four threads are running, trailing down to a single thread for the last few seconds. We have a good old load imbalance, where most of the imported files are parsed quickly but then getting through the four heaviest ones is the bottleneck.
Each of those four has a ~5GB pbrt file to be parsed along the way. I went
ahead and manually split each of those into 10 or so smaller files that are
themselves loaded via Import
. With that, parsing has sped up by a total
of 6.1x and we’re down to 97 seconds of time to first pixel. If 64 threads
are giving a 6.1x speedup, one might think that fewer cores might do well.
It is so: with 16 cores, it’s 124 seconds to first pixel, and with 8, it’s
149.
The second half of the parsing phase is still just a few CPU cores chugging along, but I ran out of motivation to keep manually splitting up the big files; that’s the sort of thing that would ideally be done automatically by an exporter anyway.
Conclusion
As we wrap up now, pbrt-v4 is 21.6x faster in time to first pixel for the Moana Island scene than pbrt-v3 was originally and 4.2x faster than it was where things stood at the start of this write-up, all of that additional improvement due to parallelism. That’s satisfying progress, and I have to admit that seeing 30 threads all simultaneously working on parsing the scene description is a thrill; I wouldn’t have expected any parallelism in that phase a few weeks ago.
That said, even in the final graph there’s a lot more area above the curve where CPUs are idle than there is below where they’re getting things done. Eyeballing it, I’d guess that if the 64 CPU threads were used to their full capabilities, it might be possible to have a 20 second time to first pixel.
Getting to that point would require much more complexity in pbrt’s implementation, however: it would likely end up kicking off work as soon as it was available (“start creating this light in a separate thread”, etc.). Having all of the object creation happening concurrently with parsing would also introduce the risks of ugly bugs due to race conditions, which I’m not sure is worth it, especially for a system with pbrt’s goals. Therefore, we’ll stop there for time to first pixel, at least for the moment.
To wrap up these updates, next time we’ll look at pbrt-v4’s performance rendering this scene, focusing on the GPU rendering path.