Swallowing the Elephant (Part 10): Rendering on the GPU

We’re big fans of actually making pictures around here, so avoiding the topic of rendering performance and focusing on performance while getting ready for rendering may seem a little off. A flimsy defense for that is “vegetables before dessert”; we must attend to our responsibilities before we go off and have fun. A better justification is more or less Amdahl’s law: as rendering time decreases, overall performance is increasingly determined by the cost of rest of the work that happens before rendering. That motivation should be more clear by the end of this post.

Preliminaries: Textures and Curves

Making pbrt-v4’s GPU rendering path capable of rendering Disney’s Moana Island scene required attending to two details that I hadn’t gotten around to until recently: supporting both Ptex textures and pbrt’s curve shape on the GPU. Both features were laggards that hadn’t yet been wired up on the GPU and both are used extensively in the Moana Island scene.

Ptex is a texture representation from Walt Disney Animation Studios that works around the uv-mapping problem by splitting meshes into faces (that may themselves be collections of triangles or quads) and then applying a plain old [0,1]² parameterization to each face. The Ptex library handles the details of things like loading textures on demand, caching parts of them in memory, and texture filtering.

To my knowledge, there isn’t a Ptex implementation that runs on the GPU. One way to work around this problem is to round-trip to the CPU and service Ptex requests there; that approach was taken by Chris Hellmuth when he rendered the Moana Island scene on the GPU. That approach gives gold-standard results, but at the cost of synchronization and data transfer between the two processors as well as the risk of the CPU being the performance bottleneck.

Ingo Wald took a different approach when he got the Moana Island rendering on the GPU; his implementation resampled each face’s texture at a low resolution and then packed the results into large flat 2D textures. That keeps everything on the GPU, but risks blurred texture lookups due to insufficient resolution.

And then there’s the approach I took: for each face, pbrt-v4 computes the face’s average texture value and stores it in an array. A texture lookup[sic] is then a simple index into that array using the face index. (Thus, it’s basically Ingo’s approach with “low resolution” taken all the way to a single pixel.) The only thing defensible about my approach is that it’s just a few lines of code to convert Ptex textures into this representation, and texture lookup is near-trivial indexing into that array.

I can’t say that I’m proud of that solution, but I can at least say that it works great for objects that are far away where all you need is the top MIP level anyway. As we will see shortly, it is certainly not production-ready. I hope to get around to replacing it with something better in the future, but for now it gets us up and rendering.

The other thing to take care of was supporting curves on the GPU. pbrt-v4’s built-in curve shape uses a recursive intersection algorithm that is a poor fit for the GPU; I didn’t even try running it there. OptiX does provide a curve primitive, highly optimized, though some plumbing would be necessary to wire up pbrt’s curve representation to use it. Impatient to get the scene up and rendering, I wrote a simple function that dices curves into bilinear patches. (This, too, is something to return to in the future.)

To my delight, once those two additions were debugged, everything just worked the first time I tried rendering the Moana Island on the GPU.

Images and Performance

Here again is the main view of the island, rendered on both the CPU and the GPU with 256 samples per pixel at 1920x804 resolution. The images are displayed using Jeri, which makes it possible to do all sorts of pixel peeping right in the browser. (Click on the tabs to switch between CPU and GPU, or use the number keys after selecting the image. Hit ‘f’ to go full-screen. You can also pan and zoom using the mouse.) If you’d prefer to examine the EXRs directly, here they are: CPU, GPU.

The differences between the image rendered on the CPU and the one rendered on the GPU are entirely due to the differences in how Ptex textures and curves are handled. For Ptex, you can see the problems with the current approach on the far-away mountainside as well as on the trunks of the palm trees. And then there’s a striking difference in the palm fronds; we’ll return to that shortly, but it’s the GPU that has the more accurate result there, not the CPU.

Oh, and about rendering performance? It’s 26.7 seconds on the GPU (an NVIDIA RTX A6000) versus 326.5 seconds on the CPU (a 32 core AMD 3970X). Work out the division and that’s 12.2x faster on the GPU. If you’d prefer a clean 2048 sample per pixel rendering, the GPU gets through that in 215.6 seconds, once again over 12x faster than the CPU doing the same.

And so it’s obvious why time to first pixel matters so much. From start to finish, that 256 sample per pixel rendering takes about 90 seconds of wall-clock time. Two thirds of it is spent getting things ready to render, 30% is rendering, and the rest is a few seconds of shutting things down at the end. With rendering being that fast, if you want to see that image sooner, optimizing startup time can be a better place to focus than optimizing rendering time. Naturally, startup time matters less as the number of pixel samples increases, but that has to go well into the thousands of them before startup time starts to be insignificant.

There is good news and bad news about memory: the scene needs “just” 29.0 GB of GPU memory to render. I’m happy with that in absolute terms, but unfortunately it limits how many GPUs can handle the scene. It would be nice to find a way to fit it in 24 GB, in which case it could be rendered on an RTX 3090, but for now the full scene requires something along the lines of an RTX A6000. (As a workaround, removing the “isIronwoodA1” and “isCoral” models gets it down under 24 GB with limited visual impact and, bonus, takes time to first pixel down to 51 seconds.)

As far as where the time is spent, pbrt-v4 offers a --stats command-line option that prints out various statistics after rendering finishes. When the GPU is used, it gives a summary of where the GPU spent its time during rendering. Here, sorted and summarized, is what it has to say about rendering the main view of the Moana Island:

	Total Time (ms)	Percentage
Tracing closest hit rays	11976.57	45.4%
Tracing shadow rays	8441.60	32.0%
Material/BSDF evaluation and shading	4294.59	16.3%
Generating samples	638.44	2.4%
Generating camera rays	443.17	1.7%
Handling escaped rays	267.56	1.0%
Updating the film	127.35	0.5%
Handling emitters hit by indirect rays	97.47	0.4%
Other	75.61	0.3%

That’s 77.4% of the total runtime spent on ray intersection tests and associated enqueuing of subsequent work. Most of the rest is in that 16.3% of material and BSDF work that is done at each intersection. It includes evaluating textures, sampling light sources, and evaluating and sampling BSDFs. For less complex scenes, that’s where most of the runtime is normally spent.

With apologies to the artists who spent untold hours on the textures for this scene, here is another view of the island scene, this one the “rootsCam” camera. (Direct links to the EXRs: CPU, GPU.)

Again rendered at 256 samples per pixel, this took 32.1 seconds to render on the GPU and 557.6 seconds on the CPU. That’s 17.4x faster on the GPU, with a similar breakdown of where the time was spent.

With this viewpoint the shortcomings of pbrt-v4’s current approach for Ptex on the GPU are even more obvious; not only do the sand dunes appear faceted from lack of texture detail, but we have lost all of the fine bump-mapping detail. (Turns out, taking differences of a constant function to compute shading normals doesn’t work out so well.)

However, it is clear from these images that it is the GPU that is giving the better result in those tufts of grass in the lower right. The CPU’s rendering path isn’t getting the self-shadowing correct down there, while the GPU’s is. (The same thing is happening in the palm fronds in the main view.) This discrepancy was unexpected and is something to chase down in the future. I suspect the issue stems from the CPU curve implementation needing a fairly large epsilon at ray intersections; this is necessary to avoid self-intersections since the CPU’s curve shape orients itself to face the ray being traced. On the GPU, a much smaller epsilon is possible because true geometry is used for curves.

Conclusion

The 12-17x speedups on the GPU are not based on a comparison of perfectly-matching implementations, though other than the ray intersection routines, curves, and Ptex, both otherwise run the same C++ code. Each is better and worse than the other in different ways: while the diced curve representation used on the GPU turned out to be superior to the CPU’s curve shape, the lack of proper Ptex texturing on the GPU at the moment is a loss.

One nice thing about the performance breakdown on the GPU is that there’s plenty of headroom to do more shading work. With 77% of runtime spent on ray intersections and 16% on shading, even doubling the cost of shading with a more complete Ptex implementation wouldn’t increase the overall runtime very much. I expect that the GPU’s speedup wouldn’t be too different with those two differences harmonized.

Next time we’ll come back to to review where pbrt-v4 stands in the time to first pixel department and then end this series, at least for now, with some renderer-design retrospection.