Last time around we finally got started digging into pbrt-v4’s performance with the Moana Island scene using its GPU rendering path. Then, as today, the focus was limited to all of the processing that goes on before rendering begins. There was plenty left unresolved by the end, including 16 seconds spent building BVHs for the object instances that featured poor utilization on both CPU and GPU.

Before we get into trying to improve that, here is the far-away view of the Moana Island scene, again rendered on an NVIDIA RTX A6000. As before, mum’s the word on performance until next time, but once again, this didn’t take too long.

Moana Island main view rendered with pbrt-v4 using the GPU Moana Island main view rendered at 1920x804 resolution with pbrt-v4 on an NVIDIA RTX A6000 GPU with 2048 samples per pixel.

Now back to work. In the 16 seconds that pbrt-v4 spends in its Build instance BVHs phase, it does the following three things for each geometric object that is instanced repeatedly in the scene:

  • Reads any geometry specified via PLY files from disk. (Any geometry not specified in PLY files has already been read during regular parsing of the scene description.)
  • Converts the geometry into the in-memory geometric representation that OptiX takes as input to its BVH construction routines.
  • Has OptiX build its BVH.

The first two steps run on the CPU and the third runs on the GPU.

There are a total of 312 such instance definitions and the work for one is independent of the work for all of the others; this is a friendly problem to parallelize. Yet if we look at the CPU and GPU utilization graphs from where we left off last time, the results are unimpressive during the Process instance BVHs phase:

CPU utilization
GPU utilization

(As before, the x axis is seconds, y axis is processor utilization, and the dashed line indicates the start of rendering.)

It starts out looking promising with 40% of the CPU in use, but after less than two seconds of that, CPU utilization drops down to just a few cores until all the instances are finished. The GPU is occasionally fully occupied, but it’s idle for much of the time. Thus, we can’t blame all of that CPU idling on threads waiting for the GPU.

Starting With The Obvious

The natural place to start is to parallelize the outermost loop that does the three things outlined above; honestly, there’s no excuse for it not having been parallel from the start. The change is barely any work at all, and in a world where the performance gods were feeling generous, that would be the end of it. The only thing that one might worry about in the parallelization is contention on the mutex used to serialize updates to the hash table, bit with just 312 instances, it seems unlikely that will be a big problem.

The good news is that this change did reduce time to first pixel; the bad news is that it was only down to 68.5 seconds—an improvement of just 3.7 seconds. I’m always happy to take a 5% improvement in overall performance, but one has to feel a little mixed about that when it might have been much more. Here are the performance graphs—as before they start after parsing has finished and the lights, materials, and textures have been created:

CPU utilization
GPU utilization

We can see that 3.7 second improvement; we can see that CPU utilization is better throughout instance processing; and we can see that the GPU spends less time idle. Yet, there’s nothing thrilling in the graphs: the CPU is still sitting around not making much of itself and the GPU isn’t being pushed very hard.

No Luck From The Next Three Obvious Things

Parallelizing the outermost loop isn’t enough if there’s something that serializes work in the middle of it. I soon remembered such a thing in the function that launches the OptiX kernels to build BVHs. BVH construction there is serialized; not only is all work submitted to the main CUDA command stream, but the CPU synchronizes with the GPU twice along the way. As a result, the GPU can only work on one BVH at a time, and if another thread shows up wanting to build its own independent BVH, its work is held up until the GPU finishes whatever it is already in the middle of.

I decided that leaving the synchronization in wouldn’t be too terrible if pbrt allowed each thread to independently submit work to the GPU. That was mostly a matter of taking a cudaStream_t in that function’s arguments and passing a different stream for each thread. Thus, each thread will wait for its own BVH to be built but it won’t be prevented from starting its own work by other threads. In turn, the GPU can work on multiple BVHs in parallel, which is helpful when there are instances that aren’t very complex.

Sadly, and to my surprise, the performance benefit from that change was nil.

I flailed a bit at that point, parallelizing two more inner loops in instance BVH construction (1) (2), hoping that giving the CPU more available work would help. From that, too, there was no meaningful change in performance. Time to get scientific again.

Too Much Performance To Handle

Giving up on an easy win from semi-informed guesses, I returned to my tried and true performance bottleneck finder: running pbrt under the debugger, interrupting execution when CPU utilization was low, and seeing what was actually going on. A quick scan of all of the threads’ backtraces at one of these points showed that all 64 were in the the TriangleMesh constructor. (“That’s funny—we shouldn’t be spending much time there at all” was my first reaction; that reaction is almost always good news when one is looking for ways to improve performance.)

Not only were all the threads in that constructor, but all but one was held up waiting for the same mutex, which was held by the remaining one. And yet, there’s nary a mutex to be seen in that code…

A closer look at the stack traces and it became clear that the mutex in pbrt’s GPU memory allocator was the point of contention; if 64 threads are trying to allocate of meshes all at once, things will understandably go bad there.

I updated the allocator to use per-thread slabs, where each thread periodically goes out to the main allocator for a 1MB chunk of memory but otherwise allocates memory from its own chunk directly, no mutex required. I assumed that this would be enough to make that allocation much less of a bottleneck, but there’s no way to know until you actually run the code.

I could tell that I was getting somewhere when my computer locked up and shut down when I ran pbrt with that fix. I restarted it and tried again, and was thrilled when my computer died once again. Progress!

To explain: I’ve been using a slightly under-powered power supply with this computer for a while. It’s enough to power the CPU at full utilization and is enough to power the GPU at full utilization. Both at the same time? A bit too much to ask for. It hadn’t been much of a problem; I just got used to not running big compilation jobs at the same time that the GPU was busy. In most of my day-to-day work, it’s one processor or the other at work at a time. Those crashes were a good hint that I had gotten both processors to be busy at once, which seemed promising.

Unwilling to wait for a new power supply to be delivered in the mail, I braved a trip to Best Buy for quick gratification and 1000 W of future-proof power. I could reliably measure performance after swapping out the old power supply; time to first pixel was down to 64.9 seconds—another 3.6 seconds improvement—and graphs that were looking better:

CPU utilization
GPU utilization

The extra good news in those graphs is that the first phase, Process non-instanced geometry, unexpectedly got one second faster—look at that spike in CPU utilization there now! Apparently instance BVH construction wasn’t the only thing bottlenecked on that mutex.

And yet even with that fixed, CPU utilization there was still middling.

It’s Always a Mutex

Another round with the debugger as profiler and there was still lots of mutex contention under the TriangleMesh constructor. A little more digging and it became clear that the BufferCache::LookupOrAdd() method was the culprit.

BufferCache is something I added to pbrt after my first go-round with the Moana Island; it uses a hash table to detect redundant vertex and index buffers in the scene and only stores each one once. It makes a difference—even though the Moana Island comes in highly-instanced, BufferCache saves 4.9 GB of memory when used with it.

What was its problem? It’s bad enough that it’s worth copying a few lines of code here:

    std::lock_guard<std::mutex> lock(mutex);
    Buffer lookupBuffer(buf.data(), buf.size());
    if (auto iter = cache.find(lookupBuffer); iter != cache.end()) {
       ...

Not only do we have have our single contended mutex, but if you trace through the Buffer class and the hashing flow, it turns out that hash of the buffer data is computed with the mutex held—totally unnecessary and awfully rude, especially if one’s buffer covers many megabytes of memory.

The fix makes three improvements: the hash is computed before the lock is acquired, a reader-writer lock is used so that multiple threads can check for their buffer in the cache concurrently, and the hash table is broken up into 64 shards, each protected by its own mutex. In short, an attempt to atone for the initial failure with plenty of parallelism prophylaxis.

Survey says? 5.3 seconds faster, which brings us down to 59.6 seconds for the time to first pixel. An extra bonus is that the first phase, Process non-instanced geometry, sped up by another 0.8 seconds, bringing it down to 3.0 seconds (versus the 4.9 seconds it was at the start of today’s work). The performance graphs are starting to go somewhere:

CPU utilization
GPU utilization

There’s still plenty of CPU idle time in Build instance BVHs, though one might make the observation that CPU utilization and GPU utilization are roughly inversely correlated there. That fits with the fact that CPU threads go idle while waiting for the GPU to build their BVHs, which is a shortcoming that I think we will stick with for now in the interests of implementation simplicity.

Conclusion

With the one-minute mark broken, my motivation started to wane. The most obvious remaining problems are low CPU utilization at the start of Process non-instanced geometry and at the end of Build instance BVHs. I believe that those both correspond to the CPU working its way through a large PLY file in a single thread; splitting those files multiple smaller ones files that could be read in parallel would likely shave a few more seconds off.

Next time, we’ll turn to rendering, covering some of the details related to getting this scene to render on the GPU in the first place and finally looking at rendering performance.