Swallowing the Elephant (Part 12): A Postscript On Disk Bandwidth

At the ostensible end of these updates about pbrt-v4’s performance when rendering Disney’s Moana Island scene, there was an unresolved question about why CPU utilization wasn’t better at the very start when pbrt was parsing the scene description. As a refresher, with 64 threads on a 32-core AMD 3970X CPU and pbrt’s GPU-based rendering path, it looked like this. (As before, the vertical dashed line indicates when rendering begins.)

CPU utilization (64 threads, with more use of Import)

Starting at the 17 second mark, low CPU utilization isn’t necessarily bad since that’s when the GPU starts getting involved building acceleration structures, but before that it’s all on the CPU. For the first twelve or so seconds, total CPU utilization is between 0.2 and 0.4, which corresponds to roughly 13–26 of those 64 threads actually making something of themselves; that’s not enough of them to be satisfying.

It started to nag at me whether limited disk bandwidth might have something to do with that—i.e., is the issue that threads are stalled waiting for I/O? I made a few measurements to try to answer that question and learned enough along the way that here we go again.

How Far We Have Come

Three years ago when I first looked at pbrt’s performance with Moana Island I was using a Google Compute Engine instance with a spinny disk for benchmarks. Nowadays you might hope for around 100 MB/s of read bandwidth from such a disk. pbrt-v4 reads a total of 27,766 MB from disk when loading this scene and it takes a lot of 100 MBs to get through all of that. Therefore, when I was doing benchmarks then I was careful to flush the OS’s buffer cache between runs so that the true cost of I/O was measured and everything didn’t come out of RAM after the first time at rates much better than 100 MB/s.

This time around, I didn’t mention the disk on the system I used for benchmarking and I didn’t worry about the buffer cache. That wasn’t an oversight, but was more of a “I’m pretty sure this doesn’t matter”¹ sort of thing, like which version of the Linux kernel was running or whether it was DDR4 3200 or DDR4 3600 RAM in the system. (For the record, 5.8.0 and the former.)

The disk I’m using now is an NVMe disk; a quick benchmark showed that it delivers a peak of 2,022 MB/s of read bandwidth. I didn’t think that could be a bottleneck, though if you distribute those 2,022 MB/s evenly to 64 threads, it’s just 32 MB/s per thread. Thinking about it in those terms made me worry that bandwidth might be tight, so I decided to make some direct measurements and see what they had to show.

Starting Position

First, I measured pbrt’s disk bandwidth use over time to get a sense of whether it ever approached the peak and to see how disk reads were distributed over the course of loading the scene. (This and following measurements were made with an empty buffer cache, just to be safe.) iostat made that easy to do, though sadly it doesn’t seem to be able to report I/O with less than one second granularity, which is more coarse than one would like given 30 seconds time to first pixel. In any case, here is a graph of what it had to say; the disk’s measured maximum I/O bandwidth is marked with a dashed horizontal line.

CPU utilization (64 threads, with more use of Import)

For the first 20 or seconds, pbrt is mostly parsing text *.pbrt scene description files; it starts out consuming plenty of bandwidth but then slows as there are fewer files left to get through. The second wave of I/O starting at 20 seconds corresponds to reading all of the PLY files for the object instances in the scene. The news in this graph is mostly good: pbrt doesn’t seem to ever top out at the maximum bandwidth, suggesting that it’s not I/O bound, though it’s close enough at 9 seconds there that it’s not possible to be sure from these measurements.

This data also makes it possible to compute an alternative speed of light measurement for time to first pixel. If we divide the total size of data read, 27,766 MB, by the peak read bandwidth of 2,022 MB/s, we can see that we can’t hope to have a time to first pixel under 13.7 seconds. That’s already an interesting result, as it shows that the earlier speed of light calculation that only considered the CPU didn’t tell the whole story: then, neglecting I/O limits, we estimated 7.2 seconds as the best possible time to first pixel.

Another thing this graph shows is that pbrt is close enough to being I/O bound at the start that there isn’t a lot of reason to worry about the relatively low CPU utilization then. We might improve it some by finding more things to start reading sooner, but the benefit would be limited since we would soon hit peak disk bandwidth and be limited by that. Further performance improvements would then require a better balance of I/O requests over time.

Turning The Bandwidth Screw

The data already seemed fairly conclusive about not being I/O bound, but I was curious about how performance varied with disk read bandwidth—how crucial is that lovely abundant NVMe bandwidth to pbrt’s performance with this scene? One way to find out is to start reducing the amount of disk read bandwidth available to pbrt and to see how that affects performance.

Once you find the right trick it’s surprisingly easy, at least on Linux, to use systemd-run to launch a process that has a limited amount of disk read bandwidth available to it. I did a quick study, dialing the bandwidth down from the 2,000 MB/s that my NVMe drive offers to the sad 50 MB/s that a middling spinning disk today might provide.

Here is a graph of pbrt-v4’s time to first pixel with the Moana Island scene as a function of available disk bandwidth, running with both 8 threads on 4 cores and 64 threads on 32 cores. Note that the y axis has a logarithmic scale, the better to fit the sadness that is a nearly 600 second time to first pixel given 50 MB/s.

CPU utilization (64 threads, with more use of Import)

There are a number of things to see in this graph. First, it offers further confirmation that pbrt-v4 is not bandwidth limited for this scene: the fact that performance doesn’t immediately decrease as bandwidth starts to decrease from 2,000 MB/s indicates that more bandwidth isn’t going to make things faster. Both lines seem to have hit their asymptote, though the 64 thread one just barely so.

This graph also shows how much bandwidth can decrease before performance is meaningfully affected. With 64 threads, you only have to go to 1400 MB/s to slow down time to first pixel by 10%, but with 8 threads you can go all the way to 800 MB/s before there’s a 10% drop. This isn’t surprising—the more threads you’ve got, the more bandwidth you’re capable of consuming—but it’s nevertheless interesting to see how much farther one can go with fewer threads.

Finally, note that below 500 MB/s, the two curves are effectively the same. Here, too, there’s no big surprise: if you’re trying to drink through a narrow straw, having more thirsty people waiting in line on the end of it isn’t going to get the water through more quickly, to grossly overstretch a metaphor.

DEFLATE, DEFLATE, DEFLATE

Compression algorithms make it possible to trade off bandwidth for computation, so my last experiment was to look at performance with the scene description compressed using gzip. Thanks to a recent patch from Jim Price, pbrt-v4 now supports reading gzip-compressed scene description files, and RPly, the PLY file reader by Diego Nehab that pbrt uses, already supported gzip-compressed PLY files. All of that made it easy to run the same experiments with a compressed scene description.

With the *.pbrt and PLY files compressed using gzip -5, pbrt-v4 reads a total of just 5,570 MB from disk—nearly 5x less than with the uncompressed scene description. Using zlib for decompression with 64 threads and the full NVMe disk bandwidth, pbrt takes 40 seconds to first pixel with a compressed scene—12 seconds slower than with everything uncompressed. Given that it wasn’t bandwidth-limited before, that isn’t surprising—we have just increased the amount of CPU work that needs to be done to get the scene into memory.

Here is the graph of disk I/O consumption over those 40 seconds; it shows that now there is plenty of headroom with never more than 500 MB/s of bandwidth used.

CPU utilization (64 threads, with more use of Import)

As we were going to press, I saw that Aras Pranckevičius just put up a nice series of blog posts about compression in OpenEXR. Those led me down all sorts of ratholes, and one of them reminded me about libdeflate, a highly optimized library that can decompress gzip-encoded files (among others). It wasn’t too much code to swap that in for zlib in pbrt and bam: down to 34 seconds to first pixel with a compressed scene. And that’s actually only using libdeflate for the *.pbrt files but still using zlib for the 1,152 MB worth of compressed PLY files, since using libdeflate with RPly would have required more complicated plumbing.

Anyway, here’s a graph that shows time to first pixel with all three options, again with 64 threads. libdeflate gets an asterisk, since it isn’t being used for PLY files (and there is thus presumably some performance being left on the floor.)

CPU utilization (64 threads, with more use of Import)

There’s lots of good stuff to see there. As advertised, libdeflate is certainly faster than zlib. It starts being the fastest option overall at around 1,300 MB/s of bandwidth. From there on down, the additional CPU work to do decompression is worth the disk bandwidth savings in return. (In contrast, zlib doesn’t make up for its computational overhead until around 1,000 MB/s.)

Both decompressors have more or less constant performance all the way down to roughly 300 MB/s from the disk. Past there, their performance converges: at that point, data is coming in so slowly that how quickly it’s decompressed doesn’t make much difference. We can also see that compression is especially helpful way down at 50 MB/s, where it’s leads to a spritely 127 seconds to first pixel—4.6x faster than the uncompressed scene is with that little bandwidth.

Discussion

For once we have gotten through these investigations without finding any surprising bottlenecks and so today has not brought any changes to pbrt’s implementation, though I suspect pbrt will switch from zlib to libdeflate fairly soon.

Perhaps the most useful result from today is a more accurate estimate of pbrt’s best possible performance when preparing to render this scene: 13.7 seconds given the disk I/O limits of the system. With that limit known, it’s easier to accept the 28 seconds to first pixel that the system delivers today—apparently only 2x off the maximum attainable performance—and to stop fiddling with the details.

And yet… I hope that the attentive reader might quibble with the logic behind that conclusion: with the compressed scene, we found ourselves with a mere 5,570 MB of disk I/O, and that’s something this computer can deliver in 2.75 seconds, which puts us once again 10x off the mark. It seems that part of speed of light is in how you define it, but nevertheless I think it’s time to leave things where they lie for now.

note

The road to disastrous performance is paved with “pretty sure” assumptions about a system’s behavior, so that assumption was admittedly not wise. ↩