Debugging Your Renderer (3/n): Assertions (and on not sweeping things under the rug)

Today we’ll keep the discussion to the topic of runtime assertions in renderers; next time it’ll be on to end-to-end tests, which will start to lead us into a more image-focused view of graphics debugging that will keep us busy for a while.

A principle in the last post on unit testing for renderers was the idea that you’d like your debugging problem to be as simple as possible; one way to achieve that is if bugs manifest themselves in a way other than “some of these pixels don’t look right…” While there will always be plenty of that sort of bug, those are usually a much harder debugging problem than a conventional one like “the program printed an error and crashed.” A good set of runtime assertions can be an effective way to turn obscure bugs into more obvious ones.

An assertion is a simple thing: a statement that a condition is always true at some point in the execution of a program. It seems that the original idea of them dates to Goldstine and von Neumann in 1947.¹ If such a statement is ever found to be false, then a fundamental assumption underlying the system’s implementation has been violated. The implications—to the performance of the program or to the correctness of its output—may be wide-ranging and possibly impossible to recover from. Assertions a great way to catch little things early before they turn into big things that are only evident much later.

In contrast to unit tests, which just have to be fast enough to not be annoying to run often, assertions must be efficient, since they often run in the innermost loops of the renderer. In return, they have the advantage that they can check many more situations than a unit test. It turns out that a myriad of unexpected edge cases come up as you trace billions of rays in many different scenes. Yet an assertion that has no chance of firing is only a drag on overall performance without offering any value. The art is to write the ones that you don’t think will ever fire but yet sometimes do so.

For a well-written general discussion of assertions, see John Regehr’s blog post on the topic.

The Basics

While C++ provides an assert macro in the standard library, it has a few shortcomings:

Assertions are either enabled or disabled, via the NDEBUG macro. Often, they are disabled completely for optimized builds, which in turn means that they run rarely and do not catch many bugs.
When an assertion fails, only the text of the assertion (e.g., “x > 0”) and its location in the source code is printed without any further context.

pbrt-v4 therefore has its own set of assertion macros, which are also integrated with pbrt’s runtime logging system. pbrt’s assertion macros are based on those in Google’s glog package. It includes assertions that are always included, even in release builds, and those that are only for debug builds, where more costly checks may be acceptable. They also provide much more helpful information than assert() does when an assertion fails.

Beyond a basic Boolean assertion (CHECK()), there are separate assertions for checking equality, inequality, and greater-than/less-than. For example, CHECK_GE() checks that the first value provided to it is greater than or equal to the second. Here is an example of its use in pbrt:

CHECK_GE(1 - pAbsorb - pScatter, -1e-6);

There’s a bit of context packed into that simple check: we have two probabilities, pAbsorb and pScatter, and if you look at the code before it you can see that the light transport algorithm has just computed three probabilities where the third, pNull is 1 - pAbsorb - pScatter. Thus, the assertion is effectively making sure that we are using valid probabilities when computing pNull.

More broadly, that check is in the context of pbrt’s code for sampling volumetric scattering. That code requires that the volumetric representation provide a majorant that bounds the density of the volume over a region of space. The CHECK_GE() then is effectively checking that the majorant is a valid bound. Thus, it’s really a check on the validity of the code that computes those bounds, which is far away in the system from where the check is made.

While that decoupling has the disadvantage that a failing assertion may require searching to find the code actually responsible for the bug, the advantage is that the check is made at every sample taken in every volumetric medium that is provided to pbrt for rendering; it gives the majorant computations a thorough workout. That check has found many bugs in that code since it was introduced; there are plenty of corner cases in the majorant computations, especially when you’re doing trilinear interpolation, which requires considering a larger footprint, and also using the nested grid representation of NanoVDB.

If that assertion fails, pbrt dumps more information than just the text of the assertion:²

[ tid 12129819 @     1.252s cpu/integrators.cpp:1004 ]
    FATAL Check failed: 1 - pAbsorb - pScatter >= -1e-6
        with 1 - pAbsorb - pScatter = -0.3336507, -1e-6 = -0.000001

In addition to the id of the thread in which the assertion failed, we have the elapsed time since rendering began (about 1.25 seconds here), the location of the assertion in the source code, what was asserted, as well as both of the values that were passed to CHECK_GE(). Having those values immediately at hand is often helpful. In the best case, one can understand the bug immediately, for example by seeing that an edge case that had been assumed to be impossible actually happens in practice. For this one, knowing whether the value was slightly outside of the limit or far outside of the limit (as it was here) may be a good starting point for further investigation.

A full stack trace then follows; that, too, can give a useful first pointer for understanding the issue. It is especially useful in still getting something from bug reports from users when it’s not possible to reproduce a bug locally as well as when pbrt is used for assignments in classes. In the latter case, the conversation often goes something like this:

“pbrt is buggy! It crashes when I call the function to normalize a vector.”
“That’s interesting–what does it print when it crashes?”
(pbrt’s output)
“That’s not a crash; it’s a failing assertion. The problem is that the foo() function that you added there is passing a degenerate vector to the vector normalization routine.”

Given that students often don’t seem to read that output in the first place, I’m not sure if any lessons are being learned about the value of assertions through that exercise, but you can at least work through that cycle much more quickly if it doesn’t require the student to fire up the debugger to provide more information.

Resilience Versus Rigidity

When an assertion fails, a program generally terminates. That’s a harsh punishment, especially if the program is well into a lengthy computation. One can treat failed assertions as exceptions and terminate just part of the computation (and maybe just a small part, like a single ray path), or one can also try to recover from the failing case and go on. How to approach all this is something of a philosophical question.

A widely-accepted principle about assertions is that they should not be used for error handling: invalid input from the user should never lead to an assertion failure but rather should be caught sooner (and a helpful error message printed, even if the program then terminates). An assertion failure should only represent an actual bug in the system: a mistake on the programmer’s side, not on the user’s, even if something goofy provided by the user is what tripped up the program. That to me seems like an unquestionably good principle.

But even with assertions limited to errors in the implementation, what else might one do when one fails? One might try to recover, patching over the underlying issue (for example, forcing the third probability to zero in the majorant case), but that approach isn’t fully satisfying. One issue is that the code paths for the error cases will only run rarely, so they won’t be well tested—it’s then hard to have confidence in their correctness.

For a commercial product (or one that is not open source), not annoying your users with an unexpected program termination is probably a good idea, though I have to say that in my experience the error handling you get is often not much better.

More optimistically, assertion failures represent useful data points. Papering over them is ignoring evidence of a deeper issue. Perhaps your code for recovering from the failed assertion is running all the time and there’s a massive bug lurking but you have no idea it exists in the first place.

So I have come to believe that the best approach is to be strict, at least for a system like pbrt. Include error handling code to deal with invalid user input, add cases as necessary to make your algorithms general-purpose and robust, but when things go wrong in a way that you hadn’t thought was possible, don’t try to muddle through it—fail if a null vector is to be normalized and abort if the majorants are seriously off. Those sorts of unexpected cases merit investigation and resolution. By making them impossible to ignore you reduce the chance of letting something serious fester for a long time. It’s an annoyance in the moment, but it makes the system much more robust in the end.

Track Down Rare Failures(!)

About not letting things fester… One of the reasons I’ve come to the rigidity view is an experience I had with the first version of pbrt. That version was more on the resilience side of things, or perhaps it was just negligence. Over the course of rendering the image below it would always print a handful of warnings about rays having not-a-number (NaN) values in their direction vectors.

I expected that something obscure was occasionally going wrong in the middle of BSDF sampling but I didn’t dig in for years after first seeing those warnings. Part of my laziness came from the (correct) assumption that it would be painful debugging since the warnings didn’t appear until rendering had gone on for some time. The underlying bug didn’t seem important to fix since it happened so rarely.

Eventually I chased it down. As with many difficult bugs, the fix was a single-character change: a greater or equals that should have been a greater than—“equals” being a case that otherwise led to a division by zero.

        // Handle total internal reflection for transmission
-       if (sint2 > 1.) return 0.;
+       if (sint2 >= 1.) return 0.;

When I rendered that scene afterward, not only were the warnings gone, but the entire rendering computation was \(1.25\times\) faster than it was before. I couldn’t understand why that would be so and spent hours trying to figure out what was going on. At first I assumed the speedup must be due to something else, like a different setting for compiler optimizations, but I found that it truly was entirely due to that one-character fix.

Eventually I got to the bottom of it. Here is where thing were going catastrophically wrong—with a few lines of code elided, this is the heart of the kd-tree traversal code in pbrt-v1:

int axis = node->SplitAxis();
float tplane = (node->SplitPos() - ray.o[axis]) * invDir[axis];
// ...
if (tplane > tmax || tplane <= 0) {
    // visit first child node next
} else if (tplane < tmin) {
    // visit second child node next
else {
    // enqueue second child to visit later and visit first child next
}

Consider that code with the lens of not-a-number. There are two rules to keep in mind: a calculation that includes a NaN will yield a NaN, and any comparison that includes a NaN evaluates to false. (Thus, the fun idiom of testing x == x as a way to check for a NaN.) Above, tplane will be NaN since the inverse ray direction is NaN. The condition in the first “if” test will be false, since both comparisons include a NaN. The condition in the second “if” test will also be false. In turn, the third case is always taken and every node of the kd-tree will be visited.

Thus, a NaN-direction ray is intersected with each and every primitive in the scene. For a complex scene, that’s a lot of intersection tests and thus, the performance impact of just a handful of those rays was substantial. Good times.

Conclusion

Here we are with two posts in a row that are comprised of me arguing for a particular way of doing things and then ending with a story about me not practicing what I’m preaching. One could take this to mean that I don’t know what I’m talking about, or one could take it to mean that my pain has the potential to be your gain. Either way works for me.

More generally, I’ve come to learn that if something seems a little stinky or uncertain in code, it really is worth stopping to take the time to chase down whether there is in fact something wrong. You have in hand evidence of a problem in a particular place in a system—that’s valuable. If you ignore it and there is a bug there, often that bug will later manifest itself in a way that’s much more obscure, maybe not evidently connected to that part of the system at all. You end up spending hours chasing it down just to discover that if you had investigated the questionable behavior when you first encountered it, you’d have fixed the underlying issue much earlier and much more easily.

notes

Goldstine and von Neumann. 1948. Planning and Coding of problems for an Electronic Computing Instrument. Technical Report, Institute of Advanced Study. ↩
To my previous frequent frustration, the CHECK macros in Google’s glog package do not print floating-point values with their full precision, which leads to error messages like Check failed: x != 0 with x = 0 bring printed when x is very small but not actually zero. This is another reason pbrt provides its own CHECK macros. ↩