Debugging Your Renderer (2/n): Unit Tests

Here we are, a year and a half after I posted an introduction that was full of talk about a forthcoming series of blog posts about debugging renderers. When I posted that I already had a text file full of notes and had the idea that I’d get through a series of 8 or so posts over the following few weeks.

…and it’s been nothing but crickets after that setup.

There’s no good reason for my poor follow-through, though this series did turn into one of those things that got more daunting to return to the longer time went by; I felt like the bar kept getting higher and that my eventual postings would have to make up for the bait and switch.

Now that I’m at it again, I can’t promise that these posts will make up for the wait; in general, you get what you pay for around here. But let’s reset and try getting back into it.

To get back in the right mood, here are a pair of images back from the first time I tried to implement Greg Ward’s irradiance caching algorithm back when I was in grad school:

In the left image (which was rendered from right to left for some reason), there was a bug that caused energy to grow without bound as the cache was populated (no doubt a missing factor of \(1/\pi\) that led to a feedback loop). I always liked how that image went from ok to a little too bright to thermonuclear by the time it was halfway through. The image on the right is my eventual success, with a slightly different scene layout.

Avoiding The Bad Place

There’s nothing fun about an image that starts out ok and then goes bad or your renderer crashing after its been running for an hour with a stack trace 20 levels deep. There’s lots to be unhappy about:

Things are broken, but they’re not utterly broken, which suggests that the underlying bug will be subtle and thus difficult to track down.
There’s an enormous amount of state to reason about—the scene in all its complexity, all of the derived data structures, and everything that happened since the start of rendering until things evidently went wrong. Any bit of it may hold the problem that led to disaster.
More specifically, the actual bug may be in code that ran long before the bug became evident; some incorrect value computed earlier that messed things up later, possibly in an indirect way. This is a particular challenge with algorithms that reuse earlier results, be it spatially, temporally or otherwise.
It may be minutes or even hours into rendering before the bug manifests itself; each time you think you’ve fixed it, you’ve got to again wait that much longer to confirm that you’re right.

Anything you can do to avoid that sad situation reduces the amount of time you spend on gnarly debugging problems, and in turn, the more productive you’ll be (and the more fun you’ll have, actually implementing fun new things rather than trying to make the old things work correctly.) That goal leads to the first principle of renderer debugging:

Try to make it a conventional debugging problem (“given these inputs, this function produces this incorrect output”) and not an unbounded “this image is wrong and I don’t know why” problem.

One of the best ways to have more bugs be in the first category is to have a good suite of unit tests. There’s nothing glamorous about writing unit tests, at least in the moment, but they can give you a lot in return for not too much work. Not only does failing unit test immediately narrow down the source of a bug to the few things that the test exercises, but it generally gives you an easier debugging problem than a failure in the context of the full renderer.

Starting Simple

A good unit test is crisp—easy to understand and just testing one thing. Writing tests becomes more fun if you embrace that way of going about it—it’s easy coding since the whole goal is to not be tricky, with the idea that you want to minimize the chance that your test itself has bugs. A good testing framework helps by making it easy to add tests; I’ve been using googletest for years, but there are plenty of others.

It’s good to start out by testing the most obvious things you can think of. That may be counter-intuitive—it’s tempting to start with devious tests that poke all the edge cases. However, if you think about it from the perspective of encountering a failing test, then the simpler the test is, the easier it is to reason about the correct behavior, and the easier debugging will be. (There is an analogy here to the old joke about the drunk searching for his car keys under the street light.) Only once the basics are covered in your tests is it worth getting more clever. If your simpler tests pass and only the more complex ones fail, then at least you can assume that simple stuff is functioning correctly; that may help you reason about why the harder cases have gone wrong.

Here is an example of a simple one from pbrt-v4. pbrt provides an AtomicFloat class that can atomically add values to a floating-point variable.¹ This test ensures that AtomicFloat isn’t utterly broken.

TEST(FloatingPoint, AtomicFloat) {
    AtomicFloat af(0);
    Float f = 0.;
    EXPECT_EQ(f, af);

    af.Add(1.0251);
    f += 1.0251;
    EXPECT_EQ(f, af);

    af.Add(2.);
    f += 2.;
    EXPECT_EQ(f, af);
}

The test is as simple as it could be: it performs a few additions and makes sure that the result is the same as if a regular float had been used. It’s hard to imagine that this test would ever fail, but if it did, jackpot! We have an easy case to reason about and trace through.

Here’s another example of a not-very-clever test from pbrt-v4. Most of the sampling functions there now provide an inversion function that goes from sampled values back to the original \([0,1]^n\) sample space. Thus, it’s worth checking that a round-trip brings you back to (more or less) where you started. The following test takes a bunch of random samples u, warps them to directions dir on the hemisphere, then warps the directions back to points up in the canonical \([0,1]^2\) square, before checking the result is pretty much back where it started.

TEST(Sampling, InvertUniformHemisphere) {
    for (Point2f u : Uniform2D(1000)) {
        Vector3f dir = SampleUniformHemisphere(u);
        Point2f up = InvertUniformHemisphereSample(dir);

        EXPECT_LT(std::abs(u.x - up.x), 1e-3);
        EXPECT_LT(std::abs(u.y - up.y), 1e-3);
    }
}

There’s not much to that test, but it’s a nice one to have in the bag. Once it passes, you can feel pretty good about your InvertUniformHemisphereSample function, at least if you have independent confidence that SampleUniformHemisphere works. And how long does it take to write? No more than a minute or two. Once it is passing, you can more confidently make improvements to the implementations of either of those functions knowing that this test has a good chance of failing if you mess something up.

About succinctness in tests: that Uniform2D in that test is a little thing I wrote purely to make unit tests more concise. It’s crafted to be used with C++ range-based for loops and here generates 1000 uniformly distributed 2D sample values to be looped over. It and a handful of other sample point generators save a few lines of code in each test that otherwise needs a number of random values of some dimensionality and pattern. I’ve found that just about anything that reduces friction when writing tests ends up being worthwhile in that each of those things generally leads to more tests being written in the end.

The Challenge of Sampling

One of the challenges in implementing a Monte Carlo renderer is that the computation is statistical in nature; sometimes it’s hard to tell if a given sample value is incorrect or if it’s a valid outlier. Bugs often only become evident in the aggregate with many samples. That challenge extends to writing unit tests—for example, given a routine to draw samples from some distribution, how can we be sure the samples are in fact from the expected distribution?

The Right Thing to do is to apply proper statistical tests. For example, Wenzel has written code that applies a \(\chi^2\)-test to pbrt’s BSDF sampling routines. Those tests recently helped him chase down and fix a tricky bug in pbrt’s rough dielectric sampling code. Much respect for doing it the right way.

My discipline is not always as strong as Wenzel’s, though there are some more straightforward alternatives that are also effective. For example, pbrt has many little sampling functions that draw samples from some distribution. An easy way to test them is to evaluate the underlying function to create a tabularized distribution and to confirm that both it and the sampling method to be tested more or less generate the same samples with same probabilities. As an example, here is an excerpt from the test for sampling a trimmed Gaussian:

    auto exp = [&](Float x) { return std::exp(-c * x); };
    auto values = Sample1DFunction(exp, 32768, 16, 0, xMax);
    PiecewiseConstant1D distrib(values, 0, xMax);

    for (Float u : Uniform1D(100)) {
        Float sampledX = SampleTrimmedExponential(u, c, xMax);
        Float sampledProb = TrimmedExponentialPDF(sampledX, c, xMax);

        Float discreteProb;
        Float discreteX = distrib.Sample(u, &discreteProb);
        EXPECT_LT(std::abs(sampledX - discreteX), 1e-2);
        EXPECT_LT(std::abs(sampledProb - discreteProb), 1e-2);

The Sample1DFunction utility routine takes a function and evaluates it in a specified number of buckets covering a specified range, returning a vector of values. PiecewiseConstant1D then computes the corresponding piecewise-constant 1D distribution. We then take samples using the exact sampling routine and the piecewise-constant routine and ensure that each sample value is approximately the same and each returned sample probability is close as well. (This test implicitly depends on both sampling approaches warping uniform samples to samples from the function with values of u close to zero at the lower end of the exponential and u close to one at the upper end, which is the case here.)

To be clear: SampleTrimmedExponential could still be buggy even when that test passes. One might fret about those fairly large 1e-2 epsilons used for the quality test, for example. It is possible that the looseness of those epsilons might mask something subtly wrong, but we can at least trust that the function isn’t completely broken, off by a significant constant factor or the like.

Writing this sort of test requires trusting your functions for sampling tabularized distributions, but those too have their own tests; eventually one can be confident in all of the foundations. For example, this one compares those results to a case where the expected result can be worked out by hand and ensures that they match.

Preserving the Evidence

Another good use for unit tests is for isolating bugs, both for debugging them when they first occur and for ensuring that a subsequent change to the system doesn’t inadvertently reintroduce them.

Disney’s Moana Island scene helped surface all sorts of bugs in pbrt; many were fairly painful to debug since many were of the form of “render for a few hours before the crash happens.” For those, I found it useful to turn them into small unit tests as soon as I could narrow down what was going wrong.

Here’s one for a ray-triangle intersection that went bad. We have a degenerate triangle (note that the x and z coordinates are all equal), and so the intersection test should never return true. But for the specific ray here, it once did, and then things went south from there. Trying potential fixes with a small test like this was a nice way to work through the issue in the first place—it was easy to try a fix, recompile, and quickly see if it worked.

TEST(Triangle, BadCases) {
    Transform identity;
    std::vector<int> indices{ 0, 1, 2 };
    std::vector<Point3f> p { Point3f(-1113.45459, -79.0496140, -56.2431908),
                             Point3f(-1113.45459, -87.0922699, -56.2431908),
                             Point3f(-1113.45459, -79.2090149, -56.2431908) };
    TriangleMesh mesh(identity, false, indices, p, {}, {}, {}, {});
    auto tris = Triangle::CreateTriangles(&mesh, Allocator());

    Ray ray(Point3f(-1081.47925, 99.9999542, 87.7701111),
            Vector3f(-32.1072998, -183.355865, -144.607635), 0.9999);

    EXPECT_FALSE(tris[0].Intersect(ray).has_value());
}

One thing to note when extracting failure cases like this is that it’s critical to get every last digit of floating-point values: if the floats you test with aren’t precisely the same as the ones that led to the bug, you may not hit the bug at all in a test run.

Never Defer Looking into a Failing Test

A cautionary tale to wrap up: a few months ago a bug report about a failing unit test in pbrt-v4 came in. It had the following summary:

gcc-8.4 has stuck forever on ZSobolSampler.ValidIndices test

gcc-9.3 passed all tests

gcc-10.3 gives me the following message (in an eternal cycle) during tests

/src/pbrt/samplers_test.cpp:182: Failure
Value of: returnedIndices.find(index) == returnedIndices.end()
Actual: false
Expected: true

The ZSobolSampler implements Ahmed and Wonka’s blue noise sampler, which is based on permuting a set of low-discrepancy samples in a way that improves their blue noise characteristics. pbrt’s ZSobolSampler.ValidIndices test essentially just checks that the permutation is correct by verifying that the same sample isn’t returned for two different pixels. That test had been helpful when I first implemented that sampler, but it had been no trouble for months when that bug report arrived.

When the bug report came in, I took a quick look at that test and couldn’t imagine how it would ever run forever. No one else had reported anything similar and so, to my shame, I assumed it must be a problem with the compiler installation on the user’s system or some other one-off error. I didn’t look at it again for almost two months.

When I gave it more attention, I immediately found that I could reproduce the bug using those compilers, just as reported. It was a gnarly bug—one that disappeared when I recompiled with debugging symbols and even disappeared with an optimized build with debugging symbols. The bug would randomly disappear if I added print statements to log the program’s execution. Eventually I thought to try UBSan, and it saved the day, identifying this line of code as the problem:

int p = (MixBits(higherDigits ^ (0x55555555 * dimension)) >> 24) % 24;

0x55555555 is a signed integer and multiplying by dimension, which was an integer that starts at 0 and goes up from there, quickly led to overflow, which is undefined behavior (UB) in C++. In turn, gcc was presumably assuming that there was no UB in the program and optimizing accordingly, leading in one case to an infinite loop and in another to a bogus sample permutation.

At least the fix was easy—all is fine with an unsigned integer, where overflow is allowed and well-defined:

int p = (MixBits(higherDigits ^ (0x55555555u * dimension)) >> 24) % 24;

Leaving aside the joys of undefined behavior in C++, it was hard enough to chase that bug down with it already narrowed down to a failing test. If the bug had been something like “images are slightly too dark with gcc-10.3” (as could conceivably happen with repeated sample values, depending on how they were being repeated), it surely would have been an even longer and more painful journey. Score +1 for unit tests and -1 for me.

Conclusion

We’re not done with testing! With the unit testing lecture over, next time it will be on to some thoughts about writing effective assertions and how end-to-end tests fit in for testing renderers.

note

That capability isn’t provided by the C++ standard library since floating-point addition is not associative, so different execution orders may give different results. For pbrt’s purposes, that’s not a concern, so AtomicFloat provides that functionality through atomic compare/exchange operations. ↩