Debugging Your Renderer (5/n): Rendering Deterministically

Deterministic program execution has a lot going for it. For most programs, it’s the natural way of being: for any particular input, the program generates the same output. Determinism makes debugging much easier, as it saves you from having to re-run the system repeatedly to trigger a bug that only happens sometimes, and it’s great for end-to-end tests, since you can safely make strict assertions about cases where the program’s output should remain absolutely unchanged (e.g., that float parser example).

However, deterministic execution doesn’t always come naturally when you’re rendering, especially when you’re rendering in parallel. Today’s post will go into some of the ways that deterministic execution can be lost, talk about how to maintain determinism, and then finish with some further discussion of its benefits.

The Basics

To start, let’s settle on a more precise definition of deterministic rendering than “same input gives same output.” It is too much to ask for bit accuracy in output across machines; not only will we encounter different standard math libraries with different levels of precision, but there are a number of corners of C++ that allow for things like variation in order of evaluation across compilers that can lead to innocuous differences in output.

Therefore, we’ll define the observable effect of determinism as: on a particular system with a particular compiler, repeatedly running the renderer on the same input always produces the same value at every pixel. Implicit in that definition is that the same computations are performed to compute each pixel’s value, though not necessarily in the same order. That definition is plenty for our needs; the benefits from nailing it down further almost certainly wouldn’t be worth the trouble.

A render running on a single core should naturally achieve that goal. If it does not, fixing that is the first order of business. Most likely it’s an uninitialized memory access, other memory corruption, or code somewhere that randomly seeds a random number generator based on something that varies like the process id or current time. (I won’t say more about fixing those sorts of problems here, as it’s all rendering-independent and is regular everyday debugging.)

Rendering in parallel is when things get more complicated. Indeed, none of the versions of pbrt before the latest, pbrt-v4, was deterministic. That was always a minor annoyance when debugging and testing the system, though I honestly didn’t realize what a productivity drag it was until determinism was achieved.

Consistent Samples

For rendering to be deterministic, the Monte Carlo sampling routines must use exactly the same random sample points at every sample taken in every pixel. If they are not, then determinism is lost from the start, since different rays will be traced each time due to slightly different rays leaving the camera, different sampling decisions will be made at intersections, and so forth. One might assume that deterministic is the natural way of being for the Samplers that generate those points, but that was not so prior to pbrt-v4. There were two issues: the placement of low discrepancy point sets and carried state in samplers that led to nondeterminism with multithreading.

When using low discrepancy point sets like Halton points, pbrt-v3 aligns the origin of the points with the upper left pixel of the image. That’s normally \((0,0)\), but then if the user specifies a crop window to render just part of the image the low discrepancy points all shift in compensation. That was always a bother for debugging since you couldn’t narrow in on a problem pixel without perturbing all of the samples and often no longer hitting the bug. That detail was easy enough to fix given attention to it.

The other issue came from the fact that each thread maintains its own Sampler instance. This way samplers can maintain state that depends on the current pixel and pixel sample (e.g., an offset into the Halton sequence). Many samplers also use pseudorandom number generators (RNGs) in their work; those, too, are per-sampler state. (For example, the stratified sampler uses a RNG to jitter sample locations and low discrepancy samplers use RNGs for randomization via scrambling.)

In pbrt-v3, those per-sampler RNGs are seeded once at system startup time and then chug along, generating random numbers as requested. Because threads are dynamically assigned to work on regions of the image, they may not work on the same pixels over multiple runs. In turn, the values that a RNG returns at a pixel both depends on which thread was assigned that pixel as well as how many random numbers it had supplied previously for other pixels.

The fix was easy: reseed the RNG before generating sample points at a particular pixel sample. The Sampler interface includes a StartPixelSample() method that is called before samples are requested at a given pixel sample, so it’s just a few lines of code to put those RNGs in a known state. Here’s that method in IndependentSampler, which generates uniform independent samples without any further nuance:

void StartPixelSample(Point2i p, int sampleIndex, int dimension) {
    rng.SetSequence(Hash(p, seed));
    rng.Advance(sampleIndex * 65536ull + dimension);
}

There are two things to note in StartPixelSample()’s implementation. First, pbrt uses the PCG RNG, which allows the specification of both a particular sequence of pseudorandom values as well as an offset into that sequence. Thus, we choose a sequence according to the pixel coordinates and then offset into it according to the index of the sample being taken in the pixel.

The other thing to mention there is Hash(), which has been useful all over the place in pbrt-v4. Here is its signature:

template <typename... Args> uint64_t Hash(Args... args);

You can pass a bunch of values or objects straight away to it and it marshals them up and passes them to MurmurHash to hash them.¹ In its use in the IndependentSampler, we also allow the user to specify a seed for random number generation; Hash() makes it simple to mush that together with the current pixel coordinates to choose a pseudorandom sequence for the current pixel.

There is, needless to say, a short unit test that ensures all of the samplers consistently generate the same sample values.

Other Moments of Randomness

Samplers were much of the trouble in bringing pbrt-v4 into the land of deterministic output, though two other places in the system that made random decisions without the involvement of a sampler needed attention.

First was a stochastic alpha test, deep in the primitive intersection code. For shapes that have an alpha texture assigned to them, we’d like to ignore any intersections where the alpha texture is zero and randomly accept ones with fractional alpha with probability according to their alpha value. The sampler isn’t available in the ray intersection routines and keeping a persistent RNG in that code has obvious problems, so here is what we do instead:

if (Float a = alpha.Evaluate(si->intr); a < 1) {
    // Possibly ignore intersection based on stochastic alpha test
    Float u = (a <= 0) ? 1.f : HashFloat(ray.o, ray.d);
    if (u > a) {
        // Ignore this intersection and trace a new ray
        [...]

Given a less-than-one alpha value, a call to HashFloat() gives a uniform random floating-point value between 0 and 1. It’s a buddy of Hash() and is also happy to take whichever-all values you pass it to turn into a random floating-point value. (Above, it’s the ray origin and direction.)

template <typename... Args>
Float HashFloat(Args... args) {
    return uint32_t(Hash(args...)) * 0x1p-32f;
}

Thus, the results are deterministic for any given ray.

The second case was in pbrt-v4’s LayeredBxDF class, which implements Guo et al.’s algorithm for stochastic evaluation and sampling of the BRDFs of layered materials. That needs an unbounded number of independent random samples, so we instantiate an RNG for each evaluation, but seed it via the incident and outgoing directions. Thus again, for any pair of directions passed to the BRDF evaluation method, the same set of random samples will be generated and the returned value will be deterministic.

Consistent Pixel Sums

With what we have so far, the same rays will be traced each time the renderer runs and in turn, if an assertion fires along the way, it will do so consistently. That’s a big benefit for debugging, but we have not yet achieved deterministic output, which is important for making end-to-end tests maximally useful.

The remaining challenge lies in summing sample values to compute each pixel’s final value. Because floating-point addition is not associative, if the image samples that contribute to a pixel are not accumulated carefully the order of summation may be different across different runs of the program and so the output may change. That was a problem in pbrt-v3 due to how it computed final pixel values: there, the image is decomposed into rectangular regions that are assigned to threads and threads generate samples within their regions, updating the pixels that each sample contributes to.

This figure illustrates the problem with that, showing all of the samples that contribute to a particular output pixel (black dot):

We have two threads responsible for adjacent \(4 \times 4\) pixel regions of the image (thick boxes). For an output image pixel near the boundary of the two regions that has a reconstruction filter that is wider than the pixel spacing (shaded circle), some of the samples that contribute will be taken by thread 1 (orange dots) and some will come from samples taken by thread 2 (blue dot). Because the threads are independent, the filtered sample values are not accumulated in a deterministic order and thus, the final pixel value is not deterministic.

pbrt-v4 addresses this issue by adopting Ernst et al.’s filter importance sampling approach. Independent samples are taken for each output pixel, with no sample sharing with other pixels. If only a single thread works on a pixel at a time, then the samples for each output pixel are naturally generated in a consistent order, giving a consistent sum. (Filter importance sampling has a number of additional advantages that are detailed in the paper, including better preservation of the benefits of high-quality sampling patterns.) With that tuned up, we (almost) have deterministic output.

Those Pesky Splats

One more thing… pbrt-v4’s output is not quite deterministic if a light transport algorithm that traces paths starting from the light sources is being used. In that case, light path vertices are splatted into the image at whichever pixel they are visible; if multiple threads end up splatting into the same pixel, then we are back to nondeterminism from unordered floating-point addition.

This issue could be addressed by having each thread splat into its own image and then summing the images at the end, though that would incur a cost in memory use that scales with the number of threads. Alternatively, we might use fixed-point rather than floating-point to store those pixel values. For now that issue is unaddressed; it rarely causes any trouble, especially since those splatted values are accumulated in double precision and generally converted all the way down to half-float precision for storage. Most of the time that loss of precision hides any sloppy sums.

The Joys of --debugstart

The greatest benefit of deterministic rendering has been the ability to quickly iterate on bugs: you can add some logging code or more assertions, recompile, and re-render, confident that the new code will see the same inputs as triggered the bug. Samplers that give exactly the same samples at each pixel also means that you can speed things up by just rendering a crop window or even a single pixel as you’re chasing a bug.

Even better, it was easy to go even further and add support for retracing just a single offending ray path. pbrt-v4 has a CheckCallbackScope class that uses RAII to register a callback function that will run if an assertion fails or if the renderer crashes. Here is how it is used in most of pbrt’s CPU integrators:

thread_local Point2i threadPixel;
thread_local int threadSampleIndex;

CheckCallbackScope _([&]() {
    return StringPrintf("Rendering failed at pixel (%d, %d) sample %d. Debug with "
                        "\"--debugstart %d,%d,%d\"\n",
                        threadPixel.x, threadPixel.y, threadSampleIndex,
                        threadPixel.x, threadPixel.y, threadSampleIndex);
});

As rendering proceeds, each thread keeps its thread-local threadPixel and threadSampleIndex variables up to date and if the renderer aborts due to an error, you get a message like:

Rendering failed at pixel (915, 249) sample 83. Debug with "--debugstart 915,249,83"

at the bottom of the crash output. If you then rerun pbrt passing it that --debugstart option, a specialized code path traces just that single ray path in the main thread of execution. That gives a simpler debugging context than launching a bunch of threads and waiting for the bug to hit again; it’s delightfully helpful for bugs that otherwise only happen after a substantial amount of time has gone by.

Conclusion

We’ve made it past “detecting rendering bugs” and have made our way to “reliably replicating those bugs.” Next time will be a few thoughts about performance bugs before we get into actual debugging techniques.

note

The attentive reader of the Hash() implementation will note that if a struct or class that has padding between elements is passed to it, the results may be nondeterministic since it hashes their in-memory contents directly. It would be nice to use a C++ SFINAE trick to get a compilation error in that case, but I’m not aware of a way to detect that at compile time. ↩