Accurate Differences of Products with Kahan’s Algorithm
I’ve recently returned to the land of floatingpoint error analysis to try to polish up a few more details for the next edition of Physically Based Rendering. Floatingpoint is an interesting corner of computation, full of surprises (good and bad) and neat tricks to help with the bad. Along the way, I came across this posting on StackOverflow, which pointed me at a nifty algorithm for accurately computing \(a \times bc \times d\).
Before we dig into that algorithm, what’s so tricky about \(a \times bc \times d\), anyway? Consider \(a=33962.035\), \(b=30438.8\), \(c=41563.4\), and \(d=24871.969\). (Those are actual values that came up during a run of pbrt.) With 32bit floats, \(a \times b=1.03376365 \times 10^9\) and \(c \times d=1.03376352 \times 10^9\). Subtract those, and you get \(128\). If you do the math in double precision and convert to float at the end, you get \(75.1656\). What happened?
The problem is that the value of each product is way off in \(1 \times 10^9\) land, where the spacing between representable floating point values is huge—it’s 64. Thus, when \(a \times b\) and \(c \times d\) are respectively rounded to the nearest representable float, they turn into multiples of 64. In turn, their difference will be a multiple of 64 and there’s no hope of coming any closer to \(75.1656\) than \(64\). In this case, the result was even farther away, due to the way the two products were rounded at \(1 \times 10^9\). We’ve run straight into good old catastrophic cancellation.^{1}
Here’s that better approach:^{2}
inline float DifferenceOfProducts(float a, float b, float c, float d) {
float cd = c * d;
float err = std::fma(c, d, cd);
float dop = std::fma(a, b, cd);
return dop + err;
}
DifferenceOfProducts()
computes \(a \times bc \times d\) in a way that avoids
catastrophic cancellation. It was originally described by the legend,
William Kahan, in On the Cost of FloatingPoint Computation Without
ExtraPrecise
Arithmetic. Incidentally,
Kahan’s write ups are generally fun reads, full of commentary about the
state of the floatingpoint world as well as math and technical insight.
That one’s conclusion includes:
Those of us who have grown old fighting against the vagaries of floatingpoint arithmetic and illconceived compiler “optimizations” take pride in our victories in that battle. But bequeathing the same battle to the generations that follow us would be contrary to the essence of civilization. Our experience indicts programming languages and development systems as sources of too many of the vagaries against which we have had to fight. Too many are unnecessary, as are certain of the tempting “optimizations” safe for integers but occasionally fatal for floatingpoint.
With that salt given its due appreciation, back to
DifferenceOfProducts()
: its use of fused multiplyadd (FMA) instructions
is the key to its cleverness.^{3} Mathematically, FMA(a,b,c)
is
\(a \times b+c\), so at first it may appear to just be useful as a
microoptimization: one instruction in the place of two. However, FMA
offers something special—it only rounds once.
With regular old \( a \times b+c \), first \(a \times b\) is computed, and that value, which generally can’t be represented exactly in floating point, is rounded to the nearest float. Then, \(c\) is added to that rounded value, and that result is again rounded to the nearest float. FMA is implemented so that rounding only happens at the end—the intermediate value of \(a \times b\) is maintained at sufficient accuracy so that when \(c\) is added to it, the final result is the closest float to the true value of \(a \times b+c\).
With that understanding of FMA in mind, back to
DifferenceOfProducts()
. Here again are the first two lines:
float cd = c * d;
float err = std::fma(c, d, cd);
The first computes the rounded value of \(c \times d\), and the second… subtracts
\(c \times d\) from its product? If you didn’t know how FMAs work, you’d think that
err
would always be zero. With an FMA, the second line actually extracts
the amount of rounding error in the computed value of \(c \times d\) and stores it
in err
. From there on out, it’s a straight shot:
float dop = std::fma(a, b, cd);
return dop + err;
A second FMA computes the difference of products with an FMA, rounding only
at the end. Thus, it’s immune to catastrophic cancellation, but it had to
work with the rounded value of \(c \times d\). The return
statement patches that
up by adding in the error extracted in the second line. Jeannenrod et
al. showed that the
result is correct to 1.5 ulps, which is stellar: FMA and the basic
floatingpoint operations are accurate to 0.5 ulps, so that’s almost as
good as could be.
Using the New Hammer
DifferenceOfProducts()
turns out to be useful surprisingly often once you
start looking for places to apply it. Computing a quadratic discriminant?
Call DifferenceOfProducts(b, b, 4 * a, c)
.^{4} Computing the determinant
of a 2x2 matrix? It’s got you covered. I count over 80 uses of it in the
implementation of next version of pbrt. Of them, my favorite is the
cross product function. It had always been a source of trouble, which
eventually led to hands being thrown up and doubles being used in the
implementation in order to avoid catastrophic cancellation:
inline Vector3f Cross(const Vector3f &v1, const Vector3f &v2) {
double v1x = v1.x, v1y = v1.y, v1z = v1.z;
double v2x = v2.x, v2y = v2.y, v2z = v2.z;
return Vector3f(v1y * v2z  v1z * v2y,
v1z * v2x  v1x * v2z,
v1x * v2y  v1y * v2x);
}
Now, we can stick with floats and use DifferenceOfProducts()
.
inline Vector3f Cross(const Vector3f &v1, const Vector3f &v2) {
return Vector3f(DifferenceOfProducts(v1.y, v2.z, v1.z, v2.y),
DifferenceOfProducts(v1.z, v2.x, v1.x, v2.z),
DifferenceOfProducts(v1.x, v2.y, v1.y, v2.x));
}
That tricky example at the start of this post was actually part of a cross product. At some point pbrt needed to compute the cross product of the vectors \((33962.035, 41563.4, 7706.415)\) and \((24871.969, 30438.8, 5643.727)\). Computed using floats, the resulting vector would have been \((1552, 1248, 128)\). (As a general rule, getting notverybig integer values back from a floatingpoint calculation that involved larger numbers is a pretty good tell that you’ve got catastrophic cancellation going on in there.)
In double precision, that cross product is \((1556.0276, 1257.5151,
75.1656)\) and we can see that with floats, \(x\) was sort of ok,
\(y\) was getting meh, and \(z\) supplied the disaster we used for
motivation. With DifferenceOfProducts()
and floats all the way?
\((1556.0276, 1257.5153, 75.1656)\). \(x\) and \(z\) match double
precision exactly and \(y\) is just barely off—there’s that extra ulp.
And what about performance? DifferenceOfProducts()
does two FMAs, a
multiply, and an add. The naive algorithm can be implemented with one FMA
and one multiply, which might seem that it would take half as much time.
In practice, it doesn’t seem to matter much once you’ve got the values at
hand in registers: in a synthetic benchmark on my laptop, I measured
DifferenceOfProducts()
to be only 1.09x more expensive than the naive
algorithm. Double precision was 2.98x slower.
Once you start to become aware of catastrophic cancellation, all sorts of
innocuouslooking expressions in code can start to make you nervous.
DifferenceOfProducts()
turns out to be good medicine for a good number of
them. It’s easy to use, and there’s not much reason to not use it widely.
notes

Catastrophic cancellation isn’t a worry when subtracting quantities with different signs or when adding quantities with the same sign. Conversely, it can be a problem when adding values with different signs. Thus, sums should generally be looked at with the same wariness as differences. ↩

I’ll leave it as an exercise to the reader to work out a
SumOfProducts()
function that protects against catastrophic cancellation. For a slightly more challenging exercise, explain why, inDifferenceOfProducts()
,dop + err == dop
if the signs ofa*b
andc*d
are different. ↩ 
An FMA instruction has been available on GPUs for over a decade, and on most CPUs for at least five years now. On CPUs, you may need to add compiler flags to get your compiler to emit it directly when you use
std::fma()
;march=native
works on gcc and clang. ↩ 
In IEEE floatingpoint, multiplication by powers of two is exact, so
4 * a
doesn’t incur any rounding error, assuming there is no overflow. ↩