The story of ispc: more on optimizations and performance (part 8)
It wasn’t quite fair to describe volta as a dumb compiler earlier; we’ll revisit that topic today.
Not only were a number of the language features carefully designed to map well to CPU hardware, but a number of custom optimization passes for LLVM IR turned out to be important for keeping performance competitive with intrinsics. Many of them were influenced by the early users at Intel and their close scrutiny of volta’s assembly output.
Uniform
The uniform
qualifier is one of the most important language features for
performance.
uniform
is a type qualifier that describes a value that’s the same across
all of the SPMD program instances being executed. It corresponds to a
scalar value and maps nicely to the CPU, which I’m told supports scalar
computation as well as SIMD. The concept has been an easy one for
programmers to pick up and has a direct mapping to how CPU
intrinsics programmers structure their code.
Declaring a variable to be uniform
can give two benefits:
- Any control flow based on a uniform value is just like control flow in regular programs: all of the SPMD program instances follow the same path, and we don’t need to worry about updating the execution mask.
- Memory access of uniform values is easy to handle and efficient: for
example, a
uniform
read corresponds to a simple scalar load.
My first exposure to the general idea of uniform
came from the RenderMan
Shading Language (RSL), which turns out to be a SPMD language that operates
on grids of points to be shaded. It also has a uniform
keyword,
signifying a value that is the same for all points. As far as I know, RSL
implementations never targeted SIMD CPU hardware, but scalar CPU
implementations maintained a mask of which points were active and could
apply uniform
for similar benefits for control flow. Closing the circle,
it was fun when Pixar posted a note about the benefits of using ispc
for
shaders
a few years ago.
It turns out that RSL had originally been designed for custom SIMD hardware
in the 1980s, and it also turns out there were precursors to uniform
in
other SPMD languages of that era that targeted multiprocessors; again, see
the ispc paper for more
on previous work in this area.
Minimizing masked instructions
Handling the all details of masked vector computation can cause a fair amount of bloat in x86 assembly. It turned out to be worthwhile to design some of volta’s language features to make it possible to foster situations where the compiler could determine that all of the program instances were active and take advantage of that in code generation.
One example of that is a specialized loop construct that volta provides,
foreach
.1 It describes a loop over one or more dimensions where
SPMD program instances are mapped to a given range of values.
We’ll use this short volta function as an example in the following; presumably it’s obvious what it does:
void increment(uniform float ptr[], uniform int count) {
foreach (i = 0 ... count) {
ptr[i] += 1;
}
}
Consider now a foreach
loop over 130 values with an 8-wide SIMD target:
there will be 16 loop iterations where the execution mask is all on that
take care of the first 128 values. Then, there will be a single iteration at
the end with a mixed mask for the ragged extra two elements. ispc/volta
generates two versions of the code for the loop body, the first one
specialized for an all-on mask.
Modern ispc generates this AVX2 assembly for the unmasked iterations, plus the extra few instructions to see if another loop iteration is needed and then jump to the appropriate place:
vaddps (%rdi,%rdx), %ymm0, %ymm1
vmovups %ymm1, (%rdi,%rdx)
Just what you’d want, unless you’re the sort of person who’s bothered by a
possibly-unnecessary use of an unaligned vector store. Assuming count
is
large, the vast majority of iterations will just run that code.
There’s a lot more that has to be done in the general case. Here’s the code for the last iteration:
vmovd %eax, %xmm0
vpbroadcastd %xmm0, %ymm0
vpaddd LCPI0_1(%rip), %ymm0, %ymm0
vmovd %esi, %xmm1
vpbroadcastd %xmm1, %ymm1
vpcmpgtd %ymm0, %ymm1, %ymm0
shll $2, %eax
cltq
vmaskmovps (%rdi,%rax), %ymm0, %ymm1
vbroadcastss LCPI0_0(%rip), %ymm2
vaddps %ymm2, %ymm1, %ymm1
vmaskmovps %ymm1, %ymm0, (%rdi,%rax)
The first handful of instructions determine the execution mask, disabling
the vector lanes that correspond to entries in the array past count
.
Then it’s necessary to use that mask with a masked load instruction,
vmaskmovps
, when loading the last values from the array so that we don’t
accidentally read memory past the end of it. Then there’s the add. Then
finally, a masked store when writing the results back so that we don’t
clobber memory after the array.
Without foreach
and the “all on” optimizations that it enabled, we’d be
going through a lot of that each time through the loop. (And the
intrinsics programmers would rightly roll their eyes and walk away.)
If you know that the number of items you have to process is always a multiple of the SIMD width, you can add something like this before the loop:
count &= ~7;
That’d be no-op in practice, but it’s enough to let the compiler be able to figure out that it doesn’t need to emit the second version of the loop body if it won’t be needed.2 (By the “compiler”, I mean LLVM in this case; ispc goes ahead and emits IR for both cases but after LLVM’s regular optimization passes have done their thing, LLVM’s dead code elimination pass takes care of business.)
There are other places beyond foreach
where we can do similar things.
When an application first calls from C/C++ to volta code, by definition all
of the program instances are running, so the execution mask can statically
be determined to be all on until SPMD control flow enters the picture.
Until then, the same sorts of optimizations can also be applied.
Furthermore, not all’s lost when SPMD control flow starts happening: it can
be worthwhile to check at runtime whether all of the program instances are
active. volta/ispc provides separate control flow keywords that allow the
programmer to indicate expected-to-be-coherent control flow: cif
, cfor
,
and so forth.
As an example, when cif
is used, volta/ispc generates code that tests the
execution mask right after updating it for the effect of the ‘if’ condition. If
it’s all on, then we can jump to a specialized code path that starts with the “mask
all on” assumption. If it’s all off, then we can directly jump until after
the ‘if’ statement’s body. Otherwise we have to execute the if’s code with regular
masking, as regular if
does. There’s obviously a cost from cif
in terms
of code duplication, but for the cases where all of the program instances
are in fact active, it can give a meaningful performance benefit.
I felt it was important to make that an explicit language feature, rather than not introducing new keywords and trying to figure out reasonable heuristics for an optimizer to decide when to add that check and when not to. It’s admittedly a little extra mental overhead for the programmer and I’m sure that there are cases where people write ispc code without using those that would benefit from them, but again it’s a case where I’d prefer that the compiler be straightforward and predictable in the code that it generates; another case of focusing on the performance-oriented programmer as the most important type of user.
Efficient SPMD loads and stores
Loads and stores from SPMD programs running on SIMD hardware are “interesting”. In general, each program instance in the vector may be reading from or writing to completely different locations in memory. Further, some of them may be inactive, and must not issue reads or writes at all in that case. (We saw a bit of this issue earlier when we looked at implementing scatter.)
In general, we need to issue a gather instruction for a SPMD read and a scatter for a write; these allow full flexibility in the memory locations accessed. Even if these are available as native instructions (as they are in AVX-512), performance is much better if we use vector loads and stores in the cases where the memory locations accessed are contiguous—it’s better to only use gather and scatter when we really need it.
Unfortunately, we need to make this decision at compile time. With GPUs (as I understand how they work these days), all of this is pretty much a run time distinction: the compiler just emits the equivalent of a gather or scatter and then the actual performance at run time depends whether the locations accessed were coherent in various ways or not. (Avoiding bank conflicts and all that.) That’s much better, since the locations accessed are in general data-dependent and thus their coherence can never be known as well at compile-time as at run-time.
That’s not how CPUs work, so we’ll have to do our best in the compiler.
The volta front-end doesn’t at all try to be clever about this; other than
scalar memory accesses through uniform
s, it starts out by just generating
tentative gathers and scatters for all SPMD reads and writes. volta declares a
whole bunch of pseudo
gather and scatter functions in LLVM IR, for
example:
declare <WIDTH x float> @__pseudo_gather64_float(<WIDTH x i64>,
<WIDTH x MASK>) nounwind readonly
(WITDH
and MASK
get set to concrete values via a macro expansion step.)
Then, for any SPMD memory read of floats (e.g. loading a SIMD vector’s
worth of values in that increment()
example above), on a 64-bit target,
volta emits a call to __pseudo_gather64_float
, providing a unique pointer
for each SIMD lane as well as the execution mask.
Those pseudo functions are left undefined through early LLVM optimization passes. In time, custom volta optimization passes start trying to improve them. There’s a lot that can be done better:
- If all of the pointers have the same value, volta substitutes a scalar load and vector broadcast.
- If the SPMD instances can be determined to be reading contiguous memory
locations (as in
increment()
), it uses vector load. - If a gather truly is necessary, or the compiler can’t be sure, then a gather it is. (And then the target-specific IR will either emit a native instruction or do the equivalent with a series of instructions.)
I wish all the complexity around doing that wasn’t necessary, but it’s a significant performance cliff if a gather or scatter is emitted unnecessarily, so it was worth banging on it a lot. In the end, those optimization passes got a lot of attention and came to be pretty robust at detecting the appropriate patterns, which I think is the best that could be hoped for.
Later on, I spent some time implementing some slightly more sophisticated approaches to avoiding gathers for reads that could be better expressed as vector loads and shuffles. Consider this function:
void reduce(uniform float ptr[], uniform float result[], uniform int count) {
foreach (i = 0 ... count) {
result[i] = ptr[i/2];
}
}
On an 8-wide target, it’d be better to issue a 4-wide load and do a vector shuffle—much faster than a gather. For a function like this:
void deinterleave(uniform float ptr[], uniform float a[],
uniform float b[], uniform int count) {
foreach (i = 0 ... count) {
a[i] = ptr[2*i];
b[i] = ptr[2*i + 1];
}
}
an experienced intrinsics programmer would use two vector loads and then some shuffles. Trying that with modern ispc, a gather is emitted; I swear those used to be taken care of by that optimization. Another little thing to dig into later.
Anyway, in the end, all this added up to about 6k LOC for custom LLVM optimization passes in ispc; so maybe the compiler wasn’t completely dumb after all. There isn’t too much deep compiler magic in all that, however, which I think makes the output of the compiler still fairly predictable.
Tomorrow: for an exciting time, make it an open sourcing volta time.
Next: The open source release and the end of volta
notes
-
volta’s
foreach
was inspired by a related construct in a GPU-targeted language that Mark Lacey, T. Foley, Jefferson Montgomery, and Geoff Berry were building. ↩ -
A fair criticism would be that we’re starting to head to the land of the programmer doing fiddly things to make the optimizer do what they want. I think there’s an interesting broader question about how a programmer might cleanly and directly provide the compiler with information about the characteristics of the data that the program will be processing for help in optimization. That wasn’t a problem that I ended up tackling in volta, though. ↩