The story of ispc: more on optimizations and performance (part 8)

It wasn’t quite fair to describe volta as a dumb compiler earlier; we’ll revisit that topic today.

Not only were a number of the language features carefully designed to map well to CPU hardware, but a number of custom optimization passes for LLVM IR turned out to be important for keeping performance competitive with intrinsics. Many of them were influenced by the early users at Intel and their close scrutiny of volta’s assembly output.

Uniform

The uniform qualifier is one of the most important language features for performance.

uniform is a type qualifier that describes a value that’s the same across all of the SPMD program instances being executed. It corresponds to a scalar value and maps nicely to the CPU, which I’m told supports scalar computation as well as SIMD. The concept has been an easy one for programmers to pick up and has a direct mapping to how CPU intrinsics programmers structure their code.

Declaring a variable to be uniform can give two benefits:

Any control flow based on a uniform value is just like control flow in regular programs: all of the SPMD program instances follow the same path, and we don’t need to worry about updating the execution mask.
Memory access of uniform values is easy to handle and efficient: for example, a uniform read corresponds to a simple scalar load.

My first exposure to the general idea of uniform came from the RenderMan Shading Language (RSL), which turns out to be a SPMD language that operates on grids of points to be shaded. It also has a uniform keyword, signifying a value that is the same for all points. As far as I know, RSL implementations never targeted SIMD CPU hardware, but scalar CPU implementations maintained a mask of which points were active and could apply uniform for similar benefits for control flow. Closing the circle, it was fun when Pixar posted a note about the benefits of using ispc for shaders a few years ago.

It turns out that RSL had originally been designed for custom SIMD hardware in the 1980s, and it also turns out there were precursors to uniform in other SPMD languages of that era that targeted multiprocessors; again, see the ispc paper for more on previous work in this area.

Minimizing masked instructions

Handling the all details of masked vector computation can cause a fair amount of bloat in x86 assembly. It turned out to be worthwhile to design some of volta’s language features to make it possible to foster situations where the compiler could determine that all of the program instances were active and take advantage of that in code generation.

One example of that is a specialized loop construct that volta provides, foreach.¹ It describes a loop over one or more dimensions where SPMD program instances are mapped to a given range of values.

We’ll use this short volta function as an example in the following; presumably it’s obvious what it does:

void increment(uniform float ptr[], uniform int count) {
    foreach (i = 0 ... count) {
        ptr[i] += 1;
    }
}

Consider now a foreach loop over 130 values with an 8-wide SIMD target: there will be 16 loop iterations where the execution mask is all on that take care of the first 128 values. Then, there will be a single iteration at the end with a mixed mask for the ragged extra two elements. ispc/volta generates two versions of the code for the loop body, the first one specialized for an all-on mask.

Modern ispc generates this AVX2 assembly for the unmasked iterations, plus the extra few instructions to see if another loop iteration is needed and then jump to the appropriate place:

	vaddps	(%rdi,%rdx), %ymm0, %ymm1
	vmovups	%ymm1, (%rdi,%rdx)

Just what you’d want, unless you’re the sort of person who’s bothered by a possibly-unnecessary use of an unaligned vector store. Assuming count is large, the vast majority of iterations will just run that code.

There’s a lot more that has to be done in the general case. Here’s the code for the last iteration:

	vmovd        %eax, %xmm0
	vpbroadcastd %xmm0, %ymm0
	vpaddd       LCPI0_1(%rip), %ymm0, %ymm0
	vmovd        %esi, %xmm1
	vpbroadcastd %xmm1, %ymm1
	vpcmpgtd     %ymm0, %ymm1, %ymm0
	shll         $2, %eax
	cltq
	vmaskmovps   (%rdi,%rax), %ymm0, %ymm1
	vbroadcastss LCPI0_0(%rip), %ymm2
	vaddps       %ymm2, %ymm1, %ymm1
	vmaskmovps   %ymm1, %ymm0, (%rdi,%rax)

The first handful of instructions determine the execution mask, disabling the vector lanes that correspond to entries in the array past count. Then it’s necessary to use that mask with a masked load instruction, vmaskmovps, when loading the last values from the array so that we don’t accidentally read memory past the end of it. Then there’s the add. Then finally, a masked store when writing the results back so that we don’t clobber memory after the array.

Without foreach and the “all on” optimizations that it enabled, we’d be going through a lot of that each time through the loop. (And the intrinsics programmers would rightly roll their eyes and walk away.)

If you know that the number of items you have to process is always a multiple of the SIMD width, you can add something like this before the loop:

count &= ~7;

That’d be no-op in practice, but it’s enough to let the compiler be able to figure out that it doesn’t need to emit the second version of the loop body if it won’t be needed.² (By the “compiler”, I mean LLVM in this case; ispc goes ahead and emits IR for both cases but after LLVM’s regular optimization passes have done their thing, LLVM’s dead code elimination pass takes care of business.)

There are other places beyond foreach where we can do similar things. When an application first calls from C/C++ to volta code, by definition all of the program instances are running, so the execution mask can statically be determined to be all on until SPMD control flow enters the picture. Until then, the same sorts of optimizations can also be applied.

Furthermore, not all’s lost when SPMD control flow starts happening: it can be worthwhile to check at runtime whether all of the program instances are active. volta/ispc provides separate control flow keywords that allow the programmer to indicate expected-to-be-coherent control flow: cif, cfor, and so forth.

As an example, when cif is used, volta/ispc generates code that tests the execution mask right after updating it for the effect of the ‘if’ condition. If it’s all on, then we can jump to a specialized code path that starts with the “mask all on” assumption. If it’s all off, then we can directly jump until after the ‘if’ statement’s body. Otherwise we have to execute the if’s code with regular masking, as regular if does. There’s obviously a cost from cif in terms of code duplication, but for the cases where all of the program instances are in fact active, it can give a meaningful performance benefit.

I felt it was important to make that an explicit language feature, rather than not introducing new keywords and trying to figure out reasonable heuristics for an optimizer to decide when to add that check and when not to. It’s admittedly a little extra mental overhead for the programmer and I’m sure that there are cases where people write ispc code without using those that would benefit from them, but again it’s a case where I’d prefer that the compiler be straightforward and predictable in the code that it generates; another case of focusing on the performance-oriented programmer as the most important type of user.

Efficient SPMD loads and stores

Loads and stores from SPMD programs running on SIMD hardware are “interesting”. In general, each program instance in the vector may be reading from or writing to completely different locations in memory. Further, some of them may be inactive, and must not issue reads or writes at all in that case. (We saw a bit of this issue earlier when we looked at implementing scatter.)

In general, we need to issue a gather instruction for a SPMD read and a scatter for a write; these allow full flexibility in the memory locations accessed. Even if these are available as native instructions (as they are in AVX-512), performance is much better if we use vector loads and stores in the cases where the memory locations accessed are contiguous—it’s better to only use gather and scatter when we really need it.

Unfortunately, we need to make this decision at compile time. With GPUs (as I understand how they work these days), all of this is pretty much a run time distinction: the compiler just emits the equivalent of a gather or scatter and then the actual performance at run time depends whether the locations accessed were coherent in various ways or not. (Avoiding bank conflicts and all that.) That’s much better, since the locations accessed are in general data-dependent and thus their coherence can never be known as well at compile-time as at run-time.

That’s not how CPUs work, so we’ll have to do our best in the compiler.

The volta front-end doesn’t at all try to be clever about this; other than scalar memory accesses through uniforms, it starts out by just generating tentative gathers and scatters for all SPMD reads and writes. volta declares a whole bunch of pseudo gather and scatter functions in LLVM IR, for example:

declare <WIDTH x float> @__pseudo_gather64_float(<WIDTH x i64>,
    <WIDTH x MASK>) nounwind readonly

(WITDH and MASK get set to concrete values via a macro expansion step.)

Then, for any SPMD memory read of floats (e.g. loading a SIMD vector’s worth of values in that increment() example above), on a 64-bit target, volta emits a call to __pseudo_gather64_float, providing a unique pointer for each SIMD lane as well as the execution mask.

Those pseudo functions are left undefined through early LLVM optimization passes. In time, custom volta optimization passes start trying to improve them. There’s a lot that can be done better:

If all of the pointers have the same value, volta substitutes a scalar load and vector broadcast.
If the SPMD instances can be determined to be reading contiguous memory locations (as in increment()), it uses vector load.
If a gather truly is necessary, or the compiler can’t be sure, then a gather it is. (And then the target-specific IR will either emit a native instruction or do the equivalent with a series of instructions.)

I wish all the complexity around doing that wasn’t necessary, but it’s a significant performance cliff if a gather or scatter is emitted unnecessarily, so it was worth banging on it a lot. In the end, those optimization passes got a lot of attention and came to be pretty robust at detecting the appropriate patterns, which I think is the best that could be hoped for.

Later on, I spent some time implementing some slightly more sophisticated approaches to avoiding gathers for reads that could be better expressed as vector loads and shuffles. Consider this function:

void reduce(uniform float ptr[], uniform float result[], uniform int count) {
    foreach (i = 0 ... count) {
        result[i] = ptr[i/2];
    }
}

On an 8-wide target, it’d be better to issue a 4-wide load and do a vector shuffle—much faster than a gather. For a function like this:

void deinterleave(uniform float ptr[], uniform float a[],
                  uniform float b[], uniform int count) {
    foreach (i = 0 ... count) {
        a[i] = ptr[2*i];
        b[i] = ptr[2*i + 1];
    }
}

an experienced intrinsics programmer would use two vector loads and then some shuffles. Trying that with modern ispc, a gather is emitted; I swear those used to be taken care of by that optimization. Another little thing to dig into later.

Anyway, in the end, all this added up to about 6k LOC for custom LLVM optimization passes in ispc; so maybe the compiler wasn’t completely dumb after all. There isn’t too much deep compiler magic in all that, however, which I think makes the output of the compiler still fairly predictable.

Tomorrow: for an exciting time, make it an open sourcing volta time.

Next: The open source release and the end of volta

notes

volta’s foreach was inspired by a related construct in a GPU-targeted language that Mark Lacey, T. Foley, Jefferson Montgomery, and Geoff Berry were building. ↩
A fair criticism would be that we’re starting to head to the land of the programmer doing fiddly things to make the optimizer do what they want. I think there’s an interesting broader question about how a programmer might cleanly and directly provide the compiler with information about the characteristics of the data that the program will be processing for help in optimization. That wasn’t a problem that I ended up tackling in volta, though. ↩