The story of ispc: Bringing up AVX and giving something back to LLVM (part 7)

Come late 2011, I was pretty excited about the imminent arrival of AVX on Sandybridge CPUs: after years of SSE’s 4-wide vectors (for 32-bit data types), AVX was doubling that, making it possible to do 8-wide 32-bit vector ops instead. This was probably the most exciting thing that had happened in Intel’s SIMD ISA since SSE arrived in the first place in 1999. The arrival of AVX was particularly exciting for volta—in the best case, lots of things would run roughly twice as fast, thanks to having twice as many vector lanes, and in the worst case, this whole SPMD on SIMD thing might be shown to be not so exciting after all.

Twice as fast is huge: the 1990s was the last time you saw anything close to “twice as fast” in a single CPU generation. Today, single-core CPU performance may improve by 10-20% per generation thanks to a better semiconductor process, slightly faster clocks, and microarchitectural improvements, but that’s about it.

The funny thing about it was, Intel was about to ship AVX but there’d be a delay, in some cases of years, before AVX made anything run faster. For things the autovectorizer could handle, it’d just be a recompile. For everything that had been written in SSE intrinsics, well, someone’d have to go and recode it in AVX intrinsics before it’d be any faster. For all the scalar code in the world that didn’t use SIMD, there’d be no benefit from AVX. What was the incentive in writing all those intrinsics in the first place if you’d have to do it all again in a few years again anyway?

Lots of things are wrong with coding in intrinsics—not just the eternal puzzle of what gets a single underscore and what gets a double underscore before it, but this whole fact that it completely ties you to a particular ISA and its capabilities. The state of affairs is completely different with GPUs, where vendors are able to make significant architectural changes from generation to generation, delivering speedups with more cores and more vector lanes without programmers needing to modify their code.

For the most part, people at Intel didn’t seem to bothered by things being this way; I never really understood it. Well, some were bothered, but I didn’t understand why leadership wasn’t actively freaking out about it—you’re shipping a CPU with double the computational capability as the one from a year before, but almost no one will enjoy that benefit?

My only guess is that it was the legacy of many years where C (and Fortran) mapped perfectly well to Intel’s CPU architectures; before multi-core and SIMD were important, there was no need for them to worry about programming models themselves, so I guess they got comfortable with that not being their concern.

While there was plenty of noise around parallel programming models in Intel’s compiler group, it didn’t have the sense of “the future of the company depends on it”, like the way that NVIDIA approached CUDA, for example. Who knows, maybe the future of the company doesn’t depend on it; I guess Intel’s still in business. It still seemed strange to be increasing the SIMD width without having a plan for how developers would actually make good use of it.

In any case, I’d take those vector lanes if they were giving ‘em to me. I started work on AVX support for volta as soon as early support for AVX started appearing in LLVM.

Adding a new backend to volta

Adding a new backend to volta basically involves enabling the corresponding LLVM code generator and then writing a bunch of LLVM IR by hand to bridge the gap between basic operations the compiler wanted to be able to perform and the specifics of a given ISA. For example, the volta standard library provides a min() function that operates on various types.

Here’s the implementation (written in volta) for float:

static inline float min(float a, float b) {
    return __min_varying_float(a, b);
}

In turn, each backend needs to provide an implementation of __min_varying_float(), written manually in LLVM IR. For AVX, there’s a corresponding instruction and LLVM exposes it via an intrinsic, and we can just call that.

Here’s the definition for AVX:

define <8 x float> @__min_varying_float(<8 x float>, <8 x float>) {
  %call = call <8 x float> @llvm.x86.avx.min.ps.256(<8 x float> %0, <8 x float> %1)
  ret <8 x float> %call
}

LLVM turns that into a single vminps instruction for a call to min() in volta.

If AVX didn’t have a single instruction that did this operation, the IR for the AVX target would need do whatever made the most sense to do the computation through other operations. (Things like scatter and gather for SSE4 are implemented that way.)

Banging on AVX support in LLVM

As explained previously, volta never would have been possible without LLVM; if the out of the box SSE4 code generation hadn’t been as good as it was, I’d likely have ended my early experiments and moved on to a new project. I owed LLVM big, so wanted to do something helpful in return.

I started trying to use the LLVM AVX backend before it was even complete; I imagine the developers probably weren’t ready to have anyone banging on it at that point. I was really excited to see how AVX worked out for volta, though, and I also figured I could help out a bit with testing their implementation.

It turned out that volta was pretty effective at exercising LLVM’s vector code generation. Not only did it generate lots of vectorized LLVM IR, it also emitted lots of x86 vector intrinsics directly (like __min_varying_float()); both of these were fairly different characteristics than the IR that most other LLVM-based compilers usually generated. This made it easy to find lots of bugs in that early AVX backend in LLVM.

Just to give a sense of typical output, here’s a semi-random selection of some of the code generated for the deferred shading example, here using AVX with modern ispc.

	vmovups   1856(%rsp), %ymm3
	vdivps    %ymm2, %ymm3, %ymm13
	vmulps    %ymm13, %ymm1, %ymm1
	vdivps    1792(%rsp), %ymm1, %ymm4
	vmulps    608(%rsp), %ymm13, %ymm1
	vdivps    1376(%rsp), %ymm1, %ymm11
	vmulps    %ymm4, %ymm4, %ymm1
	vmulps    %ymm11, %ymm11, %ymm2
	vaddps    %ymm2, %ymm1, %ymm1
	vmulps    %ymm13, %ymm13, %ymm2
	vaddps    %ymm1, %ymm2, %ymm1
	vrsqrtps  %ymm1, %ymm2

Here’s all of the assembly for that example: deferred.S.

As I got started trying to use the nascent AVX backend, the first LLVM bugs I hit were generally crashes and assertion failures; various things in LLVM that hadn’t been exercised for the new target or that had assumptions about their input that was no longer true with the new target. I’d boil down a little test case with LLVM’s bugpoint, a nifty tool that does an automated binary search to find a minimal test case, and then send it off.

After things would compile, the next step was correctness. I had a test suite of a few hundred volta programs that I used during development; each one was a short function that did a small computation and then verified that the result matched an expected value. Not only were these tests useful for verifying volta’s own correctness as I was developing it, but they also worked out well to find LLVM vector codegen correctness bugs. Whenever one of those tests failed on a new target, I’d dig in; sometimes it was my own bug, e.g. in the IR I’d written for the backend, and sometimes a LLVM codegen bug. Once all of those tests passed, I could confidently start compiling larger programs.

As LLVM’s vector code correctness became solid for a given backend, I spent a lot of time looking at generated assembly along the way (as did volta’s users); this led to lots of observations of cases where LLVM vector code quality could be improved.

I seem to have filed a total of 144 LLVM bugs over the course of development of volta/ispc. The LLVM developers generally fixed them remarkably quickly. That made it a lot of fun—it felt like we were making good progress together, I could go on to find progressively more esoteric bugs as they fixed the earlier ones. In the end, the LLVM backends for AVX and beyond came to be super solid; I’d like to think the issues volta found helped a bit with that process.

On the LLVM side, many thanks to Nadav Rotem, who did a lot of key work on vector select in LLVM; Bruno Cardoso Lopes, who did a lot of work on AVX codegen and fixed most of those bugs; and Craig Topper, who did a lot for AVX2. And of course, huge thanks to Chris Lattner for starting the whole effort in the first place, as well as the rest of the LLVM team.

Survey says…

All the sweat was worth it. It was really exciting once things started working with AVX and I could start measuring performance. In general benefits of 1.5x to 2x from AVX were typical. And it just took a recompile; existing volta code didn’t need to be modified to see those performance benefits. Once again, a relief that there hadn’t been some unexpected hiccough that made things not go as expected.

Here are a few results measured with today’s ispc, measuring speedup on a single core versus scalar code.

Workload	SSE4 speedup	AVX1 speedup	AVX1:SSE4 ratio
Black-Scholes	4.13x	6.12x	1.48x
Ray tracer	2.60x	5.42x	2.08x
Deferred shading	4.15x	5.00x	1.20x
Aobench	3.33x	4.86x	1.46x

Single core speedups for a few workloads, showing the performance benefit from AVX (measured with today’s ispc).

I could have sworn that Black-Scholes was essentially 2x faster when AVX landed. Something to dig into at some point, but those are the numbers today.

AVX2 was a big step forward as well as 8-wide 32-bit integer operations were also available:

Workload	SSE4 speedup	AVX2 speedup	AVX2:SSE4 ratio
Black-Scholes	4.13x	6.97x	1.68x
Ray tracer	2.60x	6.56x	2.52x
Deferred shading	4.15x	6.38x	1.54x
Aobench	3.33x	6.78x	2.03x

Needless to say, it was amazing to see those speedups actually happen. A doubling of SIMD vector width is relatively cheap, transistor- and power-wise. I don’t know the actual numbers, but it’s way cheaper by those metrics to double vector width than to double the number of cores on a CPU. And it turns out, if you have a reasonable programming model, compiler, and amenable workloads, you can see something approaching a doubling of performance at sub-linear silicon cost. Victory!

Next time, more details on some of the nitty-gritty of getting things to run fast.

Next: More on optimizations and performance