The story of ispc: origins (part 1)

I’ve decided to write up a little history of ispc, the compiler I wrote when I was at Intel. There’s a lot to say, so it’ll come out in a series of posts over the next few weeks. While I’ve tried to get all the details right and properly credit people, this is all from my memory. For anyone who was around at the time, please send an email if you see any factual errors.

Elegy for Larrabee

To understand the origins of ispc, it’s helpful to know a little bit about Larrabee. Larrabee (LRB) was Intel’s foray into building a high-end GPU. The project spanned roughly 2005 to 2010. After years of graphics being allotted years-old semiconductor process lines and tiny amounts of chip area, Intel was going to go big with Larrabee: GPUs on PCI-Express cards, the leading semiconductor process, competing for real at the high end, with the goal of being competitive with AMD and NVIDIA.

Intel executives fell in love with Larrabee because it was based on x86. “See, x86 can do anything! We don’t need to build some weird GPU architecture to be successful in graphics,” is what I’m sure they all told themselves. It was a seductive pitch, and on the face of it, it seemed reasonable. Just add a big vector unit on each core, add some texture units, have some programmer person write some code, and next thing you know, you’re selling more high-margin chips, putting the hurt on NVIDIA and their GPU computing ambitions to boot. (And the idea of LRB did seem reasonable to me for quite a long time, too, though I didn’t have quite the same attachment to the ISA and CPU architecture as the rest of the company did culturally.)

There are many reasons Larrabee didn’t work out and perhaps I’ll write something about my take on that at some point. (In the meantime, Tom Forsyth has a nice writeup about his views on this topic that’s worth checking out.)

One of the big problems was that there was a 16-wide vector unit on each core, yet there was no good way to write code that actually used it other than the shader compiler that was being written just for DX and OpenGL. If you’re not lighting up the vector units, you’re running at 1/16th of Larrabee’s potential performance; at that point, you’d be better running on a smaller number of regular CPU cores, sporting higher clock rates, out-of-order execution, bigger caches, and all that.

I saw one of the LRB hardware architects go out time and time again and tell developers that LRB was great because they could just program in C as usual, but now get multiple TFLOPs of performance, just from recompiling their pre-existing code.

We’d all try to explain to the hardware architects who believed that it would just take a recompile that it wasn’t quite that easy and sure, although multi-threading was pretty well understood by programmers, you still needed something for the vector unit and in all honesty, for that there was nothing. The usual response was a few head nods with slightly glazed eyes—agreement that ok, maybe it wasn’t quite that simple, but how hard could it actually be? It was quite a contrast to the panic that many software folks were starting to experience.

In general, the Intel hardware architects knew remarkably little about programming (that Forsyth fellow notwithstanding), and I’m sure those that thought that way really believed it. (To be fair, I don’t know much about actually doing HW architecture, though I guess I’m not out telling the hardware architects bogus things about how best to implement a branch predictor.)

The Intel compiler team assured the hardware architects that they had it all under control. They had the best loop vectorizer in the business—once they wrote a new backend for LRB, we’d be all set. C, C++, and even Fortran programmers would be able to light up those 16 vector lanes with ease, not needing to even think about it. (Just for calibration, it was a point of pride for them that Intel had the best Fortran compiler in the business as well.)

And there were a few folks who were thrilled to write intrinsics—Mike Abrash and the other great programmers at RAD who were writing the rasterizer wanted little more than that, and Tim Sweeney was salivating at the possibilities. I imagine that fact that they were so cool with the intrinsics option made the hardware architects think those of us sounding the alarm were just not very good programmers and thus not worth worrying about. (And to be clear, I’m a lousy programmer compared to Mike Abrash. And Tim Sweeney.) But building programmable hardware that only 5 people in the world can program ain’t a winning strategy, I’d humbly suggest.

In the end, it wasn’t the lack of a compiler for the vector units that doomed LRB: the hardware was late, the software rasterizer was late, and the whole project was caught out by the market transition to a world where power efficiency was much more important than it had been a few years before—consumers wanted mobile and battery-powered computing, and the LRB architecture was less power efficient than a conventional GPU architecture.

So LRB came to an end, but at least we’ve got AVX-512 on (some) CPUs now. From the LRB experience, though, it became clear to quite a few of us there at the time that this vector unit issue was an important one to address, if only for CPUs, where more and more processing was becoming available through SIMD.

Let’s solve this together with the compiler team!

Steeped in generating great code for regular loops that performed dense matrix math, for a long time most of the Intel compiler team denied that anything more than their auto-vectorizer was needed to take care of vector unit utilization. We quickly fell into a cycle:

They’d inform the graphics folks that they’d improved their auto-vectorizer in response to our requests and that it did everything we had asked for.
We’d try it and find that though it was better, boy was it easy to write code that wasn’t actually compiled to vector code—it’d fail unpredictably.
We’d give them failing cases, a few months would would pass and they’d inform us that the latest version solved the problem.

And so on.

It didn’t take much to fall off the vectorization path. They tried to patch things up at first but eventually they threw up their hands and came up with #pragma simd, which would disable the “is it safe to vectorize this” checks in the auto-vectorizer and vectorize the following loop no matter what. (Once a #pragma is proposed to solve a hard problem, you know things aren’t in a good place.)

So there was #pragma simd, which sort of worked, unless you called an external function; that problem never got solved. They never understood why someone would want to write a large system that ran completely using all of the vector lanes and couldn’t imagine it was an important use case. (The attentive reader may realize that this execution model precisely describes GPUs.)

Auto-vectorization is not a programming model

I think that the fatal flaw with the approach the compiler team was trying to make work was best diagnosed by T. Foley, who’s full of great insights about this stuff: auto-vectorization is not a programming model.

The problem with an auto-vectorizer is that as long as vectorization can fail (and it will), then if you’re a programmer who actually cares about what code the compiler generates for your program, you must come to deeply understand the auto-vectorizer. Then, when it fails to vectorize code you want to be vectorized, you can either poke it in the right ways or change your program in the right ways so that it works for you again. This is a horrible way to program; it’s all alchemy and guesswork and you need to become deeply specialized about the nuances of a single compiler’s implementation—something you wouldn’t otherwise need to care about one bit.

And God help you when they release a new version of the compiler with changes to the auto-vectorizer’s implementation.

With a proper programming model, then the programmer learns the model (which is hopefully fairly clean), one or more compilers implement it, the generated code is predictable (no performance cliffs), and everyone’s happy.

Along the way, many graphics people at Intel tried to explain to people in the compiler team that there were interesting things in GPU programming models that they might do well to understand and that those ideas could be profitably applied to not just to LRB, but to general CPU vector programming as well.

These interesting things boiled down to the SPMD programming model, familiar to GPU programmers both from shaders and languages like CUDA: you write code that mostly looks like it’s serial, just describing a computation on a single data element (vertex, pixel, etc.). In turn, that code is run in parallel on the hardware with many different inputs—many vertices are transformed at once, many pixels are shaded together, etc.

In this model, the parallelism is implicit. For the most part, the programmer only needs to think about operating on one piece of data, and doesn’t need to worry about how their program is mapped to hardware. (It isn’t always that simple in CUDA and in more recent versions of DirectX and OpenGL, but it’s mostly true.) Parallel execution is handled automatically, and as long as you provide the GPU with enough independent work to do, you can get great parallel utilization.

As graphics programmers have learned, SPMD is a really nice way to write high-performance parallel code. Sure, it doesn’t have serial semantics like code that the auto-vectorizer starts with, and serial semantics are great as long as they don’t inhibit performance, but as far as parallel programming models go, SPMD is conceptually clean and fairly easy to compile to SIMD hardware. (More on that point later.) Most programmers writing shaders don’t need to think about the fact that their programs are parallel at all.

It’s not really a vectorization problem, you see…

Looking back, I think the Intel compiler folks were thinking about the problem wrong, and we graphics folks failed to bridge the gap to get them to see it how we did. (But boy did we try.) To them, this was an outer loop vectorization problem: you’re not vectorizing inner loops, you’re just vectorizing the outer-most loop of your program. While that is in a sense an accurate description of the problem, it always seemed to me a strange way to think of it. (It misses, for example, the notion of communication between multiple running program instances that can be expressed in some SPMD models.)

The flaw with this mindset became clear from one nit that one of their lead architects kept coming back to in these discussions: “what happens when the CUDA compiler fails to vectorize?” He was baffled about how this problem was handled in CUDA. One had the sense that he felt that if he could just understand that, then that would be the key to fixing Intel’s auto-vectorizer and making us go away.

Of course, CUDA does not vectorize at all, and so CUDA never fails to vectorize; the question made no sense. You write your program, and although it looks mostly serial, it can and will run in parallel on the GPU, because that’s the programming model and it maps nicely to the hardware. Done and done.

We really tried to explain, multiple times, but the explanations never stuck.

In a meeting soon afterward, this same person angrily told us, “I don’t tell Toyota how to design a car; I might request features, but then designing it is their job.” He and others grew tired of the graphics people trying to tell them how they might improve their vector programming model and that their current one was insufficient for the kinds of programs we wanted to write. We all grew tired of saying the same things over and over without making any headway; at that point, it seemed impossible to convince them to do something about it.

Stay tuned for the next installment, feat. a summer in Sweden and some goofing around with LLVM that started to get interesting.

Next: Volta is born