The story of ispc: first benchmark results (part 5)

As before, this is from memory. I did my best to get the details right, but get in touch if I got anything wrong.

There was a small set of benchmarks that the compiler team used to evaluate parallel programming models—things like Black-Scholes options pricing, evaluating the Mandelbrot set, a little stencil computation. Most of them were just tens of lines of code.

The most complex, aobench, weighed in at roughly 300 lines. If I remember correctly, the graphics folks pushed aobench on them as something that was at least vaguely representative of the sort of irregularity that was typical of graphics workloads. Anything more complex or more realistic was a non-starter: it would have been too much to handle for a number of the parallel programming models at hand.

aobench, via modern ispc. 15.6x faster than serial code on a 2-core laptop with AVX2.

Intel had all sorts of parallel programming models; some only targeted multi-core and some only targeted SIMD. There was Cilk for multi-core in the Intel C compiler, there was the auto-vectorization and #pragma simd stuff there, there was the OpenCL compiler, there was the metaprogramming stuff from RapidMind that merged with Intel Ct and ended up as Intel Array Building Blocks, there was Thread Building Blocks, and there was Intel Concurrent Collections. Probably other stuff I’ve forgotten, too.

In general, the models that only targeted multi-core exhibited linear scaling with the number of cores, and those that targeted SIMD exhibited linear scaling with SIMD width for computations without control flow but didn’t work at all for those that did—computing the Mandelbrot set and aobench. Different vector lanes wanting to follow different execution paths was too much for them.

Anyhoo, once volta came to be fairly complete and I was happy with the quality of the code coming out of it, I coded some of the benchmarks up in volta and measured the performance. I was rather surprised: volta beat #pragma simd (the closest contender) for many of them and was quite close for the rest. Except for the stencil. I think.

And it wasn’t just that it won for things like aobench where it could actually deal with the control flow, but it was faster even for a number of the simple benchmarks. It was just by a few percent, but it won. I ran and re-ran the tests, just to be sure I hadn’t messed something up.

Intel had hundreds of people working on the compiler and prided themselves on generating better x86 code than any other compiler. That it had come together that well with volta, admittedly for a set of simple benchmarks, was fairly shocking, shall we say.

I don’t remember how I first communicated those results to the compiler team, but it was just as surprising to them—that the combination of volta with it’s strange-graphics-people programming model and this thing LLVM that they’d heard of a little bit, together working out so well was pretty much unimaginable.

There was soon a meeting to discuss all this with ten or so folks from the compiler team.

Beyond collective surprise, reactions were split. Most of the people were intrigued by the result—LLVM wasn’t the well-known powerhouse that it is today, and the idea that one programmer leveraging it could beat the icc compiler… Surely there were interesting things to learn from what had happened. We had a great, healthy discussion of the results and dug into some of the differences in the generated code. I may have explained what SPMD on SIMD was again.

An alternative interpretation of the results

One or two of them came to another conclusion: there was only one way to explain it—I must have cheated. I assume they imagined that I must have specialized the compiler to have special cases to detect the benchmark programs and then just spit out perfect pre-prepared code when it saw them, no compilation involved at all. That being the most likely way to explain beating icc, of course.

Assuming typical game theory for the jerks, here’s what the thinking would have been: I was a jerk too, and my real goal here was not to actually solve a problem, but was to leverage SIMD either to usurp them in their roles in parallel programming models in the compiler group or to advance some other nefarious agenda.

Under that scenario, I’d naturally hold my cards close, keeping the compiler source code secret and only begrudgingly letting them try the binary. Maybe I’d try to defer even that for a few months, claiming I wanted to make a few more improvements first. If I was cheating, I’d try to prevent their discovery of that as long as possible in the hopes of being successful in my evil plan first.

Alternatively, if I actually had a good idea, the thing to do would be to keep the details of it to myself, to prevent them from taking them and claiming them as their own to maintain their position.

Me, I was still just trying to convince the professionals to write this compiler so I didn’t have to. I did exactly the thing we’d told them about lots of times already. So I emailed around a tarball of the source code after the meeting.

Forgive me, but I’m pretty sure I included an apology in the email: it’s the first time I’ve written a compiler, so forgive me if parts of the implementation aren’t very good. A lesson learned from the jerks: sometimes it’s fun to twist the knife a little.

It turned out that I wasn’t cheating, though it was noted that volta’s transcendental functions weren’t as precise as the other compilers’. I modified volta to use Intel’s Intrinsics for Short Vector Math Library (SVML), which is what the others were using. The performance gap narrowed on options pricing, but volta was still winning. It was unchanged on the others, which didn’t use transcendentals.

Keep your source code secret—seriously?

The idea of keeping the source code secret may seem strange. After all, we all worked at the same company, right?

As it turns out, some teams would jealously guard their source code, only making binary releases available to other teams at Intel, and only at well-defined delivery points. It was one defense against the jerks.

Here’s how it went: if you were working on something that others wanted to attack, sometimes they’d take the in-progress version of the system and pick it apart, finding a bunch of examples where it didn’t yet work well, and putting together an argument that your thing was in terrible shape wasn’t working, and thus should be canceled.

And sometimes that sort of tactic actually worked; management was shockingly receptive to this sort of hysteria. Maybe it was that they were too far away from the technology to be able to evaluate the arguments on their merits, or maybe again it was an appreciation for gladiatorial combat as decision making process.

The best case scenario was that the your team would have to spend a lot of time convincing management that they were actually on track and that things were fine. The easier thing was to just not share your code in the first place.

Good times.

Next time we’ll talk about parallel programming model bake-offs and how things went with the initial internal users of volta.

Next: First users and modern CPUs coming through