One of the nice developments in architectures in the 1990s was the advent of so-called SIMD (single-instruction/multiple-data) instructions. Intel/AMD users will recognize this as the MMX/3DNow/SSE/SSE2/SSE3/SSE4 instruction sets. Users of the PowerPC will recognize it as the VMX/Altivec/Velocity Engine instruction set. Believe it or not, Sun was actually first to the punch with SPARC’s VIS (Visual Instruction Set) in 1995. Unfortunately I couldn’t find a good introductory tutorial on VIS from an assembly programmer’s point-of-view, so I decided to write my own. This article assumes a general familiarity with SPARC assembly, though if you know any RISC assembly, I suspect you’ll do alright.
The intention of VIS and other SIMD instruction sets is to perform multiple arithmetic or logical operations in one clock cycle. Those who are familiar with SPARC can probably believe that if you have one 32-bit register, and another 32-bit register, the CPU can add those two together in one clock cycle. Now imagine that instead of having a 32-bit integer in each register, you had 4 8-bit integers in each register. So you have 4 8-bit integers and another 4 8-bit integers, and what you want to do is add them in parallel, so that if register 1 has the 4 integers a, b, c and d, and register 2 has the 4 integers w, x, y and z, the output should be the 4 integers (a + w), (b + x), (c + y) and (d + z). If you think about how CPUs perform addition, it shouldn’t be hard to convince yourself that you’re not actually doing any more work: adding two sets of 4 8-bit integers is exactly the same as adding two sets of 32-bit integers, except that at every 8th bit, you’re throwing away the carry.
That’s the basic set-up. For reasons of practicality, VIS does not use 32-bit integer registers for its SIMD instructions: 32 bits is just too small. Rather, it uses the floating-point registers, which are 64 bits wide (sort of: we use %f0/%f1 as a single 64-bit register, %f2/%f3 is another one, etc.). It allows you to either pack 4 16-bit integers, or 2 32-bit integers, into a register. Using the floating-point registers is going to cause us a bit of pain, because getting values into or out of FPU registers on the SPARC is stupidly painful—basically you have to store an integer register to memory, then read it from memory into an FPU register—but we can figure out a few tricks to get around that.
To demonstrate VIS, I’ve chosen to tackle the standard C function char const *strchr(char const *, int). If you don’t have the man page handy, strchr takes in a string and a character, and searches through the string to find the first occurrence of that character. It returns a pointer to the location in the string where it found the character if successful; otherwise, it returns the null pointer. Here’s a straight-forward implementation of strchr in SPARC assembly:
strchr_slow: loop: ldub [%o0], %o5 ! load in a character from the string tst %o5 ! check to see if it's the null character bz notfound cmp %o5, %o1 ! compare it to the character we're looking for bne loop ! if we haven't found it, jump back to the top of inc %o0 ! [DS] the loop and advance to the next char retl ! if we've fallen through, we've found our character sub %o0, 1, %o0 ! just back up one character notfound: retl ! we've hit the end of the string; just return null clr %o0
(Note: I’ve implemented this as a leaf function. If you don’t like leaf functions, pretend there are save and restore instructions, %o0/%o1 are %i0/%i1 and %o5 is %l0.) This function works very well. Basically we read in a character and see if it’s what we’re looking for: if it’s not, read in the next character, and so on. From a performance standpoint, however, it sucks that we have to read in one character at a time and compare one character at a time. What if we could read in four characters at a time and compare four characters at a time? Could we perform this function four times as quickly?
The first thing we’re going to do is make sure we’re not looking for the null character. Our algorithm will work much better if we can safely assume that we’re not looking for the null character. So, handle that off the bat:
strchr_fast: save %sp, -96, %sp tst %i1 ! are we looking for the null character? bnz not_null ! hopefully not.... nop ! for clarity, I'll leave this unfilled call strlen ! find out the length of string, and use that for our mov %i0, %o0 ! return value ret restore %o0, %i0, %o0
So far so good. If the second argument is the null character, just hand our problems off to strlen to find out where that is. Now we deal with the case where we’re looking for an actual character. This is where the fun begins. The first thing we’re going to do is make four copies of our character. A character is 8 bits, and in order to do four comparisons at once, we would like four copies of it:
not_null: sll %i1, 8, %l0 ! make another copy 8 bits over or %i1, %l0, %l0 ! and merge them together sll %l0, 16, %l1 ! make a copy of that 16 bits over or %l1, %l0, %l0 ! and merge them together
Sweet. If %i1 originally started out as 000000ab, where each digit is a hex digit, then %l0 is now abababab, exactly what we want. The next thing we do is move this value into an FPU register:
st %l0, [%fp] ! move abababab into a spot on the stack ld [%fp], %f0 ! load it into %f0
I said before that for VIS we deal in 64-bit double-registers, not 32-bit registers. What we really want is 00ab00ab00ab00ab instead of abababab. To do this we make another FPU register which starts off as all zeroes. We then use our very first “real” VIS instruction—fpmerge—to “merge” the registers together:
fzero %f2 fpmerge %f2, %f0, %f0
What fpmerge does is take one byte from %f2, then one byte from %f0, then another byte from %f2, then another byte from %f0, and so on. The end result is we get our “ab”s interleaved with zeroes, so that we now have 4 16-bit values in the %f0/%f1 double-register. Time for the magic to happen! We’re going to start off our loop by loading 4 bytes all at once. We’ll then use our good friend fpmerge to interleave zeroes in there again:
loop: ld [%i0], %f4 ! load in 4 bytes fpmerge %f2, %f4, %f4 ! space it out a bit
So now %f0/%f1 is 00ab00ab00ab00ab, and %f4/%f5 is some 00cd00ef00gh00ij, corresponding to the next 4 characters we read in from the string.
This would be a good point to mention a hidden assumption I’m making: it’s okay to read past the end of a string as long as it’s not by too much. For example, consider that our string is only 2 bytes long: we’ve just read 4 bytes! Is that okay? The short answer is “yes”. The longer is that the only way we will run into a segmentation fault is if we run across a page boundary, and assuming that %i0 is divisible by 4 (see below), reading in 4 bytes will never cause us to cross a page boundary.
The more contentious assumption I’m making is that the string is word-aligned, i.e., that %i0 is divisible by 4. This assumption is actually not true, though in practice it will be true so long as we don’t pass strchr_fast a pointer to the middle of a string. We can actually modify the code here to handle the case where %i0 is not divisible by 4, by handling the first 1 to 3 characters specially until we hit a word boundary. I’m not including it here because it does nothing to the algorithm except complicate it, though for robustness’ sake, we should include it.
Anyway back to the code. We’ve just read 4 characters of the string into %f4 and spaced it out so that %f4/%f5 looks like 00cd00ef00gh00ij. Now comes the cool part. Careful you don’t miss it: this is where all the action happens in this algorithm. If you understand nothing else from this tutorial, understand this. We are going to do a couple parallel compares:
fcmpeq16 %f0, %f4, %l0 fcmpeq16 %f2, %f4, %l1
The fcmpeq16 instruction does 4 simultaneous compare instructions. The “16” in fcmpeq16 means that it does compares assuming 4 16-bit values, as opposed to 2 32-bit values. As you might guess, the “eq” in fcmpeq16 means that we’re comparing for equality. Let’s look at the first fcmpeq16, because that will be most clear. It compares 00ab to 00cd (the first 16 bits), then (in parallel) compares 00ab to 00ef (the next 16 bits), then compares 00ab to 00gh, and then 00ab to 00ij. For each comparison it does, it writes a bit to %l0. If 00ab and 00ij compare equal, then the rightmost bit of %l0 is set to 1; otherwise it’s set to 0. If 00ab and 00gh compare equal, then the second-from-right bit of %l0 is set to 1; otherwise it’s set to 0, and so on.
The second fcmpeq16 is doing the exact same thing, except we’re comparing against %f2/%f3. What’s %f2/%f3? It’s all zeroes! In effect we are comparing the characters in %f4/%f5 against the null characters, so that we can determine if we’ve hit the end of the string.
Now we get really fancy. Remember that the character we’re looking for is not the null character, because we dealt with that at the very top of the function. What this means is that %l0 and %l1—the results of our comparisons—are equal if and only if they are both zero. In other words, %l0 and %l1 compare equal if and only if we have not found the character we are looking for and we have not found the end of the string. We exploit this insight thusly:
cmp %l0, %l1 be loop ! if they're equal, we haven't found anything inc 4, %i0 ! so loop back to read the next 4 bytes!
Note this highlights nicely where the performance gains come from using VIS. In the old version of our code, we incremented 1 character per loop. In this version, we increment 4 characters per loop. Hence, we iterate through our loop one quarter as many times, which means one quarter as many branches.
If we fall through that branch, that means we have either found the character we’re looking for, or we’ve found the null character, or we’ve found both. But because we’re comparing 4 characters at once, we care about where we’ve found our match. First of all, we’re most interested in whether we found the character first or the null character first. If we found the null character first, then we should return null, to say that we didn’t find the string. Let’s deal with that off the bat.
Convince yourself that if %l0 compares less than %l1 numerically, then that means we have found the null character first, and if %l0 compares greater than %l1, then that means we have found the other character first. Also note that the condition codes are still set from the “cmp %l0, %l1″ above, so there is no reason to do another comparison.
bg found dec 5, %i0 ! [DS] back up the truck to before we advanced by 4
(Note: we decrement by 5 instead of 4 because of the inc instruction below.) If we’ve fallen through that branch, then that means we have hit the end of the string first. All we have to do is return null. Easy as pie:
ret ! return 0 restore %g0, %g0, %o0
Now comes the hard part. We have to deal with the fact that we found the character we’re looking for. It’s not just enough to know that we found the character; we have to know where we found the character. I regret to inform you that the best way I’ve found to do this is using a loop. I know, I fail. I won’t go into great detail here, but what I do is shift %l0 to the leftmost bits and keep shifting out until it turns negative, indicating I’ve found my leftmost one bit. Caution: this is tricky, and without a lot of documentation:
found: sll %l0, 28, %l0 ! shift the 4 bits so they are the leftmost 4 bits next_bit: tst %l0 ! is the leftmost bit 1? inc %i0 bpos next_bit ! we haven't found it yet; keep looping sll %l0, 1, %l0
If you followed that, then hopefully you believe the return value is sitting in %i0 for us. All we have to do is return.
I may revisit this page in the future to clarify a few points, especially near the end where I fear things may get muddled. The important thing to understand is the main loop, where we read in 4 bytes and compare 4 bytes at a time. It’s only once we get out of the loop that things get really icky, in my opinion.
By the way, if you’re compiling this with gcc or gas, use the -mcpu=ultrasparc argument to gcc; otherwise it will complain that you’re using super advanced instructions.
I said before that this would hopefully lead us to a strchr that is four times as fast. We have one quarter as many loads and one quarter as many compares, so it should be four times as fast, right? My benchmarking shows it’s only about three times as fast; I haven’t put in the effort to find out exactly why, but nothing’s ever as good as it is in theory, and honestly I think three times as fast is pretty good.
VIS and SIMD instructions in general are used in all sorts of applications, not just searching through strings. Multimedia is the biggest application, which is why Intel marketed theirs as MMX (MultiMedia eXtensions) and Sun marketed theirs as VIS (Visual Instruction Set). In audio processing it’s incredibly easy to see where SIMD speeds things up: audio processing is really nothing more than doing some simple operation, like adding or multiplying, to every element in an array. Even better, in audio processing, usually each element in your array is only 16 bits in size, because CD quality audio uses 16-bit sample depth, though for professional mixing I suppose 32-bit sample depth is all the rage these days.
Enjoy! If you would like to learn more about VIS instructions, a page like this one may serve you well, though bear in mind that documentation on VIS instructions is very hard to come by.