One of the nice developments in architectures in the 1990s was the advent of so-called SIMD (single-instruction/multiple-data) instructions. Intel/AMD users will recognize this as the MMX/3DNow/SSE/SSE2/SSE3/SSE4 instruction sets. Users of the PowerPC will recognize it as the VMX/Altivec/Velocity Engine instruction set. Believe it or not, Sun was actually first to the punch with SPARC’s VIS (Visual Instruction Set) in 1995. Unfortunately I couldn’t find a good introductory tutorial on VIS from an assembly programmer’s point-of-view, so I decided to write my own. This article assumes a general familiarity with SPARC assembly, though if you know any RISC assembly, I suspect you’ll do alright.

The intention of VIS and other SIMD instruction sets is to perform multiple arithmetic or logical operations in one clock cycle. Those who are familiar with SPARC can probably believe that if you have one 32-bit register, and another 32-bit register, the CPU can add those two together in one clock cycle. Now imagine that instead of having a 32-bit integer in each register, you had 4 8-bit integers in each register. So you have 4 8-bit integers and another 4 8-bit integers, and what you want to do is add them in parallel, so that if register 1 has the 4 integers a, b, c and d, and register 2 has the 4 integers w, x, y and z, the output should be the 4 integers (a + w), (b + x), (c + y) and (d + z). If you think about how CPUs perform addition, it shouldn’t be hard to convince yourself that you’re not actually doing any more work: adding two sets of 4 8-bit integers is exactly the same as adding two sets of 32-bit integers, except that at every 8th bit, you’re throwing away the carry.

That’s the basic set-up. For reasons of practicality, VIS does not use 32-bit integer registers for its SIMD instructions: 32 bits is just too small. Rather, it uses the floating-point registers, which are 64 bits wide (sort of: we use %f0/%f1 as a single 64-bit register, %f2/%f3 is another one, etc.). It allows you to either pack 4 16-bit integers, or 2 32-bit integers, into a register. Using the floating-point registers is going to cause us a bit of pain, because getting values into or out of FPU registers on the SPARC is stupidly painful—basically you have to store an integer register to memory, then read it from memory into an FPU register—but we can figure out a few tricks to get around that.

To demonstrate VIS, I’ve chosen to tackle the standard C function char const *strchr(char const *, int). If you don’t have the man page handy, strchr takes in a string and a character, and searches through the string to find the first occurrence of that character. It returns a pointer to the location in the string where it found the character if successful; otherwise, it returns the null pointer. Here’s a straight-forward implementation of strchr in SPARC assembly:

  ldub [%o0], %o5   ! load in a character from the string
  tst %o5           ! check to see if it's the null character
  bz notfound
  cmp %o5, %o1      ! compare it to the character we're looking for
  bne loop          ! if we haven't found it, jump back to the top of
  inc %o0           ! [DS] the loop and advance to the next char
  retl              ! if we've fallen through, we've found our character
  sub %o0, 1, %o0   ! just back up one character
  retl              ! we've hit the end of the string; just return null
  clr %o0

(Note: I’ve implemented this as a leaf function. If you don’t like leaf functions, pretend there are save and restore instructions, %o0/%o1 are %i0/%i1 and %o5 is %l0.) This function works very well. Basically we read in a character and see if it’s what we’re looking for: if it’s not, read in the next character, and so on. From a performance standpoint, however, it sucks that we have to read in one character at a time and compare one character at a time. What if we could read in four characters at a time and compare four characters at a time? Could we perform this function four times as quickly?

The first thing we’re going to do is make sure we’re not looking for the null character. Our algorithm will work much better if we can safely assume that we’re not looking for the null character. So, handle that off the bat:

  save %sp, -96, %sp
  tst %i1           ! are we looking for the null character?
  bnz not_null      ! hopefully not....
  nop               ! for clarity, I'll leave this unfilled
  call strlen       ! find out the length of string, and use that for our
  mov %i0, %o0      ! return value
  restore %o0, %i0, %o0

So far so good. If the second argument is the null character, just hand our problems off to strlen to find out where that is. Now we deal with the case where we’re looking for an actual character. This is where the fun begins. The first thing we’re going to do is make four copies of our character. A character is 8 bits, and in order to do four comparisons at once, we would like four copies of it:

  sll %i1, 8, %l0   ! make another copy 8 bits over
  or %i1, %l0, %l0  ! and merge them together
  sll %l0, 16, %l1  ! make a copy of that 16 bits over
  or %l1, %l0, %l0  ! and merge them together

Sweet. If %i1 originally started out as 000000ab, where each digit is a hex digit, then %l0 is now abababab, exactly what we want. The next thing we do is move this value into an FPU register:

  st %l0, [%fp]     ! move abababab into a spot on the stack
  ld [%fp], %f0     ! load it into %f0

I said before that for VIS we deal in 64-bit double-registers, not 32-bit registers. What we really want is 00ab00ab00ab00ab instead of abababab. To do this we make another FPU register which starts off as all zeroes. We then use our very first “real” VIS instruction—fpmerge—to “merge” the registers together:

  fzero %f2
  fpmerge %f2, %f0, %f0

What fpmerge does is take one byte from %f2, then one byte from %f0, then another byte from %f2, then another byte from %f0, and so on. The end result is we get our “ab”s interleaved with zeroes, so that we now have 4 16-bit values in the %f0/%f1 double-register. Time for the magic to happen! We’re going to start off our loop by loading 4 bytes all at once. We’ll then use our good friend fpmerge to interleave zeroes in there again:

  ld [%i0], %f4      ! load in 4 bytes
  fpmerge %f2, %f4, %f4  ! space it out a bit

So now %f0/%f1 is 00ab00ab00ab00ab, and %f4/%f5 is some 00cd00ef00gh00ij, corresponding to the next 4 characters we read in from the string.

This would be a good point to mention a hidden assumption I’m making: it’s okay to read past the end of a string as long as it’s not by too much. For example, consider that our string is only 2 bytes long: we’ve just read 4 bytes! Is that okay? The short answer is “yes”. The longer is that the only way we will run into a segmentation fault is if we run across a page boundary, and assuming that %i0 is divisible by 4 (see below), reading in 4 bytes will never cause us to cross a page boundary.

The more contentious assumption I’m making is that the string is word-aligned, i.e., that %i0 is divisible by 4. This assumption is actually not true, though in practice it will be true so long as we don’t pass strchr_fast a pointer to the middle of a string. We can actually modify the code here to handle the case where %i0 is not divisible by 4, by handling the first 1 to 3 characters specially until we hit a word boundary. I’m not including it here because it does nothing to the algorithm except complicate it, though for robustness’ sake, we should include it.

Anyway back to the code. We’ve just read 4 characters of the string into %f4 and spaced it out so that %f4/%f5 looks like 00cd00ef00gh00ij. Now comes the cool part. Careful you don’t miss it: this is where all the action happens in this algorithm. If you understand nothing else from this tutorial, understand this. We are going to do a couple parallel compares:

  fcmpeq16 %f0, %f4, %l0
  fcmpeq16 %f2, %f4, %l1

The fcmpeq16 instruction does 4 simultaneous compare instructions. The “16” in fcmpeq16 means that it does compares assuming 4 16-bit values, as opposed to 2 32-bit values. As you might guess, the “eq” in fcmpeq16 means that we’re comparing for equality. Let’s look at the first fcmpeq16, because that will be most clear. It compares 00ab to 00cd (the first 16 bits), then (in parallel) compares 00ab to 00ef (the next 16 bits), then compares 00ab to 00gh, and then 00ab to 00ij. For each comparison it does, it writes a bit to %l0. If 00ab and 00ij compare equal, then the rightmost bit of %l0 is set to 1; otherwise it’s set to 0. If 00ab and 00gh compare equal, then the second-from-right bit of %l0 is set to 1; otherwise it’s set to 0, and so on.

The second fcmpeq16 is doing the exact same thing, except we’re comparing against %f2/%f3. What’s %f2/%f3? It’s all zeroes! In effect we are comparing the characters in %f4/%f5 against the null characters, so that we can determine if we’ve hit the end of the string.

Now we get really fancy. Remember that the character we’re looking for is not the null character, because we dealt with that at the very top of the function. What this means is that %l0 and %l1—the results of our comparisons—are equal if and only if they are both zero. In other words, %l0 and %l1 compare equal if and only if we have not found the character we are looking for and we have not found the end of the string. We exploit this insight thusly:

  cmp %l0, %l1
  be loop            ! if they're equal, we haven't found anything
  inc 4, %i0         ! so loop back to read the next 4 bytes!

Note this highlights nicely where the performance gains come from using VIS. In the old version of our code, we incremented 1 character per loop. In this version, we increment 4 characters per loop. Hence, we iterate through our loop one quarter as many times, which means one quarter as many branches.

If we fall through that branch, that means we have either found the character we’re looking for, or we’ve found the null character, or we’ve found both. But because we’re comparing 4 characters at once, we care about where we’ve found our match. First of all, we’re most interested in whether we found the character first or the null character first. If we found the null character first, then we should return null, to say that we didn’t find the string. Let’s deal with that off the bat.

Convince yourself that if %l0 compares less than %l1 numerically, then that means we have found the null character first, and if %l0 compares greater than %l1, then that means we have found the other character first. Also note that the condition codes are still set from the “cmp %l0, %l1” above, so there is no reason to do another comparison.

  bg found
  dec 5, %i0         ! [DS] back up the truck to before we advanced by 4

(Note: we decrement by 5 instead of 4 because of the inc instruction below.) If we’ve fallen through that branch, then that means we have hit the end of the string first. All we have to do is return null. Easy as pie:

  ret                ! return 0
  restore %g0, %g0, %o0

Now comes the hard part. We have to deal with the fact that we found the character we’re looking for. It’s not just enough to know that we found the character; we have to know where we found the character. I regret to inform you that the best way I’ve found to do this is using a loop. I know, I fail. I won’t go into great detail here, but what I do is shift %l0 to the leftmost bits and keep shifting out until it turns negative, indicating I’ve found my leftmost one bit. Caution: this is tricky, and without a lot of documentation:

  sll %l0, 28, %l0   ! shift the 4 bits so they are the leftmost 4 bits
  tst %l0            ! is the leftmost bit 1?
  inc %i0
  bpos next_bit      ! we haven't found it yet; keep looping
  sll %l0, 1, %l0

If you followed that, then hopefully you believe the return value is sitting in %i0 for us. All we have to do is return.


I may revisit this page in the future to clarify a few points, especially near the end where I fear things may get muddled. The important thing to understand is the main loop, where we read in 4 bytes and compare 4 bytes at a time. It’s only once we get out of the loop that things get really icky, in my opinion.

By the way, if you’re compiling this with gcc or gas, use the -mcpu=ultrasparc argument to gcc; otherwise it will complain that you’re using super advanced instructions.

I said before that this would hopefully lead us to a strchr that is four times as fast. We have one quarter as many loads and one quarter as many compares, so it should be four times as fast, right? My benchmarking shows it’s only about three times as fast; I haven’t put in the effort to find out exactly why, but nothing’s ever as good as it is in theory, and honestly I think three times as fast is pretty good.

VIS and SIMD instructions in general are used in all sorts of applications, not just searching through strings. Multimedia is the biggest application, which is why Intel marketed theirs as MMX (MultiMedia eXtensions) and Sun marketed theirs as VIS (Visual Instruction Set). In audio processing it’s incredibly easy to see where SIMD speeds things up: audio processing is really nothing more than doing some simple operation, like adding or multiplying, to every element in an array. Even better, in audio processing, usually each element in your array is only 16 bits in size, because CD quality audio uses 16-bit sample depth, though for professional mixing I suppose 32-bit sample depth is all the rage these days.

Enjoy! If you would like to learn more about VIS instructions, a page like this one may serve you well, though bear in mind that documentation on VIS instructions is very hard to come by.

I haven’t updated the blog in a bit, so I thought I’d write a bit about what’s happened. First of all, I have some vindication for my 833 project, yesno. This past week there was a small thread on comp.functional about Boehm GC on OS X and how it sucks. Apparently there is some problem with reliably getting the registers on OS X—one poster claimed it to be “Apple’s fault”. In any case, I don’t feel so bad about having to work around libgc weirdness with yesno now.

My 833 project apparently went alright too. I was nervous handing it in, as ghc is fickle to get compiling and working, and I just sort of duct taped it together as best I could and handed it to Watt, half-expecting him to email me back the next day saying “how in the hell do you get this thing working?”. But I checked my mark today: after two courses I’m now sitting with a 100.0 GPA—it’s a little unsettling—so I’ll have to talk to him some time and see what he liked about it.

I’ve shifted my focus to static analysis of functional languages with real-time applications: primarily guaranteeing time and memory constraints. I’m vaguely playing around with Martin-Löf type theory, trying to come up with a good model to capture caching policy, with the aim of hopefully coming up with some cool cache model to submit to POPL this summer.

This afternoon I’ve been reworking my presentation for yesno such that it’s suitable for UWORCS. This has involved gutting most of my slides, and not just because I need to cut it down for time. Yesno, for those who are not familiar, is a programming language which aims to be complete, i.e., every program must yield some values (“halt”). The trade-off is there is some inconsistency.

I have a more concrete idea of what yesno should and should not do now. A lot of my slides, when I was proposing the project, where I was totally off on what it should do. I’m also reminded of how basic yesno is: it’s got a long road of revision and research and refinement ahead of it before it’s broadly usable. If the evolution of a programming language is like chiselling marble, then mainstream languages have abandoned even the most minute chisel in favour of polish, and I’m still whacking away with a sledgehammer.

But, it’s reaffirmed my interest in a complete framework. It’s a bit too far off to say whether completeness/inconsistency is best served by a language or a library/macro system for some existing language. At this point I think a language serves the purpose well, but who knows.

It seems to me that there’s use for using data that hasn’t been finished computing yet. A web browser that’s rendering a half-loaded page, or even in soft real-time applications, providing some sort of result because you’re running out of time. It’s a nice framework for these things, I think, and one which hasn’t had its fair due of research attention.

I may even pick up yesno again in the summer and actually formalize the semantics and what’s missing about the semantics. It worked as a class project, but there are some quite large holes in the semantics that need serious thought.

For the past few days I’ve been working on my poster in Pages, part of Apple’s iWork package. So far I have to say it’s pretty slick, though there is a learning curve.

Here’s a low-resolution preview of my poster for UWORCS:

Trying to explain yesno on a poster is exceedingly difficult, especially if you want the broad strokes to catch the reader’s eye at some distance. And I’m quickly running out of room…I may have to shrink some of the text by a few points. Keep in mind this poster is tentatively A0 sized, about 120cm by 80cm. And I have to fix the colour in a lot of places. But so far I’m pleased with my results.

I’ve mentioned before my CS888 project, which is taking GHC‘s garbage collector and adding real-time hooks into it. There will be two functions exported on the Haskell level: Control.RealTime.gcNextCollectionUpperBound :: IO Int and Control.RealTime.gcBound :: Int -> IO a -> IO a. The latter function is “where the magic happens” (after Mark pointed out that I used this phrase 3 times in my comments in my CS833 project source, I realize I use it all the time). Indeed the former function isn’t even necessary, but is useful for making garbage collection more predictable. But the latter function is used to force the garbage collector to limit itself during some chunk of code and not go off on some tangent collecting, behaviour which would be disastrous in real-time code. Note that gcBounds can be nested.

I bolded the word “force” up there, because I want to stay hard real-time if at all possible. If the user wants soft and I provide hard, that’s just a bonus. If the user wants hard and I proved soft, that’s no good.

The more I think about gcNextCollectionUpperBound, the tricker it gets. In case its name is not clear, its job is to estimate how long it thinks the next round of garbage collection will take, in milliseconds. This function should be fairly quick—it shouldn’t simulate a collection just to figure out how much garbage there is; it should be moderate accurate—hopefully within an order of magnitude in most cases; and it should return a hard upper bound, such that it is impossible for garbage collection to take more than the number returned.

So, consider the worst case scenario for garbage collection. The worst case is that garbage collection will have to swap in, from disk, the objects that it’s collecting. In the worst case, pages will be scattered all throughout, such that it will have to do a swap for every page. Thus, if we assume an 8ms seek time on the hard drive, the hard upper bound is 8p, where p is the number of pages to be collected. Then you think: maybe even assuming an 8ms seek time isn’t being hard enough; maybe the user is running this real-time system with a network-mounted swap partition, and the swap file is actually on Mars. Then the hard upper bound is something like 2400000p. Suddenly it seems obvious that getting a bound within an order of magnitude—one of the properties I wanted for gcNextCollectionUpperBound is looking hopeless.

Probably I will have to abandon my dream of returning a hard upper bound. This isn’t the end of the world: as long as gcBound provides a hard guarantee, that’s all that matters in the end. Probably what I will do to get a soft upper bound is:

  1. set up some compile-time latency constants: something like 8ms for swap, 1μs for RAM, 1ns for cache;
  2. come up with some arbitrary hand-wavish heuristics for semi-conservatively guessing when a page is going to come up from one of those 3 locations;
  3. look at total memory usage and start multiplying page numbers by constants.

It’s point #2 that’s the hard one, and I’m sceptical as to how much I’ll be able to glean about the structure of the heap without actually diving into it.

This is sadly the only research I’ve found on the subject.

A Google search for real-time haskell yields little of interest, which I suppose isn’t too surprising. Haskell, if you’re not familiar, is not just a pure functional language, but a lazy one. On the surface, this makes it a dreadfully poor choice for a couple of reasons.

  1. Haskell code tends to be slow. I’ve started using the Gentoo language shootout as my baseline for language comparisons—it seems fairly unbiased if nothing else—and with a few exceptions, Haskell just doesn’t stack up well. This isn’t too surprising: laziness adds overhead. My mentioning performance might get the hardcore real-timers in a tiff—theoretically performance has nothing to do with real-time—but in the real world, real-time is typically either done in embedded situations where hardware resources are scarce, or else we’re in a soft real-time situation playing pr0n, and speed is important.
  2. Haskell code is unpredictable. This is the biggest flaw in Haskell that Simon Peyton-Jones has offered (that I’m aware of). When tracing through Java code, it’s trivial to derive complexity, and even a rough gauge on absolute running time, just by glancing at the code. Laziness takes this all away. I’ll talk about this in a later post perhaps.
  3. Haskell is garbage collected.

It’s that last point I want to talk about. Garbage collection is typically not used in real-time systems, particularly hard real-time systems, for good reason. Imagine writing the quintessential real-time example code—nuclear power plant code; yes, I can dream—and you’ve reserved your time slice, grabbed your mutex, and then the garbage collector decides to run for 400ms. Not cool.

For my CS 888 project, I think I’m going to propose changing Haskell’s garbage collector to provide a real-time interface. Something along the lines of:
gcNextCollectionUpperBound :: IO Int
gcBound :: Int -> IO a -> IO a

The first function would give an upper bound on how long the next round of garbage collection would take (in milliseconds). The second function/combinator would execute a section of code, guaranteeing a bound on the amount of time garbage collection will take, such as:
do  gcTime <- gcNextCollectionUpperBound
    return $ gcBound (gcTime + 200) do
        requestTimeSlice (gcTime + 200)
        mutex <- getMutex
        giveUpMutex mutex
        return answer

This way we’re guaranteed that we complete saveLives before the garbage collector eats up all our time. So that’s about all I need for the interface; all I need now is the implementation. Yay, I’m half done!

GHC’s documentation on its garbage collector is a big weak, so I’ll have to dig into the source. My understanding through a bit of reading is that right now GHC uses a hybrid system, with almost all programs using the copy collector exclusively, and the other part of the system being a generational collector. This is somewhat exciting because I haven’t found any papers yet discussing real-time copy collector systems yet.

We’ll see how far I get in this. The nice thing about a project like this is that a naïve implementation can be done hopefully without too much work (do garbage collecting until you run out of time, then just give up). Even coming up with a naïve upper bound is easy (return INT_MAX). But the project offers itself to putting effort into a more finessed solution, too.

I don’t know, we’ll see what Watt says about it.