A few weeks ago I finally got IPv6 working on my router via 6to4. Of course this has no real use beyond just playing around, but it’s still neat that now my computer is globally addressable. There are a few IPv6 hosts around, too; today I took a look at the output of netstat on my machine on totally unrelated business and noticed that I was connected over IPv6 to some web page I was browsing at the time. After I got 6to4 hooked up, incidentally, I emailed my ISP to see if they had any IPv6 plans. Much to my delight, they replied saying they already have their IPv6 address block allocated and will be rolling it out to customers at some point in the near future. No timeline, of course, but still very very cool.

While teaching my networking course this term, I took a lecture to talk about SCTP. It’s a beautifully designed transport layer, designed to replace—maybe not entirely, but mostly—TCP and UDP. Like IPv6 it has the problem that it was designed after “mostly good enough solutions” had already been established. Unlike IPv6 it doesn’t need any help from ISP or backbone infrastructure, but it has in some ways a much bigger problem: it needs application-level protocols to use it if it’s to gain any support.

I felt a little guilty teaching SCTP when I’d never actually played with it seriously myself, so I gave it a try today. Unfortunately most consumer operating systems don’t support SCTP: OS X requires a kernel extension.

One of the neatest things about SCTP is that a single SCTP connection—”association” in SCTP lingo—can be multiplexed into any number of streams and these streams really do operate in parallel. There’s no head-of-line contention like in TCP where the whole connection gets plugged up if a packet arrives out of order. In SCTP, if one stream gets plugged up, all the other streams are entirely unaffected. This also has the benefit, of course, that you can have a multiple connections without the overhead of creating multiple connections; research evidently like to see how this will apply to HTTP. For me, I would have guessed the nicest application of SCTP would be SSH tunnelling and port forwarding and whatnot and I was a little disappointed to see no one’s taken a look at SSH over SCTP in a serious way.

One of the more troublesome aspects of SCTP is that the number of streams your “association” contains is negotiated at initialization time; by my reading of the RFC, there’s no way to add or remove streams dynamically.

So this morning I set up a little experiment to see how SCTP—or Michael Tuexen’s kernel extension for OS X anyway—would fare with absurd numbers of streams. I figured if you could create an association with a tonne of streams without any noticeable performance degradation, the limitation of having to specify the number of streams up front wouldn’t be so bad.

The bad news is that even though the RFC does give you theoretically 65000 streams, I started getting “Invalid argument” errors as soon as I started getting up to about 2000 streams. The good news is there was no change at all in performance or server/client behaviour whether I had/used 1 stream or 2000 streams. I suspect there are some possibly large memory structures built inside the kernel when you establish an association with 2000 streams, but that’s hard to see from userspace. If I get time I’ll poke through the source code to see how things are structured.

The other bad news is that SCTP is slow. Sending data as fast as I could to ::1 (my local machine, not over the network) was about a quarter as fast as TCP. I don’t think there’s much about SCTP that would require this slowness; it’s almost certainly due just to the immaturity of the implementation. This implementation, if I remember right, was taken from FreeBSD. I’ve heard stories that Linux’ SCTP implementation is the worst of all, much slower even that FreeBSD’s, which is sad, but maybe things have improved recently.

There’s another big advantage of SCTP—multi-homing—which I haven’t had time to play with. With multi-homing I could, theoretically, establish an association/connection over wired Ethernet, then transition to WiFi, with the connection totally unbroken and, what’s more, the application-layer totally unaware that anything had happened. I’m thinking with a bit of work one could even migrate SSH sessions from one host to another. It could be cool beans.

I’m sure there’s some irony in me putting effort into playing around with these networking protocols just so I can talk to myself. Now I know how Esperanto speakers must feel. It’s fun, though.

I apologize for the long delays between posts. Rest assured no news is good news, and it’s just business as usual.

One bit of exciting news, though, is that I’ve received an email from my old supervisor, Robin Cockett, at the University of Calgary. He has a project for a pretty cool sounding programming language based on Martin Hofmann’s work—where programs are restricted to polynomial time—and needs someone to worry about the implementation details. I’m pretty excited about the prospects.

Also I’ve decided to turn my idle attention to my Zipit, borrowed from my friend Albert. It’s kind of a fun little thing, if annoying to type on, but sports a 60MHz ARMv4 chip in it. I’m toying with the idea of having my compiler target not just C, but also ARM. I’ve got an ARM cross-assembler installed, so it’s just a matter of working out the networking details.

ARM is a surprisingly beautiful ISA. Everyone knows it for its pervasive use of conditional instructions, of course, but its addressing modes are quite nice as well. My only other exposure to pre- and post-index addressing was with the PDP-11, but ARM does it in a much cleaner way.

I regularly skim through digests of comp.risks, a newsgroup devoted to the risks of computing, as it’s directly related to my research and, above all, pretty entertaining. I was recently catching up, and the subject line “A risk of static analysis tools — and refereeing” understandably caught my eye. I reproduce it shamelessly below without attribution, as I don’t know its original source, though it’s included in comp.risks 25.02 (January 14, 2008):

Interesting anecdote: Some years ago a simple static code analysis tool was submitted to a conference. Two of the three reviewers, both of whom were extremely careful programmers, ran it on their own code just to see what would happen.

The tool produced 100% false positives (not 95, not 99, but fully 100%). As a result, the paper wasn’t accepted.

The same paper was later submitted to another conference (where reviewers didn’t try this), where it was accepted and won the best paper award.

In the fall-out of SOFSEM 2008, I have a bit of a to-do list to take care of. There’s the usual responsibilities: readings and lecture preparations for the course I’m taking; marking, office hours, and lab hours for CS342; organizing the upcoming internal CS conference, UWORCS; and keeping up with the usual political and governing bodies on campus, namely SOGS, CSGSC, GEC, and the department chair selection process. I also have sort out all my receipts from SOFSEM to claim the travel expenses: currently all my receipts are basically in a number of giant balls of papery mess.

I also have to try and keep up with my research. Currently I have three things on my plate as far as research is concerned. The first is to keep in touch with the aforementioned Matej to see how applicable my research will be to the noble goal of robust operating systems. I’ve been reading up on Pict, a language based on the π-calculus, for the past few days and have just sent him another email to see what can be done. The research here seems overwhelmingly practical and I’m less inclined to spend a lot of time coming up nice theoretical models. I think it would be more prudent to come up with a quick addition to the Pict syntax to say that it works. But it does have a lot of potential down the road, perhaps, for a better foundation.

Another road of research, which I’ve spent my office hours this morning doing, is coming up with a compiler for my language, Ca. Incidentally, I now realize I should have spent more time coming up with a good name for it.

I spent this morning writing my own copy collector in C. I like the idea of compiling the language down to C. I also like the idea of having the data structures totally under my control, as my research naturally leads away from garbage collection towards other memory management schemes like regions, or some hybrid. Further, I want my garbage collector to be precise, and it’s hard to get away from conservative garbage collectors in C. For static analysis purposes, it’s also nice to have complete control over the semantics of the memory manager. And, for the purposes of writing short papers, it’s nice to have a memory manager with very simple semantics haha.

Like I said, it’s a copy collector, a rather simple one. No generations or concurrency or speculation or anything like that. Object overhead is 8 bytes right now, though hopefully that will come down. It handles unboxed objects nicely, and precisely.

The downside is that, after much time fiddling with setjmp(), getcontext(), even writing hand-coded assembly, I’ve given up a nice universal way to derive roots, so code has to explicitly use add_root() and remove_root(). Seeing as the code will be automatically generated by the compiler, I’m hoping this won’t be too onerous, but we’ll see.

I suppose everyone’s different, but for me the memory management is the hardest part, so I’m expecting the rest to be mostly grunt work.

After I get the compiler stable enough that I can play around with it a bit, I can start looking at developing the algorithms to actually give some nice bounds on it statically. We shall see.

I’m now fairly settled back in Canada, arrived from SOFSEM 2008 in Slovakia. In the middle of the picture at right you might see a round—like a short tube on its end—building, which was our hotel and conference site.

The event was a few firsts for me: my first time presenting at an international conference, my first time in Europe, and of course my first time at SOFSEM. I’d highly recommend the conference to those in Canada—or anywhere outside of the former Czechoslovakia, really—who might not have heard of it. It was a really good mix of all sorts of computer science topics and a nice mix of theory of practice. The conference was a good size and generally a lot of fun.

As luck would have it, I got roomed with a fellow name Matej Košík, a Ph.D. student from Bratislava. He was looking forward to my talk, and his talk turned out to be one of the highlights for me. He was building a robust operating system kernel, sort of carrying all the advantages of a conventional microkernel, but based on static analysis, with the help of the specialized programming language Pict, instead of memory protection. One of the open problems for him was that kernel components can use up arbitrary CPU cycles or memory, which is where my research might fit in.

Other contributions of interest for me were some proofs of upper and lower bounds of proving group properties—i.e., proving that a semi-group is a group, or proving that a groupoid is a quasigroup—in matrix representation on a quantum computer; a new technique for getting shorter regular expressions on a DFA to regular expression transformation; a proof that valence-2 graph isomorphism is logspace-complete; a bunch of semantic web stuff, which I don’t follow very closely, but I like to hear about it and see the pretty pictures; an invited talk by Jarkko Kari on a novel way to prove non-termination of Wang tiles; deniable encryption; electronic voting. It was just a cornucopia of cool stuff; even if the conference didn’t attract the World Expert in any one topic, it was very cool to see what was going on in so many interesting areas.

In between all of this, and sometimes in lieu of it, of course, was going out to see the mountains and trying some of the local alcohol with some very cool grad students, though strangely Slovakian beer was pretty hard to find, Czech beer being much more common. I have taken a number of pictures.

For those who are interested, my research page has a link to my paper and to my presentation. If you don’t understand catamorphisms on the first try of reading the paper, maybe the presentation will make things more clear ha!

I just fixed a rather stupid bug that had been bugging me for hours. This is rather Cocoa and Objective C specific, so bear with me.

My application has an NSTableView—a widget to display a table of data. An NSTableView doesn’t store its own data; rather, it has a delegate object called a data source that supplies data to it. In order for the NSTableView to get its data in a well defined manner, it requires that its data source implement two methods: numberOfRowsInTableView: and tableView:objectValueForTableColumn:row:. These two methods return the number of rows in the table and the value at a particular column/row, respectively. Side note: why does the NSTableView need to query for the number of rows, but not the number of columns? Because it’s extremely poorly designed, that’s why.

Anyway, normally the way one would require that an object implement some method is via a “protocol.” A protocol in Objective C is what Java calls an interface, basically just some static guarantee that an object implements something. Using a protocol allows fun stuff, like the ability to verify at compile-time that an object will implement some method.

However, for reasons that are not clear to me, NeXT/Apple did not make the data source into a protocol. They made it what they call an “informal protocol,” which basically means that things will be checked at run-time, not at compile-time or link-time. Another side note: being C based, you might expect that Objective C would be a static language. Quite the opposite. It’s very common to, for example, add methods to an object or class at run-time.

Anyway, I got my program all working perfectly. Until I tried to build a “release” version of it. I narrowed it down to: if I compile it with -O0, it works perfectly; if I compile it with -O1 or higher, suddenly NSTableView doesn’t think that my data source responds to the appropriate methods. Since NSTableView doesn’t trust my object, it doesn’t display any data.

After an hour or so of trying to narrow down the bug, I finally found and fixed it. It cropped up out of my stupidity more than anything else—of course—but some stronger typing would have helped things out a little more.

The object model of Objective C, as I said before, is entirely dynamic, modelled on Smalltalk. A side-effect of this is that the object model of Objective C is not statically typed. I’m lying a little bit, but only a little bit. So the “C” in Objective C is statically weakly typed, and the “Object” part in Objective C is dynamically weakly typed.

The problem came when I wrote the “init” method for my data source. In NeXT/Apple’s Objective C framework, creation of an object happens in two stages. The “alloc” and friends methods deal with memory allocation, and the “init” and friends methods deal with object initialization. This is kind of handy because it allows memory allocation and object initialization to be independent and orthagonal. You can do stuff like: [[Foo allocWithZone: [self zone]] initWithSize: 5], using a combination of allocator and initializer that the class creator might not have thought of.

Anyway, because I’m an idiot, for some reason when I was writing my class’ initializer, I thought that “init” didn’t have a return value. So I went on my merry way and wrote:

- (void)init
  [super init];
  size = 0;
  /* other initializations ... */

That (void) you see before “init” is the return value and comes into play only now and then and even then only sort of. Its intention is to deal with C being statically typed. Another diversion: in Objective C, every message of the same name has the same type signature. So init in one method has the same type signature as init in another. Thus, any code actually calling my init method is none the wiser that I’ve erroneously assumed it has no return value.

So calling code calls init and looks in register eax (on Intel, or r3 on PowerPC) and assumes that that’s the return value. By coincidence, it happened to work out perfectly for -O0 but not for -O1 or higher. So, after changing the method to:

- init
  self = [super init]; /* yes, I know this is weird. it works in Objective C */
  if (self) {
    size = 0;
    /* ... other initializations */
  return self;

Everything worked perfectly. If only there had been some sort of type error to clue me in!

One of the nice developments in architectures in the 1990s was the advent of so-called SIMD (single-instruction/multiple-data) instructions. Intel/AMD users will recognize this as the MMX/3DNow/SSE/SSE2/SSE3/SSE4 instruction sets. Users of the PowerPC will recognize it as the VMX/Altivec/Velocity Engine instruction set. Believe it or not, Sun was actually first to the punch with SPARC’s VIS (Visual Instruction Set) in 1995. Unfortunately I couldn’t find a good introductory tutorial on VIS from an assembly programmer’s point-of-view, so I decided to write my own. This article assumes a general familiarity with SPARC assembly, though if you know any RISC assembly, I suspect you’ll do alright.

The intention of VIS and other SIMD instruction sets is to perform multiple arithmetic or logical operations in one clock cycle. Those who are familiar with SPARC can probably believe that if you have one 32-bit register, and another 32-bit register, the CPU can add those two together in one clock cycle. Now imagine that instead of having a 32-bit integer in each register, you had 4 8-bit integers in each register. So you have 4 8-bit integers and another 4 8-bit integers, and what you want to do is add them in parallel, so that if register 1 has the 4 integers a, b, c and d, and register 2 has the 4 integers w, x, y and z, the output should be the 4 integers (a + w), (b + x), (c + y) and (d + z). If you think about how CPUs perform addition, it shouldn’t be hard to convince yourself that you’re not actually doing any more work: adding two sets of 4 8-bit integers is exactly the same as adding two sets of 32-bit integers, except that at every 8th bit, you’re throwing away the carry.

That’s the basic set-up. For reasons of practicality, VIS does not use 32-bit integer registers for its SIMD instructions: 32 bits is just too small. Rather, it uses the floating-point registers, which are 64 bits wide (sort of: we use %f0/%f1 as a single 64-bit register, %f2/%f3 is another one, etc.). It allows you to either pack 4 16-bit integers, or 2 32-bit integers, into a register. Using the floating-point registers is going to cause us a bit of pain, because getting values into or out of FPU registers on the SPARC is stupidly painful—basically you have to store an integer register to memory, then read it from memory into an FPU register—but we can figure out a few tricks to get around that.

To demonstrate VIS, I’ve chosen to tackle the standard C function char const *strchr(char const *, int). If you don’t have the man page handy, strchr takes in a string and a character, and searches through the string to find the first occurrence of that character. It returns a pointer to the location in the string where it found the character if successful; otherwise, it returns the null pointer. Here’s a straight-forward implementation of strchr in SPARC assembly:

  ldub [%o0], %o5   ! load in a character from the string
  tst %o5           ! check to see if it's the null character
  bz notfound
  cmp %o5, %o1      ! compare it to the character we're looking for
  bne loop          ! if we haven't found it, jump back to the top of
  inc %o0           ! [DS] the loop and advance to the next char
  retl              ! if we've fallen through, we've found our character
  sub %o0, 1, %o0   ! just back up one character
  retl              ! we've hit the end of the string; just return null
  clr %o0

(Note: I’ve implemented this as a leaf function. If you don’t like leaf functions, pretend there are save and restore instructions, %o0/%o1 are %i0/%i1 and %o5 is %l0.) This function works very well. Basically we read in a character and see if it’s what we’re looking for: if it’s not, read in the next character, and so on. From a performance standpoint, however, it sucks that we have to read in one character at a time and compare one character at a time. What if we could read in four characters at a time and compare four characters at a time? Could we perform this function four times as quickly?

The first thing we’re going to do is make sure we’re not looking for the null character. Our algorithm will work much better if we can safely assume that we’re not looking for the null character. So, handle that off the bat:

  save %sp, -96, %sp
  tst %i1           ! are we looking for the null character?
  bnz not_null      ! hopefully not....
  nop               ! for clarity, I'll leave this unfilled
  call strlen       ! find out the length of string, and use that for our
  mov %i0, %o0      ! return value
  restore %o0, %i0, %o0

So far so good. If the second argument is the null character, just hand our problems off to strlen to find out where that is. Now we deal with the case where we’re looking for an actual character. This is where the fun begins. The first thing we’re going to do is make four copies of our character. A character is 8 bits, and in order to do four comparisons at once, we would like four copies of it:

  sll %i1, 8, %l0   ! make another copy 8 bits over
  or %i1, %l0, %l0  ! and merge them together
  sll %l0, 16, %l1  ! make a copy of that 16 bits over
  or %l1, %l0, %l0  ! and merge them together

Sweet. If %i1 originally started out as 000000ab, where each digit is a hex digit, then %l0 is now abababab, exactly what we want. The next thing we do is move this value into an FPU register:

  st %l0, [%fp]     ! move abababab into a spot on the stack
  ld [%fp], %f0     ! load it into %f0

I said before that for VIS we deal in 64-bit double-registers, not 32-bit registers. What we really want is 00ab00ab00ab00ab instead of abababab. To do this we make another FPU register which starts off as all zeroes. We then use our very first “real” VIS instruction—fpmerge—to “merge” the registers together:

  fzero %f2
  fpmerge %f2, %f0, %f0

What fpmerge does is take one byte from %f2, then one byte from %f0, then another byte from %f2, then another byte from %f0, and so on. The end result is we get our “ab”s interleaved with zeroes, so that we now have 4 16-bit values in the %f0/%f1 double-register. Time for the magic to happen! We’re going to start off our loop by loading 4 bytes all at once. We’ll then use our good friend fpmerge to interleave zeroes in there again:

  ld [%i0], %f4      ! load in 4 bytes
  fpmerge %f2, %f4, %f4  ! space it out a bit

So now %f0/%f1 is 00ab00ab00ab00ab, and %f4/%f5 is some 00cd00ef00gh00ij, corresponding to the next 4 characters we read in from the string.

This would be a good point to mention a hidden assumption I’m making: it’s okay to read past the end of a string as long as it’s not by too much. For example, consider that our string is only 2 bytes long: we’ve just read 4 bytes! Is that okay? The short answer is “yes”. The longer is that the only way we will run into a segmentation fault is if we run across a page boundary, and assuming that %i0 is divisible by 4 (see below), reading in 4 bytes will never cause us to cross a page boundary.

The more contentious assumption I’m making is that the string is word-aligned, i.e., that %i0 is divisible by 4. This assumption is actually not true, though in practice it will be true so long as we don’t pass strchr_fast a pointer to the middle of a string. We can actually modify the code here to handle the case where %i0 is not divisible by 4, by handling the first 1 to 3 characters specially until we hit a word boundary. I’m not including it here because it does nothing to the algorithm except complicate it, though for robustness’ sake, we should include it.

Anyway back to the code. We’ve just read 4 characters of the string into %f4 and spaced it out so that %f4/%f5 looks like 00cd00ef00gh00ij. Now comes the cool part. Careful you don’t miss it: this is where all the action happens in this algorithm. If you understand nothing else from this tutorial, understand this. We are going to do a couple parallel compares:

  fcmpeq16 %f0, %f4, %l0
  fcmpeq16 %f2, %f4, %l1

The fcmpeq16 instruction does 4 simultaneous compare instructions. The “16” in fcmpeq16 means that it does compares assuming 4 16-bit values, as opposed to 2 32-bit values. As you might guess, the “eq” in fcmpeq16 means that we’re comparing for equality. Let’s look at the first fcmpeq16, because that will be most clear. It compares 00ab to 00cd (the first 16 bits), then (in parallel) compares 00ab to 00ef (the next 16 bits), then compares 00ab to 00gh, and then 00ab to 00ij. For each comparison it does, it writes a bit to %l0. If 00ab and 00ij compare equal, then the rightmost bit of %l0 is set to 1; otherwise it’s set to 0. If 00ab and 00gh compare equal, then the second-from-right bit of %l0 is set to 1; otherwise it’s set to 0, and so on.

The second fcmpeq16 is doing the exact same thing, except we’re comparing against %f2/%f3. What’s %f2/%f3? It’s all zeroes! In effect we are comparing the characters in %f4/%f5 against the null characters, so that we can determine if we’ve hit the end of the string.

Now we get really fancy. Remember that the character we’re looking for is not the null character, because we dealt with that at the very top of the function. What this means is that %l0 and %l1—the results of our comparisons—are equal if and only if they are both zero. In other words, %l0 and %l1 compare equal if and only if we have not found the character we are looking for and we have not found the end of the string. We exploit this insight thusly:

  cmp %l0, %l1
  be loop            ! if they're equal, we haven't found anything
  inc 4, %i0         ! so loop back to read the next 4 bytes!

Note this highlights nicely where the performance gains come from using VIS. In the old version of our code, we incremented 1 character per loop. In this version, we increment 4 characters per loop. Hence, we iterate through our loop one quarter as many times, which means one quarter as many branches.

If we fall through that branch, that means we have either found the character we’re looking for, or we’ve found the null character, or we’ve found both. But because we’re comparing 4 characters at once, we care about where we’ve found our match. First of all, we’re most interested in whether we found the character first or the null character first. If we found the null character first, then we should return null, to say that we didn’t find the string. Let’s deal with that off the bat.

Convince yourself that if %l0 compares less than %l1 numerically, then that means we have found the null character first, and if %l0 compares greater than %l1, then that means we have found the other character first. Also note that the condition codes are still set from the “cmp %l0, %l1” above, so there is no reason to do another comparison.

  bg found
  dec 5, %i0         ! [DS] back up the truck to before we advanced by 4

(Note: we decrement by 5 instead of 4 because of the inc instruction below.) If we’ve fallen through that branch, then that means we have hit the end of the string first. All we have to do is return null. Easy as pie:

  ret                ! return 0
  restore %g0, %g0, %o0

Now comes the hard part. We have to deal with the fact that we found the character we’re looking for. It’s not just enough to know that we found the character; we have to know where we found the character. I regret to inform you that the best way I’ve found to do this is using a loop. I know, I fail. I won’t go into great detail here, but what I do is shift %l0 to the leftmost bits and keep shifting out until it turns negative, indicating I’ve found my leftmost one bit. Caution: this is tricky, and without a lot of documentation:

  sll %l0, 28, %l0   ! shift the 4 bits so they are the leftmost 4 bits
  tst %l0            ! is the leftmost bit 1?
  inc %i0
  bpos next_bit      ! we haven't found it yet; keep looping
  sll %l0, 1, %l0

If you followed that, then hopefully you believe the return value is sitting in %i0 for us. All we have to do is return.


I may revisit this page in the future to clarify a few points, especially near the end where I fear things may get muddled. The important thing to understand is the main loop, where we read in 4 bytes and compare 4 bytes at a time. It’s only once we get out of the loop that things get really icky, in my opinion.

By the way, if you’re compiling this with gcc or gas, use the -mcpu=ultrasparc argument to gcc; otherwise it will complain that you’re using super advanced instructions.

I said before that this would hopefully lead us to a strchr that is four times as fast. We have one quarter as many loads and one quarter as many compares, so it should be four times as fast, right? My benchmarking shows it’s only about three times as fast; I haven’t put in the effort to find out exactly why, but nothing’s ever as good as it is in theory, and honestly I think three times as fast is pretty good.

VIS and SIMD instructions in general are used in all sorts of applications, not just searching through strings. Multimedia is the biggest application, which is why Intel marketed theirs as MMX (MultiMedia eXtensions) and Sun marketed theirs as VIS (Visual Instruction Set). In audio processing it’s incredibly easy to see where SIMD speeds things up: audio processing is really nothing more than doing some simple operation, like adding or multiplying, to every element in an array. Even better, in audio processing, usually each element in your array is only 16 bits in size, because CD quality audio uses 16-bit sample depth, though for professional mixing I suppose 32-bit sample depth is all the rage these days.

Enjoy! If you would like to learn more about VIS instructions, a page like this one may serve you well, though bear in mind that documentation on VIS instructions is very hard to come by.

Next Page »