Rants


This image is stolen from a reddit link:

It shows ideas in computer science moving from academic research to industrial research to product to successful product. Speech recognition I find pretty amusing.

Being me, I’m disappointed no programming language concepts made it to the chart, not even objects! I can’t think off the top of my head what the first commercial object-oriented development system was: I suppose in those days there wasn’t a clear divide between selling hardware and selling software.

Anyway this is relevant to me since, when contemplating my future after my Ph.D., I often fantasize about bringing my own research to market. The first major fear I have, of course, is that no one would want to spend much money on a compiler that guarantees resource bounds; I hope I’m wrong there. The second fear, though, is that it would take far longer than I could ever expect to turn a proof-of-concept product into a marketable product. This chart confirms that, historically, there is a gap of about five years at least to turn a research idea into a consumer idea.

The chart also shows the importance of industrial research and development, the blue lines. With some notable exceptions, like Microsoft and IBM, I fear industrial research has been getting squeezed out in the past few years, which is a shame.

We need more blue!

I’ve been thinking about the compiler again and how best to implement higher-order functions, once time comes to do that. Currently the compiler uses a lambda lifter. I.e., there are no closures. Every function is lifted out of any scope to the top-level, collecting extra parameters along the way. This is fine for first-order programs, since you can change every call to a nested function into a call with extra arguments. However, in first order programs, this causes problems because the arities have to match. For instance:

f x = let g y = x + y; in
  g (x + 4);

Gets transformed into:

f_g x y = x + y;
f x = f_g x (x + 4);

Every function is now at the global level and g has all the variables it needs. This causes problems with higher-order functions though:

f x zs = let addX y = x + y; in
  map addX zs;

Naïvely this is:

f_addX x y = x + y;
f x zs = map (f_addX x) zs;

Ahh, but the first argument to map is a closure, not a function! We should properly implement closures as lambda lifted functions, so we can’t very well implement lambda lifted functions are closures! There are myriad other problems with this approach, but suffice it to say it won’t work.

Anyway, I was contemplating abandoning lambda lifting entirely and just leaving closures implemented as closures in C. The GNU C Compiler has a little-known and little-used, sadly, feature usually called nested functions. They’re half-closures in a way. They’re closures up until their lexical scope disappears. Well, such is the C way. But they would do what I needed.

I was reading the original 1988 Usenix paper which describes the technique, usually called “trampolining,” to implement nested functions quite smartly in C. As I was reading, I discovered that the author had originally intended to implement unnamed functions in C!

He gives two suggestions for syntax. I like the first one he suggests, which looks something like (int (*)(int x) { return x + 2; }). For instance, you could do something like (int (*)(int x) { return x + 2; })(3), which would evaluate to 5. Perhaps not the most useful example, but you get the idea.

These unnamed functions are really no more of a bother to implement than named functions, once you allow nested functions, but it appears this feature was never implemented in GCC! I suppose to get good use out of them you’d need a less bulky syntax. Type inference could help there.

Anyway, the point of this post was to register my dismay at GCC never implementing unnamed functions, despite there being no technical reason for not doing so. They can be quite handy, those unnamed functions.

For what it’s worth, I think I’ve decided against relying on GCC’s trampolining to do higher-order functions and lambda abstractions in Ca. It is very clever and very efficient, but the semantics are a bit hairy for the 21st century. For example, I just looked at some SPARC assembler that GCC produced for it, and it involves a system call to make to the stack executable. In today’s security conscious kernels, executing the stack, as trampolining necessarily requires, is less practical than it once was. Maybe one day we can get rid of these blunt MMUs, but that’s a rant for another time. In any case, MMU concerns aside, executing the stack mucks with the instruction cache on architectures with separate caches. As I said, hairy semantics.

In Breuel’s paper, he sets up other simpler solutions for implementing closures as being impractical due to their changing calling conventions. Well, I’m making my own language and I can make my own calling conventions. I expect that higher-order functions will simply be passed as two pointers instead of one. It saves everyone a lot of headaches.

Progress on the Ca compiler is going well. The lambda lifter is done. I really couldn’t be much more pleased about how things are going.

I decided now would be a good time to clean up and document the code. I’m a big fan of literate programming and so have been investigating literate programming tools. No matter how long your multi-line comments, I find documentation inside code to be completely useless for getting “big picture” concepts from source code. Literate programming seems to be a good solution.

The idea behind literate programming is that you have one source file. This source file is both a typeset document source file and a programming language source file. For example, through Donald Knuth’s CWEB tool, you write a foo.w file which becomes both a foo.tex file and a foo.c file.

My experience with literate programming up until this weekend has been a bit of a lie. It was with Literate Haskell, which Wikipedia—I now realize correctly—says is semi-literate, not literate. In Literate Haskell, basically you interleave LaTeX code and Haskell code, and the Haskell compiler is clever enough to only compile the Haskell code. It’s cute and it worked well enough to do my thesis, but it’s not really literate programming.

If you’ve never sat down and played with Knuth’s WEB or CWEB, I would highly recommend you do so. Knuth’s vision of literate programming is not just to interleave documentation and code, but, to borrow the names of Knuth’s own tools—to weave and tangle them together. Documentation and code are not interleaved; they’re indistinguishable, almost. CWEB gives you the freedom to refactor and reorder your code in ways that make it look totally unlike something intelligible by a C compiler, to make it look like literature—quite a feat for C—but to give you a valid program at the end.

In a sense it’s not really fair to compare Literate Haskell with CWEB, as they’re working in different domains. For example, one of the nice things about CWEB is that it lets you reorder code. This doesn’t exist in the world of Haskell, since Haskell never really had any substantial restrictions on the reordering of code in the first place.

Anyway, my Ca compiler cac is currently made out of two languages: Haskell and C. Eventually I’ll have to find a way to deal with its grammar as well, but I’ll cross that bridge later. Literate Haskell works well enough for the Haskell side of things. It’s quite inoffensive; you can structure your LaTeX code however you like.

For the C side of things, I’ve spent most of today installing tools, reading their documentation, playing with them, trying to get them to work. I played with noweb and nuweb, but neither seemed too impressive. Once I installed CWEB, things improved a bit.

CWEB really is just a flat-out brilliantly written piece of software. It does exactly what you want it do, a whole lot more, all simply, logically, and robustly. It weaves together documentation and code beautifully. There is only one problem with it: it produces Plain TeX code. I don’t know Plain TeX.

No problem; there is a software out there called latex-CWEB which is a LaTeX class that will render CWEB documentation. Note, I said class. So far as I can tell, this means any document you create with latex-CWEB documentation cannot be an article; it cannot be a report; it cannot be a book; it cannot be any type of document other than a “cweb” document. This is a bit restrictive if you consider that you might want a document which contains things other than CWEB documentation. Such as Literate Haskell documentation.

Since, as I said before, Literate Haskell is so inoffensive, it probably will be workable, and it might turn out to be the best option.

The other competing option is to try and learn Plain TeX. I’d never considered it before; I was one of those “why use TeX when we have LaTeX?” kind of people. But my opinion is changing a bit after seeing how CWEB renders my humble C code: it’s gorgeous.

One final note before I head off to bed. It’s frustratingly come to my attention all throughout today just how marginalized literate programming is. I did some Google searches for a “literate IDE”. Needless to say I didn’t find anything. It really is a shame. Literate programming is a fantastic idea, but all we have now is a smattering of simple mostly one-off tools, almost none of them having been maintained since the mid 1990s, all incompatible, all structuring things in different ways. The state of the art right now is that it really is a big hassle to write literate code unless you either consign yourself to second-rate documentation or else commit yourself to exactly one language with exactly one tool.

This unfortunately comes at the busiest time in a long time for me, so I will make this brief. There is rumour that the Canadian government is coming out with new copyright legislation. The legislation has not been tabled yet, so all we have to go on currently is rumour, but the rumour is not good. It could be some of the most restrictive copyright legislation in the world.

Of especial interest, some of you may have heard of the infamous Sony Rootkit. The Sony Rootkit was a piece of software which would silently and automatically load on Windows computers when you tried to play a Sony music CD. The software was intended to prevent you from copying music off that CD. Needless to say it introduced a huge number of security holes—making your computer widely vulnerable to malware—and degraded performance and stability—infamously it could cause a Blue Screen of Death if you attempted to stop it. If the rumours of the upcoming legislation are true, it would be illegal for you to remove software like the Sony Rootkit or attempt to prevent it from being loaded in the first place.

The government had attempted to bring in legislation like this before, under the moniker of Bill C-60. Bill C-60 failed to pass, and I have no reason to think that if the upcoming legislation is as bad as it sounds that it wouldn’t also be defeated.

Still, it’s a little worrying that this is the direction the government is trying to go with copyright. The government says that it is required to draft legislation of this sort in accordance with the 1997 WIPO Treaty, which Canada signed. I’m kind of ignorant on the 1997 WIPO Treaty, so I can’t comment on that. I hope it’s not true.

I’ll post again later once I become better informed, maybe after I get back from the Dominican Republic. In the meantime, you might be interested in Michael Geist’s list of things you can do.

I’m a teaching assistant for an introductory assembly language course this term. Specifically, it’s SPARC assembly. One of my favourite things about SPARC assembly is that the assembler does a terrible job of offering any sort of macro features. This may not sound like much of a benefit, but the benefit is that we get the students to use m4 for their macro needs. Hence, we’ve already built in to them the pipeline of running it through m4 and then assembling the code.

Anyway, I’ve been biding my time and biting my tongue for a while, saying “oh yes you can do crazy stuff in m4 like loops, but don’t worry about that. We won’t be doing any of that.” However, today I was talking a bit more about macros and how macros can invoke other macros, and their eyes lit up like Christmas trees.

So next lab I’m planning on breaking into it a little bit. I expect I’ll have a lot more stuff to talk about, but if I have time, I’ll talk about m4 and how it’s Turing-complete and maybe guage them for interest in non-imperative programming languages.

Can m4 be a gateway drug to functional programming?

I suppose it’s further proof that Haskell has corrupted me forever. For my hobby projects, I’m typically doing Objective C Cocoa these days. Usually when I find myself working in non-Haskell languages, I get frustrated at the lack of laziness. Laziness is a wonderful thing. Treating a computation as an infinite data structure just makes your life a lot easier in a number of ways, most notably being able to reuse components that were designed to work on finite data structures. From hereon, every time I say “data structure”, pretend I mean “collection” or whatever they’re called these days: abstract data types that are meant to hold a bunch of things.

The more I work, though, the more I’ve noticed that I’ve fallen into a strange design pattern: I practically never pass around actual data structures (unless they’re mutable and I’m adding something to them, for example); I practically always pass around an enumerator—NSEnumerator in Cocoa.

It’s really quite wonderful. The most immediate properties of this are:

  1. I don’t have to waste time wondering if I’m operating on an NSSet or NSArray or whatever, or if they have a common superclass;
  2. I’d have to create an NSEnumerator eventually anyway to do anything useful, so I’m not adding any new overhead;
  3. the big one, if you haven’t guessed already: it’s trivial to introduce infinite data structures, i.e., data structures which aren’t “data” at all, but rather code.

I whipped up what I call CallbackEnumerator and its friend, the Enumeratable protocol (protocols are what Java calls interfaces. The world of OO wouldn’t be complete without 3 contradictory nomenclatures). The Enumeratable protocol contains just the -enumerateNext message (and variants to allow for arguments), basically as a means to pretend that a certain object is actually a closure. The CallbackEnumerator class does nothing but transform NSEnumerator messages into Enumeratable messages.

So far it hasn’t been terribly useful, as I’m only using it in one spot in my program, in my tokenizer, the idea being that you create a tokenizer with a string and, as if by magic, it instantly gives you an NSEnumerator with all the tokens. Arguably I could have just made a special TokenizerEnumerator and skipped the whole CallbackEnumerator nonsense, but I like the potential for reuse.

Where the enumerator fails is providing structure. The enumerator does a wonderful job at simulating a lazy list. But what about the lazy tree? It seems rather pointless to create an enumerator-a-like for each abstract data type.

In general, I think the role of the enumerator in object-oriented languages like Objective C is backwards. What would make more sense is, for each abstract data type, have an “enumerated” subtype. For example, for NSArray, there could be an NSEnumeratedArray, such that NSEnumeratedArray is a subclass of NSArray. NSEnumeratedArray operates exactly like NSArray, and has the same methods, but rather than containing a giant chunk of memory from which to return elements, it executes code. Similarly for NSDictionary or any other abstract data type.

Consider then the -objectEnumerator method that NSSet implements. What this is really saying to me, semantically, is a transformation of the elements of NSSet, so that they are no longer represented as a set, but rather are represented as a lazy list/array. So, replace that method with -enumeratedArray and you have the same thing. I should mention that, in Objective C, there is a sharp divide between immutable objects and mutable objects. NSArray is necessarily immutable, and thus it’s easy to say that an NSEnumeratableArray would respond to the same messages as an NSArray. As to what an NSMutableEnumeratableArray would look like, that’s another matter.

The benefit is that you are no longer fudging two ideas together. An enumerator is two things: it is a way of representing laziness; and it is a flat, ordered view of elements. There can be a separation between the two. So you can have your NSEnumeratedArray, which is exactly as an NSEnumerator is now: it provides you a way to get the elements out of a list, one at a time, sequentially. Or you could have an NSEnumeratedDictionary or NSEnumeratedTree or what have you.

I suppose it’s somewhat telling that I had to invent types to prove the point. Unfortunately, AppKit—the “foundations” part of Cocoa that provides abstract data types—doesn’t offer much beyond arrays, sets, and dictionaries, so maybe there isn’t much of a need. Still, it’s a bit telling that while all of these classes have initializers like -initWithObjects: or -initWithSet:, there is no -initWithEnumerator: to be found. The concept of chaining lazy data types together doesn’t really exist in the API, which is unfortunate.

Why am I just finding out about Hoogle now? It’s a Google interface to Haskell’s APIs. It’s brilliant. So many times I’ve thought “okay I have an [a] and an a -> IO b, now what function makes an IO [b]?”. Especially with all the list functions, all the scans and iterates and interleaves and folds and maps tend to run together a bit. It’s especially weird since 99% of these functions can be reimplemented without much thought in 2 lines of code, but it’s some stubborn form of Haskell pride to never reimplement functionality.

Seriously, though, I think a lot of languages and libraries could benefit from a Google interface. None of this “man” or “javadocs” nonsense.

Next Page »