Random Musings and Complaints: 2010

Thursday, December 30, 2010

Recursion

In my personal experience, recursion is a topic that takes many students aback. It certainly knocked me over at my first few goes at it. Why?

Of course, this portion is a monologue, so such questions are purely rhetorical. I'll answer it for myself. For all these reasons, I learned recursion to be an evil thing:
1.) It's more complicated to understand than iteration.
2.) Anything recursive can be expressed iteratively with a stack, so there isn't any point.
3.) It uses more memory than iteration, to the point where you can eat the entire stack.
4.) It's slower than iteration.

Wow. What a useless feature. The original FORTRAN got it right when it didn't allow it. What? There are entire paradigms that discourage iteration in favor of recursion? They must be awful.

Ahhh! This was seriously the type of instruction I've received from three different instructors. I've heard similar experiences from other CS students. It seems the CS people tend not to like recursion.

But then, one fine day, a math professor of mine decided to explain recursion. In 20 minutes. I didn't think it was possible. Past instructors of mine typically spent a week on the subject, and even then no one understood it. I prepared for the worst.

The professor began by writing down the first few terms of the Fibonacci sequence. He talked about it a bit, then wrote down how to arrive at any term:
F(0) = 0
F(1) = 1
F(n) = F(n-1) + F(n-2)

That was it. There was no writing out a call stack to get the correct sequence to generate F(5), or explanation of how the functions are exactly are going to return their values. It's math; what's a call stack?

With this example, I got recursion. There are one or more base cases, and a recursive case. As long as you can reach a base case from the recursive case, it will eventually terminate. The problem with all my previous instruction is that it absolutely focused on the nitty-gritty, all the tiny details needed to arrive at a solution. Of course recursion sounded terrible in this construct. When you get that low level, you completely take away the declarative nature of recursion, reducing it to something imperative. Worse yet, all my CS instructors would explicitly go through every recursive call. This is tantamount to needing twice as much time to explain "for 0 to 10" as opposed to "for 0 to 5".

In this way, I feel that my first point is just plain completely untrue. This feeling is an artifact of how it has been taught. Personally, I feel that recursion tends to be easier to understand, and often requires less code. The base case often doubles as a catch for edge conditions, such as an empty list or null (which are common base cases). Additionally, recursion tends to flow more naturally with my thought process. Take the car-cdr design pattern, wherein one processes the head of a list, then recursively processes the rest of the list. With recursion, I first write out my processing logic, then pass off the rest to a recursive call. With iteration, I first have to worry about setting up a loop to go over the list, then write the processing code. I have the short-term memory of a goldfish, and halfway through writing the loop syntax I often forget about what processing I need to do.

Now for point #2. It's absolutely true. Recursion uses the call stack as an implicit stack, as opposed to an explicit stack. However, keep in mind that if iteration is used instead of recursion, one must then manipulate the stack. There must be extra code for a stack somewhere, and processing logic is going to be cluttered up with push/pop calls. The recursive version will almost assuredly be less code, and require less mental overhead to understand. There is just plain something elegant about a recursive DFS on a tree.

Ahh yes, points #3 and #4. Performance is the mind killer (no, I have no read Frank Herbert's Dune, but I hear it's excellent). This is a definite maybe. A maybe elucidated via a cute little exercise/tangent.

Consider the factorial function (!). As a reminder, 4! = 4*3*2*1 = 24. Consider the following iterative implementation, where N is the number to get the factorial of:

int result = N;
for( int x = N - 1; x > 1; x-- ) {
result *= x;
}
return result;

This seems to work. Except for one little bug: 0! = 1, and this will return 0 on that input. Drat. You could probably rewrite the code a bit to make this slip through ok, but the simplest thing to do is to check for it like so:
int result = ( N == 0 ) ? 1 : N;

Darn edge cases. But wait, recursion can actually use edge cases as part of the main logic. So here we have a recursive definition, in LISP:

(defun factorial (N)
(if (= N 0 ) 1
(* N (factorial (- N 1 )))))

Cool. That's a bit shorter, and it actually uses that pesky edge case as a base case. This take advantage of the fact that N! is really N*((N-1)!). Eventually, N-1 will be 0, which is the base case.

But the careful reader notices that this isn't the same as the iterative definition. The iterative definition uses a constant amount of memory. However, this uses an amount of stack space linearly proportional to N. N must be decremented until 0 is reached, at which point all the multiplications occur as individual calls return. Not only does this eat up stack, it's going to take time to unwrap all those calls. Recursion sucks!

But wait! What if this definition were tweaked a bit, like so:
(defun factorial (N)
(factorial-rec N 1)
(defun factorial-rec (N accum)
(if (= N 0) accum
(factorial-rec (- N 1) (* N accum))))

This accomplishes the same thing, but there is a twist now. The decrementing and multiplications occur before the recursive call is made. Each call doesn't need to store any local variables, because they will not be used after the recursive call is made. This is called tail recursion, where the last call made by a function is the recursive call. In the previous implementation, the multiplication was the last call. By utilizing an accumulator that stores the answer as it is processed, one can achieve this.

If your language is smart, it will notice this, and perform a tail recursion optimization on it. (By smart, I mean read the manual. Most functional and logical languages support it, including ANSI-compliant Common LISP implementations, Scala, and Prolog.) This means that internally the code will perform an imperative style goto to the top of the function, as opposed to an actual call. This causes the function to use a constant amount of stack space, although it looks and feels like recursion. If this happens, then points #3 and #4 are null.

"But that used just as much code as the imperative definition!". Yeah, it did. Plus, the imperative definition is a tad easier to read. I said tends to be more elegant, not always.

I'm not saying that recursion should be used for everything, I'm saying that it tends to be underutilized due to improper instruction. In a language that doesn't support tail recursion optimizations, anything that can be made tail recursive should be expressed iteratively anyway. Points #3 and #4 are big, and if your language can't do it for you then usually you're out of luck. (One exception to this is amazingly FORTRAN; see http://www.esm.psu.edu/~ajm138/fortranexamples.html.) Understanding and using recursion is vital for anyone who does a significant amount of programming, and it's not really possible to get into functional or logical programming without it.

Wednesday, December 15, 2010

Code Line Numbers and Productivity

What sparked this post was a homework assignment of mine for a class on artificial intelligence. We were instructed to write a program that could solve instances of Knuth's conjecture (more info at: http://www.perlmonks.org/?node_id=443037). That is, given a certain positive integer, it would produce the series of mathematical operations necessary to transform 4 into that integer. We were to use breadth first search, and had our choice of Lisp, Scheme, and Python.

Initially, this sounded like a lot of work. I have implemented solvers like this for similarly simple problems like this in both C++ and MIPS assembly. In C++, given a good design, the problem takes several hundred lines. I assume it would be shorter in a language with a garbage collector, perhaps Java, as a lot of the trickiness had to do with memory management. With MIPS...well, ouch. That was over 1,000 lines of C that I hand translated into about 3,200 lines of assembly. Plus, the MIPS was actually only a very simple DFS as opposed to a BFS. My hands were cramping before I even started to write anything.

I'm way more familiar with Lisp than Python, so I went ahead with that language. To my surprise, I went from design to completely working code in about two hours. Stripping out whitespace and comments, it was only 52 lines in the end. Not only that, the solution it printed was actual Lisp code, which could then be executed to verify that it was indeed correct (it was).

"But it's slow!" Hmm...not really. For every number I gave it, it had a solution within a second. Because of the factorials involved, numbers tended to quickly overflow, which my design would treat as a dead search path. I'd have to do this in C++ or anything else. Perhaps I should have used a special data type that can handle arbitrarily large numbers, but this was just a toy. Plus, I could have improved its performance significantly if I had used destructive list operations as opposed to nondestructive ones.

I would really hate to implement this in a more traditional language. Printing out actual executable code would require much more work than in Lisp. (For comparison, in LISP, say there is an expression E that has generated the previous number. Say that square root is used to get the next number. Adding square root to the previous expression is simply `(sqrt ,E), which returns a new expression that is square root tagged on to the old one.) Let me translate: more code. More code is more lines. People often somehow translate this to better or more productive. It's not. It's just foolish.

I have two projects that have my name tagged on them that have passed the 10,000 line mark (note that my code/comment ratio is around 50/50, so 5,000 is a better estimate). One is in Perl, and the other in Scala (a functional language with striking similarity to Java). Without going into detail with what they are, rest assured that the Scala project is far more complicated, and it does far more. (A little more detail is that the Perl project is a Web interface over a database, and that the Scala project is an implementation of my own programming language). With the Perl project, a significant amount of my time consists of actual coding. There is very little thought involved; much of it is self-explanatory. The limiting factor of that project is how fast I type.

The Scala project is the complete opposite. For that, I have dozens of pages of notes describing design and thought processes. The limiting factor for that project is my imagination. For example, I spent two days trying to come up with a routine that ended up being around 100 lines. After I wrote the code for it, I stared aback at it for 10 or 20 minutes, flabbergasted at what just happened. I exploited multiple language features of Scala that are just plain not often seen, both in terms of traditional languages and traditional thought. I took full advantage of recursion, anonymous functions, closures, and pattern matching in addition to the more traditional objects. The code was clean, made sense, and it worked the first time. The difficult part was not the code, but the ideas that went into it. I was making the language do the work that normally my fingers would have. 100 lines in two days? It can be a lot more than you think.

Don't fall into the trap of equating line numbers with effort! If I had a job that I was paid by the line, I would implement everything in assembly. I'd be rich, without having much of anything to show for it.

Arsenic(?) Bacteria

Recently, a strain of bacteria was identified in Mono Lake, CA that could supposedly use arsenic instead of phosphorous. The media immediately jumped on this, reporting all sorts of miraculous sounding things. I start this off with a list of things that the media misreported or misinterpreted:

The discovery wasn't an accident. It has been previously theorized that arsenic could be used. To test it, they found an environment with a lot of naturally occurring arsenic, scooped some stuff out, and looked to see if anything could grow on an arsenic diet.
Arsenic and phosphorous behave very similarly chemically. Arsenic compounds just tend to be less stable than the phosphorous ones.
It grows much faster with phosphorous than with arsenic. In other words, it prefers phosphorous.
It does not appear to be able to differentiate between arsenic and phosphorous.
It appears to create an internal microenvironment in which arsenic just happens to be more stable than it usually is. This microenvironment is characteristic of this bacterial species, including strains that are not known use arsenic. In other words, the arsenic usage ability is tagging along with something else it needs, so it's not all that different.

Assuming the bacteria really do use arsenic, I'm just not too impressed. A quick glance at Wikipedia shows that Mono Lake formed about 760,000 years ago. The lake has no outlet to the ocean, which has allowed dissolved materials to remain trapped in the lake. This normally refers to salts and alkaline compounds, but the same goes for arsenic. This bacterium did not suddenly grow accustomed to arsenic over a course of mere decades due to human pollution, but rather it was gradually exposed to slowly increasing levels of arsenic over a long period of time. I'd be impressed if it were only such a short period, but given the amount of time this adaptation is not unreasonable.

Again, this is all assuming that the bacteria really do use arsenic. Quite frankly, the evidence is not too convincing, which has been pointed out elsewhere. I don't have a deep background in chemistry, but I am familiar with a few of the techniques that were used in determining if the bacteria use arsenic. One of these techniques is a phenol-chloroform extraction, which is commonly used to separate out different parts of cells based on polarity characteristics. Portions with certain characteristics congregate into one of three portions of a solution. It looks much like a mixture of oil and vinegar that has been allowed to sit for some time; there will be a distinct separation of the two.

After performing this separation on cells, they determined how much arsenic was in each portion (well, sort of. More on this below). Table 2 of the paper confidently says "Chloroform (lipids)", and says how much arsenic is in this portion. However, many more things than lipids can dissolve in the chloroform layer. Additionally, some of the numbers have extremely high standard errors, as with 1.5 +- 0.8 for the aforementioned value. The use of this simple technique along with such high error suggests to me that the paper was rushed out the door, and further examination could reveal otherwise.

There is an additional problem with the phenol/chloroform extraction. There is nothing to compare these values to. These values measure the percentage of arsenic found in the given portion, not the actual amount. Being that my own body contains trace amounts of arsenic, my own cells could very well have the same percentages. In fact, being that both my own cells and these bacterial cells cannot differentiate between arsenic and phosphorous, they likely do have similar values! It would be nice to have these values compared to some sort of control to see if there really is anything special going on. There are some bacteria which can live in arsenic-contaminated environments by actively pumping out arsenic or otherwise detoxifying it, which would have made great candidates for controls. After all, with such behavior one would expect different percentages for arsenic.

In conclusion, I'm just not impressed with either the bacteria or the paper's methods. I can understand the need to publish a discovery like this ASAP, but I fear that someone jumped the gun before we can say for sure. If these bacteria are not actually using arsenic but are merely very tolerant of it, then this could become a major embarrassment. For such dramatic findings I would expect dramatically conclusive data, and I'm just not seeing it.

Tuesday, December 7, 2010

The Ping-Pong Ball Theory of Thought

Hello random person on the Internet! Welcome to my first blog post. I have no idea if this is going to do anything sensical, so I might as well make it nonsensical. According to this, I am going to describe how my thought process works. Or at least how I explain it to other people.

Picture a human skull. Now picture millions upon millions of file cabinets lining the inside of that skull. Each contains a related set of ideas. Now this is the skull of someone alive, so there is a brain. But it's not a brain, it's a ping-pong ball. If you were able to make that leap, then the next one won't be too bad. There is no gravity or atmosphere in this environment, and the walls magically allow any object that hits them conserve momentum. The point? When this ball takes off, it's going forever. Or I think it will in that environment. I'm not really into physics.

Well, the ball takes off. Each time it hits a file cabinet, a new thought comes to mind. Er, ball. Mind? Sure. The ballmind is able to keep a small, finite number of these on hand at once. It's not the same number every time, because some of these cabinets contain more files than the others, but even when all else is equal it isn't. The point is is that it's not the most predictable of things.

Now all these ideas then get mashed up. Strawberry, television, blue&gold macaw. That makes no sense. And it doesn't. It usually doesn't. But every now and then something cool comes to mind. I personally believe that the person who came up with peanut butter and jelly had the same kind of mind. Seriously, who saw that coming? Well, I did, but then again it existed a few decades before I was born so I was indoctrinated with the sandwich agenda.

Random. Welcome to my mind.