Random Musings and Complaints: February 2011

Sunday, February 20, 2011

"Cancer" is too Generic

An article was recently published in Nature that looked at genome-wide differences between seven different prostate cancer tumors (link). The results are pretty amazing. The authors took advantage of massively parallel sequencing, which makes it possible to analyze entire human genomes in a fraction of the time that previous methods required. By looking at the whole genome at once, instead of individual genes as with the traditional approach, it becomes possible to see the proverbial forest among the trees.

By looking at the whole genome, the authors were able to see all the different kinds of mutations that were occurring in prostate tumor cells at once. This revealed the general types of mutations that were occurring. For one, it was found that a certain mutation that causes a DNA strand to be off by around 2 basepairs were fairly common in the tumor cells. Such mutations are also seen in breast cancer tumor cells, though they are much less common. In fact, the most common type of mutation seen in breast cancer tumor cells is not as common in the prostate cancer tumor cells. Considering that these are both cancerous tumors, this is somewhat surprising. They look and act in similar ways, yet internally they are quite different.

They also found that chromosomes were frequently rearranged in a variety of ways in the prostate cancer tumor cells. These rearrangements, although common, seemed to be a side effect of the cancer, instead of its cause. I say this because there was little pattern among the tumors; of the seven, there were seven very different patterns of chromosomal rearrangement. If prostate cancer is being caused by some DNA repair mechanism being turned off, then these results are to be expected. Such rearrangements may naturally occur by chance, but a cell with a functioning repair system would either be able to fix them, or destroy itself via apoptosis (sacrificing itself for the benefit of the person as a whole). Without such functioning repair mechanisms, such gross mutations would be able to occur with little bound.

While the rearrangements were seemingly random, certain genes were found to commonly be mutated. Some of these genes are very crucial to the proper functioning of the cell, namely histones. Histones wrap around DNA, and can prevent the genes in wrapped DNA from being expressed. Considering that all your cells are genetically identical, expressing all genes at once for a single cell is almost certainly detrimental. A liver cell is a liver cell because it expresses different genes than a lung cell. With the mutation of histones, however, gene expression can radically change. Cancer is a logical result of such a dramatic, mostly random change in gene expression. That said, I still think the histones are more of a side-effect, or perhaps a co-effect. They are at a deeper level than the chromosomal rearrangements, but I'm not sure if they are at the root of the cause.

The greatest take-home point of the paper is that prostate cancer is very unique. Given that the technology to make this study was only recently made available, there is a lacking of data on the genome-wide uniqueness of other cancer types. Personally, I would expect such studies on other tumor cell types to reveal more uniqueness. Even though cancer generally presents in the same sort of ways across all tumor types, this type of study shows that the similarities may be more superficial than we think. At a deeper level, there are great differences between the different kinds of cancers. A cure for breast cancer or prostate cancer may be possible, but a general, all-purpose cure for cancer may be far fetched. Although this isn't the positive message one may have hoped for, it does open new lines of inquiry. Besides this, it is helpful to know that these things are more different than they appear, lest we spend resources unnecessarily to find a general cure that can't possibly exist.

Ethical Issues of Direct-to-Consumer Genetic Tests

I recently read a study that analyzed the effects of direct-to-consumer genomewide profiling on customers of such services (link). Before I get into the study itself, some background is neccessary.

Arguably all diseases and disorders have a genetic component to them. For certain ones, such as cystic fibrosis, there is a clear genetic link. For other disorders, such as Parkinson's disease, the link is not so clear. These two disorders are at opposite ends of the spectrum.

Now consider something like high cholesterol. High cholesterol is a major health concern in the US. Diet and exercise influence cholesterol levels greatly, but there is an established genetic component as well. For some people, cholesterol levels are naturally elevated simply due to their genetic makeup. Such people may not know this, or may not know what to do about it.

This is where a genetic test can be helpful. Such tests can reveal underlying health information that would otherwise be unavailable. In theory, one can gain a better understanding of certain health risks through these tests. In the case of high cholesterol, if one sees that he is genetically at a higher risk than the general population, he could take action as a preventative measure.

Such is the theory, anyway. The problem is that this isn't what people do. The study found that few people made any life changes whatsoever after gaining insight into their genetic makeup. They also found that in the majority of cases, there was no substantial amount of additional anxiety due to the tests, although many people did share their tests results with a doctor. Part of this might be explained by the fact that the test group wasn't representative of the general population, but of people with more than a passing familiarity of such tests.

It cannot be emphasized enough that this was not the general population. I suspect that the general population would show a much greater degree of anxiety, with people taking improper actions in response to getting their tests results. Why do I think this? Because the general public doesn't understand what the results mean. Even for something relatively simple like cystic fibrosis, where one can or cannot have the disorder (with little in between), there can be confusion. Someone may find that he is a carrier of the allele that causes cystic fibrosis. This would mean that there is increased risk of one's offspring of having the disorder (assuming his mate's cystic fibrosis gene status is unknown). It does not, however, mean he has the disease, nor does it mean he will ever get the disease. That said, the expected reaction of the general public to seeing a positive result for anything involving cystic fibrosis would be of horror. This is not the proper response, but without anything else to go on, panic tends to be common.

I'm not trying to say that people are stupid, or anything like that. The study itself points out that even ~90% of doctors don't feel they have the necessary background to be able to understand and interpret the results of such genetic tests. This is not some systematic failure of education, but simply due to the fact that most doctors never need to know such information. It's just not part of the job description.

Everything previously mentioned assumes that the test makers are doing everything ethically and right. Of course, reality is something far different. Test makers know that there isn't a good general understanding of genetics, and they use this to their advantage to market products. For example, many of the genomewide tests test for very hazy things, such as diabetes or Parkinson's. We still don't know exactly what causes these diseases, and at best we have identified a fairly wide assortment of genes that seem to be linked to the disorders. As to how they are linked, we usually don't know. If we can't even decide if a gene is involved with a disease, how is knowing what allele we have for this gene helpful in determining our risk factor for the disease? Yes, there is a test, and it is looking for a specific allele, but we don't know what it means to have that allele. In this case, we get a result, but no one really has any clue what it means. To someone with no understanding, one sees "Diabetes....positive", and it leads to immediate panic and perhaps irrational decisions.

Many of these tests are in ethically murky territory. The companies making the tests are in it to make a profit, not to actually help anyone, though this isn't clear without a deep analysis of what they are testing. There are few lies, but lots of deception (personally I define lying as saying something one knows to be false, and deception as neglecting to say something one knows is true). Perhaps to save face, many companies offer genetic counseling after one receives test results. Now, the test may be a few hundred dollars, but usually the counseling is several hundred, if not over a thousand dollars. Even in the study, where genetic counseling was offered for free, most people didn't take advantage of these services. Personally, this baffles me, and the only thing that comes to mind is that the majority of the people in the (biased) sample set already had a thorough understanding of what the test results meant.

I think that there is good that can come of such testing, but it has to be well-regulated. The current system takes advantage of naive people for the sole purpose of profit, and they are very good at it. To expect a decent enough understanding of genetics for the general person to understand such tests is absurd; others must get involved for the purposes of genetic counseling. With regulation, there is hope for it, but in its current state it needs to be stopped completely.

(Random note: such testing is illegal in New York state, as one cannot see the results of a genetic test without being in the presence of a doctor, so that the doctor may explain the results.)

Thursday, February 10, 2011

The JVM and Type Erasure

Type erasure is the term for what is, in my opinion, one of the dirtiest hacks in existence. Before getting into that, let's start with a story.

Originally, Java did not have support for generics. That meant that code like:

List< Integer > list = new ArrayList< Integer >();
list.add( new Integer( 4 ) );
System.out.println( list.get( 0 ).intValue() + 5 );

...would instead be...

List list = new ArrayList();
list.add( new Integer( 4 ) );
System.out.println( ((Integer)list.get( 0 )).intValue() + 5 );

Ew. Ew. Granted, the type definition is shorter, but that's only because it has less information to store to begin with. Then we have to do this ugly cast. Not only that, we're free to do such terrible things as:

list.add( new Integer( 4 ) );
list.add( list );
list.add( new Random() );

The compiler puts absolutely no restriction on what goes in, as long as its an object. This can obviously lead to bugs, as someone can end up with a heterogeneous collection of objects if one isn't careful. Though this can be checked at compile time, this check is deferred until the class cast at run time. Just ew overall.

In Java J2SE 5.0, this all changed. Support for generics was included, and the people cheered. But, at what cost?

The JVM can't do this sort of magic. (I've heard that .NET can, hence why this post is titled "The JVM and Type Erasure". I've never dabbled with any of the .NET languages, though I'd consider this a plus for them.) Java runs on the JVM, so then Java can't do it either.

But it does do it, right? I mean, it has to. We have type-safe collections and generic types!

No, it's lies, all lies! Yes, there is a check performed, and this check asserts type safety. However, the semantics are identical. In other words, you're wonderful, generic-using type-safe code below:

List< Integer > list = new ArrayList< Integer >();
list.add( new Integer( 1 ) );
list.add( new Integer( 2 ) );
list.add( new Integer( list.get( 0 ).intValue() + list.get( 1 ).intValue() ) );

...is actually converted into this at compile time:
List list = new ArrayList();
list.add( new Integer( 1 ) );
list.add( new Integer( 2 ) );
list.add( new Integer( ((Integer)list.get( 0 )).intValue() + ((Integer)list.get( 1 )).intValue() ) )

It's a trick! A dirty trick! Yes, it's type safe, but this safety is asserted by the compiler at a certain stage of compilation. After this stage, it technically isn't, but the previous stage made sure that nothing bad would happen from these class casts.

Maybe I'm being too hard on it. It is type safe, and it accomplishes roughly the same thing as a C++ template. Who cares that it's implemented in such a way, besides maybe someone obsessed with performance?

Well, the evil does descend into other places. For example, even though one has to pass solid types to instantiate a generic type, this typing information isn't actually available to the programmer. Consider the type variable "T" in a generic class. Even though "T" must be instantiated to some actual type to form instances of the class, the person writing the class has no idea what "T" is at compile time. For example, consider the following overloaded method definition:

public int myMethod( List< Integer > list ) {...}
public int myMethod( List< String > list ) {...}

If you try to compile this, javac greets you with the following error message:

name clash: myMethod(java.util.List<java.lang.String>) and myMethod(java.util.List<java.lang.Integer>) have the same erasure

This is because after type erasure, these two methods actually both look like:

public int myMethod( List< Object > list ) {...}

...because erasure simply replaces whatever the type is with "Object".

A related issue is that Java doesn't let you make a generic array. For instance, if you try:

public class Test< T > {
public T[] array = new T[5 ];
...
}

...javac greets you with:
generic array creation

So what if you want to create a type safe, generic array?

You don't. Yay Java.

Well, ok, that's not *quite* true, but it's close. First, you're strongly encouraged to use something that's already type safe like an ArrayList, which can do pretty much everything an array can do. Then if you're still complaining, people tell you to use newInstance, as part of the Array class. newInstance will return a type safe array, but you need to pass it a Class<T> object. How do you get this? You explicitly specify it. Seriously? Seriously? This is not a solution, this is just asking for more errors. Effectively, you end up specifying it twice, once as < ClassName > for the generic type, and a second time as ClassName.class for the class object correlating to the class. There is nothing from stopping you from specifying < String > and Integer.class. So you end up with a type safe array, but it has been replaced with a very type unsafe operation. So much ick.

Don't get me wrong. Generics are better than no generics. I'm just saying that there are a tremendous number of associated gotchas.

Wednesday, February 9, 2011

HPV Virulence

I recently read an article describing a virulence factor in Human Papillomavirus (HPV) (link). Virulence factors affect how virulent (roughly "bad") a given pathogen is. In the case of the article, they discussed what is known as the PDZ binding motif. Certain strains of HPV lead to the production of a protein, E6, that has the PDZ binding motif. Certain naturally-occurring proteins in human cells contain PDZ, so the viral proteins with the PDZ binding motifs are able to bind to these normal proteins.

Past studies have shown that viruses with the PDZ binding motif are more virulent than those without it. This paper investigates why. Previous studies showed a variety of things that E6 bound to, but it wasn't known exactly which of these was the worst.

The way the authors investigated this was as follows. They started with cell lines that were already infected with PDZ binding motif carrying HPV, named HeLa and CaSKi. They then added siRNA that was complementary to mRNA for E6. When the two interact, they bind to each other, and are eventually degraded by the cell. In other words, they were able to selectively turn off E6 production in the cells. They then analyzed the expression of proteins over 72 hours, to see if any had any drastic increases. In the presence of E6, proteins that bound to E6 would fail to function, and eventually be degraded by the cell. However, without E6, such proteins could flourish without degradation. As such, it is expected that without E6, the expression of naturally occurring proteins that bind to E6 would increase dramatically.

The results showed that MAGI-1 expression increased, another protein that had been identified in other studies as binding to E6. This result is significant. MAGI-1 is involved in the formation of tight junctions. Such junctions help to hold cells tightly together. In normal, healthy tissue, such junctions are crucial to keeping everything together and ordered. However, this is not the case for cancer cells. Cancerous cells have to break free of surrounding cells in order to be as virulent as possible. This way, they can grow unrestricted and metastasize. This explains why E6 is a virulence factor in HPV. The paper shows a causal relationship between the presence of E6 and the degradation of tight junctions. E6 binds to such junctions, preventing them from functioning properly and leading to their degradation. Without the tight junctions interfering, cancer cells can flourish. It's thus in the best interest of a cancer cell to destroy such junctions, so it's in the best interest of cancer cells to contain E6 genes.

This opens doors for HPV-caused cervical cancer research. If one could develop a drug to specifically target and somehow destroy or disable E6, then the resulting HPV virus would be far less virulent. The virus wouldn't be harmless by any means, but it would have a far better prognosis. I don't think this is unfeasible, but it would require an extensive amount of additional research to be performed. This may very well be a seminal paper on the cure for HPV-caused cervical cancer.

Fruit Flies Do Distributed Computing

First the ants jump on the bandwagon, and now the fruit flies. This is from a fairly recent article published in Science (link).

A common problem in distributed computing is to determine a local set of master/leader nodes in a group of nodes. Such is called the maximal independent set, or MIS. The goal is to have all non-master nodes connected to a master, and to not have masters connect to each other. Although it sounds simple from a top to bottom approach, one is not afforded this luxury; after all, this is distributed. Looking at it from the bottom up, it's a much more difficult problem. A given node can only see a very limited amount around it. From this standpoint, the problem is so difficult that constructing a MIS deterministically is impossible (assuming all nodes are equal).

For a moment, let's leave the realm of computer science and enter the realm of biology. In biological systems, there is often a need to be very precise, all without any central guiding intelligence. This is especially true of embryonic development. Cells must differentiate from a somewhat generic mass to more specialized, highly ordered structures. The paper notes the notum of a fly, where there are many neat rows of small bristles, with a few scattered (yet evenly spaced) rows of long bristles. The structure shows a high degree of organization, which all came about via simple cell to cell communication. There was no signal for "put this row here", but instead a passing of short chemical messages to one another. It's truly amazing that such order can come from relative chaos with so little intervention.

The researchers first looked at exactly how this chemical communication occurred. For one, the bristles were found to have properties like masters. The cells that they were derived from acted in a different manner than their surrounding cells. Upon further examination, they found that these cells expressed large amounts of a cell-membrane protein called "Delta". Surrounding cells had a membrane protein called "Notch", which could recognize Delta. It was found that Delta would inhibit surrounding cells with the Notch receptor from becoming bristle cells. Essentially, the message is "I'm already forming a bristle, so don't form one yourself," although the actual message is much shorter than that (connected/not connected).

This phenomenon is very similar to the MIS problem. The only difference is that the biological system is able to do it perfectly almost every time, and that such systems have been doing this for a very long time. As such, the researchers applied the method used by the biological system to form an actual MIS algorithm. The algorithm works in a multi-pass way. Without getting too involved in the details, there are two exchanges that occur during each pass. In the first exchange, with a certain probability empirically determined from the fruit flies, a message is passed to all neighbors. If we don't receive any message back, then the given node becomes a master, and terminates the algorithm. If we do receive a message back, then the algorithm continues for a set number of passes, repeating these steps.

They found that the algorithm was very effective, and could provably generate a proper MIS every time. With a high probability, all nodes would fall into place, although, on occasion, a node failed to be connected to anything. The authors acknowledge this, and claim that such events are rare and can easily be corrected by means outside of the algorithm. As for the performance of the algorithm, it runs in O(log(n)^2), where n is the number of nodes in the network, and messages passed need only be one bit in size. Pretty impressive, especially when one considers the difficulty of the problem. I guess fruit flies are smarter than we think.

Tuesday, February 8, 2011

Ants can Optimally solve the Towers of Hanoi

...which is good for the ants, because I'm terrible at it. This paper was recently published in The Journal of Experimental Biology (link). Of course, the ants weren't actually moving the disks around a bunch of pegs. Instead, they created a maze based on the various paths one can take in the game. Junctions of the maze represent game states. Edges to other junctions represent the possible legal actions one can perform from the given state.

There is a single start location and a single end location, at opposite ends of the maze. They correlate to the starting states in the Towers of Hanoi puzzle. Note that the maze they used actually just glues two expanded game mazes together, hence why the start and end locations are technically the same. Thematically, the ants must first play a classical game. Once they arrive at the solution (at the middle of the board), they must then play a separate game from its solution back to its beginning (correlating to the other half of the maze; they don't ever have to go backwards in the maze).

The simple, one sentence version is that they created a large maze in which the optimal paths were already known. During experiments, they would place food at one end of the maze, and ants at the opposite end of the maze. They then tracked the movement of ants through the maze over time.

Many paths to the food were initially found. However, over time, the ants would prune the paths they took. In the overwhelming majority of cases, they found that the ants would end up using only a single, optimal path. Additionally, if this path was then blocked mid-experiment, the ants were able to find the new optimal path an overwhelming majority of the time. So the ants could find the best path both statically and dynamically. As an aside, larger colonies were more capable in finding the optimal path than small ones, as was the case for ants that had been allowed to explore the maze without the presence of food.

There are a number of practical uses for this style of behavior. These ants were able to solve the shortest path problem using only pheromones deposited by other ants. There was no central guiding intelligence, but "swarm" intelligence (as pointed out by the authors). This translates well to communication seen between computers on the Internet. Routing algorithms must find the shortest path between systems without a central intelligence. Ideally, only a minimal amount of information is communicated to other systems in this process. The ants do this, only millions of years before the computer.

The paper points out Ant Colony Optimization (ACO) as an algorithm that exploits this behavior. However, this algorithm is designed to work on static instances of problems, not on dynamically changing problems as in the shortest path experiment set up in this paper. Essentially, algorithms that already model themselves on ant behavior are not exploiting everything they could exploit.

With the findings of this paper, there seems to be enough information about actual ant behavior to design a new, dynamic algorithm, perhaps for routing. Who knows? Perhaps the minimal spanning trees (MST) of today will give way to digital ants of tomorrow. As it is, the current "dynamic" MST algorithms used in routing typically just routinely resolve static instances of MST problems. They resolve fast enough so that it appears mostly dynamic, although it's not truly dynamic. The ants have the advantage of being truly dynamic. If not routing, then there must be some practical application that can use this behavior. It just seems to have too many interesting properties to not somehow be useful to someone.

Cancer Dodging the Immune System

I read a paper describing the interplay between two proteins, CD47 and cell-surface calreticulin (CRT), on the surface of cancer cells (link). The paper makes a number of fascinating points.

Obviously, it is in the body's overall best interest to destroy any cancer cells. Such cells can be seen as foreign invaders attacking the body. Typically, when the body is under such microbial attack, there is a response from the immune system. The immune system is able to detect that there is an attack, and to specifically identify what the attackers are. Once identified they are tagged with antibodies, and can then be engulfed by white blood cells, eventually to be destroyed.

Such is the case for a normal attack. However, this is often not seen in cancer. Frequently, the body does not recognize that cells are cancerous. Considering that such cells come from the same body, and may have previously been functioning properly, this isn't completely surprising. However, cancer cells just plain don't behave like normal cells anymore. They forcefully break away from other cells, they absorb vast quantities of nutrients, and they proliferate without control. Even so, the immune system fails to detect them. Of course, if the immune system doesn't see them, then as far as it knows there is no problem.

From the point of a bunch of cancer cells, it is in their best interest to avoid the immune system. They no longer work for the body, but they happily take up shelter and food within. If their numbers are not being reduced by the immune system, then they can grow that much more quickly. In other words, it is no random fluke that the immune system is blind to them: mutations that favor becoming covert to the immune system are both favorable and likely.

The authors of the paper investigated this phenomenon in detail. It has been previously known that cancer cells often have large amounts of CRT on them. CRT is a signal to the immune system. Phagocytes, which engulf and destroy invaders, are attracted to CRT. So something carrying large amounts of CRT is essentially screaming to be eaten by the immune system. However, this seems contradictory, given that cancer typically avoids the immune system.

The reason for this was large volumes of CD47, expressed in tandem with CRT. CD47 is sort of the counter-CRT. Phagocytes avoid things covered with CD47, making it essentially a "don't eat me" signal. The authors performed a significant amount of experimentation proving that it was, in fact, the CD47 that was preventing the cancer cells from being engulfed.

There are a number of follow-up questions to this. First and foremost, why don't cancer cells simply not produce CRT? The default of a healthy immune system is not to eat something, so without the CRT signal it probably wouldn't attack. Instead, cancer cells go through the trouble of producing both CRT and CD47. Personally, I think this suggests that CRT evolved much earlier than CD47, and it has been tied into a variety of core metabolic pathways. The paper mentions that stressed or damaged cells generally give off CRT, and the bizarre behavior of cancer cells constantly puts them under high stress. CRT is probably produced by some necessary pathway under stressful conditions. The cancer cell needs the pathway, and it can't avoid the stress, so something else must be done. It can't intelligently say "turn off CRT production", so the quick mutational fix is to overproduce CD47.

There is a second important question. If cancer cells have CD47, then is a possible course of treatment to selectively destroy or disable CD47 and allow the immune system to do its job? The paper mentions this possibility, pointing out that most (but not all) tissue types do not typically have CD47. Such tissues should be theoretically unaffected by a treatment that affects only CD47, though one can never know for sure without actual clinical trials. Personally, I think there is some promise in this, though there still needs to be a significant amount of research performed before an anti-CD47 cancer treatment is developed.

Factorial Function

I was going to include this in my post on functional looping alternatives, but I think this deserves its own post. The factorial function, i.e. n!, is frequently used as an example function for a variety of things. There are a number of wrong (er, poor) ways of implementing it, but only a few good ways.

First, let's start at what is probably the most wrong way. Ironically, this is often used to demonstrate recursion to intro-level CS students. Might have something to do with why intro-level CS students often hate recursion. The recursive definition is like so:

fac( 0 ) = 1

fac( 1 ) = 1

fac( n ) = n * fac( n - 1 )

Nice and simple. In pure math land, beautiful. In code land, not so much (using LISP):

(defun factorial (n)

(if (or (= n 0)
(= n 1)) 1
(* n (factorial (- n 1)))))

Oh sure, it's short. But it's brutally inefficient. We don't actually start any calculations until we have generated all the values of n between n and 1, inclusive. Worse yet, this space is on the stack. Ick.

So let's improve things a bit. We can make this tail-recursive with an accumulator, or we can reach in the imperative toolbox. I've actually shown the former elsewhere in my blog, so let's do the latter. But instead of just generating the number and being done with it, let's go for something better.

Consider an actual program that frequently calls factorial for a variety of inputs. Often times we see the same inputs, but we must generate the factorial (O(n)) for every call. Why not save the result after computing it? That way, every call after the first will be O(1). (Thanks to amortization, n calls will be O(n), instead of O(n^2). Cool beans.). This general trick is known as memoization (not memorization). I've used it a number of times on different problems, and it can seriously improve performance if you do everything right.

Without further ado, here is some Java code that utilizes the technique:

import java.util.*;

public class Factorial {
private List< Long > memo;
public Factorial() {
   memo = new ArrayList< Long >();
   memo.add( 0L );
   memo.add( 1L );
}
public long factorial( int n ) {
   int current = memo.size();
   while( current <= n ) {
        memo.add( current * memo.get( current - 1 ).longValue() );
        current++;
   }
   return memo.get( n ).longValue();
}
}

The while loop within factorial only does anything when we haven't already generated a value for the given value of n. In this way, we take advantage of previously used values. Note that in the general case of memoization, one uses a map instead of a list. In such a case, inputs are keys and their mappings are the values. For this, since the inputs were integers, and that one must have all lesser inputs calculated for input n, we were able to get away with a list.

So how does all of this relate to functional looping alternatives? Well, for certain mathematical functions, they shine like none other in terms of conciseness. Factorial is one of these. The following gets the factorial of n, in Scala:

( 1 to n ).foldRight( 1 )( _ * _ )

...that's it. It's shorter than the mathematical definition, for crying out loud. This first generates all the numbers between 1 and n inclusive. It then does a fold, going from the highest numbers at the right to the lowest numbers on the left, constantly multiplying them. The given 1 acts as a default value, catching 0 (0! = 1). In all other cases, it's a harmless no-op, since 1 * n = n.

This is just as bad as the original definition, if not moreso. Again, we explicitly generate all numbers between 1 and n. Granted, now we generate them on the heap, but we still make them. However, foldRight isn't tail recursive. It has to go all the way to the end of the list before it can do any processing, so it has the same problem as the original definition. Only now it's memory usage is O(n) for BOTH the heap and the stack!

Turns out we can do better. Why start from the right when we can just reverse it and start from the left? Reversing doesn't need the stack, and folding from the left is O(1) with respect to the stack (it's tail-recursive), so doing this is more efficient:

( 1 to n ).reverse.foldLeft( 1 )( _ * _ )

Not as pretty as before, but now it's O(1) on the stack and O(n)on the heap, which is still better than the original definition which is O(n) on the stack and O(1)on the heap.

Note that if you actually have to use the factorial in your code, memoization is probably the way to go, or at least a method that is O(1) for all memory. This was just so short and so cute that I couldn't pass it up.

Functional Thinking Part 1: Looping Alternatives

I intend this to be the first in a multi-part series explaining some of the advantages of functional programming. For all these parts, I'm generally going to be using Scala, because of its similarity to Java (which should be fairly well-known). That said, I'm not going to try get too Scala-specific. These techniques should transfer over to other functional languages easily, and may even transfer over to non-functional languages.

This part is going to be on various alternatives to the general imperative looping constructs (for( ;; ){}, while(){}, and do{} while();).

In imperative languages, the aforementioned basic looping constructs are your bread and butter. They do just about anything and everything related to looping. There are literally an infinite number of problems that require looping of the sort, yet they can all be broken down into one of several simple constructs. Cool. In a Zen sort of way.

However, this implies that there is a lot of overlap between the different problems. Inevitably, this leads to repetitive code, which leads to bloat, lost time, and boredom. We can do better.

"If you don't want looping constructs, then you must mean you want recursion. That doesn't always make things prettier, you know!" Hmm...not quite. Recursion is kind of the functional equivalent of the for, while, etc. loops. Both its strength and its weakness is that it is so general. With recursion, it might be easier to infer code correctness, but it won't help on the repetition.

What kind of repetition? Well, lets consider loops that iterate over elements of lists/arrays. This is a very common problem. So common, in fact, that many languages have ways to shorten up the loops that must traverse them. For instance, Java has the for-each loop, like so:

// assume "list" is a List of Integers

for( Integer current : list ) {}

This is much cleaner than a more general while loop that grabs elements from an Iterator, not to mention shorter. But this isn't what I'm talking about. I'm talking about what people then use the loop body for. Consider the following:

//  assume "list" is a List of Integers

for( Integer current : list ) {
if ( current.intValue() > 3 ) {
...
}
}

The nested "if" adds a bit of complexity. We go over every element of the list, but we only want to do the "real" body for every element that is greater than 3. This is known as filtering, where we filter out only those elements which have some property that is of interest (in this case, only those > 3). This is a very common operation.

In a typical imperative language, this is often taken as-is. Yeah, the if forces the code to become more indented, and it's a little long-winded, but there isn't a lot that can be done with it. However, in functional languages, where there is typically a means of very concisely writing and passing functions as parameters, this is not so. One can instead write code for a general filtering loop, which will filter out all the items that match a given predicate function.

In Scala, this operation is known as, well, filter. The above Java code fragment instead becomes:

list.filter( _.intValue() > 3 )

This is clearly much shorter. This returns a new list in which each element is greater than 3.

Generally, the type of the filter operation is "( (A -> Boolean ), [A] ) -> [A]", where "A" is a type, and "[A]" refers to a list of A. The general contract is that the size of the returned list is <= the size of the original list, and that the order has been preserved.

Consider another general problem. Given a list, one performs some processing over every element of the list. This processing results in the generation of a new list of the same size, where the nth element of the new list was generated by processing the nth element in the original list.

Explained with a bit more Java code:

//  assume "list" is a List of Integers

List< Integer > newList = new LinkedList< Integer >();

for( Integer current : list ) {

  newList.add( new Integer( current.intValue() + 1 ) );

}

"newList" results from adding one to each element of the given list "list". Granted, this is a simple, pretty useless example, but this is also a common real-world operation.

In functional programming land, this type of operation is known as "map", as one establishes a mapping between input and output. In scala, the above Java code looks like:

list.map( _.intValue() + 1 )

This returns a new list in the same way as before, only in a much clearer way. The type of the map operation is usually ((A -> B), [A]) -> [B], where one creates a list of the return type of the given function. The general contract is that the resulting list is the exact same length as the input list, and that the nth element in the result list resulted from applying the given function to the nth element in the first list (i.e. ordering is preserved).

There is a third type of general loop: a reduction. This is when one processes over every element of a list, processing each element in a certain way. Once all the looping is complete, there is a single result. For instance, summing the elements in a list is a reduction. Getting the maximal/minimal element in a list is also a reduction.

Reductions sound tricky. They are different across the board. However, many times these can be simplified, too. Consider the following Java code, which sums all the elements of the given list:

 //  assume "list" is a List of Integers

int retval = 0;

for ( Integer current : list ) {

 retval += current.intValue();

}

This code is surprisingly concise for what it does. For one, 0 acts as an initial value. It doesn't change anything to add zero, and it doubles as a default value for the case where there are no elements in the list. It sounds like this would be difficult to generalize.

However, it doesn't actually add this much more complication: it just means the addition of a parameter. Any general operation must take an initial value as a parameter.

In functional programming, this is called a fold. One can fold both right and left, referring to the order in which we traverse elements. (One can go left to right, a left fold, or right to left, a right fold. Left folds start at the front of the list, and right folds start at the back of the list.) The following Scala code is an equivalent version of the above Java snippet:

list.foldLeft( 0 )( _ + _ )

This simply takes 0 as an initial value, and instructs it to add each element together. The item to the left of the "+" refers to the result as we are building it, and the item on the right is the current list element. The type of the fold operation is generally ( (A, A) -> A, A, [A] ) -> A, where one is given a function that reduces two elements into one element (of the same type), an initial value, and the list, returning a single element.

Note that there is very similar operation called reduce, which doesn't take an initial element. When there is only one element in the list, it returns that element. When there are none, this is considered an error, and Scala will throw an exception under such conditions. This is good for when there is no good default value, and one expects there to be at least some data.

This has just scratched the surface on functional looping constructs. Scala alone has several more, and I bet that there are a variety of other, less common ones that are not implemented in Scala. Maybe I'll go into detail about those later.

As a brief aside, to any C# fans, some of this may sound quite familiar to certain operations in LINQ. You're absolutely right. "filter" is equivalent to LINQ's "where", "map" is equivalent to "select", and "reduceLeft" is equivalent to "aggregate". LINQ is actually based on functional programming ideas, which is why such operations are there. Pretty cool, and pretty useful.