Sunday, August 18, 2013

Parser Combinators in Common Lisp

Within the field of programming languages, parsing is often considered an afterthought.  Yes, we need to actually be able to parse in  a program in order to execute it, but for most people this is not the "interesting" part.  We often ignore this step entirely, despite the fact that parsing is often considered a solved problem that isn't.  While we often ignore this step, this does not make it any less difficult or important.  For many languages, parsing realistically requires knowledge of parser generators like ANTLR or Bison, which are used to automatically generate parsers from some input grammar.  While such generators are effective and can result in well-performing parsers, there can be a definite learning curve.  Additionally, depending on the parser generator chosen, an unwanted degree of separation can be introduced between the parser and the rest of the code.  The parser is largely a black box, which may or may not be desirable.

An alternative approach to parser generators is to use what are known as parser combinators.  With parser combinators, the parser can be represented directly in the metalanguage (i.e. the language your interpreter/compiler is written in), and it can be easily manipulated.  Moreover, these tend to be easier to use than parser generators; only a handful of APIs are needed for a working parser, as opposed to knowledge of a whole new tool.  The resulting code also tends to mirror the BNF grammar being parsed, allowing for easier maintainability.  The downside is that they are not the fastest things around by any means, but they can get a working parser off of the ground quickly.  But instead of blabbing on about parser combinators, let's let some code do the talking.

Moving forward, I assume that the target language has already been tokenized.  Given that this process is typically a series of one-to-one translations between source strings and tokens, this should be a fair assumption.  If not, one could always encode the whole underlying language in the parser, complete with (usually) unimportant pieces like whitespace.

Parser combinators are based off of a simple, usually true, assumption: any parser can be viewed as a function that takes a stream of tokens and returns some result.  More formally, this can be seen as:

parser: Stream[Token] -> ParserResult

This seems obvious enough, given that a parser by its very nature reads tokens and yields some result.  More interesting is exactly what ParserResult looks like.

For simplicity, let's say that we are only interested in recognizing whether or not a particular string is a member of the language specified by the CFG (we'll relax this restriction later).  With this restriction, we are interested only in whether or not a parser was able to parse in some tokens.  If a parser fails to parse, then the tokens were not recognized, and the parser thus returns some failure indicator.  For the remainder of this discussion, let's refer to this as ParserFailure, which is a subtype of ParserResult.

What of the case where the parser succeeded to parse in?  Intuitively, in this case we return the opposite of failure, which will be called ParserSuccess in the remainder of this discussion.  However, we must also return one extra bit.  In parsing in some form, the parser consumed some number of tokens from the input stream.  When the parser returns, we must know where it left off in this stream, i.e. what was not consumed.  As such, ParserSuccess also returns a stream consisting of all the tokens following what was successfully parsed in.

To better illustrate this, consider the code below that generates a parser that can recognize a given symbol lit.  In this code, if a string is returned, then it is assumed to be indicative of parser failure.  This allows for error messages to be automatically generated during parsing.  If is returned instead of a string, then it's assumed to be indicative of parser success, where the returned list consists of the characters remaining in the stream (i.e. everything after the literal that was parsed in).

This parser is clearly very simple, and may not appear to be very useful at first glance.  After all, it is easy enough to define a parser like this using only regular expressions.  However, the difference is that parser combinators allow for, well, combination of simpler parsers to form more complex parsers.  For example, we may want two parsers to run in sequence (i.e. AND), or we may want to indicate that multiple potential forms are possible (i.e. OR).  It is possible to derive such AND and OR parsers using these same techniques, all in a manner which is independent of the underlying parsers being combined.

First, consider the AND case.  AND first runs the first parser it was given.  If this parser returns success, then AND runs the second parser on the remaining tokens in the stream, returning the result of the second parser.  If either parser fails, then AND fails.  This behavior can be implemented like so:

Oddly enough, the AND parser's definition is shorter than that of the basic literal parser.

Now let us consider the OR case.  With the OR parser, it first tries to run the first parser.  If that succeeds, then it simply forwards the result.  If not, then it runs the second parser.  This is implemented below:

Once again, the definition of OR is quite short.

At this point, we have enough infrastructure to start parsing in some real things.  After all, conjunction (AND) and disjunction (OR) are the bread-and-butter of BNF grammars.  However, this parser is still not very useful - it can only recognize languages, while ideally we would like to generate ASTs.  With a bit of modification, we can add this functionality in an extensible way.

Looking again at ParserSuccess, it seems clear that whatever result is generated should be passed around within while it is generated.  For example, while generating an AST for an expression of the form -e, the fact that - was encountered should be retained while e is parsed.  We can do this by modifying ParserSuccess to return a pair holding our custom result along with the tokens that remain in the stream.  Whenever parsing succeeds, we also need to apply some user-defined function to return our processed result.  While we could modify all of the parsers defined so far to have this extra functionality, a more modular approach is to introduce a new kind of parser that handles this behavior.  This parser will first attempt to apply some other parser to a stream.  If the passed parser succeeds, then it will apply a user-defined function to the result of the other parser.  If the passed parser fails, then the failure is simply propagated.  This parser is detailed below.  Note that lists of two elements are used to represent a pair:

Of course, the AND parser still needs to be modified in order to use the new definition of ParserSuccess.  Not only does it need to grab the correct element of the tuple in ParserSuccess, it must also hold the results of the parsers it runs.  These are simply held in tuples, until the all-important result parser is triggered again.  Our new definition of the AND parser becomes the following:

With this new definition of the AND parser handy, we can run silly little examples like so:

A complete definition of the parser combinators shown here, along with an example parsing in a simple grammar for sentences, is on GitHub.

Thursday, September 1, 2011

On Dynamically Typed Languages

In programming languages, types are a very fundamental thing.  All expressions have a type, or at least can be represented as types.  For example, in the expression "1 + 2", where 1 and 2 are integers, the type of the expression is most likely that of an integer.  I say "most likely", because usually languages are defined so that "+" with two integers results in an integer.  (In more formal terms, the type signature of "+" is integer -> integer -> integer, meaning it takes two integers as parameters and it returns an integer.)

Typing information is important regarding a language's implementation.  In general, information regarding types must be known in order to execute a program.  For example, the expression "'moo' + 1" is incorrect in regards to the previous definition of the "+" operator, as 'moo' is not an integer.  This is an example of a type error: the expected types do not match the given types.

To illustrate that type errors are a significant problem, consider assembly language.  In general, assembly languages have no concept of types.  Data is passed around, and the type of the data depends only on the context it is used in.  The same data can represent a variety of things: an integer, a floating point number, a string of characters, a memory address, etc.  Everything is considered "valid", though it is very easy to make a nonsensical command.  Such errors show themselves only by the program behaving incorrectly, which are some of the most difficult bugs to work out. 

Clearly, typing is important.  What is also important, and getting back to the title of this post, is when typing information is determined.  In a statically typed language, typing information is assigned at compile time, before a program is executed.  In these languages, type errors reveal themselves as compile time errors.  In a dynamically typed language, typing information is assigned on the fly as the program runs.  In these languages, type errors reveal themselves as runtime errors of varying severity.  (The severity is determined by other language properties not relevant to this discussion.)

Both types of languages exist in the real world, and there are a number in both camps.  For example, C, C++, and Java are all statically typed languages.  Perl, Python, and JavaScript are all examples of dynamically typed languages.

There are pros and cons for both methods.  For statically typed languages, determining types at compile time means that more efficient code can be generated.  There is no overhead at runtime in verifying types, as such verification is performed at compile time.  Additionally, many common programming errors reveal themselves as type errors, resulting in theoretically more reliable code.  The downside is that certain problems are difficult to express in statically typed languages.  One often focuses more on getting the types just right so code will compile as opposed to actually solving the problem at hand.  Additionally, for many languages, specifying type information requires bulky annotations which take up a significant chunk of the code.  Not to mention the specification of these annotations becomes tedious.

The second listed drawback of statically typed languages, that of bulky type annotations, is frequently heralded as the key advantage of dynamically typed languages.  I will be the first to admit that such annotations are an utter pain in languages like Java, accounting for probably between 10-20% of the characters typed.  Imagine code being 10-20% shorter, just because the language is dynamically typed.

There is, however, a middle ground: type inferencers.  These are built into the compilers of certain statically typed languages, such as Scala, Haskell, and OCaml.  Type inferencers allow the programmer to skip specifying most, if not all, type annotations in code.  When code is compiled, the type inferencer goes to work, filling in the gaps. 

That's the theory, anyway.  Different inferencers have different capabilities.  Personally, I'm very familiar with Scala's inferencer.  In Scala, type annotations must be specified for input parameters to functions.  Additionally, annotations must be provided for the return values of overloaded functions and for recursive functions.  There are also a few edge cases where annotations must be specified, though these are quite rare.  Even though all of these cases require annotations, it cuts down the number significantly.  Comparing Scala code to Java code, I would safely say in half.  This pales in comparison to Haskell's type inferencer, which is so powerful that frequently all type annotations can be omitted.  Haskell code often looks like types are determined dynamically, precisely because explicit type information is notably absent.

One might have an aversion to such inferencers, at least in theory.  What if the inference is wrong, for example?  In my experience with Scala, I can say that this is rare, but it does happen.  That said, when it happens, it usually means I'm doing something REALLY wrong.  In this case, it means I'm consistently doing the wrong thing.  Any inconsistency results in a failure to infer type, as the inferred type is different in different contexts.  As such, the code is still valid, but it suffers from a logical error.  One can do this in a traditional statically typed language as well, but it is even more rare given how frequently typing information must be specified.  The repetition makes it so one tends to notice more.

The way I view it, the typing argument is a continuum between statically typed languages and dynamically typed languages, which various type inferencing mechanisms in between. 

I will admit that true dynamic typing has advantages.  For example, one of Lisp's greatest strengths is its macros, which allow for the creation of entirely new language-level abstractions.  Such macros expand to arbitrary code, and can usually accept arbitrary code.  With a perfect type inferencer that never requires type annotations, this isn't a problem.  The issue is that there is no such thing; inevitably typing information will need to be specified.  With a macro, this type information may need to be inserted in the final expanded code, instead of as a parameter to the macro.  If this is true, then macros can't be implemented properly: one would need a way to manipulate the expanded code, which largely defeats the purpose of macros. 

Personally, I prefer statically typed languages with type inferencers.  I tend to hit a wall with true dynamically typed languages after about a thousand or so lines.  By this point, I have defined several simple data structures, along with operations that act on them.  In a dynamically typed language, I can pass around the wrong data structure into the wrong operation quite easily.  Worse yet, depending on how the data structure is implemented, I may never know except for incorrect program behavior.  For example, if all the data structures are lists of length >= 1, and I have an operation that looks at only the first element and performs some very general operation on the element, then probably every data structure can be passed to this operation without a resulting type error.  By the thousand line mark, I can't keep it all in my head at the same time, and I make these simple errors.  In a statically typed language, this is revealed immediately at compile time, but not so with a dynamically typed language.  At this point I spend more time testing code than writing code, tests that I get "for free" from a statically typed language.

To put it shortly, most of my programmatic errors reveal themselves as type errors, so static typing is important to me.  This issue is so important that I feel that people with a strong preference for dynamically typed languages must think in a way fundamentally different than I.  I know people who can write correct code just as fast in dynamically typed languages as I can in statically typed languages, so there must be something.  Maybe it's practice, or maybe something else is going on.  That would make an interesting case study...

Monday, May 16, 2011

Meta-Research

...or the research of research, as I see it.  I recently read an article published in PLoS about how most published research is actually false (link). 

This very idea makes me cringe.  If that's true, why do we even bother?  Why spend enormous amounts of money to support something that's false? 

The paper doesn't really address either of those points, but rather it talks about how it can make such a claim.  Statistically, it is difficult to confirm things without absolutely enormous amounts of data.  Of course, getting data sets that are large enough for arbitrary experiments can range from difficult to impossible, with infeasibility being common.  This can put the statisticians at odds with the scientists.  The author's argument drives at this, pointing out that the data sets usually used are not large enough to be able to get a statistically valid answer. 

There is another problem.  People are people.  We are inherently biased.  It has been said that data is objective, and I used to believe that was true, at least in theory.  But then the question was posed to me: why don't scientists measure _everything measurable_ regarding an experiment?  Of course, this would mean an enormous amount of data, of which most of it is probably irrelevant.  But do we really know it's irrelevant?  The answer is no.  Our bias isn't shown so much in what we measure, but rather in what we choose not to measure - those things we think are irrelevant. 

Research costs money.  This usually means getting a grant.  Getting a grant usually means convincing someone that your research is going to do some good, be it cure a disease or (more commonly) make money.  With that in mind, why would anyone pay any amount of money for an application that reads "We want to do X.  We drew it out of a hat.  We have no idea what it does, and we have no idea what could come of this."  That is mostly unbiased (who put the ideas in the hat?).  It is also the least convincing argument I have ever heard for giving someone money. 

Now try "We want to research X, because it seems to have an effect on weight retention.  If this is true, we could develop an effective drug for weight loss."  Now we have something profitable.  But here's the problem: everyone involved wants it to be true.  More than likely, even someone working within ethical bounds is going to act differently when the desired outcome is known ahead of time.  I've been told repeatedly that one should never do data analysis until all the data is in.  However, we do not do this.  I have watched people stare at long running experiments that appear to deviate from expectations.  Frequently the target is personified, "Why are you putting a band there?  You're supposed to put it over here!"  I do this type of thing myself.  We already know what the experiment is "going" to do; we just need the formality of it actually doing it.

All in all, I think the paper is particularly interesting.  It gives a feel for how heterogeneous science really is, and it illustrates the ever-present (though shunned) human element.

Research shows that most research is wrong.

Monday, May 9, 2011

Vaccinations and Autism

And now for something completely different.  I read the original paper that supposedly linked autism to the measles/mumps/rubella (MMR) vaccine (link).  I know that this can get to be a heated topic, but for the moment I'm going to try to focus on the paper itself.  (Of course I'm going to be biased, but I'll try not to be!)

The paper suggests that a new disorder has been discovered.  Characteristic of this disorder is a combination of inflammatory bowel disease (IBF)-like symptoms, combined with autism-like symptoms.  The most notable feature of this disorder is that sufferers consistently presented with it between 24 hours and 18 months of receiving the MMR vaccine.  Most sufferers presented with symptoms within two weeks.  Such a disorder would be quite interesting, as the gastrointestinal tract and brain are two very different areas.  The author's original data in support of this disorder was a case study of 12 people.  Shortly after the paper was published with the original 12, an additional 40 patients were observed, of whom 39 were found to have this new syndrome.

Those are the facts, as presented by the authors.  Without going beyond the paper, this is not very convincing data of a new disorder.  Within the paper itself, there is only complete data presented for the original 12.  Of these 12, there is still considerable variability between patients.  Additionally, there is no control group; these 12 were hand picked by the authors.  The authors openly acknowledge this.  This was published as an "Early Report", and was more or less intended to be a springboard from which further research could be conducted.  To directly quote the paper, "We did not prove an association between measles, mumps, and rubella vaccine and the syndrome described."  Though the evidence suggests an association, there is simply not enough data to be able to make a scientifically valid determination.  Even if there is sufficient data to back an association, then one must determine if the relationship is causative or merely correlation.  (For example, when hot cocoa drinking is up, the crime rate goes down.  The reason is that it's typically cold when people drink hot cocoa, and the crime rate is known to drop in cold weather.)  Medical case studies need hundreds if not thousands of patients to be able to draw any hard and fast conclusions, and 12 patients is not enough to make such a claim.

Now I'll go beyond the paper.  For one, the main author (Dr. Wakefield) was covertly being paid by a law firm that was intending to sue the MMR vaccine manufacturers.  This is a conflict of interest.  Generally, conflicts of interest are rare in published research.  If they exist at all, they should be openly acknowledged.  (Here is a link to a paper with an open acknowledgment of a conflict of interest.)  This is a red flag.  Science is supposed to be as objective as possible, but with a conflict of interest it can be disadvantageous to be objective.

The more troubling problem is that most of the data itself is just plain not true.  Although 10/12 patients were listed as having something classifiable as autism (9/12 if you ignore data with question marks next to it), it was revealed that 3 of them never had a formal diagnosis.  Only a professional can make such a diagnosis.  Many of the symptoms of autism appear in other disorders, and only someone skilled in seeing all these disorders can actually make this judgment.  (I'm sorry, you cannot diagnose yourself as having a complex disorder just by reading a few pages on Wikipedia.)  As such, this is fraud. 

Another point is that earlier drafts of the paper used lengthier values for the time between exposure to MMR and first signs of symptoms.  As it came closer to the final draft, these time intervals shrank dramatically.

A third point is that much of the data was acquired not directly by doctors at the time of visit, but rather by parents at other times.  In the case of one of the children, such data was not acquired until 2 1/2 years until after symptoms first appeared.  For something as complex as autism, nonspecific data acquisition is not sufficient.  There are particular things that professionals look for, preferably directly as opposed to through a medical file.

I could go on and on about the different kinds of fraud and deception that occur in this paper.  A complete description of all these things can be found in here.  Note that this is from BMJ, which is a peer-reviewed source of legitimate medical information.  This is not some random website that some anonymous person made.  I must make that point clear, as there is a lot of misinformation on the Internet regarding this situation.

There have been a substantial number of follow-up peer-reviewed publications that have shown no link between autism and vaccinations, including this one.  However, the damage has already been done.  Many members of the general public think that there is a link because of this paper.  It has left a bad taste in people's mouths, with big bad science coming along to give our children autism.  This blog post is just going to be part of the fodder in this battle, which will likely continue without merit for years to come. 

People who still think there is a link will likely associate me with some evil corporate machine, and dismiss me.  Fine.  It would not be the first time someone has written me off that easily.  Let's assume there is a link, that this paper was correct, that it should never have been retracted, and this is all part of some conspiracy to cover the truth.  So if that's all true, then why does no one relate vaccinations to inflammatory bowel disease?  The bulk of the data of the paper is in support of IBF, not autism.  Dr. Wakefield is neither a psychologist nor a pediatrician, though he does specialize in the gastrointestinal tract.  The paper is not suggesting a link between autism and MMR - it is suggesting a link between autism, MMR and IBF.  It brands the combination of these three under a new disorder.  Removing one element means something else entirely, something the authors were not discussing.  In other words, if one believes what this paper is claiming, then it is self-contradictory to say that there is a link between autism and MMR without IBF involved.  As to how it happened to be autism and not IBF that was picked up by the media I'll never know.

Monday, May 2, 2011

Recycling: It Can Save Your Life

...assuming you're a lung cell.  I recently read an article published in the Public Library of Science (link) about how Pseudomonas aeruginosa infects people.  The bacterium can infect the lungs of people with other preexisting lung conditions, including pneumonia and chronic obstructive pulmonary disease (COPD).

Pseudomonas aeruginosa is an interesting infection, mostly because it requires a bit of sophistication on the part of the bacterium.  In the lungs, there is a protective mucous membrane that coats the outer layer of cells.  This outer layer of cells is known as the epithelium.  This mucous prevents most everything that is foreign to the lungs from directly contacting the epithelium, which can prevent many kinds of damage and infection.  Pseudomonas aeruginosa can't break through this layer, so it devises a strategy: send specially manufactured vesicles that can.  These vesicles have proteins on the surface that allow them to bind and fuse with cells in the epithelium, and they contain proteins that cause cellular change.  They are somewhat analogous to so called "bunker buster" bombs, which are able to penetrate a formidable outer shell and deliver a payload to the inside of the structure.  Only these are released with little guidance.


As for the payload, Pseudomonas aeruginosa causes a slight but severe change in infected cells.  In healthy cells there is a protein, namely CFTR, that regulates the amount of mucous there is in the lungs.  The protein must reside on the surface of cells to have any effect.  As part of normal cellular activities, this protein is occasionally ubiquitinated, meaning a ubiquitin group is bound to it.  This triggers a pipeline of events to occur.  The ubiquitinated protein is first sequestered from the cell membrane.  It then can follow one of two paths.  In one path, the ubiquitin group is removed, and the protein returns to the cell membrane.  In the other path, the ubiquitin group remains bound, and the protein is eventually degraded.  In healthy cells, these two paths run in tandem.  This is necessary to remove CFTR proteins from the membrane that no longer function, and are essentially just wasting space on the membrane.

What was previously known is that Pseudomonas aeruginosa infection somehow selectively shuts down the path that causes CFTR to return to the cell membrane.  As such, all the sequestered CFTR ends up being degraded.  The cell ends up degrading more CFTR than it can spare, and proper function is lost.  Without CFTR to regulate mucous properly, mucous builds up.  This is beneficial to Pseudomonas aeruginosa, as the once protective mucous ends up being its home, but this is at the detriment of its victim.  This mucous buildup is how people can literally drown in their own lung fluids, not to mention that it makes for a friendly environment for other opportunistic pathogens.

This paper investigated exactly how Pseudomonas aeruginosa is able to shut down the recycling pathway, forcing all ubiquitinated CFTR to be degraded.  The authors found that the vesicle payload contains a protein called Cif.  Although they were unable to determine exactly how, they found that Cif prevents the enzyme that deubiquitinates CFTR from functioning properly.  The reason why is somewhat complicated.  There is another protein, namely G3BP1 that normally inhibits the deubiquitination enzyme from function.  This protein is naturally occurring in lung cells, and it is presumably necessary for normal function.  G3BP1 can bind to the deubiquitination enzyme, temporarily preventing it from functioning.  In healthy cells, G3BP1 does not bind with very high affinity, presumably without other naturally occurring factors to help it along, so the net effect on the deubiquitination enzyme is minimal.

This is where Cif comes in for infected lung cells.  Cif stabilizes the interaction between G3BP1 and the deubiquitination enzyme, preventing the enzyme from functioning for much longer than with G3BP1 alone.  The effect is that the overwhelming majority of ubiquitinated CFTR never ends up getting deubiquitinated, as the deubiquitination enzyme has been inhibited by the interaction between G3BP1 and Cif.

I have a few questions regarding this mechanism, which could make for good future work.  For one, I suspect that some people are naturally immune to Pseudomonas aeruginosa infection, simply because they have mutations in either G3BP1 or the deubiquitination enzyme that prevent Cif from binding well.  It should be possible to conduct a clinical study on people with preexisting lung disorders, looking for those who for some reason never develop Pseudomonas aeruginosa infections, despite the significantly likelihood.

I also think that knowledge of this mechanism could lead to a novel drug treatment that prevents Cif from working properly.  Such a drug would somehow induce a conformational change in Cif that would prevent its proper binding to G3BP1.

The overall infection mechanism could be exploited for other purposes, as well.  Classically, specific drug delivery is a problem.  However, Pseudomonas aeruginosa is able to release vesicles that seem specific to lung tissues and contain specific payloads for said tissues.  With genetic engineering, it should be possible to change the payload to be whatever is necessary at the moment.  The result would be a targeted drug delivery system, injecting a specific drug into a specific tissue at a (below) microscopic level.  Perhaps we could even deliver an anti-Cif drug via the same mechanism used to inject Cif in the first place, of all ironic things.

Monday, April 25, 2011

Chatty Bacteria

I recently read an article published in the Public Library of Science (link) about biofilm development on nematodes.  Before getting into the article, some background is needed.

Bacteria are classically seen as unicellular organisms that exist independently of one another.  These cells do not communicate with each other, and are really just a large group of individuals.  Cell in multicellular organisms, in contrast, communicate with each other extensively through a variety of means.  There are individuals, but individuals exist for the good of the whole.  (Cancer is an example of individuals acting in the interest of individuals, as opposed to acting in the interest of the whole organism.)

This model is nice and simple, but untrue.  Different species and strains of bacteria show certain levels of communication.  Though none of these forms of communication are as extensive as those seen in multicellular organisms, they are still significant.  A fairly common type of bacterial communication is known as quorum sensing.  In quorum sensing, bacterial cells are able to send a message to each other that essentially reads "we have reached a certain size".

As to how bacteria respond to this message depends on the particular species.  For certain pathogenic bacteria, it is interpreted as an attack message.  For a small group of bacteria, attacking a host would be certain death.  The numbers are too small to cause significant damage to the host, minimizing the amount of gain from an attack.  More importantly, the host will mount defenses in the form of an immune response, and a small group could very quickly be eradicated.  For a small group, it is much more advantageous to sit and wait.  The groups numbers slowly build, but the bacteria are proverbially under the radar of the host.  As long as the bacteria are not actually harming the host, the host has no advantage in expending energy and attacking the bacteria.  At some point, the bacterial numbers become significant, to the point where an immune response would not be able to dispatch the bacteria so quickly.  It is at this point where the size signal is sent, triggering the bacteria to attack the host.  Such behavior is quite advantageous, showing the power of such a seemingly simple signal.

In the paper, the authors looked at biofilm development of a certain group of bacterium, namely Yersinia.  (Yersinia includes the infamous Yersinia pestis, which causes the black plague.)  Biofilms are the closest bacteria get to being multicellular.  Within a biofilm, bacteria live in close quarters with each other, producing a variety of compounds that benefit the group as a whole.  Biofilms act as a platform for growth, and as a whole tend to be resistant to things that would otherwise kill off bacteria, including antibiotics.  The creation of biofilms is no simple feat for bacteria, and it is often mediated by the production of chemical signals to each other.

Enter the poor nematode.  This is a simple, very tiny worm, which is often used as a model organism in biology.  Yersinia can actually make its home on nematodes, and is even capable of making biofilms on nematodes.  The authors investigated how such biofilms were made.  Given that the nematodes are capable of (and do) move around, such biofilms seem to be an interesting area of study, as many biofilms tend to develop on static surfaces.  Sure enough, the construction of these motile biofilms is mediated by the same quorum sensing signals as seen in other bacteria.  Biofilms are loaded with the quarum sensing signal, namely N-acylhomoserine lactone (AHL).  The authors genetically engineered bacteria that were incapable of making AHL, and the resulting bacteria were unable to develop substantial biofilms.  In addition to biofilm production, they also found that quorum sensing signals triggered pathogenesis in general, as evidenced by the need for AHLs to make virulon proteins.

Though quorum sensing appears to be widely utilized by bacteria, there appears to be a large amount of variation on the common theme.  There are a lot of different ways in which a "we number this many" signal can be used advantageously for bacteria.  Life, through evolution, tends to explore many of niches, and experimentally it seems that quorum sensing is no exception.  The authors note how a number of other pathogens utilize quorum sensing in their own specific ways.

This leads to an interesting topic for experimental drugs.  Without the quorum sensing signal, certain pathogens never actually express pathogenic behavior.  If we can develop a drug that prevents this signal from ever reaching its target, be it through destroying the signal, blocking its receptor, or some other means, then the bacteria in question never mount an attack.  While they are still there, they are effectively harmless.  It seems that quorum sensing is specific to bacteria, so presumably such drugs would target bacteria specifically.  Additionally, being that quorum sensing is a common theme for pathogens, such drugs may specifically target pathogenic bacteria, sparing "good" bacteria.  This is unlike modern broad spectrum antibiotics, which usually kill off everything.  (Many of the negative side effects of antibiotics are due to good bacteria getting killed.)  There seems to be a lot of good that could come of quorum sensing research, and I'm excited to see what the future holds for it in terms of medicine.

Monday, April 18, 2011

Recursive Pathogens

I recently read an article published in Nature letters (link).  The topic of the article is that of a newly discovered pathogen: the virophage. 

The virophage is something completely out of the ordinary, compared to usual pathogens.  Virophages, like viruses, are not actually alive.  They lack their own molecular machinery for reproduction, and must rely on the host's machinery for this purpose.  For a typical virus, this is fairly simple conceptually.  A typical virus hijacks the molecular machinery of the cell, using it to produce viral proteins and induce other behavior advantageous to the virus.  The cell is forced to create new viruses with its own machinery, allowing for the creation and spreading of even more viruses.

In the respect of hijacking a host's machinery, the virophage is no different from a typical virus.  What is atypical, however, is that virophages actually hijack the already hijacked cellular machinery.  That is, virophages require that some other virus has already modified the molecular machinery of a cell in a certain way that the virophage can use it.  The virophage alone cannot infect a cell; it requires both the cell and another virus infecting the cell.

For this matter, it may be wrong to say the virophage infects the cell.  Based on the results of the paper, it seems more accurate to say that the virophage infects the other virus, which happens to reside in a cell.  Infection with virophage caused many of the normal viral components produced to be nonfunctional.  That is, the virophage impeded the spread of the infecting virus.  The virophage actually had a beneficial effect for cells.  Significantly fewer cells died when infected with virus + virophage instead of just virus (virophage + cells was no different than cells alone).

Although this is not too difficult to understand, it's a very different way of thinking.  The common terms "pathogen" and "host" which used to have clear definitions become blurred.  The virophage is not a cellular pathogen, but rather a viral pathogen.  Given that viruses are not alive, this is a paradox: how can something nonliving be a pathogen to something else that is nonliving?  This gets at the very root of what it means to be "alive", which has been hotly debated in the past by people across a wide variety of fields.

I think there are a lot of directions in which this research could go.  For one, it is suggested that virophages are extremely common in oceans, and perhaps elsewhere.  So far, all virophages discovered have come from common cooling towers, so they exist out of the ocean as well.  I wonder how many different kind of virophages there are.  Perhaps we could find a virophage for existing viral human pathogens, although this is probably jumping the gun.

A logical next step is to determine exactly how the virophage is hijacking the other virus.  The nonfunctional viral particles produced are very strange, and it does not seem obvious how they come about. 

Another question that comes to mind is selection advantage and the evolution of virophages.  Consider an extremely virulent virus.  This virus usually kills its host.  For a virus, it is unfavorable to kill off the host, since the host is required for reproduction.  Additionally, it is unfavorable to adversely affect the host significantly.  Generally, very sick people partially quarantine themselves from the rest of the population, namely by bedrest.  It is in the virus' best interest to spread to as many people as possible, and a very sick host cannot do that.  This is partially why the cold virus is so ubiquitous - people rarely get sick to the point of avoiding others, which in turn spreads the virus.  In summation, a highly virulent virus is bad both for the host it infects and the virus itself.

This is where I see a virophage coming in.  Although the virophage is a viral pathogen, in this case, it is actually in the virus' best interest not to be so virulent.  If the virophage prevents the host virus from being so pathogenic, then the end result is that the host virus can spread to more people.  Granted, much less of it is spreading, but considering that only one virus is theoretically needed to start an infection, this reduction may be acceptable.  The virophage is also beneficial to the cell, as cells simultaneously infected with virophage and virus usually do much better than cells infected with only virus.

That's my suspicion anyway.  As stated before, there are a lot of paths this research can take from here, and I only scratched the surface with these ideas.  Time to revise the textbooks.