USS Clueless - Analyzing the genome

Stardate 20030919.1600

(On Screen): I think that one of the reasons why a lot of people think that some subjects are boring and impenetrable is that when they had been exposed to those subjects the presentation was incompetent.

When I was in grade school and in high school, history was easily my least favorite subject and the one where I consistently got the worst grades. But once I was out of college, I ended up being fascinated by it and have been studying it ever since. The problem wasn't that history is dull; it's that the books I was taught from were dull.

There are a lot of people who have a similar feeling about science and mathematics, and if you try to discuss such things with them on a technical level, either their eyes glaze over or they'll react negatively in some way to it. I just ran into a post by Zachary Latif who says that he used to have that kind of reaction to technical subjects but has been finding them more and more accessible as time has gone on (in part because of reading some of my stuff, blush).

With respect to history, I think it's also the case that I needed to develop more. It wasn't until I myself had become a bit more humanized and socialized that I began to have the kind of intellectual resources to be able to understand those things. All of us continue to develop during our lives; our mental toolkit doesn't freeze when we get handed our last diploma. Something of the kind may be happening to him, too.

In the course of his post, having mentioned in passing such things as Russell's Paradox and Turing's "Stopping Problem", he writes this:

This is also why the Gene Code can't be deciphered with such ease as well because our genetic code is sufficiently complex to the degree that it would take a pretty long time to able to full understand the reactions of one gene to another and algorithm cannot model every possible gene interaction.

It's worse than that. Mechanical analysis of the human genome via computer algorithm is isomorphic to the Stopping Problem.

The sequencing of the human genome is a modern miracle of science and technology. That it is possible at all to sequence even a short stretch of DNA is amazing, but that sequencing could be done so efficiently and rapidly as to make it practical to do this for the entire body of genetic information we carry is even more amazing. Of course, we're not all identical genetically, but we're far more similar than different, and that resulting body of data is of great value to us all both in what it says in places where any one of is has the same genes, and what it says in the places where our genes are different. As time goes on and there's further work, the library will be enhanced with information about places where there's a lot of variation, documenting the kinds of differences that can be found amongst us.

But having created an immense library of long sequences made up exclusively billions of A's, C's, G's and T's, the letters by which genetic "words" are spelled, which is divided into 24 volumes, trying to figure out what it all means is an even worse problem.

The evidence is that most of it doesn't mean anything. The estimate is that 90% of it is "junk" which is never expressed at all. There actually turns out to be a quick and dirty way to tell (or at least get a good indication of) whether a given section of the genome is junk or non-junk. That's because of what are known as "neutral mutations".

Mutation is the term we use for what turn out to be heritable genetic copying mistakes. The process by which DNA is copied is extremely good, with a very low error rate, but the copying process is not perfect and the error rate is not zero. However, most copying mistakes don't get passed on, since they happen in cells not related to production of sperm or eggs. Heritable mutations happen during the actual process of meiosis, and they can also happen during embryonic development in cells which are developmentally ancestral to those which will eventually participate in meiosis.

If such mutations (in the genotype) cause differences in the result (the phenotype) then natural selection operates on them. Natural selection is an inefficient statistical process and evaluates each organism which plays the game as a whole. Either they survive or they don't; either they breed or they don't. But with a large population and a lot of time, generally speaking advantageous mutations become more common and deleterious ones become less common. (With the process further complicated by the way that some genes are both, not to mention the fact that many are recessive.)

Given that the majority of mutations in functional genes break them or make them worse, it means that statistically speaking natural selection tends to act as a second level of quality control on genetic copying, since mistakes which are bad for the phenotype get weeded out in terms of survival.

But there are a lot of places where changes in the genotype do not cause any change in the phenotype. For one thing, the code is redundant. Individual amino acids are encoded using a sequence three "letters". There are 64 possible combinations but only 20 amino acids plus "stop" to code for. That means some amino acids have more than one sequence, and it turns out that a lot of them are rather nicely grouped. I reproduced the encodin