Chapter 7 What sequences should I choose?

Phylogenies are only as good as the data used to infer them, so it’s worth it to the spend some time carefully choosing the genomic regions and samples you will use. Good planning from the beginning will save you headaches further downstream.

First, you should ask yourself: what information are you hoping to gain from the tree? Are you hoping to reconstruct the history of organisms, or the history of a region of DNA, or the history of a protein? The answer will guide your choice of sequence and samples.

For closely related species in which you might hope to figure out some sort of information about the divergence between the species (including the timing of the divergence), you would use areas of the genome known to accumulate changes rapidly (non-coding regions that also do not have functionality, or whose functionality is not easily changed by changes in the DNA base sequence). Some examples of rapidly changing genetic regions include the mitochondrial control region, the wobble base on mitochondrial coding regions, and nuclear introns. It is also important to use “dense taxon sampling” among closely related species because any small change can seem disproportionately important in a recent divergence. Having multiple individuals sampled from each phylogenetic unit of interest (could be species, subspecies, or populations) helps to compensate by showing the genetic divergence within a group. This within-group divergence can then be accurately compared to the genetic divergence between two groups.

For more divergent species and comparisons, you use areas of the genome that do not change as rapidly. For example, if you wanted to do a survey of the placental mammals, you could choose a gene region that is under enough selection pressure that it mutates more slowly than the regions you would choose for closely related species.

If you are examining the relationships among deeply divergent species, or when the base-pair signal is completely swamped out over time, you might search for amino acid sequence similarities instead of DNA sequence similarities. Because of wobble, amino acid sequence can remain the same even when bases change. Sometimes amino acids of similar size/charge/shape can be substituted for others, which would result in a complete change in base pair sequence (and loss of ability to find similar sequences), but allows for finding similarities through amino acid sequence. Even when using protein sequence, it is often helpful to extract the coding sequence (once similar protein sequences have been found), because that adds extra information to fine tune the phylogenetic analysis. This technique can also be used when the primary goal is to trace the history of a particular gene (when the changes in the gene itself are of interest.)

To attempt a reconstruction of the evolutionary history of organisms, you really should use multiple lines of evidence and not rely solely on genomic data. For example to reconstruct primate evolution, one looks at the fossil record, molecular divergence, and also phylogeographic evidence (how these things map onto our understanding of the geography of the earth at various crucial time points along primate evolution). Examples of phylogeographic evidence include understanding when terrestrial (land-based) organisms might have been cut off from each other due to the formation of a river or lake, the eradication of a land bridge by melting glaciers and a rise in the earth’s temperature (which raises the sea level).

Alas, for this book, we are limited to only using genomic data.