Chapter 10 Why do we align sequences?
Sequence alignment is the art of lining up sequences from different samples in such a way that that reflects shared quality. When we perform an alignment in preparation for phylogenetic analyses, we aim to line up our sequences so that the complete alignment reflects the evolutionary relationships among all the samples. When we look at it, this mean that long stretches of the sequences should be fairly similar, with smaller regions of dissimilarity scattered throughout.
As the samples become more distantly related from each other, the regions of similarity will become smaller and the regions of dissimilarity will become larger. How the regions of dissimilarity are arranged can change, depending on our choices of assumptions.
All alignment programs will assign a “price” to each potential alignment the create, then return the least costly alignment as the final result. Most programs create potential alignments using an algorithm that assigns similarity scores to each pairwise comparison. The program then uses these scores to determine the final potential alignment. The algorithm also assigns penalties for alignments that include undesireable features. In general, an alignment algorithm can apply two major costs:
- gap opening cost: we can apply a penalty for opening (or starting) any gap (indicating an insertion or deletion event)
- gap extension cost: we can apply a penalty for making a gap longer
Alignments can be created using either the nucleotide sequence or the amino acid sequence. Amino acid sequences can be useful when dealing with more diverse samples where the nucleotide sequence includes lots of regions of dissimilarity and few regions of similarity. Because there are only 4 nucleotides, compared to 20 amino acids, amino acid sequence alignments tend to be less noisy than nucleotide sequence alignments. Amino acid sequences are also slower to change than nucleotide sequences due silent (or synonymous) nucleotide mutations that don’t affect the amino acid sequence.
Keep in mind that any alignment we use is still just a hypothesis - it may be a well-supported hypothesis that represents our best knowledge, but it may still not be correct. We may never know what the “true” alignment is.