Chapter 21 What is a network?

Phylogenetic networks (sometimes called the splits network) are a way to examine conflicting phylogenetic relationships in your data. Like a tree, networks visually represent evolutionary relationships among taxa. Unlike a tree, a network can show conflicting signals in the data, when multiple relationship patterns are supported. (A tree will simply show the relationship with the most data support.)

These types of analyses started gaining in popularity with the advent of big data in the mid 2000s, because researchers were discovering that not all genes within an organism had the same evolutionary history. Phenomena such as incomplete lineage sorting, gene loss or duplication, hybridization, or horizontal gene transfer mean that genes in an organism’s genome can come from a variety of sources. Our simple understanding of speciation (a population splits in two, and the resulting daughter populations become genetically isolated from each other and develop into two new, different species) was no longer sufficient to describe what actually happens in nature. This is the essential idea behind the difference between gene trees and species trees.

Take a look at this figure (Daniel Huson, ISMB-Tutorial 2007: Introduction to Phylogenetic Networks)

Major point!! example image

Figure T1 and T2 represent possible phylogenetic relationships among taxa. Let’s say 60% the data support the relationship in T1, while the other 40% of the data support the relationship in T2. If we were to use all the data together to infer a single consensus tree, we would only see the relationships in T1; the information in T2 would be completely obscured. A phylogenetic network, on the other hand, will show us both possible relationships (the diamond shape you see in the third figure). It’s as if figure T1 was overlaid on figure T2 (with some extra branches drawn). In a network, the diamond shapes represent all the different possible phylogenetic relationships.

21.1 Why do we see competing phylogenetic relationships?

The process of speciation is slow. While it sounds straightforward (a population splits into two reproductively isolated daughter populations, which then become two new species), in reality it is a very slow and messy process. Sometimes we are examining taxa that have only recently become separate. In this situation, we might be seeing the result of incomplete lineage sorting. When populations first split, there will be individuals within each daughter population who have the same allele for particular genes or genomic regions. Usually these shared alleles will sort, or become extinct in one population but not the other. However, this process takes time, and the bigger the daughter population, the longer this process takes.

If the two daughter populations have been separated long enough for genomic sorting to have occurred, the conflicting phylogenetic relationships could be the result of introgression. Among sexually reproductive species, members of two separate species might interbreed or hybridize (this is very common in some orders, especially ducks!). In bacteria, genes can be transferred between bacteria of different species via horizontal gene transfer. Viruses will also swap entire genes in a similar process.

Finally, it may not be possible to reconstruct the phylogenetic relationships among groups if speciation happened over a very short time period (similar to what you expect in an adaptive radiation). In this case, the relationships among species might look like a polytomy instead of a series of bifurcating nodes.

Gene trees and species trees

One of the most important things to remember in phylogenetics is that we estimate gene trees. We often use these as proxies for species trees, but they are not the same thing. Due to the process of sorting and introgression, the phylogenetic relationships supported by a certain percentage of genes or genomic regions will actually differ from the species tree. When you are looking at trees built with data from only one gene, you can’t guarantee that your gene tree is reflective of the species tree!

This is problematic now that we’re using whole genomes to infer phylogenies. The methods we use for inferring phylogenies were designed based on the idea that all the sequences evolved following a single model of evolution, with no complicating factors like lateral gene transfer, duplication events, or recombination. Unfortunately, the genome is a wild hodgepodge of coding regions, noncoding regions, introns, exons, regulatory regions, pseudogenes, and a bunch of other things we probably don’t know about yet. Each of these regions may have their own evolutionary history that is best modeled by a variety of molecular models.

There are two basic approaches to dealing with the whole genome mess. Both of these approaches are computationally complex and generally can only be done with neighbor joining or parsimony methods, at least for now. Researchers can consider each genomic region separately (the consensus gene tree approach), and then find an evolutionary history that best fits the distribution of topologies generated by each separate genomic region. Alternatively, researchers can “glue” the whole genome together into a single fasta file (the concatenation approach), which they then use for analysis, assuming you can figure out how to align the genomes. Gene duplication, gene deletion, and the presense of pseudogene regions make this a not-insignificant problem.

When you are working with microbial genomes, you have the added difficulty of the pan-genome problem. In many bacteria, genes can generally be sorted into two groups: the core genes (genes that are present in all members of a species) and the accessory genes (genes that can be present, but don’t have to be). The collection of all core genes plus all accessory genes is called the pan-genome. A typical S. aureus genome is about 2800 genes (1000 core genes and 1800 accessory genes). Unfortunately, the S. aureus pan-genome is a little more than 7400 genes. That means the 1800 accessory genes in any given S. aureus genome are selected from 6400 possibilities.

This should not discourage you from the possibility of building whole genome phylogenetic trees. New methods and programs are coming out all the time to deal with these problems, and trees from whole genome sequence are becoming a reality. If this is something you’re interested in, take a look at programs like kSNP3.