Chapter 20 Adding calibration points to your phylogenies

This week we will be working with both RStudio and BEAST. Within RStudio, we will be using the ape package again.

20.1 Choosing calibration points

This week, you should choose at least one calibration point to add to your tree. You can pick any two taxa from a tree and figure out the estimated age of the divergence between them (or the estimated age of the most recent common ancestor between them). The website TimeTree has divergence estimates for many different taxa pairs.

When two taxa (usually species, but it could be taxa from higher taxonomic levels) are entered into TimeTree, the website first checks whether these taxa are in the TimeTree of Life database. If they are, the divergence estimate between the two (the age of the MRCA between them) is retrieved. If one or both of the taxa are not in the database, the website scans NCBI taxonomy to find the closest relative of the species and uses that as a proxy to find the MRCA for two taxa.

If your taxa are not on TimeTree, you will have to be creative in finding a calibration point. You might be able to find information in the literature. If you are working with viral samples, you can use the date the sample was collected as a tip calibration point. If nothing works and you are stuck, email me and I will send you the grass dataset to work with.

For the grass dataset, let’s use the divergence between bread wheat (Triticum aestivum) and crested wheatgrass (Agropyron cristatum). When we search TimeTree, we see several estimated divergence times.

Major point!! example image

TimeTree gives us two divergence estimates: a median estimate (with 95% confidence interval), and an adjusted time, which is quite a bit older than the median estimate. This is what TimeTree says about their adjusted time estimate: “Answer: Due to conflicting time estimates between studies, ancestral nodes can be assigned younger ages than their descendants. A smoothing technique is used to adjust these times so the resulting tree is ultrametric.” (An ultrametric tree is one where all the a rooted tree with edge lengths where all leaves are equidistant from the root. When trees have been scaled using calibration points, they often become ultrametric.)

In this case, the very youngest divergence estimate from TimeTree is 2.6 million years ago (MYA), while the oldest is 16.5 MYA. We will use these estimates going forward. We could have easily chosen the other estimates as well. There isn’t necessarily a right or wrong choice here. You just need to be upfront about the estimates you chose.

20.2 Adding calibration points to internal nodes for ML trees

When we look at the grass tree, we see the node that indicates the MRCA between crested wheatgrass and bread wheat (labeled wheat on the tree) is pretty deep within the tree.

Major point!! example image

If we trace the branches until we find the node where they connect, we can see it is the basal node of the tree (the node which connects the ingroup to the outgroup). In order to add a calibration date to our ML tree in R, we need to know how R has identified (or labeled) that particular node. You can either load your saved ML tree or create it again.

library(ape)
library(phangorn)

grass.align <- read.phyDat("grass_aligned-renamed.fasta", format = "fasta")
dist <- dist.ml(grass.align)
nj.tree <- nj(dist)

fit <- pml(nj.tree, data = grass.align)
fitGTR.G <- update(fit, model = "GTR", k = 4)
fitGTR.G <- optim.pml(fitGTR.G, model = "GTR", optGamma = T, rearrangement = "stochastic", control = pml.control(trace = 0))

tree.root <- root(fitGTR.G$tree, outgroup = c('barley_D-hordein','Siberian wild rye_D-hordein'))
plot(tree.root)
nodelabels()

Major point!! example image

The node that connects crested wheatgrass and wheat (and indicates the MRCA between the two) is node 21. We will use that information, as well as the estimated divergence times from TimeTree, to apply a timescale to the ML tree using the chronos command from the ape package. If you’d like to read more about it, the ape manual is here.

The chronos command takes several arguments. The first is a lambda starting estimate (the smoothing parameter). The next argument is the model of rate substitution variation (how much should the mutation rate vary among the branches of the tree?). chronos can implement a strict clock (no variation, called “clock”), a correlated clock (where the variation is correlated between branches, but does exist, called “correlated”), a discrete clock (where estimates are discrete for each branch and requires multiple estimated parameters, called “discrete”), or a relaxed clock. The relaxed clock (called “relaxed”) allows for discrete mutation rates on the branches, but doesn’t involve estimating a different parameter for each branch. We’ll implement the discrete clock in this example. (The estimation is fairly quick, so you can easily implement all four and see how the resulting trees differ.)

The remaining arguments is the calibration data frame (which includes the actual calibration points). We use the command makeChronosCalib command to create the calibration data frame. This command takes the ML tree as the first argument.

calib <- makeChronosCalib(tree.root, node = 21, age.min = 2.6, age.max = 16.5)

calib
##   node age.min age.max soft.bounds
## 1   21     2.6    16.5       FALSE

If we want to add more calibration points, we can use the c (concatenate) command. This tells R that a string of multiple entries are being entered at once.

calib2 <- makeChronosCalib(tree.root, node = c("21", "19"), age.min = c(2.6,3.0), age.max = c(16.5, 16.5))

calib2
##   node age.min age.max soft.bounds
## 1   21     2.6    16.5       FALSE
## 2   19     3.0    16.5       FALSE

Creating the time tree, we simply combine the chronos and makeChronosCalib commands.

time.tree <- as.phylo(tree.root)

time.tree_dated <- chronos(time.tree, lambda = 1, model = "discrete", quiet = FALSE, calibration = calib, control = chronos.control())
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -4.102921 
## Optimising rates... frequencies... dates... -4.102921 
## Optimising rates... frequencies... dates... -3.795125 
## 
## log-Lik = -3.795125 
## PHIIC = 67.59

Before we plot the tree, we need to do a bit of formatting of the branch lengths (called edge.length). R will be default save these values to 10 decimal places, which can make visualization a little tricky. We’re going to format them so they only have two decimal places, then plot them on top of the time tree.

time.tree_dated$edge.length <- round(time.tree_dated$edge.length, digits = 2)

plot(time.tree_dated)
edgelabels(time.tree_dated$edge.length, bg="black", col="white", font=2)

Behold, you have a time tree. The branch lengths are now scaled to million of years, so you can estimate the age of MRCA between any two taxa on the tree. For example, based on the discrete clock model, our tree suggests the MRCA between mosquito grass and medusahead rye lived ~11 MYA (the length of the branch between mosquito grass and the node that denotes the common ancestor with medusahead rye).

Before you end your session, make sure you save your tree (and then download the .tre file to your desktop).

write.tree(time.tree_dated, "ml_grass_timetree.tre")

20.3 Adding calibration points for Bayesian trees

We can add calibration points using the BEAUTi gui you worked with last week. In this case, we open BEAUTi and load our dataset. We have two choices at this point: we can choose the “Tip Dates” tab and specify the date each sample was taken (useful for viral trees, especially during an outbreak), or we can choose the “Clock Model” tab (if you do this, use a strict clock model unless you are certain your data are not clock-like).

If you’ve chosen to not use Tip Dates, we will add additional calibration points in the “Priors” tab. These instructions are taken from the tutorial you worked with last week.

Major point!! example image

Major point!! example image

The rest of your parameters (including your substitution model and priors) can be set to the same values you used last week. Follow the instructions from last week to get your final Bayesian tree.

20.4 Visualizing the time trees in FigTree

After you finish creating your trees, you can visualize both the ML and Bayesian tree using FigTree. Play around with the options to make the tree look nice - I recommend making the tip labels larger. Be sure to include your branch lengths! Those are now scaled to be the divergence times for both the ML and Bayesian trees.

Major point!! example image

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] phangorn_2.5.5 ape_5.4-1     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.10     highr_0.8       bslib_0.4.2     compiler_4.0.2 
##  [5] pillar_1.9.0    jquerylib_0.1.4 tools_4.0.2     digest_0.6.25  
##  [9] jsonlite_1.7.1  evaluate_0.20   lifecycle_1.0.3 tibble_3.2.1   
## [13] nlme_3.1-149    lattice_0.20-41 pkgconfig_2.0.3 rlang_1.1.0    
## [17] igraph_1.2.6    fastmatch_1.1-0 Matrix_1.2-18   cli_3.6.1      
## [21] yaml_2.2.1      parallel_4.0.2  xfun_0.26       fastmap_1.1.1  
## [25] stringr_1.4.0   knitr_1.33      fs_1.5.0        vctrs_0.6.1    
## [29] sass_0.4.5      hms_0.5.3       grid_4.0.2      glue_1.4.2     
## [33] R6_2.4.1        fansi_0.4.1     ottrpal_1.0.1   rmarkdown_2.10 
## [37] bookdown_0.24   readr_1.4.0     magrittr_2.0.3  htmltools_0.5.5
## [41] quadprog_1.5-8  utf8_1.1.4      stringi_1.5.3   cachem_1.0.7