Chapter 9 Downloading the sequences from GenBank

Now that we have identified the sequences for our tree, we need to download those sequences from GenBank into R. One option is to download the sequences directly from GenBank as a fasta file. If you are interested in this option, here is a good tutorial on how to do it. This will work and the subsequent fasta file can be uploaded into R.

9.1 The DNA.bin object

However, the library ape has a command that allows us to download sequences from GenBank directly into R and store the sequences as a DNA.bin object. This is a data structure that stores information like DNA sequence, how long each sequence is, information about the species identification of each sequence, and the total base percentages of all the sequences.

The command we’re using is read.GenBank, which takes an argument of the accession number we want to download from GenBank. Since we want to download multiple sequences, We use c(““) to concatenate a string of accession numbers that read.Genbank will interpret.

library(ape) #if you haven't previously loaded ape

read.GenBank(c("JX915632","EF105403.1","DQ073553.1",
      "FJ481575.1","EF204545.1","AJ314771.1","FJ481569.1",
      "DQ073533.1","AY804128.1","AY303125.2","KF887414.1",
      "D82941.1","JX276655.1"))

## 13 DNA sequences in binary format stored in a list.
## 
## Mean sequence length: 1814.385 
##    Shortest sequence: 1476 
##     Longest sequence: 2490 
## 
## Labels:
## JX915632
## EF105403.1
## DQ073553.1
## FJ481575.1
## EF204545.1
## AJ314771.1
## ...
## 
## Base composition:
##     a     c     g     t 
## 0.310 0.303 0.271 0.115 
## (Total: 23.59 kb)

Now that you have seen what read.Genbank does, we will save it as an object, and also specify that we want the sequences in ATGC form. When as.character=TRUE is not included (like above), read.GenBank saves all the sequence data in a binary format. Binary is great for computers, but harder for humans to quickly interpret.

grass <- read.GenBank(c("JX915632","EF105403.1","DQ073553.1",
      "FJ481575.1","EF204545.1","AJ314771.1","FJ481569.1",
      "DQ073533.1","AY804128.1","AY303125.2","KF887414.1",
      "D82941.1","JX276655.1"))

R BASICS

In R, you can do two things with the output of a command. First, you can have the output displayed immediately. This is what you did in the first block of R code above. This can be really helpful if you want to immediately see what your command did, but it’s less helpful if you want to do something with the output. In the first block of code, we managed to download sequences from GenBank and print them to the screen, but we don’t have a way to build trees from the printed screen. (Print in this case refers to the display you see on the R console.)

Second, you can tell R to save the output as an object. This is what we did second block of code with grass <- read.GenBank. The <- operator tells R to redirect the output from read.GenBank to an object (or data structure) named grass. Everything you saw printed on the screen from the first block of code is now saved to grass.

If we ever want to see what objects we have saved in our R session, we can do so by typing the command

ls()

This tells R to list objects. We can see what each object contains by either typing the name of the object or by using the str (structure) command. The syntax of the structure command is

str(object_name)

9.2 The fasta format

While ape and related R packages have no difficulty interpreting a DNA.bin object, other programs need the data in the fasta format. Fasta is a really common format for saving bioinformatic data (probably the most common format used!).

The format itself is quite simple and consists of two lines. The first line is the header line, which starts with >, immediately followed by a unique identifier. The sequence itself is on the second line. The sequence can either the standard IUPAC nucleic acid or amino acid codes, with additional characters for gaps (-), uracil (U), or translation stops (*).

The first 60 nucleotides from the Triticum aestivium sequence from above might look like this in fasta format:

JX915632_Triticum_aestivium atggctaagcggctggtcctctttgcagcagtagccgtcgccctcgtggctctcaccgcc

We can convert and save our DNA.bin object in fasta format using a tool from the ape package. The write.dna commands takes three arguments: the first argument tells R the DNA.bin file to use, the second argument says what to name the new file, and the third argument says what format to use for the new file.

write.dna( grass, file = 'grass.fasta', format = 'fasta' )

If you check your RStudio files (on the lower left side of the screen, you’ll see a tab named Files), you should see a newly-created file called “grass.fasta”. If you open it up, the file will look something like this:

Major point!! example image

Going forward, we will use both the DNA.bin object and the fasta file.

R BASICS

You might have noticed that we didn’t save the output of write.dna to an object. That’s because any of the write commands are automatically saving the output to a file on your computer (or, on AnVIL, to your persistent disk). The output is saved in what as known as your working directory. You can check what your current working directory is during any R session with the command

getwd()

sessionInfo()

## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ape_5.7-1
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.8       utf8_1.2.4       generics_0.1.3   xml2_1.3.6      
##  [5] lattice_0.21-9   stringi_1.8.3    hms_1.1.3        digest_0.6.34   
##  [9] magrittr_2.0.3   grid_4.3.2       evaluate_0.23    timechange_0.3.0
## [13] bookdown_0.41    fastmap_1.1.1    rprojroot_2.0.4  jsonlite_1.8.8  
## [17] processx_3.8.3   chromote_0.3.1   ps_1.7.6         promises_1.2.1  
## [21] httr_1.4.7       fansi_1.0.6      ottrpal_1.3.0    jquerylib_0.1.4 
## [25] cli_3.6.2        rlang_1.1.4      cachem_1.0.8     yaml_2.3.8      
## [29] parallel_4.3.2   tools_4.3.2      tzdb_0.4.0       dplyr_1.1.4     
## [33] vctrs_0.6.5      R6_2.5.1         lifecycle_1.0.4  lubridate_1.9.3 
## [37] snakecase_0.11.1 stringr_1.5.1    janitor_2.2.0    pkgconfig_2.0.3 
## [41] pillar_1.9.0     bslib_0.6.1      later_1.3.2      glue_1.7.0      
## [45] Rcpp_1.0.12      xfun_0.48        tibble_3.2.1     tidyselect_1.2.0
## [49] knitr_1.48       nlme_3.1-164     htmltools_0.5.7  websocket_1.4.2 
## [53] rmarkdown_2.25   webshot2_0.1.1   readr_2.1.5      compiler_4.3.2  
## [57] askpass_1.2.0    openssl_2.1.1