Chapter 9 Downloading the sequences from GenBank
Now that we have identified the sequences for our tree, we need to download those sequences from GenBank into R. One option is to download the sequences directly from GenBank as a fasta file. If you are interested in this option, here is a good tutorial on how to do it. This will work and the subsequent fasta file can be uploaded into R.
9.1 The DNA.bin object
However, the library ape
has a command that allows us to download sequences from GenBank directly into R and store the sequences as a DNA.bin
object. This is a data structure that stores information like DNA sequence, how long each sequence is, information about the species identification of each sequence, and the total base percentages of all the sequences.
The command we’re using is read.GenBank
, which takes an argument of the accession number we want to download from GenBank. Since we want to download multiple sequences, We use c(““) to concatenate a string of accession numbers that read.Genbank will interpret.
library(ape) #if you haven't previously loaded ape
read.GenBank(c("JX915632","EF105403.1","DQ073553.1",
"FJ481575.1","EF204545.1","AJ314771.1","FJ481569.1",
"DQ073533.1","AY804128.1","AY303125.2","KF887414.1",
"D82941.1","JX276655.1"))
## 13 DNA sequences in binary format stored in a list.
##
## Mean sequence length: 1814.385
## Shortest sequence: 1476
## Longest sequence: 2490
##
## Labels:
## JX915632
## EF105403.1
## DQ073553.1
## FJ481575.1
## EF204545.1
## AJ314771.1
## ...
##
## Base composition:
## a c g t
## 0.310 0.303 0.271 0.115
## (Total: 23.59 kb)
Now that you have seen what read.Genbank does, we will
save it as an object, and also specify that we want the sequences in ATGC form. When as.character=TRUE
is not included (like above), read.GenBank
saves all the sequence data in a binary format. Binary is great for computers, but harder for humans to quickly interpret.
<- read.GenBank(c("JX915632","EF105403.1","DQ073553.1",
grass "FJ481575.1","EF204545.1","AJ314771.1","FJ481569.1",
"DQ073533.1","AY804128.1","AY303125.2","KF887414.1",
"D82941.1","JX276655.1"))
R BASICS
In R, you can do two things with the output of a command. First, you can have the output displayed immediately. This is what you did in the first block of R code above. This can be really helpful if you want to immediately see what your command did, but it’s less helpful if you want to do something with the output. In the first block of code, we managed to download sequences from GenBank and print them to the screen, but we don’t have a way to build trees from the printed screen. (Print in this case refers to the display you see on the R console.)
Second, you can tell R to save the output as an object. This is what we did second block of code with grass <- read.GenBank
. The <-
operator tells R to redirect the output from read.GenBank
to an object (or data structure) named grass
. Everything you saw printed on the screen from the first block of code is now saved to grass
.
If we ever want to see what objects we have saved in our R session, we can do so by typing the command
ls()
This tells R to list objects. We can see what each object contains by either typing the name of the object or by using the str
(structure) command. The syntax of the structure command is
str(object_name)
9.2 The fasta format
While ape
and related R packages have no difficulty interpreting a DNA.bin
object, other programs need the data in the fasta format. Fasta is a really common format for saving bioinformatic data (probably the most common format used!).
The format itself is quite simple and consists of two lines. The first line is the header line, which starts with >, immediately followed by a unique identifier. The sequence itself is on the second line. The sequence can either the standard IUPAC nucleic acid or amino acid codes, with additional characters for gaps (-), uracil (U), or translation stops (*).
The first 60 nucleotides from the Triticum aestivium sequence from above might look like this in fasta format:
JX915632_Triticum_aestivium atggctaagcggctggtcctctttgcagcagtagccgtcgccctcgtggctctcaccgcc
We can convert and save our DNA.bin
object in fasta format using a tool from the ape
package. The write.dna
commands takes three arguments: the first argument tells R the DNA.bin file to use, the second argument says what to name the new file, and the third argument says what format to use for the new file.
write.dna( grass, file = 'grass.fasta', format = 'fasta' )
If you check your RStudio files (on the lower left side of the screen, you’ll see a tab named Files), you should see a newly-created file called “grass.fasta”. If you open it up, the file will look something like this:
Going forward, we will use both the DNA.bin
object and the fasta file.
R BASICS
You might have noticed that we didn’t save the output of write.dna
to an object. That’s because any of the write
commands are automatically saving the output to a file on your computer (or, on AnVIL, to your persistent disk). The output is saved in what as known as your working directory. You can check what your current working directory is during any R session with the command
getwd()
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ape_5.4-1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.10 knitr_1.33 magrittr_2.0.3 hms_0.5.3
## [5] lattice_0.20-41 R6_2.4.1 rlang_1.1.0 fastmap_1.1.1
## [9] fansi_0.4.1 stringr_1.4.0 tools_4.0.2 parallel_4.0.2
## [13] grid_4.0.2 nlme_3.1-149 xfun_0.26 utf8_1.1.4
## [17] cli_3.6.1 jquerylib_0.1.4 htmltools_0.5.5 ottrpal_1.0.1
## [21] yaml_2.2.1 digest_0.6.25 tibble_3.2.1 lifecycle_1.0.3
## [25] bookdown_0.24 readr_1.4.0 sass_0.4.5 vctrs_0.6.1
## [29] fs_1.5.0 glue_1.4.2 cachem_1.0.7 evaluate_0.20
## [33] rmarkdown_2.10 stringi_1.5.3 pillar_1.9.0 compiler_4.0.2
## [37] bslib_0.4.2 jsonlite_1.7.1 pkgconfig_2.0.3