This analysis has been adapted from this refine.bio-examples notebook
In this analysis, we will use this acute myeloid leukemia sample dataset from @Shih2017 and pre-processed by @refinebio.
The data that we downloaded from refine.bio for this analysis has 19 samples (obtained from 19 mice with acute myeloid leukemia (AML)), containing RNA-sequencing results for types of AML under controlled treatment conditions.
This dataset can be downloaded from this page on refine.bio. 00-download-data.py
downloads it already processed and quantile normalized.
# Create the data folder if it doesn't exist
if (!dir.exists("data")) {
dir.create("data")
}
# Define the file path to the plots directory
plots_dir <- "plots"
# Create the plots folder if it doesn't exist
if (!dir.exists(plots_dir)) {
dir.create(plots_dir)
}
# Define the file path to the results directory
results_dir <- "results"
# Create the results folder if it doesn't exist
if (!dir.exists(results_dir)) {
dir.create(results_dir)
}
# Define the file path to the data directory
data_dir <- file.path("data", "SRP070849")
# Declare the file path to the gene expression matrix file
data_file <- file.path(data_dir, "SRP070849.tsv")
# Declare the file path to the metadata file
# inside the directory saved as `data_dir`
metadata_file <- file.path(data_dir, "metadata_SRP070849.tsv")
We will use pheatmap
[@Slowikowski2017] for clustering and creating a heatmap.
if (!("pheatmap" %in% installed.packages())) {
# Install pheatmap
install.packages("pheatmap", update = FALSE)
}
Attach the pheatmap
and magrittr
libraries:
# Attach the `pheatmap` library
library(pheatmap)
package ‘pheatmap’ was built under R version 4.0.3
# We will need this so we can use the pipe: %>%
library(magrittr)
# Set the seed so our results are reproducible:
set.seed(12345)
This chunk of code will read in both TSV files and add them as data frames to your environment.
# Read in metadata TSV file
metadata <- readr::read_tsv(metadata_file)
── Column specification ─────────────────────────────────────────────────────────────────────────
cols(
.default = col_character(),
refinebio_age = col_logical(),
refinebio_cell_line = col_logical(),
refinebio_compound = col_logical(),
refinebio_disease = col_logical(),
refinebio_disease_stage = col_logical(),
refinebio_genetic_information = col_logical(),
refinebio_processed = col_logical(),
refinebio_processor_id = col_double(),
refinebio_race = col_logical(),
refinebio_sex = col_logical(),
refinebio_source_archive_url = col_logical(),
refinebio_time = col_logical()
)
ℹ Use `spec()` for the full column specifications.
# Read in data TSV file
expression_df <- readr::read_tsv(data_file) %>%
# Here we are going to store the gene IDs as row names so that
# we can have only numeric values to perform calculations on later
tibble::column_to_rownames("Gene")
── Column specification ─────────────────────────────────────────────────────────────────────────
cols(
.default = col_double(),
Gene = col_character()
)
ℹ Use `spec()` for the full column specifications.
Let’s take a look at the metadata object that we read into the R environment.
head(metadata)
Now let’s ensure that the metadata and data are in the same sample order.
# Make the data in the order of the metadata
expression_df <- expression_df %>%
dplyr::select(metadata$refinebio_accession_code)
# Check if this is in the same order
all.equal(colnames(expression_df), metadata$refinebio_accession_code)
[1] TRUE
Now we are going to use the pheatmap
package to look at how are samples and genes are clustering.
For this example, we will sort genes by variance and select genes in the upper quartile, but there are many alternative criterion by which you may want to sort your genes, e.g. fold change, t-statistic, membership in a particular gene ontology, so on.
# Calculate the variance for each gene
variances <- apply(expression_df, 1, var)
# Determine the upper quartile variance cutoff value
upper_var <- quantile(variances, 0.75)
# Filter the data choosing only genes whose variances are in the upper quartile
df_by_var <- data.frame(expression_df) %>%
dplyr::filter(variances > upper_var)
Let’s save these to our results folder as a TSV file.
readr::write_tsv(df_by_var, file.path(results_dir, "top_90_var_genes.tsv"))
From the accompanying paper, we know that the mice with IDH2
mutant AML were treated with vehicle or AG-221 (the first small molecule in-vivo inhibitor of IDH2 to enter clinical trials) and the mice with TET2
mutant AML were treated with vehicle or 5-Azacytidine (Decitabine, hypomethylating agent). [@Shih2017]
# Let's prepare the annotation for the uncollapsed `DESeqData` set object
# which will be used to annotate the heatmap
annotation_df <- metadata %>%
# Create a variable to store the cancer type information
dplyr::mutate(
mutation = dplyr::case_when(
startsWith(refinebio_title, "TET2") ~ "TET2",
startsWith(refinebio_title, "IDH2") ~ "IDH2",
startsWith(refinebio_title, "WT") ~ "WT",
# If none of the above criteria are satisfied,
# we mark the `mutation` variable as "unknown"
TRUE ~ "unknown"
)
) %>%
# select only the columns we need for annotation
dplyr::select(
refinebio_accession_code,
mutation,
refinebio_treatment
) %>%
# The `pheatmap()` function requires that the row names of our annotation
# data frame match the column names of our `DESeaDataSet` object
tibble::column_to_rownames("refinebio_accession_code")
# Create and store the annotated heatmap object
heatmap_annotated <-
pheatmap(
df_by_var,
cluster_rows = TRUE,
cluster_cols = TRUE,
show_rownames = FALSE,
annotation_col = annotation_df, # Specify our annotation here
main = "Annotated Heatmap",
colorRampPalette(c(
"deepskyblue",
"black",
"yellow"
))(25
),
scale = "row" # Scale values in the direction of genes (rows)
)
You can switch this to save to a JPEG or TIFF by changing the function and file name within the function to the respective file suffix.
# Open a PNG file
png(file.path(
plots_dir,
"aml_heatmap.png" # Replace with a relevant file name
))
# Print the heatmap
heatmap_annotated
# Close the PNG file:
dev.off()
null device
1
# Print session info
sessioninfo::session_info()
─ Session info ────────────────────────────────────────────────────────────────────────────────
─ Packages ────────────────────────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.3)
cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0)
colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0)
crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2)
ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.3)
evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.3)
fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0)
farver 2.0.3 2020-01-16 [1] RSPM (R 4.0.3)
fs 1.5.0 2020-07-31 [1] RSPM (R 4.0.3)
generics 0.0.2 2018-11-29 [1] RSPM (R 4.0.0)
glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.3)
gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.3)
hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0)
htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1)
knitr 1.33 2021-09-29 [1] Github (yihui/knitr@a1052d1)
lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.2)
leanbuild 0.1.2 2021-09-29 [1] Github (jhudsl/leanbuild@dc8f933)
lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.2)
magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0)
Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.2)
munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.3)
pheatmap * 1.0.12 2019-01-04 [1] RSPM (R 4.0.3)
pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.3)
purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.3)
R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0)
RColorBrewer 1.1-2 2014-12-07 [1] RSPM (R 4.0.3)
readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2)
rlang 0.4.10 2021-09-29 [1] Github (r-lib/rlang@f0c9be5)
rmarkdown 2.10 2021-09-29 [1] Github (rstudio/rmarkdown@02d3c25)
rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.3)
sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.3)
tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2)
tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.3)
tinytex 0.26 2020-09-22 [1] RSPM (R 4.0.2)
vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2)
xfun 0.26 2021-09-29 [1] Github (yihui/xfun@74c2a66)
yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.3)
[1] /usr/local/lib/R/site-library
[2] /usr/local/lib/R/library