Chapter 8 Cell type annotation
Unless you are already an expert in neuronal cell expression, you probably didn’t know Gad1 and Slc6a1 are known interneuron markers until you were told. Luckily, we have several high-quality references databases that can be used for annotating scRNA-seq datasets. Some of these references are useful for identifying cell types (the Zeisel dataset we have been using is a well-known reference for identifying neuronal cell types). Others, such as the Gene Ontology (GO) or Kyoto Encyclopedia of Genes and Genomes (KEGG) collections, are useful for identifying biological processes associated with each cluster.
8.1 Annotating cell types
There are three basic strategies for annotating datasets: match the expression profile of each individual cell to the expression profile of cells from a reference dataset (the reference dataset approach); identify sets of marker genes highly expressed in each cell and match to gene sets from known cell types (the gene set approach); or perform a gene-set enrichment analysis on the marker genes that define each cluster.
The Zeisel dataset we have been working with has actually been annotated all this time.
# calculate the top marker genes assigned to each cell type (level1class) in the Zeisel dataset
<- pairwiseWilcox(sce.zeisel.qc, sce.zeisel.qc$level1class,
wilcox.z lfc = 1, direction = "up")
<- getTopMarkers(wilcox.z$statistics, wilcox.z$pairs,
markers.z pairwise = FALSE, n = 50)
# look at how many cell-type categories there are, as well as how many cells assigned to each category
lengths(markers.z)
## astrocytes_ependymal endothelial-mural interneurons
## 78 83 119
## microglia oligodendrocytes pyramidal CA1
## 69 80 124
## pyramidal SS
## 147
This dataset has grouped the individual cells into 7 different categories of neuronal subtypes. We can use the gene sets that define these categories to annotate a second brain scRNA-seq dataset from Tasic et al. (2016). (We will not go through all the data cleaning and clustering steps for this new dataset - having made it this far, we trust you can do that!)
#load the new dataset
<- TasicBrainData() sce.tasic
We first create the gene set lists using the GSEABase
package. The AUCell
package identifies and ranks marker sets highly expressed in each cell using an area under the curve (AUC) approach. We can assign cell type identity in the Tasic dataset by taking the marker set with the highest AUC as the label for the cell.
# AnVIL::install("GSEABase")
library(GSEABase)
#create a dataset that contains just the information about the Zeisel cell-type categories and the marker genes that define each cell-type
<- lapply(names(markers.z), function(x) {
all.sets GeneSet(markers.z[[x]], setName = x)
})<- GeneSetCollection(all.sets)
all.sets
# AnVIL::install("AUCell")
library(AUCell)
# rank genes by expression values within each cell
<- AUCell_buildRankings(counts(sce.tasic),
rankings plotStats = FALSE, verbose = FALSE)
# calculate AUC for each previously-defined marker set (from Zeisel) in the Tasic data
<- AUCell_calcAUC(all.sets, rankings)
cell.aucs <- t(assay(cell.aucs)) results
After assigning cell type identities to each cluster, a researcher should always verify that the identities make sense. Since the Tasic dataset has also been annotated, we can compare our annotation to the researcher-provided annotation as a sort of sanity check.
# assign cell type identity in the Tasic dataset by assumig the marker set with the top AUC is the proper label
<- colnames(results)[max.col(results)]
new.labels
# compare our annotations to the annotations provided by the Tasic dataset
<- table(new.labels, sce.tasic$broad_type)
tab tab
##
## new.labels Astrocyte Endothelial Cell GABA-ergic Neuron
## astrocytes_ependymal 43 2 0
## endothelial-mural 0 27 0
## interneurons 0 0 759
## microglia 0 0 0
## oligodendrocytes 0 0 1
## pyramidal SS 0 0 1
##
## new.labels Glutamatergic Neuron Microglia Oligodendrocyte
## astrocytes_ependymal 0 0 0
## endothelial-mural 0 0 0
## interneurons 2 0 0
## microglia 0 22 0
## oligodendrocytes 0 0 38
## pyramidal SS 810 0 0
##
## new.labels Oligodendrocyte Precursor Cell Unclassified
## astrocytes_ependymal 19 4
## endothelial-mural 0 2
## interneurons 0 15
## microglia 0 1
## oligodendrocytes 3 0
## pyramidal SS 0 60
As you can see, the labels applied by the researchers and by our annotation mostly match. (Pyramidal SS nerves are primarily glutamatergic, so although the categories are labeled differently, those two categories do indeed match!) The major exception is the oligodendrocyte precursor cells, which our annotation called astrocytes. However, this mismatch isn’t as concerning as you might think, once you know that both astrocytes and oligodendrocytes come from the same precursor cell lineage.
QUESTION 1. Should we be concerned when 1 or 2 cells are assigned a different cell type identity by our annotation than by the researchers? Why or why not?
sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] AUCell_1.16.0 GSEABase_1.56.0
## [3] graph_1.72.0 annotate_1.72.0
## [5] XML_3.99-0.9 AnnotationDbi_1.56.2
## [7] pheatmap_1.0.12 BiocSingular_1.10.0
## [9] scran_1.22.1 scater_1.22.0
## [11] ggplot2_3.3.5 scuttle_1.4.0
## [13] scRNAseq_2.8.0 SingleCellExperiment_1.16.0
## [15] SummarizedExperiment_1.24.0 Biobase_2.54.0
## [17] GenomicRanges_1.46.1 GenomeInfoDb_1.30.1
## [19] IRanges_2.28.0 S4Vectors_0.32.4
## [21] BiocGenerics_0.40.0 MatrixGenerics_1.6.0
## [23] matrixStats_0.61.0
##
## loaded via a namespace (and not attached):
## [1] AnnotationHub_3.2.2 BiocFileCache_2.2.1
## [3] igraph_1.3.1 lazyeval_0.2.2
## [5] BiocParallel_1.28.3 digest_0.6.29
## [7] ensembldb_2.18.4 htmltools_0.5.2
## [9] viridis_0.6.2 fansi_1.0.3
## [11] magrittr_2.0.3 memoise_2.0.1
## [13] ScaledMatrix_1.2.0 cluster_2.1.2
## [15] limma_3.50.3 Biostrings_2.62.0
## [17] R.utils_2.12.2 prettyunits_1.1.1
## [19] colorspace_2.0-3 blob_1.2.3
## [21] rappdirs_0.3.3 ggrepel_0.9.1
## [23] xfun_0.26 dplyr_1.0.8
## [25] crayon_1.5.1 RCurl_1.98-1.6
## [27] jsonlite_1.8.0 glue_1.6.2
## [29] gtable_0.3.0 zlibbioc_1.40.0
## [31] XVector_0.34.0 DelayedArray_0.20.0
## [33] scales_1.2.1 DBI_1.1.2
## [35] edgeR_3.36.0 Rcpp_1.0.8.3
## [37] viridisLite_0.4.0 xtable_1.8-4
## [39] progress_1.2.2 dqrng_0.3.0
## [41] bit_4.0.4 rsvd_1.0.5
## [43] metapod_1.2.0 httr_1.4.2
## [45] RColorBrewer_1.1-3 ellipsis_0.3.2
## [47] R.methodsS3_1.8.1 farver_2.1.0
## [49] pkgconfig_2.0.3 sass_0.4.1
## [51] dbplyr_2.1.1 locfit_1.5-9.5
## [53] utf8_1.2.2 tidyselect_1.1.2
## [55] rlang_1.0.2 later_1.3.0
## [57] munsell_0.5.0 BiocVersion_3.14.0
## [59] tools_4.1.3 cachem_1.0.6
## [61] cli_3.2.0 generics_0.1.2
## [63] RSQLite_2.2.12 ExperimentHub_2.2.1
## [65] evaluate_0.15 stringr_1.4.0
## [67] fastmap_1.1.0 yaml_2.3.5
## [69] knitr_1.33 bit64_4.0.5
## [71] purrr_0.3.4 KEGGREST_1.34.0
## [73] AnnotationFilter_1.18.0 sparseMatrixStats_1.6.0
## [75] mime_0.12 R.oo_1.24.0
## [77] xml2_1.3.3 biomaRt_2.50.3
## [79] compiler_4.1.3 beeswarm_0.4.0
## [81] filelock_1.0.2 curl_4.3.2
## [83] png_0.1-7 interactiveDisplayBase_1.32.0
## [85] statmod_1.4.36 tibble_3.1.6
## [87] bslib_0.3.1 stringi_1.7.6
## [89] highr_0.9 GenomicFeatures_1.46.5
## [91] lattice_0.20-45 bluster_1.4.0
## [93] ProtGenerics_1.26.0 Matrix_1.4-0
## [95] vctrs_0.4.1 pillar_1.7.0
## [97] lifecycle_1.0.1 BiocManager_1.30.16
## [99] jquerylib_0.1.4 BiocNeighbors_1.12.0
## [101] data.table_1.14.2 bitops_1.0-7
## [103] irlba_2.3.5 httpuv_1.6.5
## [105] rtracklayer_1.54.0 R6_2.5.1
## [107] BiocIO_1.4.0 bookdown_0.24
## [109] promises_1.2.0.1 gridExtra_2.3
## [111] vipor_0.4.5 assertthat_0.2.1
## [113] rjson_0.2.21 withr_2.5.0
## [115] GenomicAlignments_1.30.0 Rsamtools_2.10.0
## [117] GenomeInfoDbData_1.2.7 parallel_4.1.3
## [119] hms_1.1.1 grid_4.1.3
## [121] beachmat_2.10.0 rmarkdown_2.10
## [123] DelayedMatrixStats_1.16.0 Rtsne_0.16
## [125] shiny_1.7.1 ggbeeswarm_0.6.0
## [127] restfulr_0.0.13