3 Identify cell populations
In this section, we will analyse further the data and try to identify sub-populations of cells. The objective is to create clusters of cells, i.e to make groups of cells that share similar expression profile. The main steps are as follow :
- Select a subset of genes to perfom the downstream analyses
- Perform a dimension reduction
- Cluster the cells
3.1 Select highly variables (hvg)
The downstream analyses (dimension reduction and then clustering) will be performed on the subset of genes. The aims are to exclude genes with minor biological relevance and decrease the computational load.
# Find Variable Genes :
mydata_filtrd <- FindVariableFeatures(mydata_filtrd,
selection.method = "vst",
nfeatures = 2000,
verbose = FALSE)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(mydata_filtrd), 10)
# you can also access the complete list using :
list_hvg = mydata_filtrd@assays$RNA@var.features
3.2 Dimension reduction
Before performing a dimension reduction technique, it is usually advised to scale the data.
3.2.1 Scaling
Expression of each gene is scaled (centered around its mean with a unit standard deviation). This step gives equal weight to all genes while performing the dimension reduction, and insures that the most expressed genes do not dominate.
## Centering and scaling data matrix
The scaled data are stored in the scaled
slot.
By default, scaling is only performed on the variable genes.
3.2.2 PCA
Even though, the gene selection step reduces the data considerably, there is a further need to reduce dimension to keep the most important variability in the data. There are several methods for dimension reduction, the most commonly used is PCA, but you can also check ICA or more single-cell specific methods that tries to cope with over-dispersion of the data such as ZIFA and pCMF.
## PC_ 1
## Positive: nos, spz, del, CG8507, stai, CG14545, Sesn, shu, CG17698, CG6287
## CanB2, BigH1, pgc, 26-29-p, stil, PyK, Eno, CG7530, krimp, aret
## CG17270, Taf12L, Cen, fabp, CG31689, Gapdh1, exu, CG17658, CG6967, CG9917
## Negative: CG4440, CG10035, Bacc, CG13427, Df31, Tom, bl, fax, CG15628, BobA
## hth, ed, CG5059, ci, tna, Gp150, sqd, sdt, l(3)neo38, N
## Ppa, grh, vfl, Dl, roX1, wech, HmgD, CR43887, ASPP, Brd
## PC_ 2
## Positive: twi, ventrally-expressed-protein-D, CG4500, Cyp310a1, sna, ltl, Ilp4, CG12177, CG16758, Mes2
## stumps, if, htl, CG14688, CadN, zfh1, Mdr49, CG3036, Pka-C3, tin
## sty, Asph, sprt, Inx3, CR45361, Mef2, Nplp2, be, Act87E, Ndae1
## Negative: Lac, ASPP, Ptr, cno, grh, cv-2, aop, jbug, CG2162, btsz
## sdt, crb, mew, Tom, fra, CG34383, 5-HT2A, CG34224, Doc1, CG45263
## blot, hbs, Mipp1, ci, CG10176, pnr, wun2, dpp, bib, CG42788
## PC_ 3
## Positive: CG8147, srp, ps, Lapsyn, fkh, peb, CG2930, hkb, Doc3, Oatp74D
## CG15236, MRE23, exex, Gmap, Doc2, CG32053, Fas2, DNaseII, Ptx1, egr
## CG31431, ImpE2, kek1, mspo, tup, Doc1, sas, ush, CG18754, Pdp1
## Negative: dan, SoxN, sca, ths, CG6398, CG5059, bl, lea, ImpL2, Imp
## Pino, rdx, neur, RnrS, Meltrin, sbb, brk, Shroom, noc, pyr
## mid, sog, danr, Toll-6, wb, slp1, lok, pico, Atg18a, path
## PC_ 4
## Positive: aay, apt, Cys, sog, CG2930, phyl, mnd, fkh, CG32053, bib
## MRE23, DNaseII, brk, pDsRed, RpL10Ab, Gmap, ps, CG8654, Sox21a, CG18754
## a, CG13427, CG31431, Lac, SoxN, exex, Pdp1, Ocho, vnd, Obp99a
## Negative: net, Ama, Nrt, CrebA, CycE, emc, mirr, zfh1, rst, chrb
## dap, how, pAbp, Alk, hbs, Glut4EF, CG45263, wake, kay, SNCF
## Dtg, CG14427, zen, spen, if, CadN, chn, ptc, vfl, shep
## PC_ 5
## Positive: SNCF, zen, zen2, Ama, C15, Z600, CG13653, rho, net, kay
## Alk, CG8960, opa, alphagamma-element:CR32865, CG14111, Dtg, CG13454, CG13654, wntD, toc
## Ance, rst, ush, dap, Ect4, CG14915, CR44953, egr, Doc3, E(spl)m5-HLH
## Negative: dm, Bsg, Tet, spoon, Dl, CG12535, Trf2, CG9821, mt:CoI, apt
## sm, Hrb27C, pAbp, Ssdp, CG34380, mt:ND1, Oatp74D, fs(1)h, Smr, roX1
## sqd, pum, msi, CR45874, CG45050, CG11138, AGO1, CG43736, east, RapGAP1
3.2.2.1 Explore the PCA results
It is worth analysing the genes that contribute the most to each axes. It may help you to identify genes whose expression decrease/increase together. The DimHeatmap
function in Seurat helps you to visualize the genes that are driving the components and allows to get some insight about the heterogeneity of the data.
## PC_ 1
## Positive: nos, spz, del, CG8507, stai
## Negative: CG4440, CG10035, Bacc, CG13427, Df31
## PC_ 2
## Positive: twi, ventrally-expressed-protein-D, CG4500, Cyp310a1, sna
## Negative: Lac, ASPP, Ptr, cno, grh
## PC_ 3
## Positive: CG8147, srp, ps, Lapsyn, fkh
## Negative: dan, SoxN, sca, ths, CG6398
## PC_ 4
## Positive: aay, apt, Cys, sog, CG2930
## Negative: net, Ama, Nrt, CrebA, CycE
## PC_ 5
## Positive: SNCF, zen, zen2, Ama, C15
## Negative: dm, Bsg, Tet, spoon, Dl
3.2.2.2 Choose the number of axes
After PCA, you need to decide how many components you want to keep for UMAP and/or clustering. Seurat proposes a statistical methods with the function JackStraw
. Alternatively you can plot the explained variance using the ElbowPlot
. In practical case, you can keep around 20-50 principal components.
3.2.3 Cluster cells
There are two main steps to cluster the cells:
- Make a shared Nearest Neighbor graph:
- Identify the k-nearest neighbours (k-nn) of each cell (by default Seurat takes \(k=20\) neighbours).
- The distance between cells is calculated using the coordinates obtained with the PCA.
- Make clusters (community of cells) using Leiden algorithm.
- The number of clusters depends on the resolution parameter chosen: the higher the value of the parameter is, the more groups you will get.
nPC = 30 # number of PC kepts for the analysis
# k-nn graoh
mydata_filtrd <- FindNeighbors(mydata_filtrd, dims = 1:nPC)
## Computing nearest neighbor graph
## Computing SNN
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 399
## Number of edges: 15473
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.7319
## Number of communities: 4
## Elapsed time: 0 seconds
Usually the clustering results are presented using UMAP which visually separates the group more nicely than PCA.
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 01:44:16 UMAP embedding parameters a = 0.9922 b = 1.112
## 01:44:16 Read 399 rows and found 30 numeric columns
## 01:44:16 Using Annoy for neighbor search, n_neighbors = 30
## 01:44:16 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 01:44:16 Writing NN index file to temp file /tmp/RtmpZPRGzl/file1cbdf53be796ad
## 01:44:16 Searching Annoy index using 1 thread, search_k = 3000
## 01:44:16 Annoy recall = 100%
## 01:44:17 Commencing smooth kNN distance calibration using 1 thread
## 01:44:17 Initializing from normalized Laplacian + noise
## 01:44:18 Commencing optimization for 500 epochs, with 14684 positive edges
## 01:44:19 Optimization finished
3.2.4 Group exercise
Try to play with the number of PCs and the clustering parameters. How could you decide which set of parameters is the best ?
3.2.5 Understand your clusters
Hurray, you succeed to cluster your cells ! Now, your biological work starts. You need to annotate the clusters, and check if your results make sense. One way is to look at genes that are differentially expressed.
# find all markers of cluster 1
cluster1.markers <- FindMarkers(mydata_filtrd, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 5)
## p_val avg_logFC pct.1 pct.2 p_val_adj
## Mes2 2.017839e-57 1.755482 1.000 0.264 2.524519e-53
## stumps 2.291799e-55 1.504331 0.825 0.068 2.867270e-51
## sna 1.130465e-54 1.918047 0.913 0.169 1.414324e-50
## twi 4.368472e-54 2.240741 0.932 0.226 5.465395e-50
## zfh1 6.612339e-53 2.117815 0.854 0.111 8.272697e-49
You can also check if some genes of interest are specifically expressed in some clusters.