3 Identify cell populations

In this section, we will analyse further the data and try to identify sub-populations of cells. The objective is to create clusters of cells, i.e to make groups of cells that share similar expression profile. The main steps are as follow :

Select a subset of genes to perfom the downstream analyses
Perform a dimension reduction
Cluster the cells

3.1 Select highly variables (hvg)

The downstream analyses (dimension reduction and then clustering) will be performed on the subset of genes. The aims are to exclude genes with minor biological relevance and decrease the computational load.

# Find Variable Genes :
mydata_filtrd <- FindVariableFeatures(mydata_filtrd,
                                      selection.method = "vst",
                                      nfeatures = 2000,
                                      verbose = FALSE)

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(mydata_filtrd), 10)

# you can also access the complete list using : 
list_hvg = mydata_filtrd@assays$RNA@var.features

3.2 Dimension reduction

Before performing a dimension reduction technique, it is usually advised to scale the data.

3.2.1 Scaling

Expression of each gene is scaled (centered around its mean with a unit standard deviation). This step gives equal weight to all genes while performing the dimension reduction, and insures that the most expressed genes do not dominate.

mydata_filtrd <- ScaleData(mydata_filtrd)

## Centering and scaling data matrix

The scaled data are stored in the scaled slot.

By default, scaling is only performed on the variable genes.

3.2.2 PCA

Even though, the gene selection step reduces the data considerably, there is a further need to reduce dimension to keep the most important variability in the data. There are several methods for dimension reduction, the most commonly used is PCA, but you can also check ICA or more single-cell specific methods that tries to cope with over-dispersion of the data such as ZIFA and pCMF.

mydata_filtrd <- RunPCA(mydata_filtrd, features = VariableFeatures(object = mydata_filtrd))

## PC_ 1 
## Positive:  nos, spz, del, CG8507, stai, CG14545, Sesn, shu, CG17698, CG6287 
##     CanB2, BigH1, pgc, 26-29-p, stil, PyK, Eno, CG7530, krimp, aret 
##     CG17270, Taf12L, Cen, fabp, CG31689, Gapdh1, exu, CG17658, CG6967, CG9917 
## Negative:  CG4440, CG10035, Bacc, CG13427, Df31, Tom, bl, fax, CG15628, BobA 
##     hth, ed, CG5059, ci, tna, Gp150, sqd, sdt, l(3)neo38, N 
##     Ppa, grh, vfl, Dl, roX1, wech, HmgD, CR43887, ASPP, Brd 
## PC_ 2 
## Positive:  twi, ventrally-expressed-protein-D, CG4500, Cyp310a1, sna, ltl, Ilp4, CG12177, CG16758, Mes2 
##     stumps, if, htl, CG14688, CadN, zfh1, Mdr49, CG3036, Pka-C3, tin 
##     sty, Asph, sprt, Inx3, CR45361, Mef2, Nplp2, be, Act87E, Ndae1 
## Negative:  Lac, ASPP, Ptr, cno, grh, cv-2, aop, jbug, CG2162, btsz 
##     sdt, crb, mew, Tom, fra, CG34383, 5-HT2A, CG34224, Doc1, CG45263 
##     blot, hbs, Mipp1, ci, CG10176, pnr, wun2, dpp, bib, CG42788 
## PC_ 3 
## Positive:  CG8147, srp, ps, Lapsyn, fkh, peb, CG2930, hkb, Doc3, Oatp74D 
##     CG15236, MRE23, exex, Gmap, Doc2, CG32053, Fas2, DNaseII, Ptx1, egr 
##     CG31431, ImpE2, kek1, mspo, tup, Doc1, sas, ush, CG18754, Pdp1 
## Negative:  dan, SoxN, sca, ths, CG6398, CG5059, bl, lea, ImpL2, Imp 
##     Pino, rdx, neur, RnrS, Meltrin, sbb, brk, Shroom, noc, pyr 
##     mid, sog, danr, Toll-6, wb, slp1, lok, pico, Atg18a, path 
## PC_ 4 
## Positive:  aay, apt, Cys, sog, CG2930, phyl, mnd, fkh, CG32053, bib 
##     MRE23, DNaseII, brk, pDsRed, RpL10Ab, Gmap, ps, CG8654, Sox21a, CG18754 
##     a, CG13427, CG31431, Lac, SoxN, exex, Pdp1, Ocho, vnd, Obp99a 
## Negative:  net, Ama, Nrt, CrebA, CycE, emc, mirr, zfh1, rst, chrb 
##     dap, how, pAbp, Alk, hbs, Glut4EF, CG45263, wake, kay, SNCF 
##     Dtg, CG14427, zen, spen, if, CadN, chn, ptc, vfl, shep 
## PC_ 5 
## Positive:  SNCF, zen, zen2, Ama, C15, Z600, CG13653, rho, net, kay 
##     Alk, CG8960, opa, alphagamma-element:CR32865, CG14111, Dtg, CG13454, CG13654, wntD, toc 
##     Ance, rst, ush, dap, Ect4, CG14915, CR44953, egr, Doc3, E(spl)m5-HLH 
## Negative:  dm, Bsg, Tet, spoon, Dl, CG12535, Trf2, CG9821, mt:CoI, apt 
##     sm, Hrb27C, pAbp, Ssdp, CG34380, mt:ND1, Oatp74D, fs(1)h, Smr, roX1 
##     sqd, pum, msi, CR45874, CG45050, CG11138, AGO1, CG43736, east, RapGAP1

3.2.2.1 Explore the PCA results

It is worth analysing the genes that contribute the most to each axes. It may help you to identify genes whose expression decrease/increase together. The DimHeatmap function in Seurat helps you to visualize the genes that are driving the components and allows to get some insight about the heterogeneity of the data.

# Examine and visualize PCA results :
print(mydata_filtrd[["pca"]], dims = 1:5, nfeatures = 5)

## PC_ 1 
## Positive:  nos, spz, del, CG8507, stai 
## Negative:  CG4440, CG10035, Bacc, CG13427, Df31 
## PC_ 2 
## Positive:  twi, ventrally-expressed-protein-D, CG4500, Cyp310a1, sna 
## Negative:  Lac, ASPP, Ptr, cno, grh 
## PC_ 3 
## Positive:  CG8147, srp, ps, Lapsyn, fkh 
## Negative:  dan, SoxN, sca, ths, CG6398 
## PC_ 4 
## Positive:  aay, apt, Cys, sog, CG2930 
## Negative:  net, Ama, Nrt, CrebA, CycE 
## PC_ 5 
## Positive:  SNCF, zen, zen2, Ama, C15 
## Negative:  dm, Bsg, Tet, spoon, Dl

VizDimLoadings(mydata_filtrd, dims = 1:2, reduction = "pca")

# Plot expression of gene driving principal component
DimHeatmap(mydata_filtrd, dims = 1:2)

3.2.2.2 Choose the number of axes

After PCA, you need to decide how many components you want to keep for UMAP and/or clustering. Seurat proposes a statistical methods with the function JackStraw. Alternatively you can plot the explained variance using the ElbowPlot. In practical case, you can keep around 20-50 principal components.

library(ggplot2)
DimPlot(mydata_filtrd, reduction = "pca")

ElbowPlot(mydata_filtrd)

3.2.3 Cluster cells

There are two main steps to cluster the cells:

Make a shared Nearest Neighbor graph:
- Identify the k-nearest neighbours (k-nn) of each cell (by default Seurat takes \(k=20\) neighbours).
- The distance between cells is calculated using the coordinates obtained with the PCA.
Make clusters (community of cells) using Leiden algorithm.
- The number of clusters depends on the resolution parameter chosen: the higher the value of the parameter is, the more groups you will get.

nPC = 30 # number of PC kepts for the analysis
# k-nn graoh
mydata_filtrd <- FindNeighbors(mydata_filtrd, dims = 1:nPC)

## Computing nearest neighbor graph

## Computing SNN

# make the clusters
mydata_filtrd <- FindClusters(mydata_filtrd, resolution = 0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 399
## Number of edges: 15473
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.7319
## Number of communities: 4
## Elapsed time: 0 seconds

Usually the clustering results are presented using UMAP which visually separates the group more nicely than PCA.

mydata_filtrd <- RunUMAP(mydata_filtrd, dims = 1:nPC)

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 01:44:16 UMAP embedding parameters a = 0.9922 b = 1.112

## 01:44:16 Read 399 rows and found 30 numeric columns

## 01:44:16 Using Annoy for neighbor search, n_neighbors = 30

## 01:44:16 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 01:44:16 Writing NN index file to temp file /tmp/RtmpZPRGzl/file1cbdf53be796ad
## 01:44:16 Searching Annoy index using 1 thread, search_k = 3000
## 01:44:16 Annoy recall = 100%
## 01:44:17 Commencing smooth kNN distance calibration using 1 thread
## 01:44:17 Initializing from normalized Laplacian + noise
## 01:44:18 Commencing optimization for 500 epochs, with 14684 positive edges
## 01:44:19 Optimization finished

DimPlot(mydata_filtrd, reduction = "umap")

3.2.4 Group exercise

Try to play with the number of PCs and the clustering parameters. How could you decide which set of parameters is the best ?

3.2.5 Understand your clusters

Hurray, you succeed to cluster your cells ! Now, your biological work starts. You need to annotate the clusters, and check if your results make sense. One way is to look at genes that are differentially expressed.

# find all markers of cluster 1
cluster1.markers <- FindMarkers(mydata_filtrd, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 5)

##               p_val avg_logFC pct.1 pct.2    p_val_adj
## Mes2   2.017839e-57  1.755482 1.000 0.264 2.524519e-53
## stumps 2.291799e-55  1.504331 0.825 0.068 2.867270e-51
## sna    1.130465e-54  1.918047 0.913 0.169 1.414324e-50
## twi    4.368472e-54  2.240741 0.932 0.226 5.465395e-50
## zfh1   6.612339e-53  2.117815 0.854 0.111 8.272697e-49

You can also check if some genes of interest are specifically expressed in some clusters.

VlnPlot(mydata_filtrd, features = c("Mes2", "sna"))