Skip to content
Snippets Groups Projects
Verified Commit 2c98dea0 authored by Laurent Modolo's avatar Laurent Modolo
Browse files

Practical_b.Rmd: improve text

parent 834ebf37
No related branches found
No related tags found
No related merge requests found
...@@ -192,7 +192,7 @@ To circumvent this problem we are going to use the PCA as a dimension reduction ...@@ -192,7 +192,7 @@ To circumvent this problem we are going to use the PCA as a dimension reduction
<div class="pencadre"> <div class="pencadre">
Use the `prcomp()` function to compute `data_pca` from the 600 most variable genes. Use the `prcomp()` function to compute `data_pca` from the 600 most variable genes.
You can check the results with the following code: You can check the results with the following code, the `cell_annotation` variable is the cell type label of each cell in the dataset:
```{r, eval=F} ```{r, eval=F}
fviz_pca_ind( fviz_pca_ind(
...@@ -263,7 +263,7 @@ data_hclust %>% plot() ...@@ -263,7 +263,7 @@ data_hclust %>% plot()
</p> </p>
</details> </details>
Too much information drawn the information, the function `cutree()` can help you solve this problem. Too much information can drawn the information, the function `cutree()` can help you solve this problem.
<div class="red_pencadre"> <div class="red_pencadre">
Which choice of `k` would you take ? Which choice of `k` would you take ?
...@@ -293,7 +293,7 @@ data_pca %>% ...@@ -293,7 +293,7 @@ data_pca %>%
</p> </p>
</details> </details>
The adjusted Rand index is computed to compare two classifications. This index has zero expected value in the case of random partitions, and it is bounded above by 1 in the case of perfect agreement between two partitions. The adjusted Rand index can be computed to compare two classifications. This index has and expected value of zero in the case of random partitions, and it is bounded above by 1 in the case of perfect agreement between two partitions.
<div class="pencadre"> <div class="pencadre">
...@@ -396,7 +396,7 @@ data_pca %>% ...@@ -396,7 +396,7 @@ data_pca %>%
``` ```
<div class="pencadre"> <div class="pencadre">
Using the `str()` function make the following plot from your k-means results. Using the `str()` function to explore the `data_kmeans` result, make the following plot from your k-means results.
</div> </div>
```{r, echo = F} ```{r, echo = F}
...@@ -439,7 +439,7 @@ Maybe the real number of clusters in the PCs data is not $k=9$. We can use diffe ...@@ -439,7 +439,7 @@ Maybe the real number of clusters in the PCs data is not $k=9$. We can use diffe
- The silhouette value measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation). $s(i) = \frac{b(i) - a(i)}{\max{a(i),b(i)}}$ with $a(i)$ the mean distance between $i$ and cells in the same cluster and $b(i)$ the mean distance between $i$ and cells in different clusters. We plot $\frac{1}{n}\sum_{i=1}^n s(i)$ - The silhouette value measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation). $s(i) = \frac{b(i) - a(i)}{\max{a(i),b(i)}}$ with $a(i)$ the mean distance between $i$ and cells in the same cluster and $b(i)$ the mean distance between $i$ and cells in different clusters. We plot $\frac{1}{n}\sum_{i=1}^n s(i)$
<div class="pencadre"> <div class="pencadre">
The `fviz_nbclust()` function makes the following plot from your k-means results. Use the `fviz_nbclust()` function to plot these two metrics as a function of the number of clusters.
</div> </div>
<details><summary>Solution</summary> <details><summary>Solution</summary>
...@@ -480,7 +480,109 @@ fviz_nbclust(data_pca$x[, 1:3], hcut, method = "silhouette") ...@@ -480,7 +480,109 @@ fviz_nbclust(data_pca$x[, 1:3], hcut, method = "silhouette")
Explain the discrepancy between these results and $k=9$ Explain the discrepancy between these results and $k=9$
</div> </div>
### Implementing your own $k$-means clustering algorithm ## Graph-based clustering
We are going to use the `cluster_louvain()` function to perform a graph-based clustering.
This function takes into input an undirected graph instead of a distance matrix.
The `nng()` function computes a k-nearest neighbor graph. With the `mutual = T` option, this graph is undirected.
<div class="pencadre">
Check the effect of the `mutual = T` on the `data_knn` with the following code
```{r, echo=F}
data_knn <- data_dist %>%
as.matrix() %>%
nng(k = 30, mutual = T)
```
</div>
<details><summary>Solution</summary>
<p>
```{r, eval=F}
data_knn <- data_dist %>%
as.matrix() %>%
nng(k = 30, mutual = T)
data_knn_F <- data_dist %>%
as.matrix() %>%
nng(k = 30, mutual = F)
str(data_knn)
str(data_knn_F)
```
</p>
</details>
<div class="red_pencadre">
Why do we need a knn ?
</div>
The `cluster_louvain()` function implements the multi-level modularity optimization algorithm for finding community structure in a graph. Use this function on `data_knn` to create a `data_louvain` variable.
You can check the clustering results with `membership(data_louvain)`.
<div class="pencadre">
For which `resolution` value do you get 9 clusters ?
</div>
<details><summary>Solution</summary>
<p>
```{r}
data_louvain <- data_knn %>%
cluster_louvain(resolution = 0.41)
```
</p>
</details>
```{r}
data_pca %>%
fviz_pca_ind(
geom = "point",
col.ind = as.factor(membership(data_louvain))
)
```
<div class="pencadre">
Use the `adjustedRandIndex()` function to compare the `cell_annotation` to your graph-based clustering.
</div>
<details><summary>Solution</summary>
<p>
```{r}
adjustedRandIndex(
membership(data_louvain), cell_annotation
)
```
</p>
</details>
## Graph-based dimension reduction
Uniform Manifold Approximation and Projection (UMAP) is an algorithm for dimensional reduction. Its details are described by [McInnes, Healy, and Melville](https://arxiv.org/abs/1802.03426) and its official implementation is available through a python package [umap-learn](https://github.com/lmcinnes/umap).
```{r}
library(umap)
data_umap <- umap(data_pca$x[, 1:10])
data_umap$layout %>%
as_tibble(.name_repair = "universal") %>%
mutate(cell_type = cell_annotation) %>%
ggplot() +
geom_point(aes(x = ...1, y = ...2, color = cell_type))
```
<div class="red_pencadre">
What can you say about the axes of this plot ?
</div>
[The .Rmd file corresponding to this page is available here under the AGPL3 Licence](https://lbmc.gitbiopages.ens-lyon.fr/hub/formations/ens_m1_ml/Practical_b.Rmd)
## Implementing your own $k$-means clustering algorithm
The $k$-means algorithm follow the following steps: The $k$-means algorithm follow the following steps:
...@@ -602,100 +704,3 @@ data_pca %>% ...@@ -602,100 +704,3 @@ data_pca %>%
col.ind = as.factor(kmeans_example(data_pca$x[,1:2], k = 9)) col.ind = as.factor(kmeans_example(data_pca$x[,1:2], k = 9))
) )
``` ```
## Graph-based clustering
We are going to use the `cluster_louvain()` function to perform a graph-based clustering.
This function takes into input an undirected graph instead of a distance matrix.
The `nng()` function computes a k-nearest neighbor graph. With the `mutual = T` option, this graph is undirected.
<div class="pencadre">
Check the effect of the `mutual = T` on the `data_knn` with the following code
```{r, echo=F}
data_knn <- data_dist %>%
as.matrix() %>%
nng(k = 30, mutual = T)
```
</div>
<details><summary>Solution</summary>
<p>
```{r, eval=F}
data_knn_F <- data_dist %>%
as.matrix() %>%
nng(k = 30, mutual = F)
str(data_knn)
str(data_knn_F)
```
</p>
</details>
<div class="red_pencadre">
Why do we need a knn ?
</div>
The `cluster_louvain()` function implements the multi-level modularity optimization algorithm for finding community structure in a graph. Use this function on `data_knn` to create a `data_louvain` variable.
You can check the clustering results with `membership(data_louvain)`.
<div class="pencadre">
For which `resolution` value do you get 9 clusters ?
</div>
<details><summary>Solution</summary>
<p>
```{r}
data_louvain <- data_knn %>%
cluster_louvain(resolution = 0.41)
```
</p>
</details>
```{r}
data_pca %>%
fviz_pca_ind(
geom = "point",
col.ind = as.factor(membership(data_louvain))
)
```
<div class="pencadre">
Use the `adjustedRandIndex()` function to compare the `cell_annotation` to your graph-based clustering.
</div>
<details><summary>Solution</summary>
<p>
```{r}
adjustedRandIndex(
membership(data_louvain), cell_annotation
)
```
</p>
</details>
## Graph-based dimension reduction
Uniform Manifold Approximation and Projection (UMAP) is an algorithm for dimensional reduction. Its details are described by [McInnes, Healy, and Melville](https://arxiv.org/abs/1802.03426) and its official implementation is available through a python package [umap-learn](https://github.com/lmcinnes/umap).
```{r}
library(umap)
data_umap <- umap(data_pca$x[, 1:10])
data_umap$layout %>%
as_tibble() %>%
mutate(cell_type = cell_annotation) %>%
ggplot() +
geom_point(aes(x = V1, y = V2, color = cell_type))
```
<div class="red_pencadre">
What can you say about the axes of this plot ?
</div>
[The .Rmd file corresponding to this page is available here under the AGPL3 Licence](https://lbmc.gitbiopages.ens-lyon.fr/hub/formations/ens_m1_ml/Practical_b.Rmd)
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment