Practical_b.Rmd: improve text

2c98dea0 · Laurent Modolo · 834ebf37 · 2c98dea0
Verified Commit 2c98dea0 authored 1 year ago by Laurent Modolo
--- a/Practical_b.Rmd
+++ b/Practical_b.Rmd
@@ -192,7 +192,7 @@ To circumvent this problem we are going to use the PCA as a dimension reduction
 <div class="pencadre">
 Use the `prcomp()` function to compute `data_pca` from the 600 most variable genes.
-You can check the results with the following code:
+You can check the results with the following code, the `cell_annotation` variable is the cell type label of each cell in the dataset:
 ```{r, eval=F}
  fviz_pca_ind(
@@ -263,7 +263,7 @@ data_hclust %>% plot()
 </p>
 </details>
-Too much information drawn the information, the function `cutree()` can help you solve this problem.
+Too much information can drawn the information, the function `cutree()` can help you solve this problem.
 <div class="red_pencadre">
 Which choice of `k` would you take ?
@@ -293,7 +293,7 @@ data_pca %>%
 </p>
 </details>
-The adjusted Rand index is computed to compare two classifications. This index has zero expected value in the case of random partitions, and it is bounded above by 1 in the case of perfect agreement between two partitions.
+The adjusted Rand index can be computed to compare two classifications. This index has and expected value of zero in the case of random partitions, and it is bounded above by 1 in the case of perfect agreement between two partitions.
 <div class="pencadre">
@@ -396,7 +396,7 @@ data_pca %>%
 ```
 <div class="pencadre">
-Using the `str()` function make the following plot from your k-means results.
+Using the `str()` function to explore the `data_kmeans` result, make the following plot from your k-means results.
 </div>
 ```{r, echo = F}
@@ -439,7 +439,7 @@ Maybe the real number of clusters in the PCs data is not $k=9$. We can use diffe
 - The silhouette value measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation). $s(i) = \frac{b(i) - a(i)}{\max{a(i),b(i)}}$ with $a(i)$ the mean distance between $i$ and cells in the same cluster and $b(i)$ the mean distance between $i$ and cells in different clusters. We plot $\frac{1}{n}\sum_{i=1}^n s(i)$
 <div class="pencadre">
-The `fviz_nbclust()` function makes the following plot from your k-means results.
+Use the `fviz_nbclust()` function to plot these two metrics as a function of the number of clusters.
 </div>
 <details><summary>Solution</summary>
@@ -480,7 +480,109 @@ fviz_nbclust(data_pca$x[, 1:3], hcut, method = "silhouette")
 Explain the discrepancy between these results and $k=9$
 </div>
-### Implementing your own $k$-means clustering algorithm
+## Graph-based clustering
+We are going to use the `cluster_louvain()` function to perform a graph-based clustering.
+This function takes into input an undirected graph instead of a distance matrix.
+The `nng()` function computes a k-nearest neighbor graph. With the `mutual = T` option, this graph is undirected.
+<div class="pencadre">
+Check the effect of the `mutual = T` on the `data_knn` with the following code
+```{r, echo=F}
+data_knn <- data_dist %>%
+  as.matrix() %>% 
+  nng(k = 30, mutual = T)
+```
+</div>
+<details><summary>Solution</summary>
+<p>
+```{r, eval=F}
+data_knn <- data_dist %>%
+  as.matrix() %>% 
+  nng(k = 30, mutual = T)
+data_knn_F <- data_dist %>%
+  as.matrix() %>% 
+  nng(k = 30, mutual = F)
+str(data_knn)
+str(data_knn_F)
+```
+</p>
+</details>
+<div class="red_pencadre">
+Why do we need a knn ?
+</div>
+The `cluster_louvain()` function implements the multi-level modularity optimization algorithm for finding community structure in a graph. Use this function on `data_knn` to create a `data_louvain` variable.
+You can check the clustering results with `membership(data_louvain)`.
+<div class="pencadre">
+For which `resolution` value do you get 9 clusters ?
+</div>
+<details><summary>Solution</summary>
+<p>
+```{r}
+data_louvain <- data_knn %>% 
+  cluster_louvain(resolution = 0.41)
+```
+</p>
+</details>
+```{r}
+data_pca %>%
+  fviz_pca_ind(
+    geom = "point",
+    col.ind = as.factor(membership(data_louvain))
+  )
+```
+<div class="pencadre">
+Use the `adjustedRandIndex()` function to compare the `cell_annotation` to your graph-based clustering.
+</div>
+<details><summary>Solution</summary>
+<p>
+```{r}
+adjustedRandIndex(
+  membership(data_louvain), cell_annotation
+  )
+```
+</p>
+</details>
+## Graph-based dimension reduction
+Uniform Manifold Approximation and Projection (UMAP) is an algorithm for dimensional reduction. Its details are described by [McInnes, Healy, and Melville](https://arxiv.org/abs/1802.03426) and its official implementation is available through a python package [umap-learn](https://github.com/lmcinnes/umap).
+```{r}
+library(umap)
+data_umap <- umap(data_pca$x[, 1:10])
+data_umap$layout %>% 
+  as_tibble(.name_repair = "universal") %>% 
+  mutate(cell_type = cell_annotation) %>% 
+  ggplot() +
+  geom_point(aes(x = ...1, y = ...2, color = cell_type))
+```
+<div class="red_pencadre">
+What can you say about the axes of this plot ?
+</div>
+[The .Rmd file corresponding to this page is available here under the AGPL3 Licence](https://lbmc.gitbiopages.ens-lyon.fr/hub/formations/ens_m1_ml/Practical_b.Rmd)
+## Implementing your own $k$-means clustering algorithm
 The $k$-means algorithm follow the following steps:
@@ -602,100 +704,3 @@ data_pca %>%
    col.ind = as.factor(kmeans_example(data_pca$x[,1:2], k = 9))
  )
 ```
-## Graph-based clustering
-We are going to use the `cluster_louvain()` function to perform a graph-based clustering.
-This function takes into input an undirected graph instead of a distance matrix.
-The `nng()` function computes a k-nearest neighbor graph. With the `mutual = T` option, this graph is undirected.
-<div class="pencadre">
-Check the effect of the `mutual = T` on the `data_knn` with the following code
-```{r, echo=F}
-data_knn <- data_dist %>%
-  as.matrix() %>% 
-  nng(k = 30, mutual = T)
-```
-</div>
-<details><summary>Solution</summary>
-<p>
-```{r, eval=F}
-data_knn_F <- data_dist %>%
-  as.matrix() %>% 
-  nng(k = 30, mutual = F)
-str(data_knn)
-str(data_knn_F)
-```
-</p>
-</details>
-<div class="red_pencadre">
-Why do we need a knn ?
-</div>
-The `cluster_louvain()` function implements the multi-level modularity optimization algorithm for finding community structure in a graph. Use this function on `data_knn` to create a `data_louvain` variable.
-You can check the clustering results with `membership(data_louvain)`.
-<div class="pencadre">
-For which `resolution` value do you get 9 clusters ?
-</div>
-<details><summary>Solution</summary>
-<p>
-```{r}
-data_louvain <- data_knn %>% 
-  cluster_louvain(resolution = 0.41)
-```
-</p>
-</details>
-```{r}
-data_pca %>%
-  fviz_pca_ind(
-    geom = "point",
-    col.ind = as.factor(membership(data_louvain))
-  )
-```
-<div class="pencadre">
-Use the `adjustedRandIndex()` function to compare the `cell_annotation` to your graph-based clustering.
-</div>
-<details><summary>Solution</summary>
-<p>
-```{r}
-adjustedRandIndex(
-  membership(data_louvain), cell_annotation
-  )
-```
-</p>
-</details>
-## Graph-based dimension reduction
-Uniform Manifold Approximation and Projection (UMAP) is an algorithm for dimensional reduction. Its details are described by [McInnes, Healy, and Melville](https://arxiv.org/abs/1802.03426) and its official implementation is available through a python package [umap-learn](https://github.com/lmcinnes/umap).
-```{r}
-library(umap)
-data_umap <- umap(data_pca$x[, 1:10])
-data_umap$layout %>% 
-  as_tibble() %>% 
-  mutate(cell_type = cell_annotation) %>% 
-  ggplot() +
-  geom_point(aes(x = V1, y = V2, color = cell_type))
-```
-<div class="red_pencadre">
-What can you say about the axes of this plot ?
-</div>
-[The .Rmd file corresponding to this page is available here under the AGPL3 Licence](https://lbmc.gitbiopages.ens-lyon.fr/hub/formations/ens_m1_ml/Practical_b.Rmd)