From f1402350d27b3afbf317653bdce67be9cd340507 Mon Sep 17 00:00:00 2001 From: Laurent Modolo <laurent.modolo@ens-lyon.fr> Date: Tue, 7 Sep 2021 18:20:10 +0200 Subject: [PATCH] session_3: improve the Q & A system --- session_3/session_3.Rmd | 129 +++++++++++++++++++++++++++++++--------- 1 file changed, 102 insertions(+), 27 deletions(-) diff --git a/session_3/session_3.Rmd b/session_3/session_3.Rmd index 5295b4f..8e2b7e4 100644 --- a/session_3/session_3.Rmd +++ b/session_3/session_3.Rmd @@ -49,42 +49,59 @@ Like in the previous sessions, it's good practice to create a new **.R** file to # `ggplot2` statistical transformations +In the previous session, we have ploted the data as they are by using the variables values as **x** or **y** coordinates, color shade, size or transparency. +When dealing with categorical variables, also called **factors**, it can be interesting to perform some simple statistical transformations. +For example we may want to have coordinates on an axis proportional to the number of records for a given category. We are going to use the `diamonds` data set included in `tidyverse`. -- Use the `help` and `view` command to explore this data set. +<div class="pencadre"> + +- Use the `help` and `View` command to explore this data set. +- How much records does this dataset contains ? - Try the `str` command, which information are displayed ? +</div> + ```{r str_diamon} str(diamonds) ``` -We saw scatterplot (`geom_point()`), smoothplot (`geom_smooth()`). Now barplot with `geom_bar()` : +## Introduction to `geom_bar` -```{r diamonds_barplot, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} +We saw scatterplot (`geom_point()`), smoothplot (`geom_smooth()`). +Now barplot with `geom_bar()` : + +```{r diamonds_barplot, cache = TRUE, fig.width=8, fig.height=4.5} ggplot(data = diamonds, mapping = aes(x = cut)) + geom_bar() ``` More diamonds are available with high quality cuts. -On the x-axis, the chart displays cut, a variable from diamonds. On the y-axis, it displays count, but count is not a variable in diamonds! +On the x-axis, the chart displays **cut**, a variable from diamonds. On the y-axis, it displays **count**, **but count is not a variable in diamonds!** -The algorithm used to calculate new values for a graph is called a **stat**, short for statistical transformation. The figure below describes how this process works with `geom_bar()`. - +## **geom** and **stat** +The algorithm used to calculate new values for a graph is called a **stat**, short for statistical transformation. +The figure below describes how this process works with `geom_bar()`. -You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using `stat_count()` instead of `geom_bar()`: + + +You can generally use **geoms** and **stats** interchangeably. For example, you can recreate the previous plot using `stat_count()` instead of `geom_bar()`: -```{r diamonds_stat_count, include=TRUE, fig.width=8, fig.height=4.5, message=FALSE} +```{r diamonds_stat_count, include=TRUE, fig.width=8, fig.height=4.5} ggplot(data = diamonds, mapping = aes(x = cut)) + stat_count() ``` -Every geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. There are three reasons you might need to use a stat explicitly: +Every **geom** has a default **stat**; and every **stat** has a default **geom**. This means that you can typically use **geoms** without worrying about the underlying statistical transformation. There are three reasons you might need to use a **stat** explicitly: + +## Why **stat** ? -- You might want to override the default stat. +You might want to override the default stat. +For example in the following `demo` dataset we allready have a varible for the **counts** per `cut`. ```{r 3_a, include=TRUE, fig.width=8, fig.height=4.5} demo <- tribble( @@ -101,36 +118,66 @@ demo <- tribble( to guess at their meaning from the context, and you will learn exactly what they do soon!) +<div class="pencadre"> +So instead of using the default `geom_bar` parameter `stat = "count"` ty to use `"identity"` +</div> + +<details><summary>Solution</summary> +<p> ```{r 3_ab, include=TRUE, fig.width=8, fig.height=4.5} ggplot(data = demo, mapping = aes(x = cut, y = freq)) + geom_bar(stat = "identity") ``` +</p> +</details> + +You might want to override the default mapping from transformed variables to aesthetics ( e.g. proportion). -- You might want to override the default mapping from transformed variables to aesthetics ( e.g. proportion). ```{r 3_b, include=TRUE, fig.width=8, fig.height=4.5} ggplot(data = diamonds, mapping = aes(x = cut, y = ..prop.., group = 1)) + geom_bar() ``` -- In our proportion bar chart, we need to set `group = 1`. Why? +<div class="pencadre"> +In our proportion bar chart, we need to set `group = 1`. Why? +</div> +<details><summary>Solution</summary> +<p> ```{r diamonds_stats_challenge, include=TRUE, message=FALSE, fig.width=8, fig.height=4.5} ggplot(data = diamonds, mapping = aes(x = cut, y = ..prop..)) + geom_bar() ``` If group is not used, the proportion is calculated with respect to the data that contains that field and is ultimately going to be 100% in any case. For instance, The proportion of an ideal cut in the ideal cut specific data will be 1. +</p> +</details> + +## More details with `stat_summary` -- You might want to draw greater attention to the statistical transformation in your code. -you might use stat_summary(), which summarises the y values for each unique x -value, to draw attention to the summary that you are computing: +<div class="pencadre"> +You might want to draw greater attention to the statistical transformation in your code. +you might use `stat_summary()`, which summarize the **y** values for each unique **x** +value, to draw attention to the summary that you are computing +</div> +<details><summary>Solution</summary> +<p> ```{r 3_c, include=TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) + stat_summary() +``` +</p> +</details> - +<div class="pencadre"> +Set the `fun.min`, `fun.max` and `fun` to the `min`, `max` and `median` function respectively +</div> + +<details><summary>Solution</summary> +<p> +```{r 3_d, include=TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) + stat_summary( fun.min = min, @@ -138,54 +185,80 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) + fun = median ) ``` +</p> +</details> +# Coloring area plots -# Position adjustments - -You can colour a bar chart using either the `color` aesthetic, +<div class="pencadre"> +You can colour a bar chart using either the `color` aesthetic, or, more usefully, `fill`: +Try both solutions on a `cut`, histogram. +</div> +<details><summary>Solution</summary> +<p> ```{r diamonds_barplot_color, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = diamonds, mapping = aes(x = cut, color = cut)) + geom_bar() ``` -or, more usefully, `fill`: - ```{r diamonds_barplot_fill, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) + geom_bar() ``` +</p> +</details> +<div class="pencadre"> You can also use `fill` with another variable: +Try to color by `clarity`. Is `clarity` a continuous or categorial variable ? +</div> +<details><summary>Solution</summary> +<p> ```{r diamonds_barplot_fill_clarity, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + geom_bar() ``` +</p> +</details> + +# Position adjustments -The stacking is performed by the position adjustment `position` +The stacking of the `fill` parameter is performed by the position adjustment `position` -## fill +<div class="pencadre"> +Try the following `position` parameter for `geom_bar`: `"fill"`, `"dodge"` and `"jitter"` +</div> + +<details><summary>Solution</summary> +<p> ```{r diamonds_barplot_pos_fill, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + geom_bar( position = "fill") ``` -## dodge - ```{r diamonds_barplot_pos_dodge, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + geom_bar( position = "dodge") ``` -## jitter - ```{r diamonds_barplot_pos_jitter, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + geom_bar( position = "jitter") ``` +</p> +</details> +`jitter` is often used for plotting points when they are stacked on top of each others. + +<div class="pencadre"> +Compare `geom_point` to `geom_jitter` to plot `cut` versus `depth` and color by `clarity` +</div> + +<details><summary>Solution</summary> +<p> ```{r dia_jitter2, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) + geom_point() @@ -195,6 +268,8 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) + ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) + geom_jitter() ``` +</p> +</details> ## violin -- GitLab