@@ -49,42 +49,59 @@ Like in the previous sessions, it's good practice to create a new **.R** file to
...
@@ -49,42 +49,59 @@ Like in the previous sessions, it's good practice to create a new **.R** file to
# `ggplot2` statistical transformations
# `ggplot2` statistical transformations
In the previous session, we have ploted the data as they are by using the variables values as **x** or **y** coordinates, color shade, size or transparency.
When dealing with categorical variables, also called **factors**, it can be interesting to perform some simple statistical transformations.
For example we may want to have coordinates on an axis proportional to the number of records for a given category.
We are going to use the `diamonds` data set included in `tidyverse`.
We are going to use the `diamonds` data set included in `tidyverse`.
- Use the `help` and `view` command to explore this data set.
<div class="pencadre">
- Use the `help` and `View` command to explore this data set.
- How much records does this dataset contains ?
- Try the `str` command, which information are displayed ?
- Try the `str` command, which information are displayed ?
</div>
```{r str_diamon}
```{r str_diamon}
str(diamonds)
str(diamonds)
```
```
We saw scatterplot (`geom_point()`), smoothplot (`geom_smooth()`). Now barplot with `geom_bar()` :
More diamonds are available with high quality cuts.
More diamonds are available with high quality cuts.
On the x-axis, the chart displays cut, a variable from diamonds. On the y-axis, it displays count, but count is not a variable in diamonds!
On the x-axis, the chart displays **cut**, a variable from diamonds. On the y-axis, it displays **count**, **but count is not a variable in diamonds!**
The algorithm used to calculate new values for a graph is called a **stat**, short for statistical transformation. The figure below describes how this process works with `geom_bar()`.

## **geom** and **stat**
The algorithm used to calculate new values for a graph is called a **stat**, short for statistical transformation.
The figure below describes how this process works with `geom_bar()`.
You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using `stat_count()` instead of `geom_bar()`:

You can generally use **geoms** and **stats** interchangeably. For example, you can recreate the previous plot using `stat_count()` instead of `geom_bar()`:
Every geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. There are three reasons you might need to use a stat explicitly:
Every **geom** has a default **stat**; and every **stat** has a default **geom**. This means that you can typically use **geoms** without worrying about the underlying statistical transformation. There are three reasons you might need to use a **stat** explicitly:
## Why **stat** ?
- You might want to override the default stat.
You might want to override the default stat.
For example in the following `demo` dataset we allready have a varible for the **counts** per `cut`.
If group is not used, the proportion is calculated with respect to the data that contains that field and is ultimately going to be 100% in any case. For instance, The proportion of an ideal cut in the ideal cut specific data will be 1.
If group is not used, the proportion is calculated with respect to the data that contains that field and is ultimately going to be 100% in any case. For instance, The proportion of an ideal cut in the ideal cut specific data will be 1.
</p>
</details>
## More details with `stat_summary`
- You might want to draw greater attention to the statistical transformation in your code.
<div class="pencadre">
you might use stat_summary(), which summarises the y values for each unique x
You might want to draw greater attention to the statistical transformation in your code.
value, to draw attention to the summary that you are computing:
you might use `stat_summary()`, which summarize the **y** values for each unique **x**
value, to draw attention to the summary that you are computing