-
Carine Rey authoredCarine Rey authored
title: 'R.3: Transformations with ggplot2'
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "2022"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "light"
lightbox: true
css: "../www/style_Rmd.css"
library(fontawesome)
r fa(name = "fas fa-house", fill = "grey", height = "1em")
https://can.gitbiopages.ens-lyon.fr/R_basis/
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
Introduction
In the last session, we have seen how to use ggplot2
and The Grammar of Graphics. The goal of this practical is to practices more advanced features of ggplot2
.
The objectives of this session will be to:
- learn about statistical transformations
- practices position adjustments
- change the coordinate systems
The first step is to load the tidyverse
.
Solution
```{r packageloaded, include=TRUE, message=FALSE} library("tidyverse") ```
Like in the previous sessions, it's good practice to create a new .R file to write your code instead of using the R terminal directly.
ggplot2
statistical transformations
In the previous session, we have plotted the data as they are by using the variable values as x or y coordinates, color shade, size or transparency. When dealing with categorical variables, also called factors, it can be interesting to perform some simple statistical transformations. For example, we may want to have coordinates on an axis proportional to the number of records for a given category.
We are going to use the diamonds
data set included in tidyverse
.
- Use the
help
andView
command to explore this data set. - How much records does this dataset contain ?
- Try the
str
command, which information are displayed ?
str(diamonds)
geom_bar
Introduction to We saw scatterplot (geom_point()
), smoothplot (geom_smooth()
).
Now barplot with geom_bar()
:
ggplot(data = diamonds, mapping = aes(x = cut)) +
geom_bar()
More diamonds are available with high quality cuts.
On the x-axis, the chart displays cut, a variable from diamonds. On the y-axis, it displays count, but count is not a variable in diamonds!
geom and stat
The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation.
The figure below describes how this process works with geom_bar()
.
You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using stat_count()
instead of geom_bar()
:
ggplot(data = diamonds, mapping = aes(x = cut)) +
stat_count()
Every geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. There are three reasons you might need to use a stat explicitly:
Why stat ?
You might want to override the default stat.
For example, in the following demo
dataset we already have a variable for the counts per cut
.
demo <- tribble(
~cut, ~freq,
"Fair", 1610,
"Good", 4906,
"Very Good", 12082,
"Premium", 13791,
"Ideal", 21551
)
(Don't worry that you haven't seen tribble()
before. You might be able
to guess at their meaning from the context, and you will learn exactly what
they do soon!)
Solution
```{r 3_ab, include=TRUE, fig.width=8, fig.height=4.5} ggplot(data = demo, mapping = aes(x = cut, y = freq)) + geom_bar(stat = "identity") ```
You might want to override the default mapping from transformed variables to aesthetics ( e.g., proportion).
ggplot(data = diamonds, mapping = aes(x = cut, y = ..prop.., group = 1)) +
geom_bar()
Solution
```{r diamonds_stats_challenge, include=TRUE, message=FALSE, fig.width=8, fig.height=4.5} ggplot(data = diamonds, mapping = aes(x = cut, y = ..prop..)) + geom_bar() ```
If group is not used, the proportion is calculated with respect to the data that contains that field and is ultimately going to be 100% in any case. For instance, the proportion of an ideal cut in the ideal cut specific data will be 1.
stat_summary
More details with Solution
```{r 3_c, include=TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) + stat_summary()
</p>
</details>
<div class="pencadre">
Set the `fun.min`, `fun.max` and `fun` to the `min`, `max` and `median` function respectively
</div>
<details><summary>Solution</summary>
<p>
```{r 3_d, include=TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) +
stat_summary(
fun.min = min,
fun.max = max,
fun = median
)
Coloring area plots
Solution
```{r diamonds_barplot_color, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = diamonds, mapping = aes(x = cut, color = cut)) + geom_bar() ```
ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) +
geom_bar()
Solution
```{r diamonds_barplot_fill_clarity, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + geom_bar() ```
Position adjustments
The stacking of the fill
parameter is performed by the position adjustment position
Solution
```{r diamonds_barplot_pos_fill, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + geom_bar( position = "fill") ```
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar( position = "dodge")
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar( position = "jitter")
jitter
is often used for plotting points when they are stacked on top of each other.
Solution
```{r dia_jitter2, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) + geom_point() ```
ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
geom_jitter()
Solution
```{r dia_jitter4, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) + geom_jitter(width = .1, height = .1) ```
In the geom_jitter
plot that we made, we cannot really see the limits of the different clarity groups. Instead we can use the geom_violin
to see their density.
Solution
```{r dia_violon, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) + geom_violin() ```
Coordinate systems
Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are a number of other coordinate systems that are occasionally helpful.
ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
geom_boxplot()
Solution
```{r dia_boxplot_flip, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) + geom_boxplot() + coord_flip() ```
ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) +
geom_bar( show.legend = FALSE, width = 1 ) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
Solution
```{r diamonds_bar2, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) + geom_bar( show.legend = FALSE, width = 1 ) + theme(aspect.ratio = 1) + labs(x = NULL, y = NULL) + coord_polar() ```
By combining the right geom, coordinates and faceting functions, you can build a large number of different plots to present your results.
R.4: data transformation
See you inTo go further: animated plots from xls files
In order to be able to read information from a xls file, we will use the openxlsx
packages. To generate animation we will use the ggannimate
package. The additional gifski
package will allow R to save your animation in the gif format (Graphics Interchange Format)
install.packages(c("openxlsx", "gganimate", "gifski"))
library(openxlsx)
library(gganimate)
library(gifski)
Solution
2 solutions :
Use directly the url
gapminder <- read.xlsx("https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx")
Dowload the file, save it in the same directory as your script then use the local path
gapminder <- read.xlsx("gapminder.xlsx")
This dataset contains 4 variables of interest for us to display per country:
-
gdpPercap
the GDP par capita (US$, inflation-adjusted) -
lifeExp
the life expectancy at birth, in years -
pop
the population size -
contient
a factor with 5 levels
Solution
```{r gapminder_plot_a} ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) + geom_point() ```
Solution
```{r gapminder_plot_b} ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) + geom_point() + scale_x_log10() ```
For this we need to add a transition_time
layer that will take as an argument year
to our plot.
Solution
```{r gapminder_plot_c} ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) + geom_point() + scale_x_log10() + transition_time(year) + labs(title = 'Year: {as.integer(frame_time)}') ```