Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision
  • main
  • master
2 results

Target

Select target project
  • LBMC/hub/formations/R_basis
  • can/R_basis
2 results
Select Git revision
  • main
  • master
  • quarto-rebuild
3 results
Show changes
Showing
with 18725 additions and 1552 deletions
---
title: 'R.1: Installing packages from Bioconductor'
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "2021"
output:
rmdformats::downcute:
self_contain: false
use_bookdown: true
default_style: "dark"
lightbox: true
css: "../src/style.css"
---
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
To install packages from [Bioconducor](http://www.bioconductor.org) you need to
\ No newline at end of file
# Does news coverage boost support for presidential candidates in the Democratic primary?
# https://www.jacob-long.com/post/news-coverage-candidate-support/
library(tidyverse)
library(jtools)
library(tsibble)
################################ Getting the data ##############################
cable_mentions <- read_csv("https://github.com/fivethirtyeight/data/raw/master/media-mentions-2020/cable_weekly.csv")
online_mentions <- read_csv("https://github.com/fivethirtyeight/data/raw/master/media-mentions-2020/online_weekly.csv")
# Immediately convert `end_date` to date class
polls <- read_csv("https://projects.fivethirtyeight.com/polls-page/president_primary_polls.csv")
candidates <- c("Amy Klobuchar", "Andrew Yang", "Bernard Sanders",
"Beto O'Rourke", "Bill de Blasio", "Cory A. Booker",
"Elizabeth Warren", "Eric Swalwell", "Jay Robert Inslee",
"Joe Sestak", "John Hickenlooper", "John K. Delaney",
"Joseph R. Biden Jr.", "Julián Castro", "Kamala D. Harris",
"Kirsten E. Gillibrand", "Marianne Williamson",
"Michael F. Bennet", "Pete Buttigieg", "Seth Moulton",
"Steve Bullock", "Tim Ryan", "Tom Steyer", "Tulsi Gabbard",
"Wayne Messam")
candidates_clean <- c("Amy Klobuchar", "Andrew Yang", "Bernie Sanders",
"Beto O'Rourke", "Bill de Blasio", "Cory Booker",
"Elizabeth Warren", "Eric Swalwell", "Jay Inslee",
"Joe Sestak", "John Hickenlooper", "John Delaney",
"Joe Biden", "Julian Castro", "Kamala Harris",
"Kirsten Gillibrand", "Marianne Williamson",
"Michael Bennet", "Pete Buttigieg", "Seth Moulton",
"Steve Bullock", "Tim Ryan", "Tom Steyer",
"Tulsi Gabbard", "Wayne Messam")
########################### formating data #####################################
polls <- polls %>%
# Convert date to date format
mutate(end_date = as.Date(end_date, format = "%m/%d/%y")) %>%
filter(
# include only polls of at least modest quality
fte_grade %in% c("C-", "C", "C+", "B-", "B", "B+", "A-", "A", "A+"),
# only include polls ending on or after 12/30/2018
end_date >= as.Date("12/30/2018", "%m/%d/%Y"),
# only include *Democratic* primary polls
party == "DEM",
# only include the selected candidates
candidate_name %in% candidates,
# only national polls
is.na(state),
# Exclude some head-to-head results, etc.
notes %nin% c("head-to-head poll",
"HarrisX/SR Democrat LV, definite voter",
"open-ended question")
) %>%
mutate(
# Have to add 1 to the date to accommodate tsibble's yearweek()
# starting on Monday rather than Sunday like our other data sources
week = as.Date(yearweek(end_date + 1)) - 1,
# Convert candidate names to factor so I can relabel them
candidate_name = factor(candidate_name, levels = candidates, labels = candidates_clean)
)
polls_agg <- polls %>%
group_by(week, candidate_name) %>%
summarize(
pct_polls = weighted.mean(pct, log(sample_size))
)
library(ggplot2)
top_candidates <- c("Joe Biden", "Elizabeth Warren", "Bernie Sanders",
"Pete Buttigieg", "Kamala Harris", "Beto O'Rourke",
"Cory Booker")
ggplot(filter(polls_agg, candidate_name %in% top_candidates),
aes(x = week, y = pct_polls, color = candidate_name)) +
geom_line() +
theme_nice()
media <-
inner_join(cable_mentions, online_mentions, by = c("date", "name")) %>%
mutate(
# Create new variables that put the media coverage variables on
# same scale as poll numbers
pct_cable = 100 * pct_of_all_candidate_clips,
pct_online = 100 * pct_of_all_candidate_stories
)
top_candidates <- c("Joe Biden", "Elizabeth Warren", "Bernie Sanders",
"Pete Buttigieg", "Kamala Harris", "Beto O'Rourke",
"Cory Booker")
ggplot(filter(media, name %in% top_candidates),
aes(x = date, y = pct_cable, color = name)) +
geom_line() +
theme_nice()
top_candidates <- c("Joe Biden", "Elizabeth Warren", "Bernie Sanders",
"Pete Buttigieg", "Kamala Harris", "Beto O'Rourke",
"Cory Booker")
ggplot(filter(media, name %in% top_candidates),
aes(x = date, y = pct_online, color = name)) +
geom_line() +
theme_nice()
######################### Combine the data #####################################
joined <- inner_join(polls_agg, media,
by = c("candidate_name" = "name", "week" = "date"))
library(panelr)
# panel_data needs a number or ordered factor as wave variable
joined$wave <- as.ordered(joined$week)
joined_panel <- panel_data(ungroup(joined), id = candidate_name, wave = wave)
joined_pdata <- as_pdata.frame(joined_panel)
---
title: 'R.1: Installing packages from github'
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "2021"
output:
rmdformats::downcute:
self_contain: false
use_bookdown: true
default_style: "dark"
lightbox: true
css: "../src/style.css"
---
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
To install packages from [github](https://github.com/) you need to
\ No newline at end of file
This diff is collapsed.
---
title: "R.2: introduction to Tidyverse"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "2021"
output:
rmdformats::downcute:
self_contain: false
use_bookdown: true
default_style: "dark"
lightbox: true
css: "../src/style.css"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr);\nHélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "2022"
---
```{r include=FALSE}
library(fontawesome)
if("conflicted" %in% .packages())
conflicted::conflicts_prefer(dplyr::filter)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
```{r download_data, include=FALSE, eval=FALSE}
```{r download_data, include=FALSE, eval=T}
library("tidyverse")
tmp <- tempfile(fileext = ".zip")
download.file("http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip",
......@@ -114,7 +108,7 @@ read_csv("data-raw/vehicles.csv") %>%
write_csv("mpg.csv")
```
# Introduction
## Introduction
In the last session, we have gone through the basis of R.
Instead of continuing to learn more about R programming, in this session we are going to jump directly to rendering plots.
......@@ -122,7 +116,7 @@ Instead of continuing to learn more about R programming, in this session we are
We make this choice for three reasons:
- Rendering nice plots is directly rewarding
- You will be able to apply what you learn in this session to your own data (given that they are *correctly formated*)
- You will be able to apply what you learn in this session to your own data (given that they are *correctly formatted*)
- We will come back to R programming later, when you have all the necessary tools to visualize your results.
......@@ -133,7 +127,7 @@ The objectives of this session will be to:
- Learn the different aesthetics in R plots
- Compose complex graphics
## Tidyverse
### Tidyverse
The `tidyverse` package is a collection of R packages designed for data science that include `ggplot2`.
......@@ -154,7 +148,7 @@ Luckily for you, `tidyverse` is preinstalled on your Rstudio server. So you just
library("tidyverse")
```
## Toy data set `mpg`
### Toy data set `mpg`
This dataset contains a subset of the fuel economy data that the EPA makes available on [fueleconomy.gov](http://fueleconomy.gov).
It contains only models which had a new release every year between 1999 and 2008.
......@@ -170,9 +164,13 @@ For that we are going to use the command `read_csv` which is able to read a [csv
This command also works for file URL
```{r mpg_download, cache=TRUE, message=FALSE}
```{r mpg_download_local, cache=TRUE, message=FALSE, echo = F, include=F}
new_mpg <- read_csv("./mpg.csv")
```
```{r mpg_download, cache=TRUE, message=FALSE, eval = F}
new_mpg <- read_csv(
"http://perso.ens-lyon.fr/laurent.modolo/R/mpg.csv"
"https://can.gitbiopages.ens-lyon.fr/R_basis/session_2/mpg.csv"
)
```
......@@ -199,13 +197,13 @@ new_mpg
Here we can see that `new_mpg` is a `tibble` we will come back to `tibble` later.
## New script
### New script
Like in the last session, instead of typing your commands directly in the console, you are going to write them in an R script.
![](./img/formationR_session2_scriptR.png)
# First plot with `ggplot2`
## First plot with `ggplot2`
We are going to make the simplest plot possible to study the relationship between two variables: the scatterplot.
......@@ -237,7 +235,15 @@ ggplot(data = <DATA>) +
What happend when you use only the command `ggplot(data = mpg)` ?
</div>
<details><summary>Solution</summary>
<p>
```{r only_ggplot, cache = TRUE, fig.width=4.5, fig.height=2}
ggplot(data = new_mpg)
```
</p>
</details>
<div class="pencadre">
Make a scatterplot of `hwy` ( fuel efficiency ) vs. `cyl` ( number of cylinders ).
</div>
......@@ -251,12 +257,22 @@ ggplot(data = new_mpg, mapping = aes(x = hwy, y = cyl)) +
```
</p>
<div class="pencadre">
What seems to be the problem ?
</div>
<details><summary>Solution</summary>
<p>
Dots with the same coordinates are superposed.
</p>
</details>
</details>
# Aesthetic mappings
## Aesthetic mappings
`ggplot2` will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. `ggplot2` will also add a legend that explains which levels correspond to which values.
......@@ -266,7 +282,7 @@ Try the following aesthetic:
- `alpha`
- `shape`
## `color` mapping
### `color` mapping
```{r new_mpg_plot_e, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
......@@ -274,21 +290,21 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
```
## `size` mapping
### `size` mapping
```{r new_mpg_plot_f, cache = TRUE, fig.width=8, fig.height=4.5, warning=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, size = class)) +
geom_point()
```
## `alpha` mapping
### `alpha` mapping
```{r new_mpg_plot_g, cache = TRUE, fig.width=8, fig.height=4.5, warning=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, alpha = class)) +
geom_point()
```
## `shape` mapping
### `shape` mapping
```{r new_mpg_plot_h, cache = TRUE, fig.width=8, fig.height=4.5, warning=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, shape = class)) +
......@@ -325,7 +341,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
</p>
</details>
## Mapping a **continuous** variable to a color.
### Mapping a **continuous** variable to a color.
You can also map continuous variable to a color
......@@ -347,7 +363,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = displ < 5)) +
</p>
</details>
# Facets
## Facets
You can create multiple plots at once by faceting. For this you can use the command `facet_wrap`.
This command takes a formula as input.
......@@ -362,7 +378,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
```
<div class="pencadre">
Now try to facet your plot by `fl + class`
Now try to facet your plot by `fuel + class`
</div>
......@@ -371,14 +387,14 @@ Now try to facet your plot by `fl + class`
Formulas allow you to express complex relationship between variables in R !
```{r new_mpg_plot_l, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ fl + class, nrow = 2)
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ fuel + class, nrow = 2)
```
</p>
</details>
# Composition
## Composition
There are different ways to represent the information :
......@@ -416,7 +432,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
\
We can use different `data` for different layers (you will lean more on `filter()` later)
We can use different `data` (here new_mpg and mpg tables) for different layers (you will lean more on `filter()` later)
```{r new_mpg_plot_t, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
......@@ -424,14 +440,14 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = filter(mpg, class == "subcompact"))
```
# Challenge !
## Challenge !
## First challenge
### First challenge
<div class="pencadre">
Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
</div>
```R
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drive)) +
geom_point(show.legend = FALSE) +
geom_smooth(se = FALSE)
```
......@@ -441,61 +457,231 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
- What does the `se` argument to `geom_smooth()` do?
</div>
## Second challenge
<details><summary>Solution</summary>
<p>
```{r soluce_challenge_1, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drive)) +
geom_point(show.legend = FALSE) +
geom_smooth(se = FALSE)
```
</p>
</details>
### Second challenge
<div class="pencadre">
How being a `2seater` car impact the engine size versus fuel efficiency relationship ?
How being a `Two Seaters` car (*class column*) impact the engine size (*displ column*) versus fuel efficiency relationship (*hwy column*) ?
1. Make a plot of `hwy` in function of `displ `
1. *Colorize* this plot in another color for `Two Seaters` class
2. *Split* this plot for each *class*
Make a plot *colorizing* this information
</div>
<details><summary>Solution</summary>
<details><summary>Solution 1</summary>
<p>
```{r new_mpg_plot_color_2seater1, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point()
```
</p>
</details>
<details><summary>Solution 2</summary>
<p>
```{r new_mpg_plot_color_2seater, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
```{r new_mpg_plot_color_2seater2, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_point(data = filter(mpg, class == "2seater"), color = "red")
geom_point(data = filter(new_mpg, class == "Two Seaters"), color = "red")
```
</p>
</details>
<details><summary>Solution 3</summary>
<p>
```{r new_mpg_plot_color_2seater_facet, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_point(data = filter(new_mpg, class == "Two Seaters"), color = "red") +
facet_wrap(~class)
```
</p>
</details>
<div class="pencadre">
Write a `function` called `plot_color_2seater` that can take as sol argument the variable `mpg` and plot the same graph.
Write a `function` called `plot_color_a_class` that can take as argument the class and plot the same graph for this class
</div>
<details><summary>Solution</summary>
<p>
```{r new_mpg_plot_color_2seater_fx, cache = TRUE, fig.width=8, fig.height=4.5}
plot_color_2seater <- function(mpg) {
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
plot_color_a_class <- function(my_class) {
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_point(data = filter(mpg, class == "2seater"), color = "red")
geom_point(data = filter(new_mpg, class == my_class), color = "red") +
facet_wrap(~class)
}
plot_color_2seater(mpg)
plot_color_a_class("Two Seaters")
plot_color_a_class("Compact Cars")
```
</p>
</details>
## Third challenge
### Third challenge
<div class="pencadre">
Recreate the R code necessary to generate the following graph
Recreate the R code necessary to generate the following graph (see "linetype" option of "geom_smooth")
</div>
```{r new_mpg_plot_u, echo = FALSE, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = fuel)) +
geom_point() +
geom_smooth(mapping = aes(linetype = drv))
geom_smooth(linetype = "dashed", color = "black") +
facet_wrap(~fuel)
```
<details><summary>Solution</summary>
<p>
```{r new_mpg_plot_v, eval=F}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = fuel)) +
geom_point() +
geom_smooth(mapping = aes(linetype = drv))
geom_smooth(linetype = "dashed", color = "black") +
facet_wrap(~fuel)
```
</p>
</details>
\ No newline at end of file
</details>
### See you in [R.3: Transformations with ggplot2](/session_3/session_3.html)
## To go further: publication ready plots
Once you have created the graph you need for your publication, you have to save it.
You can do it with the `ggsave` function.
First save your plot in a variable :
```{r}
p1 <- ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point()
```
Then save it in the wanted format:
```{r, eval=F}
ggsave("test_plot_1.png", p1, width = 12, height = 8, units = "cm")
```
```{r, eval=F}
ggsave("test_plot_1.pdf", p1, width = 12, height = 8, units = "cm")
```
You may also change the appearance of your plot by adding a `theme` layer to your plot:
```{r,fig.width=8, fig.height=4.5, message=FALSE}
p1 + theme_bw()
```
```{r,fig.width=8, fig.height=4.5, message=FALSE}
p1 + theme_minimal()
```
You may have to combine several plots, for that you can use the `cowplot` package which is a `ggplot2` extension.
First install it :
```{r, eval=F}
install.packages("cowplot")
```
```{r, include=F, echo =F}
if (! require("cowplot")) {
install.packages("cowplot")
}
```
Then you can use the function `plot` grid to combine plots in a publication ready style:
```{r,message=FALSE}
library(cowplot)
```
```{r,fig.width=8, fig.height=4.5, message=FALSE}
p1 <- ggplot(data = new_mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
p1
```
```{r,fig.width=8, fig.height=4.5, message=FALSE}
p2 <- ggplot(data = new_mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
p2
```
```{r,fig.width=8, fig.height=4.5, message=FALSE}
plot_grid(p1, p2, labels = c('A', 'B'), label_size = 12)
```
You can also save it in a file.
```{r, eval=F}
p_final = plot_grid(p1, p2, labels = c('A', 'B'), label_size = 12)
ggsave("test_plot_1_and_2.png", p_final, width = 20, height = 8, units = "cm")
```
You can learn more features about `cowplot` on [https://wilkelab.org/cowplot/articles/introduction.html](its website).
<div class="pencadre">
Use the `cowplot` documentation to reproduce this plot and save it.
</div>
```{r, echo=F}
p1 <- ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point() + theme_bw()
p2 <- ggplot(data = new_mpg, mapping = aes(x = cty, y = hwy, color = class)) +
geom_point() + theme_bw()
p_row <- plot_grid(p1 + theme(legend.position = "none"), p2 + theme(legend.position = "none"), labels = c('A', 'B'), label_size = 12)
p_legend <- get_legend(p1 + theme(legend.position = "top"))
plot_grid(p_row, p_legend, nrow = 2, rel_heights = c(1,0.2))
```
<details><summary>Solution</summary>
<p>
```{r , echo = TRUE, eval = F}
p1 <- ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point() + theme_bw()
p2 <- ggplot(data = new_mpg, mapping = aes(x = cty, y = hwy, color = class)) +
geom_point() + theme_bw()
p_row <- plot_grid(p1 + theme(legend.position = "none"), p2 + theme(legend.position = "none"), labels = c('A', 'B'), label_size = 12)
p_legend <- get_legend(p1 + theme(legend.position = "top"))
p_final <- plot_grid(p_row, p_legend, nrow = 2, rel_heights = c(1,0.2))
p_final
```
```{r , echo = TRUE, eval = F}
ggsave("plot_1_2_and_legend.png", p_final, width = 20, height = 8, units = "cm")
```
</p>
</details>
There are a lot of other available `ggplot2` extensions which can be useful (and also beautiful).
You can take a look at them here: [https://exts.ggplot2.tidyverse.org/gallery/]( ggplot2 gallery)
---
title: "R#2: introduction to Tidyverse"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "24 Oct 2019"
output:
slidy_presentation:
highlight: tango
beamer_presentation:
theme: metropolis
slide_level: 3
fig_caption: no
df_print: tibble
highlight: tango
latex_engine: xelatex
---
```{r setup, include=FALSE, cache=TRUE}
knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
tmp <- tempfile(fileext = ".zip")
download.file("http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip",
tmp,
quiet = TRUE)
unzip(tmp, exdir = "data-raw")
new_class_level <- c(
"Compact Cars",
"Large Cars",
"Midsize Cars",
"Midsize Cars",
"Midsize Cars",
"Compact Cars",
"Minivan",
"Minivan",
"Pickup Trucks",
"Pickup Trucks",
"Pickup Trucks",
"Sport Utility Vehicle",
"Sport Utility Vehicle",
"Compact Cars",
"Special Purpose Vehicle",
"Special Purpose Vehicle",
"Special Purpose Vehicle",
"Special Purpose Vehicle",
"Special Purpose Vehicle",
"Special Purpose Vehicle",
"Sport Utility Vehicle",
"Sport Utility Vehicle",
"Pickup Trucks",
"Pickup Trucks",
"Pickup Trucks",
"Pickup Trucks",
"Sport Utility Vehicle",
"Sport Utility Vehicle",
"Compact Cars",
"Two Seaters",
"Vans",
"Vans",
"Vans",
"Vans"
)
new_fuel_level <- c(
"gas",
"Diesel",
"Regular",
"gas",
"gas",
"Regular",
"Regular",
"Hybrid",
"Hybrid",
"Regular",
"Regular",
"Hybrid",
"Hybrid"
)
read_csv("data-raw/vehicles.csv") %>%
select(
"id",
"make",
"model",
"year",
"VClass",
"trany",
"drive",
"cylinders",
"displ",
"fuelType",
"highway08",
"city08"
) %>%
rename(
"class" = "VClass",
"trans" = "trany",
"drive" = "drive",
"cyl" = "cylinders",
"displ" = "displ",
"fuel" = "fuelType",
"hwy" = "highway08",
"cty" = "city08"
) %>%
filter(drive != "") %>%
drop_na() %>%
arrange(make, model, year) %>%
mutate(class = factor(as.factor(class), labels = new_class_level)) %>%
mutate(fuel = factor(as.factor(fuel), labels = new_fuel_level)) %>%
write_csv("2_data.csv")
```
## R#2: introduction to Tidyverse
The goal of this practical is to familiarize yourself with `ggplot2`.
The objectives of this session will be to:
- Create basic plot with `ggplot2`
- Understand the `tibble` type
- Learn the different aesthetics in R plots
- Compose graphics
## Tidyverse
The tidyverse is a collection of R packages designed for data science.
All packages share an underlying design philosophy, grammar, and data structures.
```{r install_tidyverse, cache = TRUE, eval = FALSE}
install.packages("tidyverse")
```
```{r load_tidyverse, cache = TRUE}
library("tidyverse")
```
## Toy data set `mpg`
This dataset contains a subset of the fuel economy data that the EPA makes available on **http://fueleconomy.gov**. It contains only models which had a new release every year between 1999 and 2008.
```{r mpg_inspect, cache = TRUE, eval=FALSE}
?mpg
mpg
dim(mpg)
View(mpg)
```
## Updated version of the data
`mpg` is loaded with tidyverse, we want to be able to read our own data from
**http://perso.ens-lyon.fr/laurent.modolo/R/2_data.csv**
```{r mpg_download, cache=TRUE, message=FALSE}
new_mpg <- read_csv(
"http://perso.ens-lyon.fr/laurent.modolo/R/2_data.csv"
)
```
**http://perso.ens-lyon.fr/laurent.modolo/R/2_a**
## First plot with `ggplot2`
Relationship between engine size `displ` and fuel efficiency `hwy`.
```{r new_mpg_plot_a, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
```
## Composition of plot with `ggplot2`
Composition of plot with `ggplot2`
```R
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
```
- you begin a plot with the function `ggplot()`
- you complete your graph by adding one or more layers
- `geom_point()` adds a layer with a scatterplot
- each geom function in `ggplot2` takes a `mapping` argument
- the `mapping` argument is always paired with `aes()`
## First challenge
- Run `ggplot(data = new_mpg)`. What do you see?
- How many rows are in `new_mpg`? How many columns?
- What does the `cty` variable describe? Read the help for `?mpg` to find out.
- Make a scatterplot of `hwy` vs. `cyl`.
- What happens if you make a scatterplot of `class` vs. `drive`? Why is the plot not useful?
## Run `ggplot(data = mpg)`. What do you see?
```{r empty_plot, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg)
```
## How many rows are in `new_mpg`? How many columns?
```{r size_of_mpg, cache = TRUE, fig.width=8, fig.height=4.5}
new_mpg
```
## Make a scatterplot of `hwy` vs. `cyl`.
```{r new_mpg_plot_b, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg) +
geom_point(mapping = aes(x = hwy, y = cyl))
```
## What happens if you make a scatterplot of `class` vs. `drive`?
Why is the plot not useful?
```{r new_mpg_plot_c, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg) +
geom_point(mapping = aes(x = class, y = drive))
```
## Aesthetic mappings
How can you explain these cars?
```{r new_mpg_plot_d, echo = FALSE, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_point(data = mpg %>% filter(class == "2seater"),
mapping = aes(x = displ, y = hwy), color = "red")
```
## Aesthetic mapping `color`
```{r new_mpg_plot_e, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
```
## Aesthetic mappings
`ggplot2` will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. `ggplot2` will also add a legend that explains which levels correspond to which values.
Try the following aesthetic:
- `size`
- `alpha`
- `shape`
## Aesthetic mapping `size`
```{r new_mpg_plot_f, cache = TRUE, fig.width=8, fig.height=4.5, warning=FALSE}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
```
## Aesthetic mapping `alpha`
```{r new_mpg_plot_g, cache = TRUE, fig.width=8, fig.height=4.5, warning=FALSE}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
```
## Aesthetic mapping `shape`
```{r new_mpg_plot_h, cache = TRUE, fig.width=8, fig.height=4.5, warning=FALSE}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
```
## Aesthetic
You can also set the aesthetic properties of your geom manually. For example, we can make all of the points in our plot blue:
```{r new_mpg_plot_i, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
```
## Second challenge
- What’s gone wrong with this code? Why are the points not blue?
```R
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
```
- Which variables in `mpg` are **categorical**? Which variables are **continuous**? (Hint: type `mpg`)
- Map a **continuous** variable to color, size, and shape.
- What does the `stroke` aesthetic do? What shapes does it work with? (Hint: use ?geom_point)
- What happens if you map an aesthetic to something other than a variable name, like `color = displ < 5`?
## Facets
```{r new_mpg_plot_j, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~class)
```
## Facets
```{r new_mpg_plot_k, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~class, nrow = 2)
```
## Facets
```{r new_mpg_plot_l, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ fl + class, nrow = 2)
```
## Composition
There are different ways to represent the information
```{r new_mpg_plot_o, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
```
## Composition
There are different ways to represent the information
```{r new_mpg_plot_p, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
```
## Composition
We can add as many layers as we want
```{r new_mpg_plot_q, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
```
## Composition
We can avoid code duplication
```{r new_mpg_plot_r, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
```
## Composition
We can make `mapping` layer specific
```{r new_mpg_plot_s, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
```
## Composition
We can use different `data` for different layer (You will lean more on `filter()` later)
```{r new_mpg_plot_t, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"))
```
## Fird challenge
- Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
```R
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
```
**http://perso.ens-lyon.fr/laurent.modolo/R/2_d**
- What does `show.legend = FALSE` do?
- What does the `se` argument to `geom_smooth()` do?
## Third challenge
- Recreate the R code necessary to generate the following graph
```{r new_mpg_plot_u, echo = FALSE, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(mapping = aes(linetype = drv))
```
## Third challenge
```{r new_mpg_plot_v, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(mapping = aes(linetype = drv))
```
\ No newline at end of file
File added
---
title: 'R.3: Transformations with ggplot2'
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "2021"
output:
rmdformats::downcute:
self_contain: false
use_bookdown: true
default_style: "dark"
lightbox: true
css: "../src/style.css"
date: "2022"
---
```{r include=FALSE}
library(fontawesome)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
# Introduction
## Introduction
In the last session, we have seen how to use `ggplot2` and [The Grammar of Graphics](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448/ref=as_li_ss_tl). The goal of this practical is to practices more advanced features of `ggplot2`.
......@@ -45,46 +35,63 @@ library("tidyverse")
</p>
</details>
Like in the previous sessions, it's good practice to create a new **.R** file to write your code instead of using directly the R terminal.
Like in the previous sessions, it's good practice to create a new **.R** file to write your code instead of using the R terminal directly.
# `ggplot2` statistical transformations
## `ggplot2` statistical transformations
In the previous session, we have plotted the data as they are by using the variable values as **x** or **y** coordinates, color shade, size or transparency.
When dealing with categorical variables, also called **factors**, it can be interesting to perform some simple statistical transformations.
For example, we may want to have coordinates on an axis proportional to the number of records for a given category.
We are going to use the `diamonds` data set included in `tidyverse`.
- Use the `help` and `view` command to explore this data set.
<div class="pencadre">
- Use the `help` and `View` command to explore this data set.
- How much records does this dataset contain ?
- Try the `str` command, which information are displayed ?
</div>
```{r str_diamon}
str(diamonds)
```
We saw scatterplot (`geom_point()`), smoothplot (`geom_smooth()`). Now barplot with `geom_bar()` :
### Introduction to `geom_bar`
```{r diamonds_barplot, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
We saw scatterplot (`geom_point()`), smoothplot (`geom_smooth()`).
Now barplot with `geom_bar()` :
```{r diamonds_barplot, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = diamonds, mapping = aes(x = cut)) +
geom_bar()
```
More diamonds are available with high quality cuts.
On the x-axis, the chart displays cut, a variable from diamonds. On the y-axis, it displays count, but count is not a variable in diamonds!
On the x-axis, the chart displays **cut**, a variable from diamonds. On the y-axis, it displays **count**, **but count is not a variable in diamonds!**
The algorithm used to calculate new values for a graph is called a **stat**, short for statistical transformation. The figure below describes how this process works with `geom_bar()`.
![](img/visualization-stat-bar.png)
### **geom** and **stat**
The algorithm used to calculate new values for a graph is called a **stat**, short for statistical transformation.
The figure below describes how this process works with `geom_bar()`.
![](img/visualization-stat-bar.png)
You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using `stat_count()` instead of `geom_bar()`:
You can generally use **geoms** and **stats** interchangeably. For example, you can recreate the previous plot using `stat_count()` instead of `geom_bar()`:
```{r diamonds_stat_count, include=TRUE, fig.width=8, fig.height=4.5, message=FALSE}
```{r diamonds_stat_count, include=TRUE, fig.width=8, fig.height=4.5}
ggplot(data = diamonds, mapping = aes(x = cut)) +
stat_count()
```
Every geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. There are three reasons you might need to use a stat explicitly:
Every **geom** has a default **stat**; and every **stat** has a default **geom**. This means that you can typically use **geoms** without worrying about the underlying statistical transformation. There are three reasons you might need to use a **stat** explicitly:
- You might want to override the default stat.
### Why **stat** ?
You might want to override the default stat.
For example, in the following `demo` dataset we already have a variable for the **counts** per `cut`.
```{r 3_a, include=TRUE, fig.width=8, fig.height=4.5}
demo <- tribble(
......@@ -101,36 +108,66 @@ demo <- tribble(
to guess at their meaning from the context, and you will learn exactly what
they do soon!)
<div class="pencadre">
So instead of using the default `geom_bar` parameter `stat = "count"` try to use `"identity"`
</div>
<details><summary>Solution</summary>
<p>
```{r 3_ab, include=TRUE, fig.width=8, fig.height=4.5}
ggplot(data = demo, mapping = aes(x = cut, y = freq)) +
geom_bar(stat = "identity")
```
</p>
</details>
You might want to override the default mapping from transformed variables to aesthetics ( e.g., proportion).
- You might want to override the default mapping from transformed variables to aesthetics ( e.g. proportion).
```{r 3_b, include=TRUE, fig.width=8, fig.height=4.5}
ggplot(data = diamonds, mapping = aes(x = cut, y = ..prop.., group = 1)) +
geom_bar()
```
- In our proportion bar chart, we need to set `group = 1`. Why?
<div class="pencadre">
In our proportion bar chart, we need to set `group = 1`. Why?
</div>
<details><summary>Solution</summary>
<p>
```{r diamonds_stats_challenge, include=TRUE, message=FALSE, fig.width=8, fig.height=4.5}
ggplot(data = diamonds, mapping = aes(x = cut, y = ..prop..)) +
geom_bar()
```
If group is not used, the proportion is calculated with respect to the data that contains that field and is ultimately going to be 100% in any case. For instance, The proportion of an ideal cut in the ideal cut specific data will be 1.
If group is not used, the proportion is calculated with respect to the data that contains that field and is ultimately going to be 100% in any case. For instance, the proportion of an ideal cut in the ideal cut specific data will be 1.
</p>
</details>
### More details with `stat_summary`
- You might want to draw greater attention to the statistical transformation in your code.
you might use stat_summary(), which summarises the y values for each unique x
value, to draw attention to the summary that you are computing:
<div class="pencadre">
You might want to draw greater attention to the statistical transformation in your code.
you might use `stat_summary()`, which summarize the **y** values for each unique **x**
value, to draw attention to the summary that you are computing
</div>
<details><summary>Solution</summary>
<p>
```{r 3_c, include=TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) +
stat_summary()
```
</p>
</details>
<div class="pencadre">
Set the `fun.min`, `fun.max` and `fun` to the `min`, `max` and `median` function respectively
</div>
<details><summary>Solution</summary>
<p>
```{r 3_d, include=TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) +
stat_summary(
fun.min = min,
......@@ -138,54 +175,80 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) +
fun = median
)
```
</p>
</details>
## Coloring area plots
# Position adjustments
You can colour a bar chart using either the `color` aesthetic,
<div class="pencadre">
You can color a bar chart using either the `color` aesthetic, or, more usefully `fill`:
Try both solutions on a `cut`, histogram.
</div>
<details><summary>Solution</summary>
<p>
```{r diamonds_barplot_color, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, color = cut)) +
geom_bar()
```
or, more usefully, `fill`:
```{r diamonds_barplot_fill, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) +
geom_bar()
```
</p>
</details>
<div class="pencadre">
You can also use `fill` with another variable:
Try to color by `clarity`. Is `clarity` a continuous or categorial variable ?
</div>
<details><summary>Solution</summary>
<p>
```{r diamonds_barplot_fill_clarity, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar()
```
</p>
</details>
## Position adjustments
The stacking is performed by the position adjustment `position`
The stacking of the `fill` parameter is performed by the position adjustment `position`
## fill
<div class="pencadre">
Try the following `position` parameter for `geom_bar`: `"fill"`, `"dodge"` and `"jitter"`
</div>
<details><summary>Solution</summary>
<p>
```{r diamonds_barplot_pos_fill, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar( position = "fill")
```
## dodge
```{r diamonds_barplot_pos_dodge, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar( position = "dodge")
```
## jitter
```{r diamonds_barplot_pos_jitter, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar( position = "jitter")
```
</p>
</details>
`jitter` is often used for plotting points when they are stacked on top of each other.
<div class="pencadre">
Compare `geom_point` to `geom_jitter` plot `cut` versus `depth` and color by `clarity`
</div>
<details><summary>Solution</summary>
<p>
```{r dia_jitter2, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
geom_point()
......@@ -195,61 +258,166 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
geom_jitter()
```
</p>
</details>
<div class="pencadre">
What parameters of `geom_jitter` control the amount of jittering ?
</div>
<details><summary>Solution</summary>
<p>
```{r dia_jitter4, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
geom_jitter(width = .1, height = .1)
```
</p>
</details>
## violin
In the `geom_jitter` plot that we made, we cannot really see the limits of the different clarity groups. Instead we can use the `geom_violin` to see their density.
<details><summary>Solution</summary>
<p>
```{r dia_violon, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
geom_violin()
```
</p>
</details>
# Coordinate systems
## Coordinate systems
Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are a number of other coordinate systems that are occasionally helpful.
```{r dia_boxplot, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
geom_boxplot()
```
<div class="pencardre">
Add the `coord_flip()` layer to the previous plot
</div>
<details><summary>Solution</summary>
<p>
```{r dia_boxplot_flip, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
geom_boxplot() +
coord_flip()
```
</p>
</details>
<div class="pencardre">
Add the `coord_polar()` layer to this plot:
```{r diamonds_bar, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE, eval=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) +
geom_bar( show.legend = FALSE, width = 1 ) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
```
</div>
```{r dia_12, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = depth, y = table)) +
geom_point() +
geom_abline()
<details><summary>Solution</summary>
<p>
```{r diamonds_bar2, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) +
geom_bar( show.legend = FALSE, width = 1 ) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL) +
coord_polar()
```
</p>
</details>
By combining the right **geom**, **coordinates** and **faceting** functions, you can build a large number of different plots to present your results.
## See you in [R.4: data transformation](/session_4/session_4.html)
## To go further: animated plots from xls files
In order to be able to read information from a xls file, we will use the `openxlsx` packages. To generate animation we will use the `ggannimate` package. The additional `gifski` package will allow R to save your animation in the gif format (Graphics Interchange Format)
```{r dia_quickmap, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = depth, y = table)) +
geom_point() +
geom_abline() +
coord_quickmap()
```{r install_readxl, eval=F}
install.packages(c("openxlsx", "gganimate", "gifski"))
```
```{r load_readxl}
library(openxlsx)
library(gganimate)
library(gifski)
```
<div class="pencardre">
Use the `openxlsx` package to save the [https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx](https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx) file to the `gapminder` variable
</div>
<details><summary>Solution</summary>
<p>
2 solutions :
Use directly the url
```{r load_xlsx_url, eval = F}
gapminder <- read.xlsx("https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx")
```
```{r diamonds_bar, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
bar <- ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) +
geom_bar( show.legend = FALSE, width = 1 ) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
Dowload the file, save it in the same directory as your script then use the local path
```{r load_xlsx}
gapminder <- read.xlsx("gapminder.xlsx")
```
bar
</p>
</details>
This dataset contains 4 variables of interest for us to display per country:
- `gdpPercap` the GDP par capita (US$, inflation-adjusted)
- `lifeExp` the life expectancy at birth, in years
- `pop` the population size
- `contient` a factor with 5 levels
<div class="pencardre">
Using `ggplot2`, build a scatterplot of the `gdpPercap` vs `lifeExp`. Add the `pop` and `continent` information to this plot.
</div>
<details><summary>Solution</summary>
<p>
```{r gapminder_plot_a}
ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
geom_point()
```
</p>
</details>
<div class="pencardre">
What's wrong ?
You can use the `scale_x_log10()` to display the `gdpPercap` on the `log10` scale.
</div>
```{r diamonds_bar_polar, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
bar + coord_polar()
<details><summary>Solution</summary>
<p>
```{r gapminder_plot_b}
ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
geom_point() +
scale_x_log10()
```
</p>
</details>
<div class="pencardre">
We would like to add the `year` information to the plots. We could use a `facet_wrap`, but instead we are going to use the `gganimate` package.
For this we need to add a `transition_time` layer that will take as an argument `year` to our plot.
</div>
<details><summary>Solution</summary>
<p>
```{r gapminder_plot_c}
ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
geom_point() +
scale_x_log10() +
transition_time(year) +
labs(title = 'Year: {as.integer(frame_time)}')
```
</p>
</details>
\ No newline at end of file
---
title: "R#3: stats with ggplot2"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "08 Nov 2019"
output:
slidy_presentation:
highlight: tango
beamer_presentation:
theme: metropolis
slide_level: 3
fig_caption: no
df_print: tibble
highlight: tango
latex_engine: xelatex
---
```{r setup, include=FALSE, cache=TRUE}
knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
```
## R#3: stats with ggplot2
The goal of this practical is to practices advanced features of `ggplot2`.
The objectives of this session will be to:
- learn about statistical transformations
- practices position adjustments
- change the coordinate systems
## `ggplot2` statistical transformations
We are going to use the `diamonds` data set included in `tidyverse`.
- Use the `help` and `View` command to explore this data set.
- Try the `str` command, which information are displayed ?
## `ggplot2` statistical transformations
```{r diamonds_barplot, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
```
More diamonds are available with high quality cuts.
## `ggplot2` statistical transformations
On the x-axis, the chart displays cut, a variable from diamonds. On the y-axis, it displays count, but count is not a variable in diamonds!
The algorithm used to calculate new values for a graph is called a **stat**, short for statistical transformation. The figure below describes how this process works with `geom_bar()`.
\includegraphics[width=\textwidth]{img/visualization-stat-bar.png}
## `ggplot2` statistical transformations
You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using `stat_count()` instead of `geom_bar()`:
```{r diamonds_stat_count, eval=FALSE, message=FALSE}
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))
```
## `ggplot2` statistical transformations
Every geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. There are three reasons you might need to use a stat explicitly:
- You might want to override the default stat. **3_a**
- You might want to override the default mapping from transformed variables to aesthetics. **3_b**
- You might want to draw greater attention to the statistical transformation in your code. **3_c**
## Statistical transformation challenge
- What does `geom_col()` do? How is it different to `geom_bar()`?
- What variables does `stat_smooth()` compute? What parameters control its behaviour?
- In our proportion bar chart, we need to set `group = 1`. Why? In other words what is the problem with these two graphs?
```{r diamonds_stats_challenge, eval=FALSE, message=FALSE}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop..))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))
```
## Position adjustments
You can colour a bar chart using either the `colour` aesthetic, or, more usefully, `fill`:
```{r diamonds_barplot_color, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, colour = cut))
```
## Position adjustments
You can colour a bar chart using either the `colour` aesthetic, or, more usefully, `fill`:
```{r diamonds_barplot_fill, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
```
## Position adjustments
You can also use `fill` with another variable:
```{r diamonds_barplot_fill_clarity, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
```
## Position adjustments
The stacking is performed by the position adjustment `position`
```{r diamonds_barplot_pos_identity, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds,
mapping = aes(x = cut, colour = clarity)) +
geom_bar(fill = NA, position = "identity")
```
## Position adjustments
The stacking is performed by the position adjustment `position`
```{r diamonds_barplot_pos_fill, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity),
position = "fill")
```
## Position adjustments
The stacking is performed by the position adjustment `position`
```{r diamonds_barplot_pos_dodge, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity),
position = "dodge")
```
## Position adjustments
The stacking is performed by the position adjustment `position`
```{r mpg_point_pos_jitter, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy),
position = "jitter")
```
## Position adjustments
The stacking is performed by the position adjustment `position`
```{r mpg_jitter, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = mpg) +
geom_jitter(mapping = aes(x = displ, y = hwy))
```
## Position adjustments challenges
- What is the problem with this plot? How could you improve it?
```{r mpg_point, eval=F, message=FALSE}
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
```
- What parameters to `geom_jitter()` control the amount of jittering?
- Compare and contrast `geom_jitter()` with `geom_count()`
- What’s the default position adjustment for `geom_boxplot()` ? Create a visualisation of the `mpg` dataset that demonstrates it.
## Coordinate systems
Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are a number of other coordinate systems that are occasionally helpful.
## Coordinate systems
```{r mpg_boxplot, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
```
## Coordinate systems
```{r mpg_boxplot_flip, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()
```
## Coordinate systems
```{r diamonds_bar, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
bar <- ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = cut),
show.legend = FALSE,
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
```
**3_d**
## Coordinate systems
```{r diamonds_bar_plot, echo=F, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
bar
```
**3_d**
## Coordinate systems
```{r diamonds_bar_flip, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
bar + coord_flip()
```
## Coordinate systems
```{r mpg_jitter_noquickmap, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = mpg) +
geom_jitter(mapping = aes(x = cty, y = hwy))
```
## Coordinate systems
```{r mpg_jitter_quickmap, cache = TRUE, fig.width=3.5, fig.height=3.5, message=FALSE}
ggplot(data = mpg) +
geom_jitter(mapping = aes(x = cty, y = hwy)) +
coord_quickmap()
```
## Coordinate systems
```{r mpg_jitter_log, cache = TRUE, fig.width=8.5, fig.height=3.5, message=FALSE}
ggplot(data = mpg) +
geom_jitter(mapping = aes(x = cty, y = hwy)) +
scale_y_log10() +
scale_x_log10()
```
## Coordinate systems
```{r diamonds_bar_polar, cache = TRUE, fig.width=5, fig.height=3.5, message=FALSE}
bar + coord_polar()
```
## Coordinate systems challenges
- Turn a stacked bar chart into a pie chart using `coord_polar()`.
- What does `labs()` do? Read the documentation.
- What does the plot below tell you about the relationship between `city` and highway `mpg`? Why is `coord_fixed()` important? What does `geom_abline()` do?
```{r mpg_point_fixed, eval = F, cache = TRUE, fig.width=4.5, fig.height=3.5, message=FALSE}
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()
```
## Coordinate systems challenges
```{r diamonds_barplot_pos_fill_polar, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity),
position = "fill") +
coord_polar()
```
## Coordinate systems challenges
```{r mpg_point_nofixed_plot, eval = T, cache = TRUE, fig.width=8, fig.height=3.5, message=FALSE}
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() + geom_abline()
```
## Coordinate systems challenges
```{r mpg_point_fixed_plot, eval = T, cache = TRUE, fig.width=8, fig.height=3.5, message=FALSE}
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() + geom_abline() + coord_fixed()
```
This diff is collapsed.
This diff is collapsed.
---
title: "R#4: data transformation"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "Mars 2020"
output:
html_document: default
pdf_document: default
---
<style type="text/css">
h3 { /* Header 3 */
position: relative ;
color: #729FCF ;
left: 5%;
}
h2 { /* Header 2 */
color: darkblue ;
left: 10%;
}
h1 { /* Header 1 */
color: #034b6f ;
}
#pencadre{
border:1px;
border-style:solid;
border-color: #034b6f;
background-color: #EEF3F9;
padding: 1em;
text-align: center ;
border-radius : 5px 4px 3px 2px;
}
legend{
color: #034b6f ;
}
#pquestion {
color: darkgreen;
font-weight: bold;
}
</style>
```{r setup, include=FALSE, cache=TRUE}
knitr::opts_chunk$set(echo = TRUE)
```
The goal of this practical is to practices data transformation with `tidyverse`.
The objectives of this session will be to:
- Filter rows with `filter()`
- Arrange rows with `arrange()`
- Select columns with `select()`
- Add new variables with `mutate()`
- Combining multiple operations with the pipe `%>%`
```R
install.packages("nycflights13")
```
```{r packageloaded, include=TRUE, message=FALSE}
library("tidyverse")
library("nycflights13")
```
\
# Data set : nycflights13
`nycflights13::flights`contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in `?flights`
```{r display_data, include=TRUE}
flights
```
- **int** stands for integers.
- **dbl** stands for doubles, or real numbers.
- **chr** stands for character vectors, or strings.
- **dttm** stands for date-times (a date + a time).
- **lgl** stands for logical, vectors that contain only TRUE or FALSE.
- **fctr** stands for factors, which R uses to represent categorical variables with fixed possible values.
- **date** stands for dates.
\
# Filter rows with `filter()`
`filter()` allows you to subset observations based on their values.
```{r filter_month_day, include=TRUE}
filter(flights, month == 1, day == 1)
```
\
`dplyr` functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, `<-`
```{r filter_month_day_sav, include=TRUE}
jan1 <- filter(flights, month == 1, day == 1)
```
\
R either prints out the results, or saves them to a variable.
```{r filter_month_day_sav_display, include=TRUE}
(dec25 <- filter(flights, month == 12, day == 25))
```
\
# Logical operators
Multiple arguments to `filter()` are combined with “and”: every expression must be true in order for a row to be included in the output.
![](./img/transform-logical.png)
\
Test the following operations:
```{r filter_logical_operators, include=TRUE}
filter(flights, month == 11 | month == 12)
filter(flights, month %in% c(11, 12))
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
```
\
# Missing values
One important feature of R that can make comparison tricky are missing values, or `NA`s (“not availables”).
```{r filter_logical_operators_NA, include=TRUE}
NA > 5
10 == NA
NA + 10
```
```{r filter_logical_operators_test_NA, include=TRUE}
is.na(NA)
```
\
# Arrange rows with `arrange()`
\
`arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order.
```{r arrange_ymd, include=TRUE}
arrange(flights, year, month, day)
```
\
Use `desc()` to re-order by a column in descending order:
```{r arrange_desc, include=TRUE}
arrange(flights, desc(dep_delay))
```
Missing values are always sorted at the end:
```{r arrange_NA, include=TRUE}
arrange(tibble(x = c(5, 2, NA)), x)
arrange(tibble(x = c(5, 2, NA)), desc(x))
```
\
# Select columns with `select()`
\
`select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
```{r select_ymd, , include=TRUE}
select(flights, year, month, day)
select(flights, year:day)
select(flights, -(year:day))
```
\
here are a number of helper functions you can use within `select()`:
- `starts_with("abc")`: matches names that begin with “abc”.
- `ends_with("xyz")`: matches names that end with “xyz”.
- `contains("ijk")`: matches names that contain “ijk”.
- `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`.
See `?select` for more details.
\
# Add new variables with `mutate()`
\
It’s often useful to add new columns that are functions of existing columns. That’s the job of `mutate()`.
```{r mutate, include=TRUE}
flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time)
flights_sml
mutate(flights_sml, gain = dep_delay - arr_delay,
speed = distance / air_time * 60)
```
\
```{r mutate_reuse, include=TRUE}
flights_sml <- mutate(flights_sml, gain = dep_delay - arr_delay,
speed = distance / air_time * 60)
```
\
### Useful creation functions
- Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`).
- Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`; and dplyr provides `cummean()` for cumulative means.
- Logical comparisons, `<`, `<=`, `>`, `>=`, `!=`, and `==`
- Ranking: there are a number of ranking functions, but you should start with `min_rank()`. There is also `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`
\
# Combining multiple operations with the pipe
\
We don't want to create useless intermediate variables so we can use the pipe operator: `%>%`
( or `ctrl + shift + M`).
<div id="pquestion"> - Find the 10 most delayed flights using a ranking function. `min_rank()` </div>
```{r pipe_example_a, include=TRUE}
flights_md <- mutate(flights,
most_delay = min_rank(desc(dep_delay)))
flights_md <- filter(flights_md, most_delay < 10)
flights_md <- arrange(flights_md, most_delay)
```
\
```{r pipe_example_b, include=TRUE}
flights_md2 <- flights %>%
mutate(most_delay = min_rank(desc(dep_delay))) %>%
filter(most_delay < 10) %>%
arrange(most_delay)
select(flights_md2, year:day, flight, origin, dest, dep_delay, most_delay)
```
\
Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.
\
Working with the pipe is one of the key criteria for belonging to the `tidyverse`. The only exception is `ggplot2`: it was written before the pipe was discovered. Unfortunately, the next iteration of `ggplot2`, `ggvis`, which does use the pipe, isn’t quite ready for prime time yet.
# Grouped summaries with `summarise()`
`summarise()` collapses a data frame to a single row:
```{r load_data, include=TRUE}
flights %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
```
### The power of `summarise()` with `group_by()`
This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the `dplyr` verbs on a grouped data frame they’ll be automatically applied “by group”.
```{r summarise_group_by, include=TRUE, fig.width=8, fig.height=3.5}
flights_delay <- flights %>%
group_by(year, month) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE), sd = sd(dep_delay, na.rm = TRUE)) %>%
arrange(month)
flights_delay
ggplot(data = flights_delay, mapping = aes(x = month, y = delay)) +
geom_bar(stat="identity", color="black", fill = "#619CFF") +
geom_errorbar(mapping = aes( ymin=0, ymax=delay+sd)) +
theme(axis.text.x = element_blank())
```
### Missing values
You may have wondered about the na.rm argument we used above. What happens if we don’t set it?
```{r summarise_group_by_NA, include=TRUE}
flights %>%
group_by(dest) %>%
summarise(
dist = mean(distance),
delay = mean(arr_delay)
)
```
Aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value.
# Counts
Whenever you do any aggregation, it’s always a good idea to include either a count (`n()`). That way you can check that you’re not drawing conclusions based on very small amounts of data.
```{r summarise_group_by_count, include = TRUE, warning=F, message=F, fig.width=8, fig.height=3.5}
summ_delay_filghts <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
summ_delay_filghts
ggplot(data = summ_delay_filghts, mapping = aes(x = dist, y = delay, size = count)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme(legend.position='none')
```
## Thank you !
\
## For curious or motivated people: Challenge time!
\
\
---
title: "Challenge time!"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "Mars 2020"
output:
html_document: default
pdf_document: default
---
<style type="text/css">
h3 { /* Header 3 */
position: relative ;
color: #729FCF ;
left: 5%;
}
h2 { /* Header 2 */
color: darkblue ;
left: 10%;
}
h1 { /* Header 1 */
color: #034b6f ;
}
#pencadre{
border:1px;
border-style:solid;
border-color: #034b6f;
background-color: #EEF3F9;
padding: 1em;
text-align: center ;
border-radius : 5px 4px 3px 2px;
}
legend{
color: #034b6f ;
}
#pquestion {
color: darkgreen;
font-weight: bold;
}
</style>
```{r setup, include=FALSE, cache=TRUE}
knitr::opts_chunk$set(echo = TRUE)
```
### Filter challenges :
Find all flights that:
- Had an arrival delay of two or more hours
- Were operated by United, American, or Delta
- Departed between midnight and 6am (inclusive)
Another useful dplyr filtering helper is `between()`. What does it do? Can you use it to simplify the code needed to answer the previous challenges?
How many flights have a missing `dep_time`? What other variables are missing? What might these rows represent?
Why is `NA ^ 0` not `NA`? Why is `NA | TRUE` not `NA`? Why is `FALSE & NA` not `NA`? Can you figure out the general rule? (`NA * 0` is a tricky counter-example!)
### Arrange challenges :
- Sort flights to find the most delayed flights. Find the flights that left earliest.
- Sort flights to find the fastest flights.
- Which flights traveled the longest? Which traveled the shortest?
### Select challenges :
- Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`.
- What does the `one_of()` function do? Why might it be helpful in conjunction with this vector?
```{r select_one_of, eval=F, message=F, cache=T}
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
```
- Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
```{r select_contains, eval=F, message=F, cache=T}
select(flights, contains("TIME"))
```
### Mutate challenges :
- Currently `dep_time` and `sched_dep_time` are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
```{r mutate_challenges_a, eval=F, message=F, cache=T}
mutate(
flights,
dep_time = (dep_time %/% 100) * 60 +
dep_time %% 100,
sched_dep_time = (sched_dep_time %/% 100) * 60 +
sched_dep_time %% 100
)
```
\
- Compare `dep_time`, `sched_dep_time`, and `dep_delay`. How would you expect those three numbers to be related?
```{r mutate_challenge_b, eval=F, message=F, cache=T}
mutate(
flights,
dep_time = (dep_time %/% 100) * 60 +
dep_time %% 100,
sched_dep_time = (sched_dep_time %/% 100) * 60 +
sched_dep_time %% 100
)
```
\
### Challenge with `summarise()` and `group_by()`
Imagine that we want to explore the relationship between the distance and average delay for each location.
here are three steps to prepare this data:
- Group flights by destination.
- Summarise to compute distance, average delay, and number of flights.
- Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
```{r summarise_group_by_ggplot_a, eval = F}
flights %>%
group_by(dest)
```
\
Imagine that we want to explore the relationship between the distance and average delay for each location.
- Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
```{r summarise_group_by_ggplot_b, eval = F}
flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
```
session_4/img/colorsR.png

370 KiB

session_4/img/transform-logical.png

70.2 KiB | W: | H:

session_4/img/transform-logical.png

82.8 KiB | W: | H:

session_4/img/transform-logical.png
session_4/img/transform-logical.png
session_4/img/transform-logical.png
session_4/img/transform-logical.png
  • 2-up
  • Swipe
  • Onion skin