R#2: introduction to Tidyverse

Laurent Modolo laurent.modolo@ens-lyon.fr

24 Oct 2019

R#2: introduction to Tidyverse

The goal of this practical is to familiarize yourself with ggplot2.

The objectives of this session will be to:

Tidyverse

The tidyverse is a collection of R packages designed for data science.

All packages share an underlying design philosophy, grammar, and data structures.

install.packages("tidyverse")
library("tidyverse")

Toy data set mpg

This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov. It contains only models which had a new release every year between 1999 and 2008.

?mpg
mpg
dim(mpg)
View(mpg)

Updated version of the data

mpg is loaded with tidyverse, we want to be able to read our own data from http://perso.ens-lyon.fr/laurent.modolo/R/2_data.csv

new_mpg <- read_csv(
  "http://perso.ens-lyon.fr/laurent.modolo/R/2_data.csv"
  )

http://perso.ens-lyon.fr/laurent.modolo/R/2_a

First plot with ggplot2

Relationship between engine size displ and fuel efficiency hwy.

ggplot(data = new_mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

Composition of plot with ggplot2

Composition of plot with ggplot2

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

First challenge

Run ggplot(data = mpg). What do you see?

ggplot(data = new_mpg)

How many rows are in new_mpg? How many columns?

new_mpg
## # A tibble: 40,440 x 12
##       id make  model   year class trans drive   cyl displ fuel    hwy   cty
##    <dbl> <chr> <chr>  <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
##  1 13309 Acura 2.2CL…  1997 Subc… Auto… Fron…     4   2.2 Regu…    26    20
##  2 13310 Acura 2.2CL…  1997 Subc… Manu… Fron…     4   2.2 Regu…    28    22
##  3 13311 Acura 2.2CL…  1997 Subc… Auto… Fron…     6   3   Regu…    26    18
##  4 14038 Acura 2.3CL…  1998 Subc… Auto… Fron…     4   2.3 Regu…    27    19
##  5 14039 Acura 2.3CL…  1998 Subc… Manu… Fron…     4   2.3 Regu…    29    21
##  6 14040 Acura 2.3CL…  1998 Subc… Auto… Fron…     6   3   Regu…    26    17
##  7 14834 Acura 2.3CL…  1999 Subc… Auto… Fron…     4   2.3 Regu…    27    20
##  8 14835 Acura 2.3CL…  1999 Subc… Manu… Fron…     4   2.3 Regu…    29    21
##  9 14836 Acura 2.3CL…  1999 Subc… Auto… Fron…     6   3   Regu…    26    17
## 10 11789 Acura 2.5TL   1995 Comp… Auto… Fron…     5   2.5 Prem…    23    18
## # … with 40,430 more rows

Make a scatterplot of hwy vs. cyl.

ggplot(data = new_mpg) + 
  geom_point(mapping = aes(x = hwy, y = cyl))

What happens if you make a scatterplot of class vs. drive?

Why is the plot not useful?

ggplot(data = new_mpg) + 
  geom_point(mapping = aes(x = class, y = drive))

Aesthetic mappings

How can you explain these cars?

Aesthetic mapping color

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

Aesthetic mappings

ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. ggplot2 will also add a legend that explains which levels correspond to which values.

Try the following aesthetic:

Aesthetic mapping size

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = class))

Aesthetic mapping alpha

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, alpha = class))

Aesthetic mapping shape

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))

Aesthetic

You can also set the aesthetic properties of your geom manually. For example, we can make all of the points in our plot blue:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

Second challenge

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

Facets

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~class)

Facets

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~class, nrow = 2)

Facets

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ fl + class, nrow = 2)

Composition

There are different ways to represent the information

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

Composition

There are different ways to represent the information

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))

Composition

We can add as many layers as we want

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

Composition

We can avoid code duplication

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() +
  geom_smooth()

Composition

We can make mapping layer specific

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) +
  geom_smooth()

Composition

We can use different data for different layer (You will lean more on filter() later)

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) +
  geom_smooth(data = filter(mpg, class == "subcompact"))

Fird challenge

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

http://perso.ens-lyon.fr/laurent.modolo/R/2_d

Third challenge

Third challenge

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() +
  geom_smooth(mapping = aes(linetype = drv))