-
Laurent Modolo authoredLaurent Modolo authored
title: "R#6: tidydata"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "19 Dec 2019"
output:
slidy_presentation:
highlight: tango
beamer_presentation:
theme: metropolis
slide_level: 3
fig_caption: no
df_print: tibble
highlight: tango
latex_engine: xelatex
library(tidyverse)
library(nycflights13)
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
Tidydata
There are three interrelated rules which make a dataset tidy:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
library(tidyverse)
pivot longer
table4a # number of TB cases
pivot longer
table4a %>%
pivot_longer(-country,
names_to = "year",
values_to = "case")
pivot wider
table2
pivot wider
table2 %>%
pivot_wider(names_from = type,
values_from = count)
Relational data
Sometime the information can be split between different table
library(nycflights13)
flights
airlines
airports
weather
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
Relational data
knitr::include_graphics('img/relational-nycflights.png')
joints
knitr::include_graphics('img/join-venn.png')
inner_joint()
Matches pairs of observations whenever their keys are equal
flights2 %>%
inner_join(airlines)
left_joint()
keeps all observations in x
flights2 %>%
left_join(airlines)
right_joint()
keeps all observations in y
flights2 %>%
right_join(airlines)
full_joint()
keeps all observations in x
and y
flights2 %>%
full_join(airlines)
Defining the key columns
The default, by = NULL
, uses all variables that appear in both tables, the so called natural join.
flights2 %>%
left_join(weather)
Defining the key columns
The default, by = NULL
, uses all variables that appear in both tables, the so called natural join.
flights2 %>%
left_join(planes, by = "tailnum")
Defining the key columns
A named character vector: by = c("a" = "b")
. This will match variable a
in table x
to variable b
in table y
.
flights2 %>%
left_join(airports, c("dest" = "faa"))
Filtering joins
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
-
semi_join(x, y)
keeps all observations inx
that have a match iny
. -
anti_join(x, y)
drops all observations inx
that have a match iny
.
Filtering joins
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
flights %>%
semi_join(top_dest)
Set operations
These expect the x and y inputs to have the same variables, and treat the observations like sets:
-
intersect(x, y)
: return only observations in bothx
andy
. -
union(x, y)
: return unique observations inx
andy
. -
setdiff(x, y)
: return observations inx
, but not iny
.