Skip to content
Snippets Groups Projects
title: "R#6: tidydata"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "19 Dec 2019"
output:
  slidy_presentation:
    highlight: tango
  beamer_presentation:
    theme: metropolis
    slide_level: 3
    fig_caption: no
    df_print: tibble
    highlight: tango
    latex_engine: xelatex
library(tidyverse)
library(nycflights13)
flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)

Tidydata

There are three interrelated rules which make a dataset tidy:

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.
library(tidyverse)

pivot longer

table4a # number of TB cases

pivot longer

table4a %>% 
  pivot_longer(-country,
               names_to = "year",
               values_to = "case")

pivot wider

table2

pivot wider

table2 %>% 
  pivot_wider(names_from = type,
              values_from = count)

Relational data

Sometime the information can be split between different table

library(nycflights13)
flights
airlines
airports
weather
flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)

Relational data

knitr::include_graphics('img/relational-nycflights.png')

joints

knitr::include_graphics('img/join-venn.png')

inner_joint()

Matches pairs of observations whenever their keys are equal

flights2 %>%
  inner_join(airlines)

left_joint()

keeps all observations in x

flights2 %>%
  left_join(airlines)

right_joint()

keeps all observations in y

flights2 %>%
  right_join(airlines)

full_joint()

keeps all observations in x and y

flights2 %>%
  full_join(airlines)

Defining the key columns

The default, by = NULL, uses all variables that appear in both tables, the so called natural join.

flights2 %>% 
  left_join(weather)

Defining the key columns

The default, by = NULL, uses all variables that appear in both tables, the so called natural join.

flights2 %>% 
  left_join(planes, by = "tailnum")

Defining the key columns

A named character vector: by = c("a" = "b"). This will match variable a in table x to variable b in table y.

flights2 %>% 
  left_join(airports, c("dest" = "faa"))

Filtering joins

Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:

  • semi_join(x, y) keeps all observations in x that have a match in y.
  • anti_join(x, y) drops all observations in x that have a match in y.

Filtering joins

top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  head(10)
flights %>% 
  semi_join(top_dest)

Set operations

These expect the x and y inputs to have the same variables, and treat the observations like sets:

  • intersect(x, y): return only observations in both x and y.
  • union(x, y): return unique observations in x and y.
  • setdiff(x, y): return observations in x, but not in y.