Skip to content
Snippets Groups Projects
Forked from LBMC / Hub / formations / R_basis
167 commits behind the upstream repository.
title: '#8 Factors'
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "31 Jan 2020"
always_allow_html: yes
output:
  slidy_presentation:
    highlight: tango
  beamer_presentation:
    theme: metropolis
    slide_level: 3
    fig_caption: no
    df_print: tibble
    highlight: tango
    latex_engine: xelatex
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)

Creating factors

Imagine that you have a variable that records month:

x1 <- c("Dec", "Apr", "Jan", "Mar")

Using a string to record this variable has two problems:

  1. There are only twelve possible months, and there’s nothing saving you from typos:
x2 <- c("Dec", "Apr", "Jam", "Mar")
  1. It doesn’t sort in a useful way:
sort(x1)

Creating factors

You can fix both of these problems with a factor.

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
y1 <- factor(x1, levels = month_levels)
y1
sort(y1)

Creating factors

And any values not in the set will be converted to NA:

y2 <- parse_factor(x2, levels = month_levels)
y2

Sometimes you’d prefer that the order of the levels match the order of the first appearance in the data.

f2 <- x1 %>% factor() %>% fct_inorder()
f2
levels(f2)

General Social Survey

gss_cat %>%
  count(race)

General Social Survey

By default, ggplot2 will drop levels that don’t have any values. You can force them to display with:

ggplot(gss_cat, aes(race)) +
  geom_bar() +
  scale_x_discrete(drop = FALSE)

Modifying factor order

It’s often useful to change the order of the factor levels in a visualisation.

relig_summary <- gss_cat %>%
  group_by(relig) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )
ggplot(relig_summary, aes(tvhours, relig)) + geom_point()

8_a

Modifying factor order

It is difficult to interpret this plot because there’s no overall pattern. We can improve it by reordering the levels of relig using fct_reorder(). fct_reorder() takes three arguments:

  • f, the factor whose levels you want to modify.
  • x, a numeric vector that you want to use to reorder the levels.
  • Optionally, fun, a function that’s used if there are multiple values of x for each value of f. The default value is median.

Modifying factor order

ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
  geom_point()

8_b

Modifying factor order

As you start making more complicated transformations, I’d recommend moving them out of aes() and into a separate mutate() step. For example, you could rewrite the plot above as:

relig_summary %>%
  mutate(relig = fct_reorder(relig, tvhours)) %>%
  ggplot(aes(tvhours, relig)) +
    geom_point()

8_c

fct_reorder2()

Another type of reordering is useful when you are colouring the lines on a plot. fct_reorder2() reorders the factor by the y values associated with the largest x values. This makes the plot easier to read because the line colours line up with the legend.

by_age <- gss_cat %>%
  filter(!is.na(age)) %>%
  count(age, marital) %>%
  group_by(age) %>%
  mutate(prop = n / sum(n))

8_d

fct_reorder2()

ggplot(by_age, aes(age, prop, colour = marital)) +
  geom_line(na.rm = TRUE)

8_e

fct_reorder2()

ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
  geom_line() +
  labs(colour = "marital")

8_f