Skip to content
Snippets Groups Projects
title: "R.8: Factors"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "2021"
output:
  rmdformats::downcute:
    self_contain: true
    use_bookdown: true
    default_style: "light"
    lightbox: true
    css: "http://perso.ens-lyon.fr/laurent.modolo/R/src/style.css"
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
klippy::klippy(
  position = c('top', 'right'),
  color = "white",
  tooltip_message = 'Click to copy',
  tooltip_success = 'Copied !')

Introduction

In this session, you will learn more about the factor type in R. Factors can be very useful, but you have to be mindful of the implicit conversions from simple vector to factor ! They are the source of loot of pain for R programmers.

As usual we will need the `tidyverse` library.
Solution

```{r load_data, eval=T, message=F} library(tidyverse) ```

Creating factors

Imagine that you have a variable that records month:

x1 <- c("Dec", "Apr", "Jan", "Mar")

Using a string to record this variable has two problems:

  1. There are only twelve possible months, and there’s nothing saving you from typos:
x2 <- c("Dec", "Apr", "Jam", "Mar")
  1. It doesn’t sort in a useful way:
sort(x1)

You can fix both of these problems with a factor.

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
y1 <- factor(x1, levels = month_levels)
y1
sort(y1)

And any values not in the set will be converted to NA:

y2 <- parse_factor(x2, levels = month_levels)
y2

Sometimes you’d prefer that the order of the levels match the order of the first appearance in the data.

f2 <- x1 %>% factor() %>% fct_inorder()
f2
levels(f2)

General Social Survey

gss_cat %>%
  count(race)

By default, ggplot2 will drop levels that don’t have any values. You can force them to display with:

ggplot(gss_cat, aes(race)) +
  geom_bar() +
  scale_x_discrete(drop = FALSE)

Modifying factor order

It’s often useful to change the order of the factor levels in a visualisation.

relig_summary <- gss_cat %>%
  group_by(relig) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )
ggplot(relig_summary, aes(tvhours, relig)) + geom_point()

It is difficult to interpret this plot because there’s no overall pattern. We can improve it by reordering the levels of relig using fct_reorder(). fct_reorder() takes three arguments:

  • f, the factor whose levels you want to modify.
  • x, a numeric vector that you want to use to reorder the levels.
  • Optionally, fun, a function that’s used if there are multiple values of x for each value of f. The default value is median.
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
  geom_point()

As you start making more complicated transformations, I’d recommend moving them out of aes() and into a separate mutate() step. For example, you could rewrite the plot above as:

relig_summary %>%
  mutate(relig = fct_reorder(relig, tvhours)) %>%
  ggplot(aes(tvhours, relig)) +
    geom_point()

fct_reorder2()

Another type of reordering is useful when you are colouring the lines on a plot. fct_reorder2() reorders the factor by the y values associated with the largest x values. This makes the plot easier to read because the line colours line up with the legend.

by_age <- gss_cat %>%
  filter(!is.na(age)) %>%
  count(age, marital) %>%
  group_by(age) %>%
  mutate(prop = n / sum(n))
ggplot(by_age, aes(age, prop, colour = marital)) +
  geom_line(na.rm = TRUE)
ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
  geom_line() +
  labs(colour = "marital")