---
title: '#8 Factors'
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "31 Jan 2020"
always_allow_html: yes
output:
  slidy_presentation:
    highlight: tango
  beamer_presentation:
    theme: metropolis
    slide_level: 3
    fig_caption: no
    df_print: tibble
    highlight: tango
    latex_engine: xelatex
---
```{r setup, include=FALSE, cache=TRUE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
```

## Creating factors

Imagine that you have a variable that records month:

```{r declare_month, eval=T, cache=T}
x1 <- c("Dec", "Apr", "Jan", "Mar")
```

Using a string to record this variable has two problems:

1. There are only twelve possible months, and there’s nothing saving you from typos:

```{r declare_month2, eval=T, cache=T}
x2 <- c("Dec", "Apr", "Jam", "Mar")
```

2. It doesn’t sort in a useful way:

```{r sort_month, eval=T, cache=T}
sort(x1)
```

## Creating factors

You can fix both of these problems with a factor.

```{r sort_month_factor, eval=T, cache=T}
month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
y1 <- factor(x1, levels = month_levels)
y1
sort(y1)
```

## Creating factors

And any values not in the set will be converted to NA:

```{r sort_month_factor2, eval=T, cache=T}
y2 <- parse_factor(x2, levels = month_levels)
y2
```

Sometimes you’d prefer that the order of the levels match the order of the first appearance in the data.

```{r inorder_month_factor, eval=T, cache=T}
f2 <- x1 %>% factor() %>% fct_inorder()
f2
levels(f2)
```

## General Social Survey

```{r race_count, eval=T, cache=T}
gss_cat %>%
  count(race)
```

## General Social Survey

By default, ggplot2 will drop levels that don’t have any values. You can force them to display with:

```{r race_plot, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(gss_cat, aes(race)) +
  geom_bar() +
  scale_x_discrete(drop = FALSE)
```

## Modifying factor order

It’s often useful to change the order of the factor levels in a visualisation.

```{r tv_hour, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
relig_summary <- gss_cat %>%
  group_by(relig) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )
ggplot(relig_summary, aes(tvhours, relig)) + geom_point()
```

**8_a**

## Modifying factor order

It is difficult to interpret this plot because there’s no overall pattern. We can improve it by reordering the levels of relig using `fct_reorder()`. `fct_reorder()` takes three arguments:

- `f`, the factor whose levels you want to modify.
- `x`, a numeric vector that you want to use to reorder the levels.
- Optionally, `fun`, a function that’s used if there are multiple values of `x` for each value of `f`. The default value is `median`.

## Modifying factor order

```{r tv_hour_order, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
  geom_point()
```

**8_b**

## Modifying factor order

As you start making more complicated transformations, I’d recommend moving them out of `aes()` and into a separate `mutate()` step. For example, you could rewrite the plot above as:

```{r tv_hour_order_mutate, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
relig_summary %>%
  mutate(relig = fct_reorder(relig, tvhours)) %>%
  ggplot(aes(tvhours, relig)) +
    geom_point()
```
**8_c**

## `fct_reorder2()`

Another type of reordering is useful when you are colouring the lines on a plot. `fct_reorder2()` reorders the factor by the `y` values associated with the largest `x` values. This makes the plot easier to read because the line colours line up with the legend.

```{r fct_reorder2, eval=T, plot=T}
by_age <- gss_cat %>%
  filter(!is.na(age)) %>%
  count(age, marital) %>%
  group_by(age) %>%
  mutate(prop = n / sum(n))
```
**8_d**

## `fct_reorder2()`

```{r fct_reorder2a, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(by_age, aes(age, prop, colour = marital)) +
  geom_line(na.rm = TRUE)
```

**8_e**

## `fct_reorder2()`

```{r fct_reorder2b, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
  geom_line() +
  labs(colour = "marital")
```

**8_f**