Skip to content
Snippets Groups Projects
Commit a7ea281f authored by GD's avatar GD
Browse files

Simpson's paradox example and correlation

parent 5ec13af6
No related branches found
No related tags found
No related merge requests found
---
title: "counter-intuitive examples regarding correlations"
output:
html_document: default
pdf_document: default
date: "`r Sys.Date()`"
---
A.K.A Simpson's paradox^[Ref: https://en.wikipedia.org/wiki/Simpson%27s_paradox]
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r req, include=FALSE}
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
```
## Example: Penguin dataset^[c.f. https://allisonhorst.github.io/palmerpenguins/]:
```{r ex1_data}
library(palmerpenguins)
data(package = 'palmerpenguins')
head(penguins)
```
> Note: a good dataset for teaching (alternative to Fisher Iris dataset)
**Global correlation between bill depth and bill length**^[See https://allisonhorst.github.io/palmerpenguins/#bill-dimensions]:
```{r ex1_global_corr, echo=FALSE}
penguins %>% drop_na() %>%
summarize(corr = cor(bill_depth_mm, bill_length_mm)) %>% pull(corr)
```
**Intra-species correlation between bill depth and bill length**:
```{r ex1_group_corr, echo=FALSE}
penguins %>% drop_na() %>% group_by(species) %>%
summarize(corr = cor(bill_depth_mm, bill_length_mm)) %>% ungroup() %>% print()
```
> Global correlation is negative but intra-species correlation is positive.
### Graphical representation
Code from https://github.com/apreshill/palmerpenguins-useR-2022
```{r ex1_graphics, echo=FALSE, ,message=FALSE, warning=FALSE, results='hide', fig.keep='all'}
library(ggpubr)
library(paletteer)
library(ggiraph)
#| fig.alt: "Scatterplot of bill length versus bill depth for the three penguin species, showing a positive linear relationship within species. If species is omitted as a variable, the relationship switches to a negative trend, another example of Simpson’s paradox in the data"
# Simpson's Paradox example (bill dimensions, omitting species):
simpson_nospecies_base <- penguins %>%
# doing this so ggiraph recognizes species across plots
mutate(species = as.character(species)) %>%
mutate(species = case_when(
species == "Adelie" ~ "Adélie",
TRUE ~ species)
) %>%
ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
theme(panel.border = element_rect(fill = NA, color = "gray70")) +
labs(x = "Bill length (mm)", y = "Bill depth (mm)")
# Bill dimensions, including species:
simpson_wspecies_base <-
penguins %>%
mutate(species = as.character(species)) %>%
mutate(species = case_when(
species == "Adelie" ~ "Adélie",
TRUE ~ species)
) %>%
ggplot(aes(x = bill_length_mm, y = bill_depth_mm, group = species)) +
scale_color_paletteer_d("colorblindr::OkabeIto") +
theme(panel.border = element_rect(fill = NA, color = "gray70")) +
labs(x = "Bill length (mm)", y = "Bill depth (mm)") +
guides(color = guide_legend("Species"),
shape = guide_legend("Species"))
nospecies_tooltip <- c(str_c("Bill length (mm) = ", penguins$bill_length_mm,
"\n Bill depth (g) = ", penguins$bill_depth_mm,
"\n Species = ", penguins$species))
simpson_nospecies_int <-
simpson_nospecies_base +
geom_point_interactive(aes(tooltip = nospecies_tooltip,
data_id = species),
size = 2,
alpha = 0.6) +
geom_smooth_interactive(method = lm,
se = FALSE,
color = "black"
)
wspecies_tooltip <- c(str_c("Bill length (mm) = ", penguins$bill_length_mm,
"\n Bill depth (g) = ", penguins$bill_depth_mm,
"\n Species = ", penguins$species))
simpson_wspecies_int <-
simpson_wspecies_base +
geom_point_interactive(aes(color = species,
shape = species,
tooltip = wspecies_tooltip,
data_id = species),
size = 2,
alpha = 0.6) +
geom_smooth_interactive(aes(color= species,
data_id= species),
method = lm,
se = FALSE
)
ggarrange(simpson_nospecies_int, simpson_wspecies_int, widths = c(1.05, 1.5))
```
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment