Skip to content
Snippets Groups Projects
Commit 5aa6b602 authored by Gilquin's avatar Gilquin
Browse files

Merge branch 'session-5_challenges2-3' into 'main'

fix: rework challenges 2 and 3. Ref #2

See merge request !10
parents 09c9102c 67a2be32
No related branches found
No related tags found
1 merge request!10fix: rework challenges 2 and 3. Ref #2
Pipeline #2188 passed
...@@ -288,43 +288,51 @@ Which day would you prefer to book a flight ? ...@@ -288,43 +288,51 @@ Which day would you prefer to book a flight ?
</p> </p>
</details> </details>
We can add error bars to this plot to justify our decision. We can add error bars to this plot to justify our decision. Brainstorm a way to construct the error bars.
Brainstorm a way to have access to the mean and standard deviation or the `prop_cancel_day` and `av_delay`.
**Hints**:
1. We can define the error bars with confidence intervals.
2. `cancel_day` can be modeled as a Bernoulli random variable: $X \sim \mathcal{B}(p)$.\
The corresponding $\alpha=5\%$ two-sided confidence interval is defined by:
$$ \left[ \ \hat{p} \pm q_{1-\frac{\alpha}{2}} \sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}} \ \right] $$
3. `dep_delay` can be modeled as a Gaussian random variable: $X \sim \mathcal{N}(\mu, \sigma^2)$.\
The corresponding $\alpha=5\%$ two-sided confidence interval is defined by:
$$ \left[ \ \hat{\mu} \pm t_{1-\frac{\alpha}{2}, n-1} \frac{\hat{\sigma}}{\sqrt{n}} \ \right] $$
4. We can draw error bars with the functions `geom_errorbar` and `geom_errorbarh`.
<details><summary>Solution</summary> <details><summary>Solution</summary>
<p> <p>
```{r grouping_challenges_b2, eval=T, message=FALSE, cache=T, fig.width=8, fig.height=3.5} ```{r grouping_challenges_b2, eval=T, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
alpha <- 0.05
flights %>% flights %>%
mutate( mutate(
canceled = is.na(dep_time) | is.na(arr_time) canceled = is.na(dep_time) | is.na(arr_time),
) %>% wday = strftime(time_hour, "%A")
mutate(wday = strftime(time_hour, "%A")) %>%
group_by(day) %>%
mutate(
prop_cancel_day = sum(canceled) / sum(!canceled),
av_delay = mean(dep_delay, na.rm = TRUE)
) %>% ) %>%
group_by(wday) %>% group_by(wday) %>%
summarize( summarize(
mean_cancel_day = mean(prop_cancel_day, na.rm = TRUE), n_obs = n(),
sd_cancel_day = sd(prop_cancel_day, na.rm = TRUE), prop_cancel_day = sum(canceled) / n_obs,
mean_av_delay = mean(av_delay, na.rm = TRUE), sd_cancel_day = sqrt(prop_cancel_day * (1 - prop_cancel_day)),
sd_av_delay = sd(av_delay, na.rm = TRUE) av_delay = mean(dep_delay, na.rm = T),
sd_delay = sd(dep_delay, na.rm = T)
) %>% ) %>%
ggplot(mapping = aes(x = mean_av_delay, y = mean_cancel_day, color = wday)) + ggplot(mapping = aes(x = av_delay, y = prop_cancel_day, color = wday)) +
geom_point() + geom_point() +
geom_errorbarh( geom_errorbarh(
mapping = aes( mapping = aes(
xmin = -sd_av_delay + mean_av_delay, xmin = av_delay - qt(1 - alpha / 2, n_obs - 1) * sd_delay / sqrt(n_obs),
xmax = sd_av_delay + mean_av_delay xmax = av_delay + qt(1 - alpha / 2, n_obs - 1) * sd_delay / sqrt(n_obs)
) )
) + ) +
geom_errorbar( geom_errorbar(
mapping = aes( mapping = aes(
ymin = -sd_cancel_day + mean_cancel_day, ymin = prop_cancel_day - qnorm(1 - alpha / 2) * sd_cancel_day / sqrt(n_obs),
ymax = sd_cancel_day + mean_cancel_day ymax = prop_cancel_day + qnorm(1 - alpha / 2) * sd_cancel_day / sqrt(n_obs)
) )
) ) +
theme_linedraw()
``` ```
</p> </p>
</details> </details>
...@@ -371,28 +379,57 @@ flights %>% ...@@ -371,28 +379,57 @@ flights %>%
summarise( summarise(
carrier_delay = mean(arr_delay, na.rm = T) carrier_delay = mean(arr_delay, na.rm = T)
) %>% ) %>%
mutate(carrier = fct_reorder(carrier, carrier_delay)) %>% mutate(carrier = fct_reorder(carrier, carrier_delay, .na_rm = T)) %>%
ggplot(mapping = aes(x = carrier, y = carrier_delay)) + ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
geom_col(alpha = 0.5) geom_col(alpha = 0.5)
``` ```
</p> </p>
</details> </details>
Can you disentangle the effects of bad airports vs. bad carriers? (Hint: think about `group_by(carrier, dest) %>% summarise(n=n())`) Can you disentangle the effects of bad airports vs. bad carriers?
**Hints**:
1. Think about `group_by(carrier, dest)`.
2. We can color points per airport destination with the function `geom_jitter`.
3. We can label points per airport destination with the function `geom_text_repel` from package `ggrepel`.
4. We can control the jitter randomness with the function `position_jitter`.
<details><summary>Solution</summary> <details><summary>Solution</summary>
<p> <p>
```{r grouping_challenges_c1, eval=F, echo = T, message=FALSE, cache=T} ```{r grouping_challenges_c1, eval=F, echo = T, message=FALSE, cache=T}
require(ggrepel)
flights %>% flights %>%
group_by(carrier, dest) %>% group_by(carrier, dest) %>%
summarise( summarise(
carrier_delay = mean(arr_delay, na.rm = T), carrier_delay = mean(arr_delay, na.rm = T),
number_of_flight = n() nflight = n()
) %>% ) %>%
mutate(carrier = fct_reorder(carrier, carrier_delay)) %>% ungroup() %>%
mutate(carrier = fct_reorder(carrier, carrier_delay, .na_rm = T)) %>%
ggplot(mapping = aes(x = carrier, y = carrier_delay)) + ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
geom_boxplot() + geom_boxplot(outlier.shape = NA) +
geom_jitter(height = 0) geom_jitter(
aes(color = dest), # color points per destination
position = position_jitter(
width = 0.2, # small horizontal jitter
height = 0, # no vertical jitter
seed = 1 # to be reproducible
),
show.legend = FALSE # remove legend
) +
geom_text_repel(
aes(label = dest, color = dest), # color label per destination
max.overlaps = 10, # allow more labels to be drawn
position = position_jitter(
width = 0.2,
height = 0,
seed = 1
),
show.legend = FALSE
) +
theme_linedraw()
``` ```
</p> </p>
</details> </details>
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment