Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
R_basis
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
CAN
R_basis
Commits
5aa6b602
Commit
5aa6b602
authored
10 months ago
by
Gilquin
Browse files
Options
Downloads
Plain Diff
Merge branch 'session-5_challenges2-3' into 'main'
fix: rework challenges 2 and 3. Ref
#2
See merge request
!10
parents
09c9102c
67a2be32
No related branches found
No related tags found
1 merge request
!10
fix: rework challenges 2 and 3. Ref #2
Pipeline
#2188
passed
10 months ago
Stage: deploy
Changes
1
Pipelines
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
session_5/session_5.Rmd
+62
-25
62 additions, 25 deletions
session_5/session_5.Rmd
with
62 additions
and
25 deletions
session_5/session_5.Rmd
+
62
−
25
View file @
5aa6b602
...
@@ -288,43 +288,51 @@ Which day would you prefer to book a flight ?
...
@@ -288,43 +288,51 @@ Which day would you prefer to book a flight ?
</p>
</p>
</details>
</details>
We can add error bars to this plot to justify our decision.
We can add error bars to this plot to justify our decision. Brainstorm a way to construct the error bars.
Brainstorm a way to have access to the mean and standard deviation or the `prop_cancel_day` and `av_delay`.
**Hints**:
1. We can define the error bars with confidence intervals.
2. `cancel_day` can be modeled as a Bernoulli random variable: $X \sim \mathcal{B}(p)$.\
The corresponding $\alpha=5\%$ two-sided confidence interval is defined by:
$$ \left[ \ \hat{p} \pm q_{1-\frac{\alpha}{2}} \sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}} \ \right] $$
3. `dep_delay` can be modeled as a Gaussian random variable: $X \sim \mathcal{N}(\mu, \sigma^2)$.\
The corresponding $\alpha=5\%$ two-sided confidence interval is defined by:
$$ \left[ \ \hat{\mu} \pm t_{1-\frac{\alpha}{2}, n-1} \frac{\hat{\sigma}}{\sqrt{n}} \ \right] $$
4. We can draw error bars with the functions `geom_errorbar` and `geom_errorbarh`.
<details><summary>Solution</summary>
<details><summary>Solution</summary>
<p>
<p>
```{r grouping_challenges_b2, eval=T, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
```{r grouping_challenges_b2, eval=T, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
alpha <- 0.05
flights %>%
flights %>%
mutate(
mutate(
canceled = is.na(dep_time) | is.na(arr_time)
canceled = is.na(dep_time) | is.na(arr_time),
) %>%
wday = strftime(time_hour, "%A")
mutate(wday = strftime(time_hour, "%A")) %>%
group_by(day) %>%
mutate(
prop_cancel_day = sum(canceled) / sum(!canceled),
av_delay = mean(dep_delay, na.rm = TRUE)
) %>%
) %>%
group_by(wday) %>%
group_by(wday) %>%
summarize(
summarize(
mean_cancel_day = mean(prop_cancel_day, na.rm = TRUE),
n_obs = n(),
sd_cancel_day = sd(prop_cancel_day, na.rm = TRUE),
prop_cancel_day = sum(canceled) / n_obs,
mean_av_delay = mean(av_delay, na.rm = TRUE),
sd_cancel_day = sqrt(prop_cancel_day * (1 - prop_cancel_day)),
sd_av_delay = sd(av_delay, na.rm = TRUE)
av_delay = mean(dep_delay, na.rm = T),
sd_delay = sd(dep_delay, na.rm = T)
) %>%
) %>%
ggplot(mapping = aes(x =
mean_
av_delay, y =
mean
_cancel_day, color = wday)) +
ggplot(mapping = aes(x = av_delay, y =
prop
_cancel_day, color = wday)) +
geom_point() +
geom_point() +
geom_errorbarh(
geom_errorbarh(
mapping = aes(
mapping = aes(
xmin =
-sd_
av_delay
+ mean_av_delay
,
xmin = av_delay
- qt(1 - alpha / 2, n_obs - 1) * sd_delay / sqrt(n_obs)
,
xmax =
sd_
av_delay +
mean_av_delay
xmax = av_delay +
qt(1 - alpha / 2, n_obs - 1) * sd_delay / sqrt(n_obs)
)
)
) +
) +
geom_errorbar(
geom_errorbar(
mapping = aes(
mapping = aes(
ymin =
-sd
_cancel_day
+ mean_cancel_day
,
ymin =
prop
_cancel_day
- qnorm(1 - alpha / 2) * sd_cancel_day / sqrt(n_obs)
,
ymax =
sd
_cancel_day +
mean_cancel_day
ymax =
prop
_cancel_day +
qnorm(1 - alpha / 2) * sd_cancel_day / sqrt(n_obs)
)
)
)
) +
theme_linedraw()
```
```
</p>
</p>
</details>
</details>
...
@@ -371,28 +379,57 @@ flights %>%
...
@@ -371,28 +379,57 @@ flights %>%
summarise(
summarise(
carrier_delay = mean(arr_delay, na.rm = T)
carrier_delay = mean(arr_delay, na.rm = T)
) %>%
) %>%
mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
mutate(carrier = fct_reorder(carrier, carrier_delay
, .na_rm = T
)) %>%
ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
geom_col(alpha = 0.5)
geom_col(alpha = 0.5)
```
```
</p>
</p>
</details>
</details>
Can you disentangle the effects of bad airports vs. bad carriers? (Hint: think about `group_by(carrier, dest) %>% summarise(n=n())`)
Can you disentangle the effects of bad airports vs. bad carriers?
**Hints**:
1. Think about `group_by(carrier, dest)`.
2. We can color points per airport destination with the function `geom_jitter`.
3. We can label points per airport destination with the function `geom_text_repel` from package `ggrepel`.
4. We can control the jitter randomness with the function `position_jitter`.
<details><summary>Solution</summary>
<details><summary>Solution</summary>
<p>
<p>
```{r grouping_challenges_c1, eval=F, echo = T, message=FALSE, cache=T}
```{r grouping_challenges_c1, eval=F, echo = T, message=FALSE, cache=T}
require(ggrepel)
flights %>%
flights %>%
group_by(carrier, dest) %>%
group_by(carrier, dest) %>%
summarise(
summarise(
carrier_delay = mean(arr_delay, na.rm = T),
carrier_delay = mean(arr_delay, na.rm = T),
n
umber_of_
flight = n()
nflight = n()
) %>%
) %>%
mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
ungroup() %>%
mutate(carrier = fct_reorder(carrier, carrier_delay, .na_rm = T)) %>%
ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
geom_boxplot() +
geom_boxplot(outlier.shape = NA) +
geom_jitter(height = 0)
geom_jitter(
aes(color = dest), # color points per destination
position = position_jitter(
width = 0.2, # small horizontal jitter
height = 0, # no vertical jitter
seed = 1 # to be reproducible
),
show.legend = FALSE # remove legend
) +
geom_text_repel(
aes(label = dest, color = dest), # color label per destination
max.overlaps = 10, # allow more labels to be drawn
position = position_jitter(
width = 0.2,
height = 0,
seed = 1
),
show.legend = FALSE
) +
theme_linedraw()
```
```
</p>
</p>
</details>
</details>
...
...
This diff is collapsed.
Click to expand it.
Gilquin
@lgilquin
mentioned in issue
#2 (closed)
·
8 months ago
mentioned in issue
#2 (closed)
mentioned in issue #2
Toggle commit list
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment