Compare revisions

06d9eb16 · 2887e584 · 06d9eb16 · 2887e584 · 06d9eb16 · 06d9eb16
--- a/session_4/session_4.Rmd
+++ b/session_4/session_4.Rmd
--- a/session_4/slides.Rmd
+++ b/session_4/slides.Rmd
--- a/session_5/session_5.Rmd
+++ b/session_5/session_5.Rmd
--- a/session_5/slides.Rmd
+++ b/session_5/slides.Rmd
---
-title: "R#5: data transformation"
-author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
-date: "28 Nov 2019"
-output:
-  slidy_presentation:
-    highlight: tango
-  beamer_presentation:
-    theme: metropolis
-    slide_level: 3
-    fig_caption: no
-    df_print: tibble
-    highlight: tango
-    latex_engine: xelatex
---
-
-```{r setup, include=FALSE, cache=TRUE}
-knitr::opts_chunk$set(echo = FALSE)
-library(tidyverse)
-```
-
-## Grouped summaries with `summarise()`
-
-`summarise()` collapses a data frame to a single row:
-
-```{r load_data, eval=T, message=FALSE, cache=T}
-library(nycflights13)
-library(tidyverse)
-flights %>% 
-  summarise(delay = mean(dep_delay, na.rm = TRUE))
-```
-
-## The power of `summarise()` with `group_by()`
-
-This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the `dplyr` verbs on a grouped data frame they’ll be automatically applied “by group”.
-
-
-```{r summarise_group_by, eval=T, message=FALSE, cache=T}
-flights %>% 
-  group_by(year, month, day) %>% 
-  summarise(delay = mean(dep_delay, na.rm = TRUE))
-```
-
-**5_a**
-
-## Challenge with `summarise()` and `group_by()`
-
-Imagine that we want to explore the relationship between the distance and average delay for each location. 
-here are three steps to prepare this data: 
-
- Group flights by destination.
- Summarise to compute distance, average delay, and number of flights.
- Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
-
-```{r summarise_group_by_ggplot_a, eval = F}
-flights %>% 
-  group_by(dest)
-```
-
-## Challenge with `summarise()` and `group_by()`
-
-Imagine that we want to explore the relationship between the distance and average delay for each location. 
-
- Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
-
-```{r summarise_group_by_ggplot_b, eval = F}
-flights %>% 
-  group_by(dest) %>% 
-  summarise(
-    count = n(),
-    dist = mean(distance, na.rm = TRUE),
-    delay = mean(arr_delay, na.rm = TRUE)
-  )
-```
-
-## Missing values
-
-You may have wondered about the na.rm argument we used above. What happens if we don’t set it?
-
-```{r summarise_group_by_NA, cache = TRUE, fig.width=8, fig.height=4.5, message = FALSE}
-flights %>% 
-  group_by(dest) %>% 
-  summarise(
-    dist = mean(distance),
-    delay = mean(arr_delay)
-  )
-```
-
-Aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value.
-
-## Counts
-
-Whenever you do any aggregation, it’s always a good idea to include either a count (`n()`). That way you can check that you’re not drawing conclusions based on very small amounts of data.
-
-```{r summarise_group_by_count, cache = TRUE, fig.width=8, fig.height=4.5, message = FALSE}
-flights %>% 
-  group_by(dest) %>% 
-  summarise(
-    count = n(),
-    dist = mean(distance, na.rm = TRUE),
-    delay = mean(arr_delay, na.rm = TRUE)
-  )
-```
-
-## Challenge with `summarise()` and `group_by()`
-
-Imagine that we want to explore the relationship between the distance and average delay for each location. 
-
- Summarise to compute distance, average delay, and number of flights.
- Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
-
-```{r summarise_group_by_ggplot_c, eval = F}
-flights %>% 
-  group_by(dest) %>% 
-  summarise(
-    count = n(),
-    dist = mean(distance, na.rm = TRUE),
-    delay = mean(arr_delay, na.rm = TRUE)
-  ) %>% 
-  filter(count > 20, dest != "HNL")
-```
-
-## Challenge with `summarise()` and `group_by()`
-
-Imagine that we want to explore the relationship between the distance and average delay for each location. 
-
-```{r summarise_group_by_ggplot_d, eval = F}
-flights %>% 
-  group_by(dest) %>% 
-  summarise(
-    count = n(),
-    dist = mean(distance, na.rm = TRUE),
-    delay = mean(arr_delay, na.rm = TRUE)
-  ) %>% 
-  filter(count > 20, dest != "HNL") %>% 
-  ggplot(mapping = aes(x = dist, y = delay)) +
-  geom_point(aes(size = count), alpha = 1/3) +
-  geom_smooth(se = FALSE)
-```
-
-**5_b**
-
-## Challenge with `summarise()` and `group_by()`
-
-```{r summarise_group_by_ggplot, echo = FALSE, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
-flights %>% 
-  group_by(dest) %>% 
-  summarise(
-    count = n(),
-    dist = mean(distance, na.rm = TRUE),
-    delay = mean(arr_delay, na.rm = TRUE)
-  ) %>% 
-  filter(count > 20, dest != "HNL") %>% 
-  ggplot(mapping = aes(x = dist, y = delay)) +
-  geom_point(aes(size = count), alpha = 1/3) +
-  geom_smooth(se = FALSE)
-```
-
-## Ungrouping
-
-
-If you need to remove grouping, and return to operations on ungrouped data, use `ungroup()`.
-
-```{r ungroup, eval=T, message=FALSE, cache=T}
-flights %>% 
-  group_by(year, month, day) %>% 
-  ungroup() %>%
-  summarise(delay = mean(dep_delay, na.rm = TRUE))
-```
-
-## Grouping challenges
-
- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? (`strftime(x,'%A')` give you the name of the day from a POSIXct date)
- Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about `flights %>% group_by(carrier, dest) %>% summarise(n())`)
-
-
-## Grouping challenges
-
- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? (`strftime(x,'%A')` give you the name of the day from a POSIXct date)
-
-```{r grouping_challenges_a, eval=F, message=FALSE, cache=T}
-flights %>% 
-  mutate(
-    canceled = is.na(dep_time) | is.na(arr_time)
-  ) %>% 
-  filter(canceled) %>% 
-  mutate(wday = strftime(time_hour,'%A')) %>% 
-  group_by(wday) %>% 
-  summarise(
-    cancel_day = n()
-  ) %>%
-  ggplot(mapping = aes(x = fct_reorder(wday, cancel_day), y = cancel_day)) +
-  geom_col()
-```
-
-**5_b**
-
-## Grouping challenges
-
- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? (`strftime(x,'%A')` give you the name of the day from a POSIXct date)
-
-```{r grouping_challenges_b, eval=T, echo = F, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
-flights %>% 
-  mutate(
-    canceled = is.na(dep_time) | is.na(arr_time)
-  ) %>% 
-  mutate(wday = strftime(time_hour,'%A')) %>% 
-  group_by(wday) %>% 
-  summarise(
-    cancel_day = n()
-  ) %>%
-  ggplot(mapping = aes(x = wday, y = cancel_day)) +
-  geom_col()
-```
-
-## Grouping challenges
-
- Which carrier has the worst delays?
-
-```{r grouping_challenges_c, eval=F, echo = T, message=FALSE, cache=T}
-flights %>% 
-  group_by(carrier) %>% 
-  summarise(
-    carrier_delay = mean(arr_delay, na.rm = T)
-  ) %>%
-  mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
-  ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
-  geom_col(alpha = 0.5)
-```
-
-**5_c**
-
-## Grouping challenges
-
- Which carrier has the worst delays?
-
-```{r grouping_challenges_d, eval=T, echo = F, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
-flights %>% 
-  group_by(carrier) %>% 
-  summarise(
-    carrier_delay = mean(arr_delay, na.rm = T)
-  ) %>%
-  mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
-  ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
-  geom_col(alpha = 0.5)
-```
-
-## Grouped mutates (and filters)
-
-Grouping is also useful in conjunction with `mutate()` and `filter()`
-
- Find all groups bigger than a threshold:
- Standardise to compute per group metrics:
-
-```{r group_filter, eval=F}
-flights %>% 
-  group_by(dest, year) %>% 
-  filter(n() > 10000) %>% 
-  filter(arr_delay > 0) %>% 
-  mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
-  select(year:day, dest, arr_delay, prop_delay)
-```
-
-## Goup by challenges
-
- What time of day should you fly if you want to avoid delays as much as possible?
-
-```{r group_filter_b, eval=T, echo = F, warning=F, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
-flights %>% 
-  group_by(hour) %>% 
-  summarise(
-    mean_delay = mean(arr_delay, na.rm = T),
-    sd_delay = sd(arr_delay, na.rm = T),
-  ) %>% 
-  ggplot() +
-  geom_errorbar(mapping = aes(
-    x = hour,
-    ymax = mean_delay + sd_delay,
-    ymin = mean_delay - sd_delay)) +
-  geom_point(mapping = aes(
-    x = hour,
-    y = mean_delay,
-  ))
-```
- **5_d**
\ No newline at end of file
--- a/session_6/img/overview_joins.png
+++ b/session_6/img/overview_joins.png
--- a/session_6/img/overview_set.png
+++ b/session_6/img/overview_set.png
--- a/session_6/img/pivot_longer.png
+++ b/session_6/img/pivot_longer.png
--- a/session_6/img/pivot_wider.png
+++ b/session_6/img/pivot_wider.png
--- a/session_6/session_6.Rmd
+++ b/session_6/session_6.Rmd
--- a/session_6/slides.Rmd
+++ b/session_6/slides.Rmd
--- a/session_7/session_7.Rmd
+++ b/session_7/session_7.Rmd
--- a/session_8/session_8.Rmd
+++ b/session_8/session_8.Rmd
--- a/src/Dockerfile
+++ b/src/Dockerfile
+FROM rocker/tidyverse
+
+RUN apt-get update \
+    && apt-get install -y \
+    libxt6 \
+    cargo
+
+RUN Rscript -e "install.packages('rmdformats')"
+
+#session 1
+RUN Rscript -e "install.packages('rvest')"
+RUN Rscript -e "install.packages('remotes'); remotes::install_github('rlesur/klippy')"
+
+#session 3
+RUN Rscript -e "install.packages('gganimate')"
+RUN Rscript -e "install.packages('gifski')"
+RUN Rscript -e "install.packages('openxlsx')"
+
+#session4
+RUN Rscript -e "install.packages(c('ghibli', 'nycflights13','viridis','ggrepel'))"
\ No newline at end of file
--- a/src/create_docker_to_deploy_the_course.sh
+++ b/src/create_docker_to_deploy_the_course.sh
--- a/src/create_users_from_mail.sh
+++ b/src/create_users_from_mail.sh
--- a/src/create_users_from_user_list_csv.sh
+++ b/src/create_users_from_user_list_csv.sh
--- a/src/create_users_from_user_pwd_list.sh
+++ b/src/create_users_from_user_pwd_list.sh
--- a/src/Raleway-Regular.ttf
+++ b/src/Raleway-Regular.ttf
--- a/src/YanoneKaffeesatz-Bold.ttf
+++ b/src/YanoneKaffeesatz-Bold.ttf
--- a/www/github-pandoc.css
+++ b/www/github-pandoc.css
No results found