From b9a4a8e3a0eb1144069db79dd96b15ab239a6f6b Mon Sep 17 00:00:00 2001 From: hpolvech <helene.polveche@ens-lyon.fr> Date: Thu, 26 Mar 2020 16:32:52 +0100 Subject: [PATCH] fin session3, decomp session4 tuto + challengeTime --- session_3/HTML_tuto_s3.Rmd | 4 +- session_4/HTML_toto_s4.Rmd | 342 ++++++++++++++++++++++++++++++++++++ session_4/challengeTime.Rmd | 139 +++++++++++++++ 3 files changed, 483 insertions(+), 2 deletions(-) create mode 100644 session_4/HTML_toto_s4.Rmd create mode 100644 session_4/challengeTime.Rmd diff --git a/session_3/HTML_tuto_s3.Rmd b/session_3/HTML_tuto_s3.Rmd index c388fa9..4612e49 100644 --- a/session_3/HTML_tuto_s3.Rmd +++ b/session_3/HTML_tuto_s3.Rmd @@ -1,5 +1,5 @@ --- -title: "R#3: Transformations with ggplot2" +title: 'R#3: Transformations with ggplot2' author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)" date: "Mars 2020" output: @@ -295,4 +295,4 @@ bar + coord_polar() ``` -##See you to Session#4 : "" \ No newline at end of file +##See you to Session#4 : "data transformation" \ No newline at end of file diff --git a/session_4/HTML_toto_s4.Rmd b/session_4/HTML_toto_s4.Rmd new file mode 100644 index 0000000..cb620ee --- /dev/null +++ b/session_4/HTML_toto_s4.Rmd @@ -0,0 +1,342 @@ +--- +title: "R#4: data transformation" +author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)" +date: "Mars 2020" +output: + html_document: default + pdf_document: default +--- +<style type="text/css"> +h3 { /* Header 3 */ + position: relative ; + color: #729FCF ; + left: 5%; +} +h2 { /* Header 2 */ + color: darkblue ; + left: 10%; +} +h1 { /* Header 1 */ + color: #034b6f ; +} +#pencadre{ + border:1px; + border-style:solid; + border-color: #034b6f; + background-color: #EEF3F9; + padding: 1em; + text-align: center ; + border-radius : 5px 4px 3px 2px; +} +legend{ + color: #034b6f ; +} +#pquestion { + color: darkgreen; + font-weight: bold; +} +</style> + +```{r setup, include=FALSE, cache=TRUE} +knitr::opts_chunk$set(echo = TRUE) +``` + +The goal of this practical is to practices data transformation with `tidyverse`. +The objectives of this session will be to: + +- Filter rows with `filter()` +- Arrange rows with `arrange()` +- Select columns with `select()` +- Add new variables with `mutate()` +- Combining multiple operations with the pipe `%>%` + +```R +install.packages("nycflights13") +``` + +```{r packageloaded, include=TRUE, message=FALSE} +library("tidyverse") +library("nycflights13") +``` + + \ + +# Data set : nycflights13 + +`nycflights13::flights`contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in `?flights` + + +```{r display_data, include=TRUE} +flights +``` + +- **int** stands for integers. +- **dbl** stands for doubles, or real numbers. +- **chr** stands for character vectors, or strings. +- **dttm** stands for date-times (a date + a time). +- **lgl** stands for logical, vectors that contain only TRUE or FALSE. +- **fctr** stands for factors, which R uses to represent categorical variables with fixed possible values. +- **date** stands for dates. + + \ + +# Filter rows with `filter()` + +`filter()` allows you to subset observations based on their values. + +```{r filter_month_day, include=TRUE} +filter(flights, month == 1, day == 1) +``` + + \ + +`dplyr` functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, `<-` + +```{r filter_month_day_sav, include=TRUE} +jan1 <- filter(flights, month == 1, day == 1) +``` + + \ + +R either prints out the results, or saves them to a variable. + +```{r filter_month_day_sav_display, include=TRUE} +(dec25 <- filter(flights, month == 12, day == 25)) +``` + + \ + +# Logical operators + +Multiple arguments to `filter()` are combined with “andâ€: every expression must be true in order for a row to be included in the output. + + + + \ + +Test the following operations: + +```{r filter_logical_operators, include=TRUE} +filter(flights, month == 11 | month == 12) +filter(flights, month %in% c(11, 12)) +filter(flights, !(arr_delay > 120 | dep_delay > 120)) +filter(flights, arr_delay <= 120, dep_delay <= 120) +``` + + \ + +# Missing values + +One important feature of R that can make comparison tricky are missing values, or `NA`s (“not availablesâ€). + +```{r filter_logical_operators_NA, include=TRUE} +NA > 5 +10 == NA +NA + 10 +``` + + +```{r filter_logical_operators_test_NA, include=TRUE} +is.na(NA) +``` + + \ + +# Arrange rows with `arrange()` + + \ + +`arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order. + +```{r arrange_ymd, include=TRUE} +arrange(flights, year, month, day) +``` + + \ +Use `desc()` to re-order by a column in descending order: + +```{r arrange_desc, include=TRUE} +arrange(flights, desc(dep_delay)) +``` + +Missing values are always sorted at the end: + +```{r arrange_NA, include=TRUE} +arrange(tibble(x = c(5, 2, NA)), x) +arrange(tibble(x = c(5, 2, NA)), desc(x)) +``` + + \ + +# Select columns with `select()` + + \ + +`select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables. + +```{r select_ymd, , include=TRUE} +select(flights, year, month, day) +select(flights, year:day) +select(flights, -(year:day)) +``` + + \ + +here are a number of helper functions you can use within `select()`: + +- `starts_with("abc")`: matches names that begin with “abcâ€. +- `ends_with("xyz")`: matches names that end with “xyzâ€. +- `contains("ijk")`: matches names that contain “ijkâ€. +- `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`. + +See `?select` for more details. + + \ + +# Add new variables with `mutate()` + + \ + +It’s often useful to add new columns that are functions of existing columns. That’s the job of `mutate()`. + +```{r mutate, include=TRUE} +flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time) + +flights_sml + +mutate(flights_sml, gain = dep_delay - arr_delay, + speed = distance / air_time * 60) +``` + + \ + +```{r mutate_reuse, include=TRUE} +flights_sml <- mutate(flights_sml, gain = dep_delay - arr_delay, + speed = distance / air_time * 60) + +``` + + \ + +### Useful creation functions + +- Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`). +- Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`; and dplyr provides `cummean()` for cumulative means. +- Logical comparisons, `<`, `<=`, `>`, `>=`, `!=`, and `==` +- Ranking: there are a number of ranking functions, but you should start with `min_rank()`. There is also `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()` + + \ + +# Combining multiple operations with the pipe + + \ + +We don't want to create useless intermediate variables so we can use the pipe operator: `%>%` +( or `ctrl + shift + M`). + +<div id="pquestion"> - Find the 10 most delayed flights using a ranking function. `min_rank()` </div> + +```{r pipe_example_a, include=TRUE} +flights_md <- mutate(flights, + most_delay = min_rank(desc(dep_delay))) +flights_md <- filter(flights_md, most_delay < 10) +flights_md <- arrange(flights_md, most_delay) +``` + + \ + + +```{r pipe_example_b, include=TRUE} +flights_md2 <- flights %>% + mutate(most_delay = min_rank(desc(dep_delay))) %>% + filter(most_delay < 10) %>% + arrange(most_delay) + +select(flights_md2, year:day, flight, origin, dest, dep_delay, most_delay) +``` + + \ + +Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom. + + \ + +Working with the pipe is one of the key criteria for belonging to the `tidyverse`. The only exception is `ggplot2`: it was written before the pipe was discovered. Unfortunately, the next iteration of `ggplot2`, `ggvis`, which does use the pipe, isn’t quite ready for prime time yet. + +# Grouped summaries with `summarise()` + +`summarise()` collapses a data frame to a single row: + +```{r load_data, include=TRUE} +flights %>% + summarise(delay = mean(dep_delay, na.rm = TRUE)) +``` + +### The power of `summarise()` with `group_by()` + +This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the `dplyr` verbs on a grouped data frame they’ll be automatically applied “by groupâ€. + +```{r summarise_group_by, include=TRUE, fig.width=8, fig.height=3.5} +flights_delay <- flights %>% + group_by(year, month) %>% + summarise(delay = mean(dep_delay, na.rm = TRUE), sd = sd(dep_delay, na.rm = TRUE)) %>% + arrange(month) + +flights_delay + +ggplot(data = flights_delay, mapping = aes(x = month, y = delay)) + + geom_bar(stat="identity", color="black", fill = "#619CFF") + + geom_errorbar(mapping = aes( ymin=0, ymax=delay+sd)) + + theme(axis.text.x = element_blank()) + +``` + + +### Missing values + +You may have wondered about the na.rm argument we used above. What happens if we don’t set it? + +```{r summarise_group_by_NA, include=TRUE} +flights %>% + group_by(dest) %>% + summarise( + dist = mean(distance), + delay = mean(arr_delay) + ) +``` + +Aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value. + + +# Counts + +Whenever you do any aggregation, it’s always a good idea to include either a count (`n()`). That way you can check that you’re not drawing conclusions based on very small amounts of data. + +```{r summarise_group_by_count, include = TRUE, warning=F, message=F, fig.width=8, fig.height=3.5} +summ_delay_filghts <- flights %>% + group_by(dest) %>% + summarise( + count = n(), + dist = mean(distance, na.rm = TRUE), + delay = mean(arr_delay, na.rm = TRUE) + ) +summ_delay_filghts + +ggplot(data = summ_delay_filghts, mapping = aes(x = dist, y = delay, size = count)) + + geom_point() + + geom_smooth(method = lm, se = FALSE) + + theme(legend.position='none') + +``` + +## Thank you ! + + \ + +## For curious or motivated people: Challenge time! + + \ + + \ + + diff --git a/session_4/challengeTime.Rmd b/session_4/challengeTime.Rmd new file mode 100644 index 0000000..1986436 --- /dev/null +++ b/session_4/challengeTime.Rmd @@ -0,0 +1,139 @@ +--- +title: "Challenge time!" +author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)" +date: "Mars 2020" +output: + html_document: default + pdf_document: default +--- + <style type="text/css"> + h3 { /* Header 3 */ + position: relative ; + color: #729FCF ; + left: 5%; + } +h2 { /* Header 2 */ + color: darkblue ; + left: 10%; +} +h1 { /* Header 1 */ + color: #034b6f ; +} +#pencadre{ +border:1px; +border-style:solid; +border-color: #034b6f; + background-color: #EEF3F9; + padding: 1em; +text-align: center ; +border-radius : 5px 4px 3px 2px; +} +legend{ + color: #034b6f ; +} +#pquestion { +color: darkgreen; +font-weight: bold; +} +</style> + + ```{r setup, include=FALSE, cache=TRUE} +knitr::opts_chunk$set(echo = TRUE) +``` + + +### Filter challenges : + +Find all flights that: + + - Had an arrival delay of two or more hours +- Were operated by United, American, or Delta +- Departed between midnight and 6am (inclusive) + +Another useful dplyr filtering helper is `between()`. What does it do? Can you use it to simplify the code needed to answer the previous challenges? + +How many flights have a missing `dep_time`? What other variables are missing? What might these rows represent? + +Why is `NA ^ 0` not `NA`? Why is `NA | TRUE` not `NA`? Why is `FALSE & NA` not `NA`? Can you figure out the general rule? (`NA * 0` is a tricky counter-example!) + +### Arrange challenges : + +- Sort flights to find the most delayed flights. Find the flights that left earliest. +- Sort flights to find the fastest flights. +- Which flights traveled the longest? Which traveled the shortest? + +### Select challenges : + +- Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`. +- What does the `one_of()` function do? Why might it be helpful in conjunction with this vector? +```{r select_one_of, eval=F, message=F, cache=T} +vars <- c("year", "month", "day", "dep_delay", "arr_delay") +``` +- Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default? +```{r select_contains, eval=F, message=F, cache=T} +select(flights, contains("TIME")) +``` + + +### Mutate challenges : + +- Currently `dep_time` and `sched_dep_time` are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight. + + +```{r mutate_challenges_a, eval=F, message=F, cache=T} +mutate( + flights, + dep_time = (dep_time %/% 100) * 60 + + dep_time %% 100, + sched_dep_time = (sched_dep_time %/% 100) * 60 + + sched_dep_time %% 100 +) +``` + +\ + +- Compare `dep_time`, `sched_dep_time`, and `dep_delay`. How would you expect those three numbers to be related? + +```{r mutate_challenge_b, eval=F, message=F, cache=T} +mutate( + flights, + dep_time = (dep_time %/% 100) * 60 + + dep_time %% 100, + sched_dep_time = (sched_dep_time %/% 100) * 60 + + sched_dep_time %% 100 +) +``` + +\ + +### Challenge with `summarise()` and `group_by()` + +Imagine that we want to explore the relationship between the distance and average delay for each location. +here are three steps to prepare this data: + +- Group flights by destination. +- Summarise to compute distance, average delay, and number of flights. +- Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport. + +```{r summarise_group_by_ggplot_a, eval = F} +flights %>% + group_by(dest) +``` + + \ + +Imagine that we want to explore the relationship between the distance and average delay for each location. + +- Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport. + +```{r summarise_group_by_ggplot_b, eval = F} +flights %>% + group_by(dest) %>% + summarise( + count = n(), + dist = mean(distance, na.rm = TRUE), + delay = mean(arr_delay, na.rm = TRUE) + ) +``` + + -- GitLab