From b9a4a8e3a0eb1144069db79dd96b15ab239a6f6b Mon Sep 17 00:00:00 2001
From: hpolvech <helene.polveche@ens-lyon.fr>
Date: Thu, 26 Mar 2020 16:32:52 +0100
Subject: [PATCH] fin session3, decomp session4 tuto + challengeTime

---
 session_3/HTML_tuto_s3.Rmd  |   4 +-
 session_4/HTML_toto_s4.Rmd  | 342 ++++++++++++++++++++++++++++++++++++
 session_4/challengeTime.Rmd | 139 +++++++++++++++
 3 files changed, 483 insertions(+), 2 deletions(-)
 create mode 100644 session_4/HTML_toto_s4.Rmd
 create mode 100644 session_4/challengeTime.Rmd

diff --git a/session_3/HTML_tuto_s3.Rmd b/session_3/HTML_tuto_s3.Rmd
index c388fa9..4612e49 100644
--- a/session_3/HTML_tuto_s3.Rmd
+++ b/session_3/HTML_tuto_s3.Rmd
@@ -1,5 +1,5 @@
 ---
-title: "R#3: Transformations with ggplot2"
+title: 'R#3: Transformations with ggplot2'
 author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
 date: "Mars 2020"
 output:
@@ -295,4 +295,4 @@ bar + coord_polar()
 ```
 
 
-##See you to Session#4 : ""
\ No newline at end of file
+##See you to Session#4 : "data transformation"
\ No newline at end of file
diff --git a/session_4/HTML_toto_s4.Rmd b/session_4/HTML_toto_s4.Rmd
new file mode 100644
index 0000000..cb620ee
--- /dev/null
+++ b/session_4/HTML_toto_s4.Rmd
@@ -0,0 +1,342 @@
+---
+title: "R#4: data transformation"
+author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
+date: "Mars 2020"
+output:
+  html_document: default
+  pdf_document: default
+---
+<style type="text/css">
+h3 { /* Header 3 */
+  position: relative ;
+  color: #729FCF ;
+  left: 5%;
+}
+h2 { /* Header 2 */
+  color: darkblue ;
+  left: 10%;
+} 
+h1 { /* Header 1 */
+  color: #034b6f ;
+} 
+#pencadre{
+  border:1px; 
+  border-style:solid; 
+  border-color: #034b6f; 
+  background-color: #EEF3F9; 
+  padding: 1em;
+  text-align: center ;
+  border-radius : 5px 4px 3px 2px;
+}
+legend{
+  color: #034b6f ;
+}
+#pquestion {
+  color: darkgreen;
+  font-weight: bold;
+}
+</style>
+
+```{r setup, include=FALSE, cache=TRUE}
+knitr::opts_chunk$set(echo = TRUE)
+```
+
+The goal of this practical is to practices data transformation with `tidyverse`.
+The objectives of this session will be to:
+
+- Filter rows with `filter()`
+- Arrange rows with `arrange()`
+- Select columns with `select()`
+- Add new variables with `mutate()`
+- Combining multiple operations with the pipe `%>%`
+
+```R
+install.packages("nycflights13")
+```
+
+```{r packageloaded, include=TRUE, message=FALSE}
+library("tidyverse")
+library("nycflights13")
+```
+
+ \ 
+ 
+# Data set : nycflights13
+
+`nycflights13::flights`contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in `?flights`
+
+
+```{r display_data, include=TRUE}
+flights
+```
+
+- **int** stands for integers.
+- **dbl** stands for doubles, or real numbers.
+- **chr** stands for character vectors, or strings.
+- **dttm** stands for date-times (a date + a time).
+- **lgl** stands for logical, vectors that contain only TRUE or FALSE.
+- **fctr** stands for factors, which R uses to represent categorical variables with fixed possible values.
+- **date** stands for dates.
+
+ \ 
+ 
+# Filter rows with `filter()`
+
+`filter()` allows you to subset observations based on their values. 
+
+```{r filter_month_day, include=TRUE}
+filter(flights, month == 1, day == 1)
+```
+
+ \ 
+ 
+`dplyr` functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, `<-`
+
+```{r filter_month_day_sav, include=TRUE}
+jan1 <- filter(flights, month == 1, day == 1)
+```
+
+ \ 
+ 
+R either prints out the results, or saves them to a variable.
+
+```{r filter_month_day_sav_display, include=TRUE}
+(dec25 <- filter(flights, month == 12, day == 25))
+```
+
+ \ 
+ 
+# Logical operators
+
+Multiple arguments to `filter()` are combined with “and”: every expression must be true in order for a row to be included in the output.
+
+![](./img/transform-logical.png)
+
+ \ 
+
+Test the following operations:
+
+```{r filter_logical_operators, include=TRUE}
+filter(flights, month == 11 | month == 12)
+filter(flights, month %in% c(11, 12))
+filter(flights, !(arr_delay > 120 | dep_delay > 120))
+filter(flights, arr_delay <= 120, dep_delay <= 120)
+```
+
+ \ 
+ 
+# Missing values
+
+One important feature of R that can make comparison tricky are missing values, or `NA`s (“not availables”). 
+
+```{r filter_logical_operators_NA, include=TRUE}
+NA > 5
+10 == NA
+NA + 10
+```
+
+
+```{r filter_logical_operators_test_NA, include=TRUE}
+is.na(NA)
+```
+
+ \ 
+ 
+# Arrange rows with `arrange()`
+
+ \ 
+
+`arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order.
+
+```{r arrange_ymd, include=TRUE}
+arrange(flights, year, month, day)
+```
+
+ \ 
+Use `desc()` to re-order by a column in descending order:
+
+```{r arrange_desc, include=TRUE}
+arrange(flights, desc(dep_delay))
+```
+
+Missing values are always sorted at the end:
+
+```{r arrange_NA, include=TRUE}
+arrange(tibble(x = c(5, 2, NA)), x)
+arrange(tibble(x = c(5, 2, NA)), desc(x))
+```
+
+ \ 
+
+# Select columns with `select()`
+
+ \ 
+ 
+`select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
+
+```{r select_ymd, , include=TRUE}
+select(flights, year, month, day)
+select(flights, year:day)
+select(flights, -(year:day))
+```
+
+ \ 
+
+here are a number of helper functions you can use within `select()`:
+
+- `starts_with("abc")`: matches names that begin with “abc”.
+- `ends_with("xyz")`: matches names that end with “xyz”.
+- `contains("ijk")`: matches names that contain “ijk”.
+- `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`.
+
+See `?select` for more details.
+
+ \ 
+ 
+# Add new variables with `mutate()`
+
+ \ 
+ 
+It’s often useful to add new columns that are functions of existing columns. That’s the job of `mutate()`.
+
+```{r mutate, include=TRUE}
+flights_sml <- select(flights,  year:day, ends_with("delay"), distance, air_time)
+
+flights_sml
+
+mutate(flights_sml, gain = dep_delay - arr_delay,
+            speed = distance / air_time * 60)
+```
+
+ \ 
+
+```{r mutate_reuse, include=TRUE}
+flights_sml <- mutate(flights_sml, gain = dep_delay - arr_delay,
+            speed = distance / air_time * 60)
+
+```
+
+ \ 
+ 
+### Useful creation functions
+
+- Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`).
+- Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`; and dplyr provides `cummean()` for cumulative means. 
+- Logical comparisons, `<`, `<=`, `>`, `>=`, `!=`, and `==`
+- Ranking: there are a number of ranking functions, but you should start with `min_rank()`. There is also `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`
+
+ \ 
+ 
+# Combining multiple operations with the pipe
+
+ \ 
+ 
+We don't want to create useless intermediate variables so we can use the pipe operator: `%>%`
+( or `ctrl + shift + M`). 
+
+<div id="pquestion"> - Find the 10 most delayed flights using a ranking function. `min_rank()` </div>
+
+```{r pipe_example_a, include=TRUE}
+flights_md <- mutate(flights,
+                     most_delay = min_rank(desc(dep_delay)))
+flights_md <- filter(flights_md, most_delay < 10)
+flights_md <- arrange(flights_md, most_delay)
+```
+
+ \ 
+ 
+
+```{r pipe_example_b, include=TRUE}
+flights_md2 <- flights %>%
+    mutate(most_delay = min_rank(desc(dep_delay))) %>% 
+    filter(most_delay < 10) %>% 
+    arrange(most_delay)
+
+select(flights_md2, year:day, flight, origin, dest, dep_delay, most_delay)
+```
+
+ \ 
+
+Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom. 
+
+ \ 
+
+Working with the pipe is one of the key criteria for belonging to the `tidyverse`. The only exception is `ggplot2`: it was written before the pipe was discovered. Unfortunately, the next iteration of `ggplot2`, `ggvis`, which does use the pipe, isn’t quite ready for prime time yet.
+
+# Grouped summaries with `summarise()`
+
+`summarise()` collapses a data frame to a single row:
+
+```{r load_data, include=TRUE}
+flights %>% 
+  summarise(delay = mean(dep_delay, na.rm = TRUE))
+```
+
+### The power of `summarise()` with `group_by()`
+
+This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the `dplyr` verbs on a grouped data frame they’ll be automatically applied “by group”.
+
+```{r summarise_group_by, include=TRUE, fig.width=8, fig.height=3.5}
+flights_delay <- flights %>% 
+  group_by(year, month) %>% 
+  summarise(delay = mean(dep_delay, na.rm = TRUE), sd = sd(dep_delay, na.rm = TRUE)) %>% 
+  arrange(month)
+
+flights_delay
+
+ggplot(data = flights_delay, mapping = aes(x = month, y = delay)) +
+  geom_bar(stat="identity", color="black", fill = "#619CFF") +
+  geom_errorbar(mapping = aes( ymin=0, ymax=delay+sd)) + 
+  theme(axis.text.x = element_blank())
+
+```
+
+
+### Missing values
+
+You may have wondered about the na.rm argument we used above. What happens if we don’t set it?
+
+```{r summarise_group_by_NA, include=TRUE}
+flights %>% 
+  group_by(dest) %>% 
+  summarise(
+    dist = mean(distance),
+    delay = mean(arr_delay)
+  )
+```
+
+Aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value.
+
+
+# Counts
+
+Whenever you do any aggregation, it’s always a good idea to include either a count (`n()`). That way you can check that you’re not drawing conclusions based on very small amounts of data.
+
+```{r summarise_group_by_count, include = TRUE, warning=F, message=F, fig.width=8, fig.height=3.5}
+summ_delay_filghts <- flights %>% 
+                      group_by(dest) %>% 
+                      summarise(
+                          count = n(),
+                          dist = mean(distance, na.rm = TRUE),
+                          delay = mean(arr_delay, na.rm = TRUE)
+                      )
+summ_delay_filghts
+
+ggplot(data = summ_delay_filghts, mapping = aes(x = dist, y = delay, size = count)) +
+  geom_point() +
+  geom_smooth(method = lm, se = FALSE) +
+  theme(legend.position='none')
+
+```
+
+## Thank you !
+
+ \ 
+ 
+## For curious or motivated people: Challenge time!
+
+ \ 
+ 
+ \ 
+ 
+ 
diff --git a/session_4/challengeTime.Rmd b/session_4/challengeTime.Rmd
new file mode 100644
index 0000000..1986436
--- /dev/null
+++ b/session_4/challengeTime.Rmd
@@ -0,0 +1,139 @@
+---
+title: "Challenge time!"
+author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
+date: "Mars 2020"
+output:
+  html_document: default
+  pdf_document: default
+---
+  <style type="text/css">
+  h3 { /* Header 3 */
+      position: relative ;
+    color: #729FCF ;
+      left: 5%;
+  }
+h2 { /* Header 2 */
+    color: darkblue ;
+  left: 10%;
+} 
+h1 { /* Header 1 */
+    color: #034b6f ;
+} 
+#pencadre{
+border:1px; 
+border-style:solid; 
+border-color: #034b6f; 
+  background-color: #EEF3F9; 
+  padding: 1em;
+text-align: center ;
+border-radius : 5px 4px 3px 2px;
+}
+legend{
+  color: #034b6f ;
+}
+#pquestion {
+color: darkgreen;
+font-weight: bold;
+}
+</style>
+  
+  ```{r setup, include=FALSE, cache=TRUE}
+knitr::opts_chunk$set(echo = TRUE)
+```
+
+
+### Filter challenges :
+
+Find all flights that:
+  
+  - Had an arrival delay of two or more hours
+- Were operated by United, American, or Delta
+- Departed between midnight and 6am (inclusive)
+
+Another useful dplyr filtering helper is `between()`. What does it do? Can you use it to simplify the code needed to answer the previous challenges?
+
+How many flights have a missing `dep_time`? What other variables are missing? What might these rows represent?
+
+Why is `NA ^ 0` not `NA`? Why is `NA | TRUE` not `NA`? Why is `FALSE & NA` not `NA`? Can you figure out the general rule? (`NA * 0` is a tricky counter-example!)
+
+### Arrange challenges :
+
+- Sort flights to find the most delayed flights. Find the flights that left earliest.
+- Sort flights to find the fastest flights.
+- Which flights traveled the longest? Which traveled the shortest?
+
+### Select challenges :
+
+- Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`.
+- What does the `one_of()` function do? Why might it be helpful in conjunction with this vector?
+```{r select_one_of, eval=F, message=F, cache=T}
+vars <- c("year", "month", "day", "dep_delay", "arr_delay")
+```
+- Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
+```{r select_contains, eval=F, message=F, cache=T}
+select(flights, contains("TIME"))
+```
+
+
+### Mutate challenges :
+
+- Currently `dep_time` and `sched_dep_time` are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
+
+
+```{r mutate_challenges_a, eval=F, message=F, cache=T}
+mutate(
+  flights,
+  dep_time = (dep_time %/% 100) * 60 +
+    dep_time %% 100,
+  sched_dep_time = (sched_dep_time %/% 100) * 60 +
+    sched_dep_time %% 100
+)
+```
+
+\ 
+
+- Compare `dep_time`, `sched_dep_time`, and `dep_delay`. How would you expect those three numbers to be related?
+
+```{r mutate_challenge_b, eval=F, message=F, cache=T}
+mutate(
+  flights,
+  dep_time = (dep_time %/% 100) * 60 + 
+    dep_time %% 100,
+  sched_dep_time = (sched_dep_time %/% 100) * 60 +
+    sched_dep_time %% 100
+)
+```
+
+\ 
+
+### Challenge with `summarise()` and `group_by()`
+
+Imagine that we want to explore the relationship between the distance and average delay for each location. 
+here are three steps to prepare this data: 
+
+- Group flights by destination.
+- Summarise to compute distance, average delay, and number of flights.
+- Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
+
+```{r summarise_group_by_ggplot_a, eval = F}
+flights %>% 
+  group_by(dest)
+```
+
+ \ 
+
+Imagine that we want to explore the relationship between the distance and average delay for each location. 
+
+- Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
+
+```{r summarise_group_by_ggplot_b, eval = F}
+flights %>% 
+  group_by(dest) %>% 
+  summarise(
+    count = n(),
+    dist = mean(distance, na.rm = TRUE),
+    delay = mean(arr_delay, na.rm = TRUE)
+  )
+```
+
+
-- 
GitLab