From f41f9805ccd5d1c78ab37e678452a1f615c567e7 Mon Sep 17 00:00:00 2001
From: Carine Rey <carine.rey@ens-lyon.fr>
Date: Tue, 4 Oct 2022 16:24:29 +0200
Subject: [PATCH] add items in select

---
 session_4/session_4.Rmd | 130 +++++++++++++++++++++++++++++-----------
 1 file changed, 95 insertions(+), 35 deletions(-)

diff --git a/session_4/session_4.Rmd b/session_4/session_4.Rmd
index a89d1d0..12bd776 100644
--- a/session_4/session_4.Rmd
+++ b/session_4/session_4.Rmd
@@ -228,7 +228,7 @@ One important feature of R that can make comparison tricky is missing values, or
 Indeed each of the variable type can contain either a value of this type (i.e., `2` for an **int**) or nothing.
 The *nothing recorded in a variable* status is represented with the `NA` symbol.
 
-As operations with `NA` values don t make sense, if you have `NA` somewhere in your operation, the results will be `NA`
+As operations with `NA` values don't make sense, if you have `NA` somewhere in your operation, the results will be `NA`
 
 ```{r filter_logical_operators_NA, include=TRUE}
 NA > 5
@@ -245,16 +245,19 @@ is.na(NA)
 `filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values. If you want to preserve missing values, ask for them explicitly:
 
 ```{r filter_logical_operators_test_NA2, include=TRUE}
-df <- tibble(x = c(1, NA, 3))
-filter(df, x > 1)
-filter(df, is.na(x) | x > 1)
+df <- tibble( x = c("A","B","C"),
+              y = c(1, NA, 3)
+            )
+df
+filter(df, y > 1)
+filter(df, is.na(y) | y > 1)
 ```
 
 ## Challenges
 
 <div class="pencadre">
 Find all flights that:
-- Had an arrival delay of two or more hours (you can check `?flights`)
+- Had an arrival delay (`arr_delay`) of two or more hours (you can check `?flights`)
 - Flew to Houston (IAH or HOU)
 </div>
 
@@ -289,7 +292,7 @@ Why is `NA ^ 0` not missing? Why is `NA | TRUE` not missing? Why is `FALSE & NA`
 ```{r filter_chalenges_d, eval=TRUE}
 NA ^ 0 # ^ 0 is always 1 it's an arbitrary rule not a computation
 NA | TRUE # if a member of a OR operation is TRUE the results is TRUE
-FALSE & NA # if a member of a AN operation is FALSE the results is TRUE
+FALSE & NA # if a member of a AND operation is FALSE the results is FALSE
 NA * 0 # here we have a true computation
 ```
 </p>
@@ -300,54 +303,55 @@ NA * 0 # here we have a true computation
 `arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order.
 
 ```{r arrange_ymd, include=TRUE}
-arrange(flights, year, month, day)
+arrange(flights, distance, dep_delay)
 ```
 
-<div class="pencadre">
-Use `desc()` to reorder by a column in descending order:
-</div>
 
-<details><summary>Solution</summary>
-<p>
+You can use `desc()` to reorder by a column in descending order:
 
 ```{r arrange_desc, include=TRUE}
-arrange(flights, desc(dep_delay))
+arrange(flights, distance, desc(dep_delay))
 ```
-</p>
-</details>
+
 
 ## Missing values
 
 Missing values are always sorted at the end:
 
 ```{r arrange_NA, include=TRUE}
-arrange(tibble(x = c(5, 2, NA)), x)
-arrange(tibble(x = c(5, 2, NA)), desc(x))
+df <- tibble( x = c("A","B","C"),
+              y = c(1, NA, 3)
+            )
+df
+
+arrange(df, y)
+arrange(df, desc(y))
 ```
 
 ## Challenges
 <div class="pencadre">
 
-- Find the most delayed flight.
-- Find the flight that left earliest.
-- How could you arrange all missing values to the start ?
+- Find the most delayed flight at arrival (`arr_delay`).
+- Find the flight that left earliest (`dep_delay`).
+- How could you arrange all missing values to the start in the `df` tibble ?
 
 </div>
 
 <details><summary>Solution</summary>
 <p>
 
-Find the most delayed flight.
+Find the most delayed flight at arrival
 ```{r chalange_arrange_desc_a, include=TRUE}
-arrange(flights, desc(dep_delay))
+arrange(flights, desc(arr_delay))
 ```
 Find the flight that left earliest.
 ```{r chalange_arrange_desc_b, include=TRUE}
 arrange(flights, dep_delay)
 ```
-How could you arrange all missing values to the start
+How could you arrange all missing values to the start in the `df` tibble ?
+
 ```{r chalange_arrange_desc_c, include=TRUE}
-arrange(tibble(x = c(5, 2, NA)), desc(is.na(x)))
+arrange(df, desc(is.na(y)))
 ```
 </p>
 </details>
@@ -358,47 +362,75 @@ arrange(tibble(x = c(5, 2, NA)), desc(is.na(x)))
 `select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
 
 You can select by column names
+
 ```{r select_ymd_a, include=TRUE}
 select(flights, year, month, day)
 ```
 
 By defining a range of columns
+
 ```{r select_ymd_b, include=TRUE}
 select(flights, year:day)
 ```
 
-Or you can do a negative (`-`) to remove columns.
+Or, you can do a negative (`-`) to remove columns.
+
 ```{r select_ymd_c, include=TRUE}
 select(flights, -(year:day))
 ```
 
+And, you can also rename column names on the fly.
+
+```{r select_ymd_d, include=TRUE}
+select(flights, Y = year, M = month, D = day)
+```
+
+
 ## Helper functions
 
 here are a number of helper functions you can use within `select()`:
 
-- `starts_with("abc")`: matches names that begin with `"abc"`.
-- `ends_with("xyz")`: matches names that end with `"xyz"`.
-- `contains("ijk")`: matches names that contain `"ijk"`.
+- `starts_with("abc")`: matches column names that begin with `"abc"`.
+- `ends_with("xyz")`: matches column names that end with `"xyz"`.
+- `contains("ijk")`: matches column names that contain `"ijk"`.
 - `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`.
+- `where(test_function)`: select columns for which the result is TRUE.
 
 See `?select` for more details.
 
 ## Challenges
 
 <div class="pencadre">
+<p>
 
-- Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`.
-<details><summary>Solution</summary>
+- Brainstorm as many ways as possible to select only `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`. You can associate several selections arguments with `|` , `&` and `!`. 
+
+The simplest way to start: 
+
+```{r challenge_select_a1_simple}
+df_dep_arr <- select(flights, dep_time, dep_delay, arr_time, arr_delay)
+colnames(df_dep_arr)
+```
+
+</p>
+</div>
+
+<details><summary>Other solutions</summary>
 <p>
 
-```{r challenge_select_a, eval=FALSE}
-select(flights, contains("time") | contains("delay"))
-select(flights, contains("_") & !starts_with("sched") & !starts_with("time"))
+```{r challenge_select_a1, eval=FALSE}
+select(flights, dep_time, dep_delay, arr_time, arr_delay)
+select(flights, starts_with("dep"), starts_with("arr") )
+select(flights, starts_with("dep") | starts_with("arr") )
+select(flights, matches("^(dep|arr)") )
+select(flights, dep_time : arr_delay & !starts_with("sched"))
 ```
 </p>
 </details>
 
-- What does the `one_of()` function do? Why might it be helpful in conjunction with this vector? 
+- What does the `any_of()` function do?
+- Why might it be helpful in conjunction with this vector? What is the difference with `all_of()`  (hint : add "toto" to vars) ?
+
 
 ```{r select_one_of, eval=T, message=F, cache=T}
 vars <- c("year", "month", "day", "dep_delay", "arr_delay")
@@ -408,12 +440,40 @@ vars <- c("year", "month", "day", "dep_delay", "arr_delay")
 <p>
 
 ```{r challenge_select_b, eval=FALSE}
-select(flights, one_of(vars))
+select(flights, any_of(vars))
+select(flights, all_of(vars))
+```
+
+From the help message (`?all_of()`) :
+
+ - all_of() is for strict selection. If any of the variables in the character vector is missing, an error is thrown.
+ - any_of() doesn't check for missing variables. It is especially useful with negative selections, when you would like to make sure a variable is removed.
+ 
+```{r challenge_select_b2, eval=FALSE}
+vars <- c(vars, "toto")
+select(flights, any_of(vars))
+select(flights, all_of(vars))
+```
+</p>
+</details>
+
+- Select all columns wich contain character values ? numeric values ?
+
+</p>
+</div>
+
+<details><summary>Solution</summary>
+<p>
+
+```{r challenge_select_e1, eval=FALSE}
+select(flights, where(is.character))
+select(flights, where(is.numeric))
 ```
 </p>
 </details>
 
 - Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
+
 ```{r select_contains, eval=F, message=F, cache=T}
 select(flights, contains("TIME"))
 ```
-- 
GitLab