Skip to content
Snippets Groups Projects
Commit f41f9805 authored by Carine Rey's avatar Carine Rey
Browse files

add items in select

parent 4e6dd703
No related branches found
No related tags found
1 merge request!6Switch to main as default branch
...@@ -228,7 +228,7 @@ One important feature of R that can make comparison tricky is missing values, or ...@@ -228,7 +228,7 @@ One important feature of R that can make comparison tricky is missing values, or
Indeed each of the variable type can contain either a value of this type (i.e., `2` for an **int**) or nothing. Indeed each of the variable type can contain either a value of this type (i.e., `2` for an **int**) or nothing.
The *nothing recorded in a variable* status is represented with the `NA` symbol. The *nothing recorded in a variable* status is represented with the `NA` symbol.
As operations with `NA` values don t make sense, if you have `NA` somewhere in your operation, the results will be `NA` As operations with `NA` values don't make sense, if you have `NA` somewhere in your operation, the results will be `NA`
```{r filter_logical_operators_NA, include=TRUE} ```{r filter_logical_operators_NA, include=TRUE}
NA > 5 NA > 5
...@@ -245,16 +245,19 @@ is.na(NA) ...@@ -245,16 +245,19 @@ is.na(NA)
`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values. If you want to preserve missing values, ask for them explicitly: `filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values. If you want to preserve missing values, ask for them explicitly:
```{r filter_logical_operators_test_NA2, include=TRUE} ```{r filter_logical_operators_test_NA2, include=TRUE}
df <- tibble(x = c(1, NA, 3)) df <- tibble( x = c("A","B","C"),
filter(df, x > 1) y = c(1, NA, 3)
filter(df, is.na(x) | x > 1) )
df
filter(df, y > 1)
filter(df, is.na(y) | y > 1)
``` ```
## Challenges ## Challenges
<div class="pencadre"> <div class="pencadre">
Find all flights that: Find all flights that:
- Had an arrival delay of two or more hours (you can check `?flights`) - Had an arrival delay (`arr_delay`) of two or more hours (you can check `?flights`)
- Flew to Houston (IAH or HOU) - Flew to Houston (IAH or HOU)
</div> </div>
...@@ -289,7 +292,7 @@ Why is `NA ^ 0` not missing? Why is `NA | TRUE` not missing? Why is `FALSE & NA` ...@@ -289,7 +292,7 @@ Why is `NA ^ 0` not missing? Why is `NA | TRUE` not missing? Why is `FALSE & NA`
```{r filter_chalenges_d, eval=TRUE} ```{r filter_chalenges_d, eval=TRUE}
NA ^ 0 # ^ 0 is always 1 it's an arbitrary rule not a computation NA ^ 0 # ^ 0 is always 1 it's an arbitrary rule not a computation
NA | TRUE # if a member of a OR operation is TRUE the results is TRUE NA | TRUE # if a member of a OR operation is TRUE the results is TRUE
FALSE & NA # if a member of a AN operation is FALSE the results is TRUE FALSE & NA # if a member of a AND operation is FALSE the results is FALSE
NA * 0 # here we have a true computation NA * 0 # here we have a true computation
``` ```
</p> </p>
...@@ -300,54 +303,55 @@ NA * 0 # here we have a true computation ...@@ -300,54 +303,55 @@ NA * 0 # here we have a true computation
`arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order. `arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order.
```{r arrange_ymd, include=TRUE} ```{r arrange_ymd, include=TRUE}
arrange(flights, year, month, day) arrange(flights, distance, dep_delay)
``` ```
<div class="pencadre">
Use `desc()` to reorder by a column in descending order:
</div>
<details><summary>Solution</summary> You can use `desc()` to reorder by a column in descending order:
<p>
```{r arrange_desc, include=TRUE} ```{r arrange_desc, include=TRUE}
arrange(flights, desc(dep_delay)) arrange(flights, distance, desc(dep_delay))
``` ```
</p>
</details>
## Missing values ## Missing values
Missing values are always sorted at the end: Missing values are always sorted at the end:
```{r arrange_NA, include=TRUE} ```{r arrange_NA, include=TRUE}
arrange(tibble(x = c(5, 2, NA)), x) df <- tibble( x = c("A","B","C"),
arrange(tibble(x = c(5, 2, NA)), desc(x)) y = c(1, NA, 3)
)
df
arrange(df, y)
arrange(df, desc(y))
``` ```
## Challenges ## Challenges
<div class="pencadre"> <div class="pencadre">
- Find the most delayed flight. - Find the most delayed flight at arrival (`arr_delay`).
- Find the flight that left earliest. - Find the flight that left earliest (`dep_delay`).
- How could you arrange all missing values to the start ? - How could you arrange all missing values to the start in the `df` tibble ?
</div> </div>
<details><summary>Solution</summary> <details><summary>Solution</summary>
<p> <p>
Find the most delayed flight. Find the most delayed flight at arrival
```{r chalange_arrange_desc_a, include=TRUE} ```{r chalange_arrange_desc_a, include=TRUE}
arrange(flights, desc(dep_delay)) arrange(flights, desc(arr_delay))
``` ```
Find the flight that left earliest. Find the flight that left earliest.
```{r chalange_arrange_desc_b, include=TRUE} ```{r chalange_arrange_desc_b, include=TRUE}
arrange(flights, dep_delay) arrange(flights, dep_delay)
``` ```
How could you arrange all missing values to the start How could you arrange all missing values to the start in the `df` tibble ?
```{r chalange_arrange_desc_c, include=TRUE} ```{r chalange_arrange_desc_c, include=TRUE}
arrange(tibble(x = c(5, 2, NA)), desc(is.na(x))) arrange(df, desc(is.na(y)))
``` ```
</p> </p>
</details> </details>
...@@ -358,47 +362,75 @@ arrange(tibble(x = c(5, 2, NA)), desc(is.na(x))) ...@@ -358,47 +362,75 @@ arrange(tibble(x = c(5, 2, NA)), desc(is.na(x)))
`select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables. `select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
You can select by column names You can select by column names
```{r select_ymd_a, include=TRUE} ```{r select_ymd_a, include=TRUE}
select(flights, year, month, day) select(flights, year, month, day)
``` ```
By defining a range of columns By defining a range of columns
```{r select_ymd_b, include=TRUE} ```{r select_ymd_b, include=TRUE}
select(flights, year:day) select(flights, year:day)
``` ```
Or you can do a negative (`-`) to remove columns. Or, you can do a negative (`-`) to remove columns.
```{r select_ymd_c, include=TRUE} ```{r select_ymd_c, include=TRUE}
select(flights, -(year:day)) select(flights, -(year:day))
``` ```
And, you can also rename column names on the fly.
```{r select_ymd_d, include=TRUE}
select(flights, Y = year, M = month, D = day)
```
## Helper functions ## Helper functions
here are a number of helper functions you can use within `select()`: here are a number of helper functions you can use within `select()`:
- `starts_with("abc")`: matches names that begin with `"abc"`. - `starts_with("abc")`: matches column names that begin with `"abc"`.
- `ends_with("xyz")`: matches names that end with `"xyz"`. - `ends_with("xyz")`: matches column names that end with `"xyz"`.
- `contains("ijk")`: matches names that contain `"ijk"`. - `contains("ijk")`: matches column names that contain `"ijk"`.
- `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`. - `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`.
- `where(test_function)`: select columns for which the result is TRUE.
See `?select` for more details. See `?select` for more details.
## Challenges ## Challenges
<div class="pencadre"> <div class="pencadre">
<p>
- Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`. - Brainstorm as many ways as possible to select only `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`. You can associate several selections arguments with `|` , `&` and `!`.
<details><summary>Solution</summary>
The simplest way to start:
```{r challenge_select_a1_simple}
df_dep_arr <- select(flights, dep_time, dep_delay, arr_time, arr_delay)
colnames(df_dep_arr)
```
</p>
</div>
<details><summary>Other solutions</summary>
<p> <p>
```{r challenge_select_a, eval=FALSE} ```{r challenge_select_a1, eval=FALSE}
select(flights, contains("time") | contains("delay")) select(flights, dep_time, dep_delay, arr_time, arr_delay)
select(flights, contains("_") & !starts_with("sched") & !starts_with("time")) select(flights, starts_with("dep"), starts_with("arr") )
select(flights, starts_with("dep") | starts_with("arr") )
select(flights, matches("^(dep|arr)") )
select(flights, dep_time : arr_delay & !starts_with("sched"))
``` ```
</p> </p>
</details> </details>
- What does the `one_of()` function do? Why might it be helpful in conjunction with this vector? - What does the `any_of()` function do?
- Why might it be helpful in conjunction with this vector? What is the difference with `all_of()` (hint : add "toto" to vars) ?
```{r select_one_of, eval=T, message=F, cache=T} ```{r select_one_of, eval=T, message=F, cache=T}
vars <- c("year", "month", "day", "dep_delay", "arr_delay") vars <- c("year", "month", "day", "dep_delay", "arr_delay")
...@@ -408,12 +440,40 @@ vars <- c("year", "month", "day", "dep_delay", "arr_delay") ...@@ -408,12 +440,40 @@ vars <- c("year", "month", "day", "dep_delay", "arr_delay")
<p> <p>
```{r challenge_select_b, eval=FALSE} ```{r challenge_select_b, eval=FALSE}
select(flights, one_of(vars)) select(flights, any_of(vars))
select(flights, all_of(vars))
```
From the help message (`?all_of()`) :
- all_of() is for strict selection. If any of the variables in the character vector is missing, an error is thrown.
- any_of() doesn't check for missing variables. It is especially useful with negative selections, when you would like to make sure a variable is removed.
```{r challenge_select_b2, eval=FALSE}
vars <- c(vars, "toto")
select(flights, any_of(vars))
select(flights, all_of(vars))
```
</p>
</details>
- Select all columns wich contain character values ? numeric values ?
</p>
</div>
<details><summary>Solution</summary>
<p>
```{r challenge_select_e1, eval=FALSE}
select(flights, where(is.character))
select(flights, where(is.numeric))
``` ```
</p> </p>
</details> </details>
- Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default? - Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
```{r select_contains, eval=F, message=F, cache=T} ```{r select_contains, eval=F, message=F, cache=T}
select(flights, contains("TIME")) select(flights, contains("TIME"))
``` ```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment