diff --git a/session_6/img/overview_joins.png b/session_6/img/overview_joins.png new file mode 100644 index 0000000000000000000000000000000000000000..c10e7cc0773fbf24d0c7fdd98a421feac170fd8b Binary files /dev/null and b/session_6/img/overview_joins.png differ diff --git a/session_6/img/overview_set.png b/session_6/img/overview_set.png new file mode 100644 index 0000000000000000000000000000000000000000..d0f8132cd6459852b27df865f23ef101de02439b Binary files /dev/null and b/session_6/img/overview_set.png differ diff --git a/session_6/img/pivot_longer.png b/session_6/img/pivot_longer.png new file mode 100644 index 0000000000000000000000000000000000000000..79fa32f7655e2d016ac6500802a001482a75fb25 Binary files /dev/null and b/session_6/img/pivot_longer.png differ diff --git a/session_6/img/pivot_wider.png b/session_6/img/pivot_wider.png new file mode 100644 index 0000000000000000000000000000000000000000..518b82ce92fdb3ed7c7fadf0d5e474e9be48e9e5 Binary files /dev/null and b/session_6/img/pivot_wider.png differ diff --git a/session_6/session_6.Rmd b/session_6/session_6.Rmd index 52bcaad2278e57a6f9634005796379bfbf458714..e6dd9ccbad5ee42746f33dfe2f4143193dd68fc3 100644 --- a/session_6/session_6.Rmd +++ b/session_6/session_6.Rmd @@ -54,13 +54,18 @@ library(tidyverse) </p> </details> -For this practical we are going to use the `table` dataset which demonstrate multiple ways to layout the same tabular data. +For this practical we are going to use the `table` set of datasets which demonstrate multiple ways to layout the same tabular data. <div class="pencadre"> -Use the help to know more about this dataset +Use the help to know more about `table1` dataset </div> <details><summary>Solution</summary> + +```{r} +?table1 +``` + <p> `table1`, `table2`, `table3`, `table4a`, `table4b`, and `table5` all display the number of TB (Tuberculosis) cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. The data contains values associated with four variables (country, year, cases, and population), but each table organizes the values in a different layout. @@ -72,6 +77,41 @@ The data is a subset of the data contained in the World Health Organization Glob ## pivot longer +```{r, echo=FALSE, out.width='100%'} +knitr::include_graphics('img/pivot_longer.png') +``` + +```{r, eval = F} +wide_example <- tibble(X1 = c("A","B"), + X2 = c(1,2), + X3 = c(0.1,0.2), + X4 = c(10,20)) +``` + +If you have a wide dataset, such as `wide_example`, that you want to make longer, you will use the `pivot_longer()` function. + +You have to specify the names of the columns you want to pivot into longer format (X2,X3,X4): + +```{r, eval = F} +wide_example %>% + pivot_longer(c(X2,X3,X4)) +``` + +... or the reverse selection (-X1): + +```{r, eval = F} +wide_example %>% pivot_longer(-X1) +``` + +You can specify the names of the columns where the data will be tidy (by default, it is `names` and `value`): + +```{r, eval = F} +long_example <- wide_example %>% + pivot_longer(-X1), names_to = "V1", values_to = "V2") +``` + +### Exercice + <div class="pencadre"> Visualize the `table4a` dataset (you can use the `View()` function). @@ -109,6 +149,22 @@ table4a %>% ## pivot wider +```{r, echo=FALSE, out.width='100%'} +knitr::include_graphics('img/pivot_wider.png') +``` + +If you have a long dataset, that you want to make wider, you will use the `pivot_wider()` function. + +You have to specify which column contains the name of the output column (`names_from`), and which column contains the cell values from (`values_from`). + +```{r, eval = F} +long_example %>% pivot_wider(names_from = V1, + values_from = V2) +``` + + +### Exercice + <div class="pencadre"> Visualize the `table2` dataset Is the data **tidy** ? How would you transform this dataset to make it **tidy** ? (you can now make also make a guess from the name of the subsection) @@ -132,7 +188,9 @@ table2 %>% ## Relational data -Sometime the information can be split between different table +To avoid having a huge table and to save space, information is often splited between different tables. + +In our `flights` dataset, information about the `carrier` or the `airports` (origin and dest) are saved in a separate table (`airlines`, `airports`). ```{r airlines, eval=T, echo = T} library(nycflights13) @@ -144,27 +202,40 @@ flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier) ``` -## Relational data +## Relational schema + +The relationships between tables can be seen in a relational graph. The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation. ```{r airlines_dag, echo=FALSE, out.width='100%'} knitr::include_graphics('img/relational-nycflights.png') ``` -## joints +## Joints + +If you have to combine data from 2 tables in a a new table, you will use `joints`. + +There are several types of joints depending of what you want to get. ```{r joints, echo=FALSE, out.width='100%'} knitr::include_graphics('img/join-venn.png') ``` -## `inner_joint()` +Small concrete examples: + +```{r , echo=FALSE, out.width='100%'} +knitr::include_graphics('img/overview_joins.png') +``` + +### `inner_joint()` -Matches pairs of observations whenever their keys are equal +keeps observations in `x` AND `y` ```{r inner_joint, eval=T} flights2 %>% inner_join(airlines) ``` -## `left_joint()` + +### `left_joint()` keeps all observations in `x` @@ -173,7 +244,7 @@ flights2 %>% left_join(airlines) ``` -## `right_joint()` +### `right_joint()` keeps all observations in `y` @@ -182,7 +253,7 @@ flights2 %>% right_join(airlines) ``` -## `full_joint()` +### `full_joint()` keeps all observations in `x` and `y` @@ -195,29 +266,41 @@ flights2 %>% The default, `by = NULL`, uses all variables that appear in both tables, the so called natural join. -```{r left_join_weather, eval=T} +```{r , eval=T} flights2 %>% left_join(weather) ``` -## Defining the key columns - -The default, `by = NULL`, uses all variables that appear in both tables, the so called natural join. +If the two tables contain columns with the same names but corresponding to different things (such as `year` in `flights2` and `planes`) you have to manually define the key or the keys. -```{r left_join_tailnum, eval=T, echo = T} +```{r , eval=T, echo = T} flights2 %>% left_join(planes, by = "tailnum") ``` -## Defining the key columns - -A named character vector: `by = c("a" = "b")`. This will match variable `a` in table `x` to variable `b` in table `y`. +If you want to join by data that are in two columns with different names, you must specify the correspondence with a named character vector: `by = c("a" = "b")`. This will match variable `a` in table `x` to variable `b` in table `y`. -```{r left_join_airport, eval=T, echo = T} +```{r , eval=T, echo = T} flights2 %>% left_join(airports, c("dest" = "faa")) ``` +If 2 columns have identical names in the input tables but are not used in the join, they are automatically renamed with the suffix `.x` and `.y` because all column names must be different in the output table. + +```{r , eval=T, echo = T} +flights2 %>% + left_join(airports, c("dest" = "faa")) %>% + left_join(airports, c("origin" = "faa")) +``` + +You can change the suffix using the option `suffix` + +```{r , eval=T, echo = T} +flights2 %>% + left_join(airports, by = c("dest" = "faa")) %>% + left_join(airports, by = c("origin" = "faa"), suffix = c(".dest",".origin")) +``` + ## Filtering joins Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types: @@ -225,9 +308,6 @@ Filtering joins match observations in the same way as mutating joins, but affect - `semi_join(x, y)` keeps all observations in `x` that have a match in `y`. - `anti_join(x, y)` drops all observations in `x` that have a match in `y`. - -## Filtering joins - ```{r top_dest, eval=T, echo = T} top_dest <- flights %>% count(dest, sort = TRUE) %>% @@ -244,4 +324,8 @@ These expect the x and y inputs to have the same variables, and treat the observ - `union(x, y)`: return unique observations in `x` and `y`. - `setdiff(x, y)`: return observations in `x`, but not in `y`. +```{r , echo=FALSE, out.width='100%'} +knitr::include_graphics('img/overview_set.png') +``` + ## See you in [R.7: String & RegExp](http://perso.ens-lyon.fr/laurent.modolo/R/session_7/)