Skip to content
Snippets Groups Projects
Verified Commit 07632cf0 authored by Laurent Modolo's avatar Laurent Modolo
Browse files

Merge remote-tracking branch 'origin/master'

parents 878f2d8f 7153452e
No related branches found
No related tags found
No related merge requests found
session_6/img/overview_joins.png

50.5 KiB

session_6/img/overview_set.png

11.5 KiB

session_6/img/pivot_longer.png

21.1 KiB

session_6/img/pivot_wider.png

21.7 KiB

......@@ -54,13 +54,18 @@ library(tidyverse)
</p>
</details>
For this practical we are going to use the `table` dataset which demonstrate multiple ways to layout the same tabular data.
For this practical we are going to use the `table` set of datasets which demonstrate multiple ways to layout the same tabular data.
<div class="pencadre">
Use the help to know more about this dataset
Use the help to know more about `table1` dataset
</div>
<details><summary>Solution</summary>
```{r}
?table1
```
<p>
`table1`, `table2`, `table3`, `table4a`, `table4b`, and `table5` all display the number of TB (Tuberculosis) cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. The data contains values associated with four variables (country, year, cases, and population), but each table organizes the values in a different layout.
......@@ -72,6 +77,41 @@ The data is a subset of the data contained in the World Health Organization Glob
## pivot longer
```{r, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/pivot_longer.png')
```
```{r, eval = F}
wide_example <- tibble(X1 = c("A","B"),
X2 = c(1,2),
X3 = c(0.1,0.2),
X4 = c(10,20))
```
If you have a wide dataset, such as `wide_example`, that you want to make longer, you will use the `pivot_longer()` function.
You have to specify the names of the columns you want to pivot into longer format (X2,X3,X4):
```{r, eval = F}
wide_example %>%
pivot_longer(c(X2,X3,X4))
```
... or the reverse selection (-X1):
```{r, eval = F}
wide_example %>% pivot_longer(-X1)
```
You can specify the names of the columns where the data will be tidy (by default, it is `names` and `value`):
```{r, eval = F}
long_example <- wide_example %>%
pivot_longer(-X1), names_to = "V1", values_to = "V2")
```
### Exercice
<div class="pencadre">
Visualize the `table4a` dataset (you can use the `View()` function).
......@@ -109,6 +149,22 @@ table4a %>%
## pivot wider
```{r, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/pivot_wider.png')
```
If you have a long dataset, that you want to make wider, you will use the `pivot_wider()` function.
You have to specify which column contains the name of the output column (`names_from`), and which column contains the cell values from (`values_from`).
```{r, eval = F}
long_example %>% pivot_wider(names_from = V1,
values_from = V2)
```
### Exercice
<div class="pencadre">
Visualize the `table2` dataset
Is the data **tidy** ? How would you transform this dataset to make it **tidy** ? (you can now make also make a guess from the name of the subsection)
......@@ -132,7 +188,9 @@ table2 %>%
## Relational data
Sometime the information can be split between different table
To avoid having a huge table and to save space, information is often splited between different tables.
In our `flights` dataset, information about the `carrier` or the `airports` (origin and dest) are saved in a separate table (`airlines`, `airports`).
```{r airlines, eval=T, echo = T}
library(nycflights13)
......@@ -144,27 +202,40 @@ flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
```
## Relational data
## Relational schema
The relationships between tables can be seen in a relational graph. The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation.
```{r airlines_dag, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/relational-nycflights.png')
```
## joints
## Joints
If you have to combine data from 2 tables in a a new table, you will use `joints`.
There are several types of joints depending of what you want to get.
```{r joints, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/join-venn.png')
```
## `inner_joint()`
Small concrete examples:
```{r , echo=FALSE, out.width='100%'}
knitr::include_graphics('img/overview_joins.png')
```
### `inner_joint()`
Matches pairs of observations whenever their keys are equal
keeps observations in `x` AND `y`
```{r inner_joint, eval=T}
flights2 %>%
inner_join(airlines)
```
## `left_joint()`
### `left_joint()`
keeps all observations in `x`
......@@ -173,7 +244,7 @@ flights2 %>%
left_join(airlines)
```
## `right_joint()`
### `right_joint()`
keeps all observations in `y`
......@@ -182,7 +253,7 @@ flights2 %>%
right_join(airlines)
```
## `full_joint()`
### `full_joint()`
keeps all observations in `x` and `y`
......@@ -195,29 +266,41 @@ flights2 %>%
The default, `by = NULL`, uses all variables that appear in both tables, the so called natural join.
```{r left_join_weather, eval=T}
```{r , eval=T}
flights2 %>%
left_join(weather)
```
## Defining the key columns
The default, `by = NULL`, uses all variables that appear in both tables, the so called natural join.
If the two tables contain columns with the same names but corresponding to different things (such as `year` in `flights2` and `planes`) you have to manually define the key or the keys.
```{r left_join_tailnum, eval=T, echo = T}
```{r , eval=T, echo = T}
flights2 %>%
left_join(planes, by = "tailnum")
```
## Defining the key columns
A named character vector: `by = c("a" = "b")`. This will match variable `a` in table `x` to variable `b` in table `y`.
If you want to join by data that are in two columns with different names, you must specify the correspondence with a named character vector: `by = c("a" = "b")`. This will match variable `a` in table `x` to variable `b` in table `y`.
```{r left_join_airport, eval=T, echo = T}
```{r , eval=T, echo = T}
flights2 %>%
left_join(airports, c("dest" = "faa"))
```
If 2 columns have identical names in the input tables but are not used in the join, they are automatically renamed with the suffix `.x` and `.y` because all column names must be different in the output table.
```{r , eval=T, echo = T}
flights2 %>%
left_join(airports, c("dest" = "faa")) %>%
left_join(airports, c("origin" = "faa"))
```
You can change the suffix using the option `suffix`
```{r , eval=T, echo = T}
flights2 %>%
left_join(airports, by = c("dest" = "faa")) %>%
left_join(airports, by = c("origin" = "faa"), suffix = c(".dest",".origin"))
```
## Filtering joins
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
......@@ -225,9 +308,6 @@ Filtering joins match observations in the same way as mutating joins, but affect
- `semi_join(x, y)` keeps all observations in `x` that have a match in `y`.
- `anti_join(x, y)` drops all observations in `x` that have a match in `y`.
## Filtering joins
```{r top_dest, eval=T, echo = T}
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
......@@ -244,4 +324,8 @@ These expect the x and y inputs to have the same variables, and treat the observ
- `union(x, y)`: return unique observations in `x` and `y`.
- `setdiff(x, y)`: return observations in `x`, but not in `y`.
```{r , echo=FALSE, out.width='100%'}
knitr::include_graphics('img/overview_set.png')
```
## See you in [R.7: String & RegExp](http://perso.ens-lyon.fr/laurent.modolo/R/session_7/)
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment