Skip to content
Snippets Groups Projects
Commit 7153452e authored by Carine Rey's avatar Carine Rey
Browse files

add examples in merging data section

parent 46e6ce39
No related branches found
No related tags found
No related merge requests found
session_6/img/overview_joins.png

50.5 KiB

session_6/img/overview_set.png

11.5 KiB

......@@ -188,7 +188,9 @@ table2 %>%
## Relational data
Sometime the information can be split between different table
To avoid having a huge table and to save space, information is often splited between different tables.
In our `flights` dataset, information about the `carrier` or the `airports` (origin and dest) are saved in a separate table (`airlines`, `airports`).
```{r airlines, eval=T, echo = T}
library(nycflights13)
......@@ -200,27 +202,40 @@ flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
```
## Relational data
## Relational schema
The relationships between tables can be seen in a relational graph. The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation.
```{r airlines_dag, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/relational-nycflights.png')
```
## joints
## Joints
If you have to combine data from 2 tables in a a new table, you will use `joints`.
There are several types of joints depending of what you want to get.
```{r joints, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/join-venn.png')
```
## `inner_joint()`
Small concrete examples:
```{r , echo=FALSE, out.width='100%'}
knitr::include_graphics('img/overview_joins.png')
```
### `inner_joint()`
Matches pairs of observations whenever their keys are equal
keeps observations in `x` AND `y`
```{r inner_joint, eval=T}
flights2 %>%
inner_join(airlines)
```
## `left_joint()`
### `left_joint()`
keeps all observations in `x`
......@@ -229,7 +244,7 @@ flights2 %>%
left_join(airlines)
```
## `right_joint()`
### `right_joint()`
keeps all observations in `y`
......@@ -238,7 +253,7 @@ flights2 %>%
right_join(airlines)
```
## `full_joint()`
### `full_joint()`
keeps all observations in `x` and `y`
......@@ -251,29 +266,41 @@ flights2 %>%
The default, `by = NULL`, uses all variables that appear in both tables, the so called natural join.
```{r left_join_weather, eval=T}
```{r , eval=T}
flights2 %>%
left_join(weather)
```
## Defining the key columns
If the two tables contain columns with the same names but corresponding to different things (such as `year` in `flights2` and `planes`) you have to manually define the key or the keys.
The default, `by = NULL`, uses all variables that appear in both tables, the so called natural join.
```{r left_join_tailnum, eval=T, echo = T}
```{r , eval=T, echo = T}
flights2 %>%
left_join(planes, by = "tailnum")
```
## Defining the key columns
A named character vector: `by = c("a" = "b")`. This will match variable `a` in table `x` to variable `b` in table `y`.
If you want to join by data that are in two columns with different names, you must specify the correspondence with a named character vector: `by = c("a" = "b")`. This will match variable `a` in table `x` to variable `b` in table `y`.
```{r left_join_airport, eval=T, echo = T}
```{r , eval=T, echo = T}
flights2 %>%
left_join(airports, c("dest" = "faa"))
```
If 2 columns have identical names in the input tables but are not used in the join, they are automatically renamed with the suffix `.x` and `.y` because all column names must be different in the output table.
```{r , eval=T, echo = T}
flights2 %>%
left_join(airports, c("dest" = "faa")) %>%
left_join(airports, c("origin" = "faa"))
```
You can change the suffix using the option `suffix`
```{r , eval=T, echo = T}
flights2 %>%
left_join(airports, by = c("dest" = "faa")) %>%
left_join(airports, by = c("origin" = "faa"), suffix = c(".dest",".origin"))
```
## Filtering joins
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
......@@ -281,9 +308,6 @@ Filtering joins match observations in the same way as mutating joins, but affect
- `semi_join(x, y)` keeps all observations in `x` that have a match in `y`.
- `anti_join(x, y)` drops all observations in `x` that have a match in `y`.
## Filtering joins
```{r top_dest, eval=T, echo = T}
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
......@@ -300,4 +324,8 @@ These expect the x and y inputs to have the same variables, and treat the observ
- `union(x, y)`: return unique observations in `x` and `y`.
- `setdiff(x, y)`: return observations in `x`, but not in `y`.
```{r , echo=FALSE, out.width='100%'}
knitr::include_graphics('img/overview_set.png')
```
## See you in [R.7: String & RegExp](http://perso.ens-lyon.fr/laurent.modolo/R/session_7/)
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment