`summarise()` collapses a data frame to a single row:
```{r load_data, eval=T, message=FALSE, cache=T}
library(nycflights13)
library(tidyverse)
flights %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
```
## The power of `summarise()` with `group_by()`
This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the `dplyr` verbs on a grouped data frame they’ll be automatically applied “by group”.
Aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value.
## Counts
Whenever you do any aggregation, it’s always a good idea to include either a count (`n()`). That way you can check that you’re not drawing conclusions based on very small amounts of data.
If you need to remove grouping, and return to operations on ungrouped data, use `ungroup()`.
```{r ungroup, eval=T, message=FALSE, cache=T}
flights %>%
group_by(year, month, day) %>%
ungroup() %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
```
## Grouping challenges
- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? (`strftime(x,'%A')` give you the name of the day from a POSIXct date)
- Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about `flights %>% group_by(carrier, dest) %>% summarise(n())`)
## Grouping challenges
- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? (`strftime(x,'%A')` give you the name of the day from a POSIXct date)
- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? (`strftime(x,'%A')` give you the name of the day from a POSIXct date)