Compare revisions

06d9eb16 · 06d9eb16 · 06d9eb16 · 06d9eb16 · 06d9eb16 · 06d9eb16
--- a/session_7/session_7.Rmd
+++ b/session_7/session_7.Rmd
--- a/session_8/session_8.Rmd
+++ b/session_8/session_8.Rmd
+---
+title: "R.8: Factors"
+author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
+date: "2022"
+---
+
+```{r include=FALSE}
+library(fontawesome)
+
+if("conflicted" %in% .packages())
+    conflicted::conflicts_prefer(dplyr::filter)
+```
+
+```{r setup, include=FALSE}
+rm(list=ls())
+knitr::opts_chunk$set(echo = TRUE)
+knitr::opts_chunk$set(comment = NA)
+```
+
+## Introduction
+
+In this session, you will learn more about the factor type in R.
+Factors can be very useful, but you have to be mindful of the implicit conversions from simple vector to factor !
+They are the source of loot of pain for R programmers.
+
+<div class="pencadre">
+As usual we will need the `tidyverse` library.
+</div>
+
+<details><summary>Solution</summary>
+<p>
+```{r load_data, eval=T, message=F}
+library(tidyverse)
+```
+</p>
+</details>
+
+## Creating factors
+
+Imagine that you have a variable that records month:
+
+```{r declare_month, eval=T, cache=T}
+x1 <- c("Dec", "Apr", "Jan", "Mar")
+```
+
+Using a string to record this variable has two problems:
+
+1. There are only twelve possible months, and there’s nothing saving you from typos:
+
+```{r declare_month2, eval=T, cache=T}
+x2 <- c("Dec", "Apr", "Jam", "Mar")
+```
+
+2. It doesn’t sort in a useful way:
+
+```{r sort_month, eval=T, cache=T}
+sort(x1)
+```
+
+You can fix both of these problems with a factor.
+
+```{r sort_month_factor, eval=T, cache=T}
+month_levels <- c(
+  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
+  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
+)
+y1 <- factor(x1, levels = month_levels)
+y1
+sort(y1)
+```
+
+And any values not in the set will be converted to NA:
+
+```{r sort_month_factor2, eval=T, cache=T}
+y2 <- parse_factor(x2, levels = month_levels)
+y2
+```
+
+Sometimes you’d prefer that the order of the levels match the order of the first appearance in the data.
+
+```{r inorder_month_factor, eval=T, cache=T}
+f2 <- x1 %>% factor() %>% fct_inorder()
+f2
+levels(f2)
+```
+
+## General Social Survey
+
+```{r race_count, eval=T, cache=T}
+gss_cat %>%
+  count(race)
+```
+
+By default, `ggplot2` will drop levels that don’t have any values. You can force them to display with:
+
+```{r race_plot, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
+ggplot(gss_cat, aes(x = race)) +
+  geom_bar() +
+  scale_x_discrete(drop = FALSE)
+```
+
+## Modifying factor order
+
+It’s often useful to change the order of the factor levels in a visualisation.
+
+```{r tv_hour, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
+relig_summary <- gss_cat %>%
+  group_by(relig) %>%
+  summarise(
+    age = mean(age, na.rm = TRUE),
+    tvhours = mean(tvhours, na.rm = TRUE),
+    n = n()
+  )
+ggplot(relig_summary, aes(x = tvhours, y = relig)) + geom_point()
+```
+
+It is difficult to interpret this plot because there’s no overall pattern. We can improve it by reordering the levels of relig using `fct_reorder()`. `fct_reorder()` takes three arguments:
+
+- `f`, the factor whose levels you want to modify.
+- `x`, a numeric vector that you want to use to reorder the levels.
+- Optionally, `fun`, a function that’s used if there are multiple values of `x` for each value of `f`. The default value is `median`.
+
+```{r tv_hour_order, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
+ggplot(relig_summary, aes(x = tvhours, y = fct_reorder(relig, tvhours))) +
+  geom_point()
+```
+
+As you start making more complicated transformations, I’d recommend moving them out of `aes()` and into a separate `mutate()` step. For example, you could rewrite the plot above as:
+
+```{r tv_hour_order_mutate, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
+relig_summary %>%
+  mutate(relig = fct_reorder(relig, tvhours)) %>%
+  ggplot(aes(x = tvhours, y = relig)) +
+    geom_point()
+```
+
+## `fct_reorder2()`
+
+Another type of reordering is useful when you are colouring the lines on a plot. `fct_reorder2()` reorders the factor by the `y` values associated with the largest `x` values. This makes the plot easier to read because the line colours line up with the legend.
+
+```{r fct_reorder2, eval=T, plot=T}
+by_age <- gss_cat %>%
+  filter(!is.na(age)) %>%
+  count(age, marital) %>%
+  group_by(age) %>%
+  mutate(prop = n / sum(n))
+```
+
+```{r fct_reorder2a, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
+ggplot(by_age, aes(x = age, y = prop, colour = marital)) +
+  geom_line(na.rm = TRUE)
+```
+
+```{r fct_reorder2b, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
+ggplot(by_age, aes(x = age, y = prop, colour = fct_reorder2(marital, age, prop))) +
+  geom_line() +
+  labs(colour = "marital")
+```
+
+## Materials
+
+There are lots of material online for R and more particularly on `tidyverse` and `Rstudio`
+
+You can find cheat sheet for all the packages of the `tidyverse` on this page:
+[https://www.rstudio.com/resources/cheatsheets/](https://www.rstudio.com/resources/cheatsheets/)
+
+The `Rstudio` websites are also a good place to learn more about R and the meta-package maintenained by the `Rstudio` community:
+
+- [https://www.rstudio.com/resources/webinars/](https://www.rstudio.com/resources/webinars/)
+- [https://www.rstudio.com/products/rpackages/](https://www.rstudio.com/products/rpackages/)
+
+For example [rmarkdown](https://rmarkdown.rstudio.com/) is a great way to turn your analyses into high quality documents, reports, presentations and dashboards:
+
+ - A comprehensive guide: [https://bookdown.org/yihui/rmarkdown/](https://bookdown.org/yihui/rmarkdown/)
+ - The cheatsheet [https://raw.githubusercontent.com/rstudio/cheatsheets/main/rmarkdown-2.0.pdf](https://raw.githubusercontent.com/rstudio/cheatsheets/main/rmarkdown-2.0.pdf)
+
+In addition most packages will provide **vignette**s on how to perform an analysis from scratch. On the [bioconductor.org](http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html) website (specialised on R packages for biologists), you will have direct links to the packages vignette.
+
+Finally, don't forget to search the web for your problems or error in R websites like [stackoverflow](https://stackoverflow.com/) contains high quality and well-curated answers.
\ No newline at end of file
--- a/session_n+1/tp.R
+++ b/session_n+1/tp.R
--- a/session_n+1/tp.md
+++ b/session_n+1/tp.md
--- a/session_n/img/rmarkdownflow.png
+++ b/session_n/img/rmarkdownflow.png
--- a/session_n/slides.Rmd
+++ b/session_n/slides.Rmd
--- a/session_n/slides_example.Rmd
+++ b/session_n/slides_example.Rmd
--- a/session_n/tp.R
+++ b/session_n/tp.R
--- a/session_n/tp.md
+++ b/session_n/tp.md
--- a/src/Dockerfile
+++ b/src/Dockerfile
--- a/src/create_docker_to_deploy_the_course.sh
+++ b/src/create_docker_to_deploy_the_course.sh
--- a/src/create_users_from_mail.sh
+++ b/src/create_users_from_mail.sh
+#! /usr/bin/bash
+
+# USAGE
+# wget -qO - http://perso.ens-lyon.fr/laurent.modolo/R/create_users_from_mail.sh | tr -d '\r' | bash -s usertest@mail.fr usertest2@mail.f
+
+USERMAILS=$@
+for USERMAIL in ${USERMAILS[@]}
+do
+  USERNAME=$(echo ${USERMAIL} | sed -E 's/(.*)@.*/\1/')
+  adduser ${USERNAME} --gecos 'First Last,RoomNumber,WorkPhone,HomePhone' --disabled-password --force-badname > /dev/null
+  PASSWD=$(openssl rand -base64 10)
+  echo "${USERNAME}:${PASSWD}" | chpasswd > /dev/null
+  echo "======================================================================="
+  echo "${USERMAIL}:"
+  echo "${USERNAME}"
+  echo "${PASSWD}"
+done
--- a/src/create_users_from_user_list_csv.sh
+++ b/src/create_users_from_user_list_csv.sh
--- a/src/create_users_from_user_pwd_list.sh
+++ b/src/create_users_from_user_pwd_list.sh
--- a/web/1_a
+++ b/web/1_a
--- a/web/1_b
+++ b/web/1_b
--- a/web/1_c
+++ b/web/1_c
--- a/web/1_d
+++ b/web/1_d
--- a/web/1_e
+++ b/web/1_e
--- a/web/2_a
+++ b/web/2_a
No results found