diff --git a/Practical_c.Rmd b/Practical_c.Rmd index 95b53ca97f6e1bf10ea927edf52a8ddaa1736d2c..896792d2ebd9ef1867865ba685ac55dd1d071a92 100644 --- a/Practical_c.Rmd +++ b/Practical_c.Rmd @@ -38,6 +38,9 @@ library(pbapply) # nice lapply if (!requireNamespace("performance")) install.packages("performance") library(performance) # to evaluate model +if (!requireNamespace("palmerpenguins")) + install.packages("palmerpenguins") +library(palmerpenguins) # to evaluate model ``` ```{r setup, include=FALSE , echo=FALSE} @@ -60,6 +63,13 @@ klippy::klippy( tooltip_success = 'Copied !') ``` +::: {.hidden} +$$ + \def\H{{\mathcal{H}}} + \def\PP{{\mathbb{P}}} +$$ +::: + <!-- ## template {-} --> <!-- <div class="pencadre"> --> @@ -110,142 +120,14 @@ Throughout this practical, we will explore the following subjects: We will need the following packages ```{r echo=T, message=F} -library(tidyverse) # to manipulate data and make plot -``` - -## Loading the data - -In this data, 64 different yeast strains are considered. - -> Note: all strains were generated from an admixture between two original strains but we will not investigate this point here. - -For each strain, we have genotype data, containing the [SNPs](https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism) along the yeast genome. Each SNP is encoded with a `0` or `1` value corresponding to the number of derivative allele present at the corresponding locus. - -> Note: Here, the yeast were sequenced during their haplotype phase, therefore the possible values for each genotype are only `0` and `1`. - -For each strain, we also have measures of different morphological traits for hundred of cells during 3 different stages of the cell cycle (called `"A"`, `"A2B"` and `"C"`). - -Since the mutation is rate is very low in yeast, we can consider that all cells from the same strain share the same genotype across their entire genome. - -We will use data stored in the `practical_c.RData` file. - -```{r, include=F} -# load data -data_file <- "practical_c.Rdata" -data_list <- load(data_file) -``` - -```{r, eval=F} -data_list <- load(url("https://lbmc.gitbiopages.ens-lyon.fr/hub/formations/ens_m1_ml/practical_c.Rdata"), verbose = T) -``` - -It contains the following objects: -```{r} -print(data_list) # list the loaded objects -``` - -- `gt_data`: a data frame with 2820 SNPs in rows and the corresponding genotype for the 64 strains in columns, along with other informations regarding the SNPs such as their position on the genome, their identification, etc. More details with: - -```{r, eval=F} -str(gt_data) -``` - -- `morpho_data`: a data frame with 45215 cells in rows and the corresponding morphological trait measures in columns, along with the corresponding cell identification, strain identification and cell cycle, etc. More details with: - -```{r, eval=F} -str(morpho_data) -``` - -- `yeast_data`: a data frame containing the morphological data (see `morpho_data`) and the corresponding genotype of all SNPs (in additional columns) for each cell (in rows). More details with: - -```{r, eval=F} -str(yeast_data) -``` - -- `yeast_av_data`: a data frame containing the average (over all cells) of the morphological measures for each strain and each cell cycle phase with the corresponding genotype (same columns as `yeast_data`). More details with: - -```{r, eval=F} -str(yeast_av_data) -``` - -- `strain_id`: a data frame containing the identification of the different strains (including other not present in `morpho_data` and `gt_data`) - -## Data description - -We will first conduct a descriptive analysis of the data. During the course of this practical, we will first consider the morphological measure called `A101` and two SNPs, namely `YAL069W_1` and `YDL200C_427`. - -And we will focus on the morphological data averaged by strain and cell cycle. The idea here is to avoid dealing with the measure heterogeneity across all cells within each strain. - -### Morphological data {-} - -Generally, when the dimension of the data is not too large (otherwise see [below](#dimension-reduction) and afterward), a first exploratory step in data analysis (for quantitative variables) is to consider descriptive statistics, such as position parameters (mean/average, empirical quantiles including the median), dispersion parameters (empirical variance and empirical standard deviation), possibly depending on groups or classes defined by a qualitative variable. - - -<div class="pencadre"> -Which graphical representation of the data does give an insight about the distribution (including the position and dispersion) of a variable ? -</div> - -<details><summary>Solution</summary> -<p> - -For instance, a box-plot (and derivative) - ---- - -</p> -</details> - -<div class="pencadre"> -Construct a boxplot of the value of the variable `A101` depending on the SNPs `YAL069W_1` or `YDL200C_427`. How to account for the cell cycle in the representation? -</div> - -<details><summary>Solution</summary> -<p> - -```{r} -ggplot(yeast_av_data, aes(factor(YAL069W_1), A101)) + geom_boxplot() + - facet_wrap(~cell_cycle, labeller = "label_both") + - theme_bw() -``` - -```{r} -ggplot(yeast_av_data, aes(factor(YDL200C_427), A101)) + geom_boxplot() + - facet_wrap(~cell_cycle, labeller = "label_both") + - theme_bw() +library(tidyverse) # to manipulate data and make plot +library(performance) # to evaluate model ``` --- -</p> -</details> - - -> Note: depending on the context and the data, other representations like empirical histogram or empirical density representation of the quantitative variable of interest depending on the factor of interest can be used. - -<div class="pencadre"> -What can you infer about the possible relationship between the morphological trait quantified by the variable `A101` and the SNPs `YAL069W_1` or `YDL200C_427` from the boxplot? -</div> - -<details><summary>Solution</summary> -<p> - -For all cell cycle phases, the value of the variable `A101` seems to depend on the SNP `YAL069W_1`. Indeed, the value of the variable `A101` is globally higher in presence of the derivative allele. - -On contrary, for cell cycle phases `"A1B"` and `"C"`, the value of the variable `A101` does not seem to depend on the SNP `YAL069W_1`. -</p> -</details> - ---- - ## Parenthesis about statistical hypothesis testing and p-values -::: {.hidden} -$$ - \def\H{{\mathcal{H}}} - \def\PP{{\mathbb{P}}} -$$ -::: - ### Reminder about statistical testing #### Building the test @@ -804,20 +686,425 @@ Decreasing $\alpha$ to reduce the type I error decreases the power of the test. - confirm a detected effect with additional experiments/studies - the more (independent) studies, the lower risk of incorrect conclusion +--- + +## Loading the data + +In this data, 64 different yeast strains are considered. + +> Note: all strains were generated from an admixture between two original strains but we will not investigate this point here. + +For each strain, we have genotype data, containing the [SNPs](https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism) along the yeast genome. Each SNP is encoded with a `0` or `1` value corresponding to the number of derivative allele present at the corresponding locus. + +> Note: Here, the yeast were sequenced during their haplotype phase, therefore the possible values for each genotype are only `0` and `1`. + +For each strain, we also have measures of different morphological traits for hundred of cells during 3 different stages of the cell cycle (called `"A"`, `"A2B"` and `"C"`). + +Since the mutation is rate is very low in yeast, we can consider that all cells from the same strain share the same genotype across their entire genome. + +We will use data stored in the `practical_c.RData` file. + +```{r, include=F} +# load data +data_file <- "practical_c.Rdata" +data_list <- load(data_file) +``` + +```{r, eval=F} +data_list <- load(url("https://lbmc.gitbiopages.ens-lyon.fr/hub/formations/ens_m1_ml/practical_c.Rdata"), verbose = T) +``` + +It contains the following objects: +```{r} +print(data_list) # list the loaded objects +``` + +- `gt_data`: a data frame with 2820 SNPs in rows and the corresponding genotypes for the 64 strains in columns, along with other informations regarding the SNPs such as their position on the genome, their identification, etc. More details with: + +```{r, eval=F} +str(gt_data) +``` + +- `morpho_data`: a data frame with 45215 cells in rows and the corresponding morphological trait measures in columns, along with the corresponding cell identification, strain identification and cell cycle, etc. More details with: + +```{r, eval=F} +str(morpho_data) +``` + +- `yeast_data`: a data frame containing the morphological data (see `morpho_data`) and the corresponding genotypes of all SNPs (in additional columns) for each cell (in rows). More details with: + +```{r, eval=F} +str(yeast_data) +``` + +- `yeast_av_data`: a data frame containing the average (over all cells) of the morphological measures for each strain and each cell cycle phase with the corresponding genotype (same columns as `yeast_data`). More details with: + +```{r, eval=F} +str(yeast_av_data) +``` + +- `strain_id`: a data frame containing the identification of the different strains (including other not present in `morpho_data` and `gt_data`) + +## Data description + +We will first conduct a descriptive analysis of the data. During the course of this practical, we will first consider the morphological measure called `A101` and the genotype associated to two SNPs, namely `YAL069W_1` and `YDL200C_427`. + +Generally, when the dimension of the data is not too large (otherwise see [below](#dimension-reduction) and afterward), a first exploratory step in data analysis (for quantitative variables) is to consider descriptive statistics, such as position parameters (mean/average, empirical quantiles including the median), dispersion parameters (empirical variance and empirical standard deviation), possibly depending on groups or classes defined by a qualitative variables. + + +<div class="pencadre"> +Which graphical representation of the data does give an insight about the distribution (including the position and dispersion) of a variable ? +</div> + +<details><summary>Solution</summary> +<p> + +For instance, a box-plot (and derivative) + +--- + +</p> +</details> + +**IMPORTANT:** first we will focus on the **morphological data averaged by strain and cell cycle**, i.e. `yeast_av_data`. The idea here is to avoid dealing with the measure heterogeneity across all cells within each strain (c.f. [later](#full-data-analysis)). + +<div class="pencadre"> +Construct a boxplot of the value of the variable `A101` depending on the genotype associated to the SNPs `YAL069W_1` or `YDL200C_427` (separately). What can you say about this representation? +</div> + +<details><summary>Solution</summary> +<p> + +```{r} +ggplot(yeast_av_data, aes(factor(YAL069W_1), A101)) + geom_boxplot() + + theme_bw() +``` + +```{r} +ggplot(yeast_av_data, aes(factor(YDL200C_427), A101)) + geom_boxplot() + + theme_bw() +``` + +--- + +</p> +</details> + +<div class="pencadre"> +How to account for the cell cycle in the previous representations? Is it important? +</div> + +<details><summary>Solution</summary> +<p> + +```{r} +ggplot(yeast_av_data, aes(factor(YAL069W_1), A101)) + geom_boxplot() + + facet_wrap(~cell_cycle, labeller = "label_both") + + theme_bw() +``` + +```{r} +ggplot(yeast_av_data, aes(factor(YDL200C_427), A101)) + geom_boxplot() + + facet_wrap(~cell_cycle, labeller = "label_both") + + theme_bw() +``` + +--- + +</p> +</details> + +<div class="pencadre"> +What can you infer about the possible relationship between the morphological trait quantified by the variable `A101` and the genotype associated to the SNPs `YAL069W_1` or `YDL200C_427` from the boxplot? +</div> + +<details><summary>Solution</summary> +<p> + +For all cell cycle phases, the value of the variable `A101` seems to depend on the genotype associated to the SNP `YAL069W_1`. Indeed, the value of the variable `A101` is globally higher in presence of the derivative allele. + +On contrary, for cell cycle phases `"A1B"` and `"C"`, the value of the variable `A101` does not seem to depend on the genotype associated to the SNP `YAL069W_1`. + +--- + +</p> +</details> + + +> Note: depending on the context and the data, other representations like empirical histogram or empirical density representation of quantitative variables of interest depending on factors (qualitative variables) of interest can be used. + +--- + +## Parenthesis on Simpon's paradox and confounding factors + +In the [`palmerpenguins`](https://allisonhorst.github.io/palmerpenguins/) dataset, we have different morphological measurements for numerous individuals of penguins of different species. We want to investigate a possible link between the bill depth vs the bill length for these penguins. + +Reminder about the anatomy of a penguin: + +<figure><img src="./img/culmen_depth.png" alt="Trulli" style="width:100%"> +<figcaption align = "center"> +Microscopy picture of yeast cells ([credit](https://allisonhorst.github.io/palmerpenguins/)) +</figcaption></figure> + +```{r, include=FALSE} +library(palmerpenguins) +data("penguins") +``` + +Here is a [scatter plot](https://en.wikipedia.org/wiki/Scatter_plot) representing the bill depth vs the bill length for all individuals in the dataset. + +```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.align = "center"} +ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm)) + geom_point() + + theme_bw() +``` + +<div class="pencadre"> +What can you say about this representation? Can you infer any relationship between bill depth and length in penguins? +</div> + +<details><summary>Solution</summary> +<p> + +There seems to be a negative relationship between bill depth and length in penguins. As the length increases, the depth decreases. + +```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.align = "center"} +ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm)) + geom_point() + + geom_smooth(method='lm', formula= y~x, se=FALSE) + + theme_bw() +``` + +--- + +</p> +</details> + +<div class="pencadre"> +Could we question this result? +</div> + +<details><summary>Solution</summary> +<p> + +Here is the same representation discriminated by species: + +```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.align = "center"} +ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm, group = species, col = species)) + geom_point() + + geom_smooth(method='lm', formula= y~x, se=FALSE) + + theme_bw() +``` + +We obtained opposite conclusions : in all species, there seems to be a positive relationship between bill depth and length. As the length increases, the depth also increases. + +--- + +</p> +</details> + +The *species* factor is called a [**confounding factors**](https://en.wikipedia.org/wiki/Confounding): a variable (potentially unknown) that influences both variables resulting in a spurious association. + +In this particular case, the change of trend when accounting for the groups of individuals is called the [Simpon's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox). + +Here is another example from the following publication: + +> [2] Appleton, David & French, Joyce & Vanderpump, Mark. (1996). Ignoring a Covariate: An Example of Simpson's Paradox. American Statistician - AMER STATIST. 50. 340-341. 10.1080/00031305.1996.10473563. + +After a first survey in the 1970s, a follow-up study in 1992-94 (in Whickham a mixed urban and rural district near Newcastle upon Tyne, United Kingdom) investigated (among other things) the 20-year after survival of subjects included in the original study. + +```{r, include=FALSE} +# data from [2] +smoking_raw_data <- rbind( + data.frame( + age = c("18-24", "25-34", "35-44", "45-54", "55-64", "65-74", "75+"), + smoker_status = "non_smoker", + dead = c(1, 5, 7, 12, 40, 101, 64), + alive = c(61, 152, 114, 66, 81, 28, 0) + ), + data.frame( + age = c("18-24", "25-34", "35-44", "45-54", "55-64", "65-74", "75+"), + smoker_status = "smoker", + dead = c(2, 3, 14, 27, 51, 29, 13), + alive = c(53, 121, 95, 103, 64, 7, 0) + ) +) + +# reformat data +smoking_data <- smoking_raw_data %>% + # compute suvival rate + mutate(survival_rate = alive/(dead+alive)) + +# aggregated count (without accounting for age) +smoking_agg_data <- smoking_data %>% group_by(smoker_status) %>% + summarise(dead = sum(dead), alive = sum(alive)) %>% ungroup() %>% + # compute suvival rate + mutate(survival_rate = alive/(dead+alive)) +``` + +Here are the results (dataset from [2]) regarding the survival of the women in the study depending on their smoking status ("smoker" or "non smoker"): + +```{r, echo=FALSE, message=FALSE, warning=FALSE} +knitr::kable(smoking_agg_data %>% mutate(survival_rate=str_c(round(100*survival_rate, digits = 3), "%"))) +``` + +```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.align = "center"} +ggplot(smoking_agg_data, aes(x=smoker_status, y=survival_rate, fill = smoker_status)) + + geom_bar(stat = "identity", position = "dodge") + + scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + + theme_bw() +``` + +<div class="pencadre"> +What can you say about these results ? +</div> + +<details><summary>Solution</summary> +<p> + +The survival rate is larger in the "smoker" sub-group than in the "not smoker" sub-group. Surprising? + +--- + +</p> +</details> + +<div class="pencadre"> +What could explain this surprising result? +</div> + +<details><summary>Solution</summary> +<p> + +**A confounding variable**: the age of the person in the original study. + +There is a sampling bias in this dataset. The number of older non-smoking women (thus with a higher mortality risk) is very high compared to the other groups, hence decreasing the global survival rate in the non-smoking group. + +```{r, echo=FALSE, message=FALSE, warning=FALSE} +knitr::kable(smoking_data %>% mutate(survival_rate=str_c(round(100*survival_rate, digits = 3), "%"))) +``` + +```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.align = "center"} +ggplot(smoking_data, aes(x=age, y=survival_rate, fill = smoker_status)) + + geom_bar(stat = "identity", position = "dodge") + + scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + + theme_bw() +``` + +--- + +</p> +</details> + + +> **Note:** in the two previous examples, we did not focus on quantifying any statistical significance of the different effects that we discussed (that would require using statistical models). The point was to give you the intuition about potential issues and misleading results caused by any forgotten confounding factors and the Simpson's paradox in an analysis. + +<div class="pencadre"> +How to avoid drawing false conclusions because of confounding factors? +</div> + +<details><summary>Solution</summary> +<p> + +- Use randomization to build your experiment, like [randomized controlled trial](https://en.wikipedia.org/wiki/Randomized_controlled_trial), possibly with [double-blinding](https://en.wikipedia.org/wiki/Blinded_experiment), to remove the potential effect of confounding variables (should be used for any serious drug trial), or at least control the potential sampling bias caused by confounding factors (like [case-control studies](https://en.wikipedia.org/wiki/Case%E2%80%93control_study), [cohort studies](https://en.wikipedia.org/wiki/Cohort_study), [stratification](https://en.wikipedia.org/wiki/Stratified_sampling)). + +2. If not possible (it is not always possible depending on the design of the experiment and/or the object of the study), measure and log various metadata regarding your subjects/individuals so that you will be able to account for the potential effect of confounding variables in your analysis (c.f. [later](#one-factor-anova)). + +It generally requires a certain level of technical expertise/knowledge in the considered subject to be able to identify potential confounding factors before the experiments (so that you can monitor and log the corresponding quantities during your experiment). + +More details [here](https://en.wikipedia.org/wiki/Confounding#Decreasing_the_potential_for_confounding). + +--- + +</p> +</details> + + +--- + ## Linear model and analysis of variance -We will now use a linear model, and especially an analysis of variance (ANOVA), to test the association between the morphological trait `A101` and the SNPs `YAL069W_1` or `YDL200C_427` (separately). +We will now use a linear model, and especially an analysis of variance (ANOVA), to test the association between the morphological trait `A101` and the genotype associated to the SNPs `YAL069W_1` or `YDL200C_427` (separately). + +### One-factor ANOVA + +<div class="pencadre"> +Write the linear model for the ANOVA regarding the morphological trait `A101` depending on genotype associated to the SNP `YAL069W_1`? +</div> + +<details><summary>Solution</summary> +<p> + +Notations: + +- $i=1,2$ is the genotype associated to the SNP `YAL069W_1` +- $y_{ir}$ is the value of the morphological trait `A101` in the $r^\text{th}$ individual with genotype $i$. The associated random variable is $Y_{ir}$ + +Model: + +\[ +Y_{ir} = \mu_i + E_{ir} +\] + +Assumptions: + +- we assume that $E_{ir} \sim \mathcal{N}(0, \sigma^2)$ +- we assume that $Y_{ir}$ are independent (hence that $E_{ir}$ are independent) + +Equivalent formulation: + +\[ +Y_{ir} \sim \mathcal{N}(\mu_i, \sigma^2) +\] + +--- + +</p> +</details> + +<!-- <div class="pencadre"> --> +<!-- Question ? --> +<!-- </div> --> + +<!-- <details><summary>Solution</summary> --> +<!-- <p> --> + +<!-- Solution --> + +<!-- </p> --> +<!-- </details> --> -### 1-factor ANOVA -We will first consider only a single cell cycle phase (e.g. the `"C"` phase) to avoid dealing with the measure heterogeneity across cell cycles. +<!-- <div class="pencadre"> --> +<!-- Question ? --> +<!-- </div> --> + +<!-- <details><summary>Solution</summary> --> +<!-- <p> --> + +<!-- Solution --> + +<!-- </p> --> +<!-- </details> --> + + +<!-- <div class="pencadre"> --> +<!-- Question ? --> +<!-- </div> --> + +<!-- <details><summary>Solution</summary> --> +<!-- <p> --> + +<!-- Solution --> + +<!-- </p> --> +<!-- </details> --> + +**IMPORTANT**: We will first consider only a single cell cycle phase (e.g. the `"C"` phase) to avoid dealing with the measure heterogeneity across cell cycles (c.f. [later](#two-factor-anova)) ```{r} yeast_av_subdata <- yeast_av_data %>% filter(cell_cycle == "C") ``` <div class="pencadre"> -Conduct a 1 factor ANOVA between the considered variable and each of the considered SNPs. +Conduct a one-factor ANOVA between the considered output variable (`A101`) and each of the considered SNPs (`YAL069W_1` or `YDL200C_427`). **Hint**: use the functions `lm()` and `anova()` </div> <details><summary>Solution</summary> @@ -852,16 +1139,36 @@ Check the normality of the residuals and the homoskedasticity (constant variance Between `A101` and the SNP `YAL069W_1`: ```{r} +# linear model mod1 <- lm(A101 ~ factor(YAL069W_1), data = yeast_av_subdata) + +# residuals +residuals(mod1) + # qq-plot plot(mod1, which = 2) + +# check normality +performance::check_normality(mod1) +# check homoskedasticity +performance::check_heteroskedasticity(mod1) ``` Between `A101` and the SNP `YAL069W_1`: ```{r} +# linear model mod2 <- lm(A101 ~ factor(YDL200C_427), data = yeast_av_subdata) + +# residuals +residuals(mod2) + # qq-plot plot(mod2, which = 2) + +# check normality +performance::check_normality(mod2) +# check homooskedasticity +performance::check_heteroskedasticity(mod2) ``` --- @@ -870,14 +1177,49 @@ plot(mod2, which = 2) </details> + <div class="pencadre"> -Do you find a significant effect of the respective SNPs on the considered morphological trait? Comment the results in relation with your intuition from the analyis of the descriptive statistics [above](#data-description)? +Compute the value of the coefficients in the ANOVA model regarding the morphological trait `A101` depending on genotype associated to the SNP `YAL069W_1` that was defined in the first question? This model is different from the results given by the `lm` function? How so? </div> <details><summary>Solution</summary> <p> -Solution +Between `A101` and the SNP `YAL069W_1`: + +```{r} +# manual computation +yeast_av_subdata %>% group_by(YAL069W_1) %>% summarise(mu=mean(A101)) + +# automatic computation +lm(A101 ~ factor(YAL069W_1), data = yeast_av_subdata) +``` + +Model considered in R: + +\[ +Y_{ir} = \mu + \nu_i + E_{ir} +\] + +with: + +- $\mu$ is the intercept +- $\sum_i \nu_i = 0$ (think $\nu_i = \mu_i - \mu$), hence you get only $\nu_1$ + +Matrix notation: + +\[ +Y = \mu + X \times B + E +\] + +where: +- $Y$ is the vector containing the values for the output variable +- $X = [x_{ij}]$ is the $\{0,1\}$ indicator that individual $j$ has genotype $i$ +- $B = [\nu_i]$ is the vector of genotype coefficients +- $E$ is the vector of residuals + +Then $\mu$ and $B$ are estimated by a **least square linear regression** (see [here](https://en.wikipedia.org/wiki/Simple_linear_regression) [here](https://en.wikipedia.org/wiki/Linear_regression) and [here](https://en.wikipedia.org/wiki/Least_squares) for more details and references). + --- @@ -886,53 +1228,45 @@ Solution <div class="pencadre"> -Compute the value of the coefficients in each model by hand and compare your results to the results of the `lm` function? +**Interpretation of the ANOVA results:** do you find a significant effect of the respective SNPs on the considered morphological trait? Comment the results in relation with your intuition from the analysis of the descriptive statistics [above](#data-description)? </div> <details><summary>Solution</summary> <p> -Between `A101` and the SNP `YAL069W_1`: - -```{r} -anova(lm(A101 ~ factor(YAL069W_1), data = yeast_av_subdata)) -``` - -Between `A101` and the SNP `YAL069W_1`: -```{r} -anova(lm(A101 ~ factor(YDL200C_427), data = yeast_av_subdata)) -``` +After verifying the normality and homoskedasticity of the residuals (**if either is not verified, we cannot use the results from the ANOVA significance test because it assumes a Gaussian model**), we find a significant effect of SNP `YAL069W_1` and a non-significant effect of SNP `YDL200C_427` onto the morphological trait `A101` (when focusing on the `C` cell cycle phase), which confirms our intuition from the graphical representation. --- </p> </details> -### 2-factor ANOVA -<!-- Between `A101` and the SNP `YAL069W_1`: --> +### Two-factor ANOVA -<!-- ```{r} --> -<!-- anova(lm(A101 ~ factor(YAL069W_1) + factor(cell_cycle), data = yeast_av_data)) --> -<!-- ``` --> +Between `A101` and the SNP `YAL069W_1`: -<!-- Between `A101` and the SNP `YAL069W_1`: --> -<!-- ```{r} --> -<!-- anova(lm(A101 ~ factor(YDL200C_427) + factor(cell_cycle), data = yeast_av_data)) --> -<!-- ``` --> +```{r} +anova(lm(A101 ~ factor(YAL069W_1) + factor(cell_cycle), data = yeast_av_data)) +``` + +Between `A101` and the SNP `YAL069W_1`: +```{r} +anova(lm(A101 ~ factor(YDL200C_427) + factor(cell_cycle), data = yeast_av_data)) +``` -<!-- Interaction? --> +Interaction? -<!-- Between `A101` and the SNP `YAL069W_1`: --> +Between `A101` and the SNP `YAL069W_1`: -<!-- ```{r} --> -<!-- anova(lm(A101 ~ factor(YAL069W_1) * factor(cell_cycle), data = yeast_av_data)) --> -<!-- ``` --> +```{r} +anova(lm(A101 ~ factor(YAL069W_1) * factor(cell_cycle), data = yeast_av_data)) +``` -<!-- Between `A101` and the SNP `YAL069W_1`: --> -<!-- ```{r} --> -<!-- anova(lm(A101 ~ factor(YDL200C_427) * factor(cell_cycle), data = yeast_av_data)) --> -<!-- ``` --> +Between `A101` and the SNP `YAL069W_1`: +```{r} +anova(lm(A101 ~ factor(YDL200C_427) * factor(cell_cycle), data = yeast_av_data)) +``` <!-- <div class="pencadre"> --> @@ -974,4 +1308,10 @@ PCA vs Correspondence analysis --- +## Full data analysis + + + + + [The .Rmd file corresponding to this page is available here under the AGPL3 Licence](https://lbmc.gitbiopages.ens-lyon.fr/hub/formations/ens_m1_ml/Practical_b.Rmd) diff --git a/practical_c.Rdata b/practical_c.Rdata index 449043c229a56a2622c2616d2c96b54c08638bfe..15b54915d987d0eb1ee9a693fa4b68d4a5b158f2 100644 Binary files a/practical_c.Rdata and b/practical_c.Rdata differ diff --git a/practical_c_prepa.R b/practical_c_prepa.R index d3dd54de60f0641fe9d476327f26fed41db471fc..15ac09f932d0f8ad8937fc5b288d900798e49291 100644 --- a/practical_c_prepa.R +++ b/practical_c_prepa.R @@ -232,9 +232,9 @@ morpho_data <- morpho_data %>% #### additional preprocessing #### -morpho_data <- morpho_data %>% - # apply log1p transform to morphological trait measures - mutate(across(matches("[ABC][0-9]+\\-?[0-9]*"), log1p)) +# morpho_data <- morpho_data %>% +# # apply log1p transform to morphological trait measures +# mutate(across(matches("[ABC][0-9]+\\-?[0-9]*"), log1p)) #### merge morpho and genotype data ####