diff --git a/Practical_c.Rmd b/Practical_c.Rmd index c3ac99d0108d284cfdfbf6308ac81f2fe083bb58..88c985dea6fa9a71e16304a2bc9013b78d32ba1d 100644 --- a/Practical_c.Rmd +++ b/Practical_c.Rmd @@ -138,9 +138,7 @@ Detail the general framework to build a statistical test? - Compute the observed value $t$ of a random variable, called a statistic, $T$ using measured quantities on a sample of observations. - Requirements: know the probability distribution of $T$ assuming $\H_0$ is **true** -- Verify if the value $t$ is "probable" according to the distibution of $T$ under $\H_0$ - ---- +- Verify if the value $t$ is "probable" according to the distibution of $T$ under $\H_0$. </p> </details> @@ -152,10 +150,8 @@ What are the possible answers of a statistical test? <details><summary>Solution</summary> <p> -- possible answers: "$\H_0$ is false with a given risk" (reject $\H_0$, the test result is significant) vs "the result is not significant" (given this sample, we don't know if $\H_0$ is true or false) -- the answer is **never**: ~~$\H_0$ is true~~ - ---- +- possible answers: "$\H_0$ is false with a given risk" (reject $\H_0$, the test result is significant) vs "the result is not significant" (given this sample, we don't know if $\H_0$ is true or false). +- the answer is **never**: ~~$\H_0$ is true~~. </p> </details> @@ -240,8 +236,6 @@ ggplot(data.frame(x = c(0, 10)), aes(x)) + size = 3, colour = "blue") ``` ---- - </p> </details> @@ -267,9 +261,7 @@ The p-value is a conditional probability assuming that $\H_0$ is true. Reasoning: - assuming $\H_0$ $\to$ the statistics $T$ follows given probability distribution -- "Inductive contraposition": p-value is small $\to$ observed value $t$ for the statistics $T$ is unlikely considering its probability distribution under $\H_0$ $\to$ it is unlikely that $T$ follows this distribution $\to$ it is unlikely that $\H_0$ is true - ---- +- "Inductive contraposition": p-value is small $\to$ observed value $t$ for the statistics $T$ is unlikely considering its probability distribution under $\H_0$ $\to$ it is unlikely that $T$ follows this distribution $\to$ it is unlikely that $\H_0$ is true. </p> </details> @@ -282,11 +274,9 @@ What are type I and type II errors? <details><summary>Solution</summary> <p> -- Type I error: reject $\H_0$ conditionally to the fact that $\H_0$ is true - -- Type II error: not reject $\H_0$ conditionally to the fact that $\H_0$ is false +- Type I error: reject $\H_0$ conditionally to the fact that $\H_0$ is true. ---- +- Type II error: not reject $\H_0$ conditionally to the fact that $\H_0$ is false. </p> </details> @@ -305,8 +295,6 @@ What are type I and type II risks? - Power $= 1 - \beta = \PP(\text{reject } \H_0\ \vert\ \H_0 \text{ is false})$ ---- - </p> </details> @@ -330,8 +318,6 @@ The type 1 risk is a conditional probability assuming that $\H_0$ is true. The type 1 risk $\alpha$ is generally chosen (e.g. $\alpha = 5\%$). It is important to evaluate the power of the test which can not be done in general (it requires either to design a test where the distribution of statistic $T$ is known assuming that tjhe alternative hypostheis $\H_1$ is true, or to evaluate the power using simulations where the truth about $\H_0$ and $\H_1$ is konwn. ---- - </p> </details> @@ -471,8 +457,6 @@ ggplot(data.frame(x = c(0, 10)), aes(x)) + size = 3, colour = "blue") ``` ---- - </p> </details> @@ -555,8 +539,6 @@ In a non negligible number of samples, the null hypothesis was rejected (p-value However, in the majority of the studies, the null hypothesis is correctly not rejected. ---- - </p> </details> @@ -589,8 +571,6 @@ In a non negligible number of samples, the null hypothesis was not rejected (p-v However, in the majority of the studies, the null hypothesis is correctly rejected. ---- - </p> </details> @@ -640,8 +620,6 @@ power_values <- sapply( ) ``` ---- - </p> </details> @@ -657,8 +635,6 @@ ggplot(data.frame(alpha=alpha_values, power=power_values)) + theme_bw() ``` ---- - </p> </details> @@ -671,8 +647,6 @@ What can you say about this representation? Decreasing $\alpha$ to reduce the type I error decreases the power of the test. ---- - </p> </details> @@ -756,8 +730,6 @@ Which graphical representation of the data does give an insight about the distri For instance, a box-plot (and derivative) ---- - </p> </details> @@ -780,8 +752,6 @@ ggplot(yeast_av_data, aes(factor(YDL200C_427), A101)) + geom_boxplot() + theme_bw() ``` ---- - </p> </details> @@ -804,8 +774,6 @@ ggplot(yeast_av_data, aes(factor(YDL200C_427), A101)) + geom_boxplot() + theme_bw() ``` ---- - </p> </details> @@ -820,8 +788,6 @@ For all cell cycle phases, the value of the variable `A101` seems to depend on t On contrary, for cell cycle phases `"A1B"` and `"C"`, the value of the variable `A101` does not seem to depend on the genotype associated to the SNP `YAL069W_1`. ---- - </p> </details> @@ -868,8 +834,6 @@ ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm)) + geom_point() + theme_bw() ``` ---- - </p> </details> @@ -890,8 +854,6 @@ ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm, group = species, col = s We obtained opposite conclusions : in all species, there seems to be a positive relationship between bill depth and length. As the length increases, the depth also increases. ---- - </p> </details> @@ -956,8 +918,6 @@ What can you say about these results ? The survival rate is larger in the "smoker" sub-group than in the "not smoker" sub-group. Surprising? ---- - </p> </details> @@ -983,8 +943,6 @@ ggplot(smoking_data, aes(x=age, y=survival_rate, fill = smoker_status)) + theme_bw() ``` ---- - </p> </details> @@ -1006,8 +964,6 @@ It generally requires a certain level of technical expertise/knowledge in the co More details [here](https://en.wikipedia.org/wiki/Confounding#Decreasing_the_potential_for_confounding). ---- - </p> </details> @@ -1049,8 +1005,6 @@ Equivalent formulation: Y_{jr} \sim \mathcal{N}(\mu_j, \sigma^2) \] ---- - </p> </details> @@ -1078,8 +1032,6 @@ Between `A101` and the SNP `YAL069W_1`: anova(lm(A101 ~ factor(YDL200C_427), data = yeast_av_subdata)) ``` ---- - </p> </details> @@ -1140,8 +1092,6 @@ performance::check_normality(mod2) performance::check_heteroskedasticity(mod2) ``` ---- - </p> </details> @@ -1197,8 +1147,6 @@ Conditional expectation: Then $\mu$ and $B$ are estimated by a **least square linear regression** (see [here](https://en.wikipedia.org/wiki/Simple_linear_regression) [here](https://en.wikipedia.org/wiki/Linear_regression) and [here](https://en.wikipedia.org/wiki/Least_squares) for more details and references). ---- - </p> </details> @@ -1212,8 +1160,6 @@ Then $\mu$ and $B$ are estimated by a **least square linear regression** (see [h After verifying the normality and homoskedasticity of the residuals (**if it is not verified, we cannot use the results from the ANOVA significance test because it assumes a Gaussian model**), we find a significant effect of SNP `YAL069W_1` and a non-significant effect of SNP `YDL200C_427` onto the morphological trait `A101` (when focusing on the `C` cell cycle phase), which confirms our intuition from the graphical representation. ---- - </p> </details> @@ -1255,9 +1201,7 @@ Y_{jkr} = \mu + \beta_j + \alpha_k + \gamma_{jk} + \varepsilon_{jkr} where: -- $\beta_j$ is the coefficient associated to the interaction between genotype $j$ and cell cycle phase $k$, with the constraints: $\sum_j \gamma_{jk} = 0$ and $\sum_k \gamma_{jk} = 0$ - ---- +- $\beta_j$ is the coefficient associated to the interaction between genotype $j$ and cell cycle phase $k$, with the constraints: $\sum_j \gamma_{jk} = 0$ and $\sum_k \gamma_{jk} = 0$. </p> </details> @@ -1299,16 +1243,11 @@ anova(lm(A101 ~ factor(YDL200C_427) * cell_cycle, data = yeast_av_data)) > **Note:** in a linear formula in R, `factor1 * factor2` is equivalent to `factor1 + factor2 + factor1:factor2`. The `:` is used to explicitely specify the interaction between factors. ---- - </p> </details> - - - <div class="pencadre"> After doing regular verification for an ANOVA model, interpret the previous results? </div> @@ -1393,8 +1332,6 @@ In practice, you need to choose if you are going to illicitly use these p-values > - When comparing multiple model, e.g. with or without interaction, a simple rule is to start by training with the "richest" model (with all factors and interactions) and then train a more simple model where we remove factors and/or interactions that are not significant (if we can use the p-values). > - There exist automatic methods, called [**model selection**](https://en.wikipedia.org/wiki/Model_selection) approaches to allow to automatically compare and choose the "best" model among a set of candidate models (these types of methods and the comparison criteria they are based on is beyond the scope of the course). ---- - </p> </details> @@ -1425,8 +1362,6 @@ What can you say about the heatmap ? We observe a lot of variability in the genotypes along all SNPs between the different strains. Some SNPs seems to be less variable than other. It appears quite difficult to extract information from this representation. We would need to summarize the information contained in the genotype table to get an overview of these information. ---- - </p> </details> @@ -1440,9 +1375,6 @@ How could we visualize both the information contained in the SNP data and the mo We can use a dimension reduction approach, like PCA. - ---- - </p> </details> @@ -1511,8 +1443,6 @@ fviz_pca_ind( ) ``` ---- - </p> </details> @@ -1526,8 +1456,6 @@ What can you say about this representation? The structure of the SNP data extracted by the first two components of PCA does not allow to discriminate between the different level of the morphological trait `A101`. The variability in the data explained by the first two components is not very high though. ---- - </p> </details> @@ -1567,8 +1495,6 @@ Is the representation obtained by MCA better than the one obtained with PCA? Not better, maybe we could try some supervised dimension reduction approaches (c.f. next note) or some non-linear dimension reduction approaches (c.f. previous practical subject). ---- - </p> </details> @@ -1595,8 +1521,6 @@ We have a non-negligible risk to wrongly reject the null hypothesis for many of Thus, we have to use p-values correction (or adjustment) procedure adapted to the case of multiple testing. ---- - </p> </details> @@ -1627,8 +1551,6 @@ What should we verify? In theory, we should verify the Gaussian and homoskedasticity assumptions for the residuals in each model (i.e. `r nrow(test_result)` models). In practice, it can be cumbersome... (but it would be a flaw in the analysis). ---- - </p> </details> @@ -1666,8 +1588,6 @@ sum(test_result$p_values <= 0.05) **But what about the p-value correction?** ---- - </p> </details> @@ -1690,8 +1610,6 @@ test_result <- test_result %>% ``` ---- - </p> </details> @@ -1733,8 +1651,6 @@ ggplot(test_result) + geom_line(aes(x=p_values, y=fdr_adj_p_values)) + theme_bw() ``` ---- - </p> </details> @@ -1774,8 +1690,6 @@ test_result %>% head(50) ``` ---- - </p> </details> @@ -1814,15 +1728,13 @@ You could also try to consider linear model integrating multiple SNPs (instead o </div> -<details><summary>Solution</summary> -<p> - -Write me! +<!-- <details><summary>Solution</summary> --> +<!-- <p> --> ---- +<!-- Write me! --> -</p> -</details> +<!-- </p> --> +<!-- </details> -->