Skip to content
Snippets Groups Projects
Commit 903afda4 authored by Ghislain Durif's avatar Ghislain Durif
Browse files

section reorganization + minor edition + fix LaTeX formula tags

parent 1c317a3c
No related branches found
No related tags found
No related merge requests found
Pipeline #1289 passed
...@@ -121,9 +121,9 @@ library(factoextra) # to plot dimension reduction output ...@@ -121,9 +121,9 @@ library(factoextra) # to plot dimension reduction output
--- ---
## Parenthesis about statistical hypothesis testing and p-values ## Reminder about statistical hypothesis testing and p-values
### Reminder about statistical testing ### Statistical testing
#### Building the test #### Building the test
...@@ -150,6 +150,7 @@ What are the possible answers of a statistical test? ...@@ -150,6 +150,7 @@ What are the possible answers of a statistical test?
<details><summary>Solution</summary> <details><summary>Solution</summary>
<p> <p>
- possible answers: "$\H_0$ is false with a given risk" (reject $\H_0$, the test result is significant) vs "the result is not significant" (given this sample, we don't know if $\H_0$ is true or false). - possible answers: "$\H_0$ is false with a given risk" (reject $\H_0$, the test result is significant) vs "the result is not significant" (given this sample, we don't know if $\H_0$ is true or false).
- the answer is **never**: ~~$\H_0$ is true~~. - the answer is **never**: ~~$\H_0$ is true~~.
...@@ -167,10 +168,11 @@ What is the p-value ? ...@@ -167,10 +168,11 @@ What is the p-value ?
p-value = conditional probability assuming $\H_0$ that the statistic $T$ is at least as extreme as the observed value $t$ p-value = conditional probability assuming $\H_0$ that the statistic $T$ is at least as extreme as the observed value $t$
For a **bilateral test** For a **bilateral test**:
\[
p\text{-value} = \PP(T < -t \ \text{or}\ T > t) = 1 - \PP(-t \leq T \leq t) $$
\] p\text{-value} = \PP(T < -t \ \text{or}\ T > t) = \PP(\vert T\vert > \vert t\vert) = 1 - \PP(-t \leq T \leq t)
$$
```{r pval_bilat, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.align="center"} ```{r pval_bilat, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.align="center"}
ggplot(data.frame(x = c(-3, 3)), aes(x)) + ggplot(data.frame(x = c(-3, 3)), aes(x)) +
...@@ -208,9 +210,9 @@ ggplot(data.frame(x = c(-3, 3)), aes(x)) + ...@@ -208,9 +210,9 @@ ggplot(data.frame(x = c(-3, 3)), aes(x)) +
``` ```
For a (right) **uniteral test** For a (right) **uniteral test**
\[ $$
p\text{-value} = \PP(T > t) = 1 - \PP(T \leq t) p\text{-value} = \PP(T > t) = 1 - \PP(T \leq t)
\] $$
```{r pval_unilat, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.align="center"} ```{r pval_unilat, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.align="center"}
ggplot(data.frame(x = c(0, 10)), aes(x)) + ggplot(data.frame(x = c(0, 10)), aes(x)) +
...@@ -468,12 +470,12 @@ ggplot(data.frame(x = c(0, 10)), aes(x)) + ...@@ -468,12 +470,12 @@ ggplot(data.frame(x = c(0, 10)), aes(x)) +
<!-- <details><summary>Solution</summary> --> <!-- <details><summary>Solution</summary> -->
<!-- <p> --> <!-- <p> -->
<!-- \[ --> <!-- $$ -->
<!-- \begin{aligned} --> <!-- \begin{aligned} -->
<!-- \PP() \\ --> <!-- \PP() \\ -->
<!-- = \PP() --> <!-- = \PP() -->
<!-- \end{aligned} --> <!-- \end{aligned} -->
<!-- \] --> <!-- $$ -->
<!-- </p> --> <!-- </p> -->
<!-- </details> --> <!-- </details> -->
...@@ -487,7 +489,7 @@ The data are generated from the simulation of a random Gaussian variable of mean ...@@ -487,7 +489,7 @@ The data are generated from the simulation of a random Gaussian variable of mean
Since, we are working with simulations, we can repeat the experiment as much as we want, i.e. generate multiple samples of observed values for the considered variables. Since, we are working with simulations, we can repeat the experiment as much as we want, i.e. generate multiple samples of observed values for the considered variables.
> Note: it is only possible because we are working with simulated data. When analysing real experimental data, it is generally very complicated or costly (or even impossible) to repeat the experiments numerous times in order to generate multiple independent samples of observed values. > **Note:** it is only possible because we are working with simulated data. When analysing real experimental data, it is generally very complicated or costly (or even impossible) to repeat the experiments numerous times in order to generate multiple independent samples of observed values.
In the T-test, we test the hypotheses "$\H_0$: $\mu = \mu_0$" versus "$\H_1$: $\mu \ne \mu_0$" where $\mu$ is the population mean of the variable (that is unknown), for a given value $\mu_0$. In the following, we will choose $\mu_0=0$ and test "$\H_0$: $\mu = 0$" versus "$\H_1$: $\mu \ne 0$". In the T-test, we test the hypotheses "$\H_0$: $\mu = \mu_0$" versus "$\H_1$: $\mu \ne \mu_0$" where $\mu$ is the population mean of the variable (that is unknown), for a given value $\mu_0$. In the following, we will choose $\mu_0=0$ and test "$\H_0$: $\mu = 0$" versus "$\H_1$: $\mu \ne 0$".
...@@ -590,13 +592,13 @@ To estimate $\beta$, we need to repeat the same experiment multiple time and est ...@@ -590,13 +592,13 @@ To estimate $\beta$, we need to repeat the same experiment multiple time and est
A simple way is to generate data where $\H_0$ is known to be false, i.e. $\PP(\H_0 \text{ is false})=1$. A simple way is to generate data where $\H_0$ is known to be false, i.e. $\PP(\H_0 \text{ is false})=1$.
We have (c.f. [here](https://en.wikipedia.org/wiki/Conditional_probability)): We have (c.f. [here](https://en.wikipedia.org/wiki/Conditional_probability)):
\[ $$
\begin{aligned} \begin{aligned}
& \text{power}\\ & \text{power}\\
& = \PP(\text{reject } \H_0\ \vert\ \H_0 \text{ is false})\\ & = \PP(\text{reject } \H_0\ \vert\ \H_0 \text{ is false})\\
& = \frac{\PP(\text{reject } \H_0\ \ \text{and}\ \H_0 \text{ is false})}{\PP(\H_0 \text{ is false})} & = \frac{\PP(\text{reject } \H_0\ \ \text{and}\ \H_0 \text{ is false})}{\PP(\H_0 \text{ is false})}
\end{aligned} \end{aligned}
\] $$
In this case, we have then: $\text{power} = \PP(\text{reject } \H_0\ \ \text{and}\ \H_0 \text{ is false})$ and we can estimate this probability by counting the number of times reject $\H_0$ among the repetition of the experiment depending on the type I risk $\alpha$ that we choose. In this case, we have then: $\text{power} = \PP(\text{reject } \H_0\ \ \text{and}\ \H_0 \text{ is false})$ and we can estimate this probability by counting the number of times reject $\H_0$ among the repetition of the experiment depending on the type I risk $\alpha$ that we choose.
...@@ -661,11 +663,11 @@ Decreasing $\alpha$ to reduce the type I error decreases the power of the test. ...@@ -661,11 +663,11 @@ Decreasing $\alpha$ to reduce the type I error decreases the power of the test.
In this data, 64 different yeast strains are considered. In this data, 64 different yeast strains are considered.
> Note: all strains were generated from an admixture between two original strains but we will not investigate this point here. > **Note:** all strains were generated from an admixture between two original strains but we will not investigate this point here.
For each strain, we have genotype data, containing the [SNPs](https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism) along the yeast genome. Each SNP is encoded with a `0` or `1` value corresponding to the number of derivative allele present at the corresponding locus. For each strain, we have genotype data, containing the [SNPs](https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism) along the yeast genome. Each SNP is encoded with a `0` or `1` value corresponding to the number of derivative allele present at the corresponding locus.
> Note: Here, the yeast were sequenced during their haplotype phase, therefore the possible values for each genotype are only `0` and `1`. > **Note:** Here, the yeast were sequenced during their haplotype phase, therefore the possible values for each genotype are only `0` and `1`.
For each strain, we also have measures of different morphological traits for hundred of cells during 3 different stages of the cell cycle (called `"A"`, `"A2B"` and `"C"`). For each strain, we also have measures of different morphological traits for hundred of cells during 3 different stages of the cell cycle (called `"A"`, `"A2B"` and `"C"`).
...@@ -792,181 +794,7 @@ On contrary, for cell cycle phases `"A1B"` and `"C"`, the value of the variable ...@@ -792,181 +794,7 @@ On contrary, for cell cycle phases `"A1B"` and `"C"`, the value of the variable
</details> </details>
> Note: depending on the context and the data, other representations like empirical histogram or empirical density representation of quantitative variables of interest depending on factors (qualitative variables) of interest can be used. > **Note:** depending on the context and the data, other representations like empirical histogram or empirical density representation of quantitative variables of interest depending on factors (qualitative variables) of interest can be used.
---
## Parenthesis on Simpon's paradox and confounding factors
In the [`palmerpenguins`](https://allisonhorst.github.io/palmerpenguins/) dataset, we have different morphological measurements for numerous individuals of penguins of different species. We want to investigate a possible link between the bill depth vs the bill length for these penguins.
Reminder about the anatomy of a penguin:
<figure><img src="./img/culmen_depth.png" alt="Trulli" style="width:100%">
<figcaption align = "center">
Microscopy picture of yeast cells ([credit](https://allisonhorst.github.io/palmerpenguins/))
</figcaption></figure>
```{r, include=FALSE}
library(palmerpenguins)
data("penguins")
```
Here is a [scatter plot](https://en.wikipedia.org/wiki/Scatter_plot) representing the bill depth vs the bill length for all individuals in the dataset.
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.align = "center"}
ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm)) + geom_point() +
theme_bw()
```
<div class="pencadre">
What can you say about this representation? Can you infer any relationship between bill depth and length in penguins?
</div>
<details><summary>Solution</summary>
<p>
There seems to be a negative relationship between bill depth and length in penguins. As the length increases, the depth decreases.
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.align = "center"}
ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm)) + geom_point() +
geom_smooth(method='lm', formula= y~x, se=FALSE) +
theme_bw()
```
</p>
</details>
<div class="pencadre">
Could we question this result?
</div>
<details><summary>Solution</summary>
<p>
Here is the same representation discriminated by species:
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.align = "center"}
ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm, group = species, col = species)) + geom_point() +
geom_smooth(method='lm', formula= y~x, se=FALSE) +
theme_bw()
```
We obtained opposite conclusions : in all species, there seems to be a positive relationship between bill depth and length. As the length increases, the depth also increases.
</p>
</details>
The *species* factor is called a [**confounding factors**](https://en.wikipedia.org/wiki/Confounding): a variable (potentially unknown) that influences both variables resulting in a spurious association.
In this particular case, the change of trend when accounting for the groups of individuals is called the [Simpon's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox).
Here is another example from the following publication:
> [2] Appleton, David & French, Joyce & Vanderpump, Mark. (1996). Ignoring a Covariate: An Example of Simpson's Paradox. American Statistician - AMER STATIST. 50. 340-341. 10.1080/00031305.1996.10473563.
After a first survey in the 1970s, a follow-up study in 1992-94 (in Whickham a mixed urban and rural district near Newcastle upon Tyne, United Kingdom) investigated (among other things) the 20-year after survival of subjects included in the original study.
```{r, include=FALSE}
# data from [2]
smoking_raw_data <- rbind(
data.frame(
age = c("18-24", "25-34", "35-44", "45-54", "55-64", "65-74", "75+"),
smoker_status = "non_smoker",
dead = c(1, 5, 7, 12, 40, 101, 64),
alive = c(61, 152, 114, 66, 81, 28, 0)
),
data.frame(
age = c("18-24", "25-34", "35-44", "45-54", "55-64", "65-74", "75+"),
smoker_status = "smoker",
dead = c(2, 3, 14, 27, 51, 29, 13),
alive = c(53, 121, 95, 103, 64, 7, 0)
)
)
# reformat data
smoking_data <- smoking_raw_data %>%
# compute suvival rate
mutate(survival_rate = alive/(dead+alive))
# aggregated count (without accounting for age)
smoking_agg_data <- smoking_data %>% group_by(smoker_status) %>%
summarise(dead = sum(dead), alive = sum(alive)) %>% ungroup() %>%
# compute suvival rate
mutate(survival_rate = alive/(dead+alive))
```
Here are the results (dataset from [2]) regarding the survival of the women in the study depending on their smoking status ("smoker" or "non smoker"):
```{r, echo=FALSE, message=FALSE, warning=FALSE}
knitr::kable(smoking_agg_data %>% mutate(survival_rate=str_c(round(100*survival_rate, digits = 3), "%")))
```
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.align = "center"}
ggplot(smoking_agg_data, aes(x=smoker_status, y=survival_rate, fill = smoker_status)) +
geom_bar(stat = "identity", position = "dodge") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
theme_bw()
```
<div class="pencadre">
What can you say about these results ?
</div>
<details><summary>Solution</summary>
<p>
The survival rate is larger in the "smoker" sub-group than in the "not smoker" sub-group. Surprising?
</p>
</details>
<div class="pencadre">
What could explain this surprising result?
</div>
<details><summary>Solution</summary>
<p>
**A confounding variable**: the age of the person in the original study.
There is a sampling bias in this dataset. The number of older non-smoking women (thus with a higher mortality risk) is very high compared to the other groups, hence decreasing the global survival rate in the non-smoking group.
```{r, echo=FALSE, message=FALSE, warning=FALSE}
knitr::kable(smoking_data %>% mutate(survival_rate=str_c(round(100*survival_rate, digits = 3), "%")))
```
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.align = "center"}
ggplot(smoking_data, aes(x=age, y=survival_rate, fill = smoker_status)) +
geom_bar(stat = "identity", position = "dodge") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
theme_bw()
```
</p>
</details>
> **Note:** in the two previous examples, we did not focus on quantifying any statistical significance of the different effects that we discussed (that would require using statistical models). The point was to give you the intuition about potential issues and misleading results caused by any forgotten confounding factors and the Simpson's paradox in an analysis.
<div class="pencadre">
How to avoid drawing false conclusions because of confounding factors?
</div>
<details><summary>Solution</summary>
<p>
- Use randomization to build your experiment, like [randomized controlled trial](https://en.wikipedia.org/wiki/Randomized_controlled_trial), possibly with [double-blinding](https://en.wikipedia.org/wiki/Blinded_experiment), to remove the potential effect of confounding variables (should be used for any serious drug trial), or at least control the potential sampling bias caused by confounding factors (like [case-control studies](https://en.wikipedia.org/wiki/Case%E2%80%93control_study), [cohort studies](https://en.wikipedia.org/wiki/Cohort_study), [stratification](https://en.wikipedia.org/wiki/Stratified_sampling)).
- If not possible (it is not always possible depending on the design of the experiment and/or the object of the study), measure and log various metadata regarding your subjects/individuals so that you will be able to account for the potential effect of confounding variables in your analysis (c.f. [later](#one-factor-anova)).
It generally requires a certain level of technical expertise/knowledge in the considered subject to be able to identify potential confounding factors before the experiments (so that you can monitor and log the corresponding quantities during your experiment).
More details [here](https://en.wikipedia.org/wiki/Confounding#Decreasing_the_potential_for_confounding).
</p>
</details>
--- ---
...@@ -990,9 +818,9 @@ Notations: ...@@ -990,9 +818,9 @@ Notations:
Model: Model:
\[ $$
Y_{jr} = \mu_j + \varepsilon_{jr} Y_{jr} = \mu_j + \varepsilon_{jr}
\] $$
Assumptions: Assumptions:
...@@ -1001,9 +829,9 @@ Assumptions: ...@@ -1001,9 +829,9 @@ Assumptions:
Equivalent formulation: Equivalent formulation:
\[ $$
Y_{jr} \sim \mathcal{N}(\mu_j, \sigma^2) Y_{jr} \sim \mathcal{N}(\mu_j, \sigma^2)
\] $$
</p> </p>
</details> </details>
...@@ -1116,9 +944,9 @@ lm(A101 ~ factor(YAL069W_1), data = yeast_av_subdata) ...@@ -1116,9 +944,9 @@ lm(A101 ~ factor(YAL069W_1), data = yeast_av_subdata)
**Model considered in R** (with the previous notations): **Model considered in R** (with the previous notations):
\[ $$
Y_{jr} = \mu + \beta_j + \varepsilon_{jr} Y_{jr} = \mu + \beta_j + \varepsilon_{jr}
\] $$
with: with:
...@@ -1127,9 +955,9 @@ with: ...@@ -1127,9 +955,9 @@ with:
**Corresponding matrix notation** (which requires to reindex $Y$ and $E$): **Corresponding matrix notation** (which requires to reindex $Y$ and $E$):
\[ $$
Y = \mu + X \times B + E Y = \mu + X \times B + E
\] $$
where: where:
...@@ -1140,9 +968,9 @@ where: ...@@ -1140,9 +968,9 @@ where:
Conditional expectation: Conditional expectation:
\[ $$
\EE(Y_i\ \vert\ X_{ij} = x_{ij}) = \mu + \sum_{i=1}^n x_{ij}\ \beta_j \EE(Y_i\ \vert\ X_{ij} = x_{ij}) = \mu + \sum_{i=1}^n x_{ij}\ \beta_j
\] $$
Then $\mu$ and $B$ are estimated by a **least square linear regression** (see [here](https://en.wikipedia.org/wiki/Simple_linear_regression) [here](https://en.wikipedia.org/wiki/Linear_regression) and [here](https://en.wikipedia.org/wiki/Least_squares) for more details and references). Then $\mu$ and $B$ are estimated by a **least square linear regression** (see [here](https://en.wikipedia.org/wiki/Simple_linear_regression) [here](https://en.wikipedia.org/wiki/Linear_regression) and [here](https://en.wikipedia.org/wiki/Least_squares) for more details and references).
...@@ -1183,9 +1011,9 @@ Notation: ...@@ -1183,9 +1011,9 @@ Notation:
Two-factor ANOVA without interaction: Two-factor ANOVA without interaction:
\[ $$
Y_{jkr} = \mu + \beta_j + \alpha_k + \varepsilon_{jkr} Y_{jkr} = \mu + \beta_j + \alpha_k + \varepsilon_{jkr}
\] $$
where: where:
...@@ -1195,9 +1023,9 @@ where: ...@@ -1195,9 +1023,9 @@ where:
Two-factor ANOVA with interaction: Two-factor ANOVA with interaction:
\[ $$
Y_{jkr} = \mu + \beta_j + \alpha_k + \gamma_{jk} + \varepsilon_{jkr} Y_{jkr} = \mu + \beta_j + \alpha_k + \gamma_{jk} + \varepsilon_{jkr}
\] $$
where: where:
...@@ -1737,6 +1565,179 @@ You could also try to consider linear model integrating multiple SNPs (instead o ...@@ -1737,6 +1565,179 @@ You could also try to consider linear model integrating multiple SNPs (instead o
<!-- </details> --> <!-- </details> -->
## Appendix (optional): Simpon's paradox and confounding factors
In the [`palmerpenguins`](https://allisonhorst.github.io/palmerpenguins/) dataset, we have different morphological measurements for numerous individuals of penguins of different species. We want to investigate a possible link between the bill depth vs the bill length for these penguins.
Reminder about the anatomy of a penguin:
<figure><img src="./img/culmen_depth.png" alt="Trulli" style="width:100%">
<figcaption align = "center">
Microscopy picture of yeast cells ([credit](https://allisonhorst.github.io/palmerpenguins/))
</figcaption></figure>
```{r, include=FALSE}
library(palmerpenguins)
data("penguins")
```
Here is a [scatter plot](https://en.wikipedia.org/wiki/Scatter_plot) representing the bill depth vs the bill length for all individuals in the dataset.
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.align = "center"}
ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm)) + geom_point() +
theme_bw()
```
<div class="pencadre">
What can you say about this representation? Can you infer any relationship between bill depth and length in penguins?
</div>
<details><summary>Solution</summary>
<p>
There seems to be a negative relationship between bill depth and length in penguins. As the length increases, the depth decreases.
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.align = "center"}
ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm)) + geom_point() +
geom_smooth(method='lm', formula= y~x, se=FALSE) +
theme_bw()
```
</p>
</details>
<div class="pencadre">
Could we question this result?
</div>
<details><summary>Solution</summary>
<p>
Here is the same representation discriminated by species:
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.align = "center"}
ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm, group = species, col = species)) + geom_point() +
geom_smooth(method='lm', formula= y~x, se=FALSE) +
theme_bw()
```
We obtained opposite conclusions : in all species, there seems to be a positive relationship between bill depth and length. As the length increases, the depth also increases.
</p>
</details>
The *species* factor is called a [**confounding factors**](https://en.wikipedia.org/wiki/Confounding): a variable (potentially unknown) that influences both variables resulting in a spurious association.
In this particular case, the change of trend when accounting for the groups of individuals is called the [Simpon's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox).
Here is another example from the following publication:
> [2] Appleton, David & French, Joyce & Vanderpump, Mark. (1996). Ignoring a Covariate: An Example of Simpson's Paradox. American Statistician - AMER STATIST. 50. 340-341. 10.1080/00031305.1996.10473563.
After a first survey in the 1970s, a follow-up study in 1992-94 (in Whickham a mixed urban and rural district near Newcastle upon Tyne, United Kingdom) investigated (among other things) the 20-year after survival of subjects included in the original study.
```{r, include=FALSE}
# data from [2]
smoking_raw_data <- rbind(
data.frame(
age = c("18-24", "25-34", "35-44", "45-54", "55-64", "65-74", "75+"),
smoker_status = "non_smoker",
dead = c(1, 5, 7, 12, 40, 101, 64),
alive = c(61, 152, 114, 66, 81, 28, 0)
),
data.frame(
age = c("18-24", "25-34", "35-44", "45-54", "55-64", "65-74", "75+"),
smoker_status = "smoker",
dead = c(2, 3, 14, 27, 51, 29, 13),
alive = c(53, 121, 95, 103, 64, 7, 0)
)
)
# reformat data
smoking_data <- smoking_raw_data %>%
# compute suvival rate
mutate(survival_rate = alive/(dead+alive))
# aggregated count (without accounting for age)
smoking_agg_data <- smoking_data %>% group_by(smoker_status) %>%
summarise(dead = sum(dead), alive = sum(alive)) %>% ungroup() %>%
# compute suvival rate
mutate(survival_rate = alive/(dead+alive))
```
Here are the results (dataset from [2]) regarding the survival of the women in the study depending on their smoking status ("smoker" or "non smoker"):
```{r, echo=FALSE, message=FALSE, warning=FALSE}
knitr::kable(smoking_agg_data %>% mutate(survival_rate=str_c(round(100*survival_rate, digits = 3), "%")))
```
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.align = "center"}
ggplot(smoking_agg_data, aes(x=smoker_status, y=survival_rate, fill = smoker_status)) +
geom_bar(stat = "identity", position = "dodge") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
theme_bw()
```
<div class="pencadre">
What can you say about these results ?
</div>
<details><summary>Solution</summary>
<p>
The survival rate is larger in the "smoker" sub-group than in the "not smoker" sub-group. Surprising?
</p>
</details>
<div class="pencadre">
What could explain this surprising result?
</div>
<details><summary>Solution</summary>
<p>
**A confounding variable**: the age of the person in the original study.
There is a sampling bias in this dataset. The number of older non-smoking women (thus with a higher mortality risk) is very high compared to the other groups, hence decreasing the global survival rate in the non-smoking group.
```{r, echo=FALSE, message=FALSE, warning=FALSE}
knitr::kable(smoking_data %>% mutate(survival_rate=str_c(round(100*survival_rate, digits = 3), "%")))
```
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.align = "center"}
ggplot(smoking_data, aes(x=age, y=survival_rate, fill = smoker_status)) +
geom_bar(stat = "identity", position = "dodge") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
theme_bw()
```
</p>
</details>
> **Note:** in the two previous examples, we did not focus on quantifying any statistical significance of the different effects that we discussed (that would require using statistical models). The point was to give you the intuition about potential issues and misleading results caused by any forgotten confounding factors and the Simpson's paradox in an analysis.
<div class="pencadre">
How to avoid drawing false conclusions because of confounding factors?
</div>
<details><summary>Solution</summary>
<p>
- Use randomization to build your experiment, like [randomized controlled trial](https://en.wikipedia.org/wiki/Randomized_controlled_trial), possibly with [double-blinding](https://en.wikipedia.org/wiki/Blinded_experiment), to remove the potential effect of confounding variables (should be used for any serious drug trial), or at least control the potential sampling bias caused by confounding factors (like [case-control studies](https://en.wikipedia.org/wiki/Case%E2%80%93control_study), [cohort studies](https://en.wikipedia.org/wiki/Cohort_study), [stratification](https://en.wikipedia.org/wiki/Stratified_sampling)).
- If not possible (it is not always possible depending on the design of the experiment and/or the object of the study), measure and log various metadata regarding your subjects/individuals so that you will be able to account for the potential effect of confounding variables in your analysis (c.f. [later](#one-factor-anova)).
It generally requires a certain level of technical expertise/knowledge in the considered subject to be able to identify potential confounding factors before the experiments (so that you can monitor and log the corresponding quantities during your experiment).
More details [here](https://en.wikipedia.org/wiki/Confounding#Decreasing_the_potential_for_confounding).
</p>
</details>
---
--- ---
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment