Newer
Older
---
title: "Simulation tutorial"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Simulation tutorial}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(HTRfit)
```
In this documentation, we will use HTRfit to simulate various experimental design involving multiple experimental condition. To do so, we will first need to estimate plausible parameter based on publicly available data.
## Public dataset for parameters inference
In this section, we utilize publicly available RNAseq data from the public project to estimate parameters for simulating experimental designs using HTRfit. The dataset comprises count data for 30 genotypes across 6428 genes with 2-4 replicates per genotype.
```{r loading_pub_data, warning = FALSE, message = FALSE}
## -- public data embedded in HTRfit
pub_data_fn <- system.file("extdata", "pub_data_rna.tsv", package = "HTRfit")
pub_data <- read.table(file = pub_data_fn, header = TRUE)
pub_metadata_fn <- system.file("extdata", "pub_metadata.tsv", package = "HTRfit")
pub_metadata <- read.table(file = pub_metadata_fn, header = TRUE)
```
The public project offers comprehensive count data spanning various genotypes and genes. Key statistics about the dataset include:
```{r metrics_pub, echo = FALSE, warning = FALSE, message = FALSE}
## -- 6428 genes & 111 samples
n_genes <- dim(pub_data)[1]
n_samples <- dim(pub_data)[2]
## -- number of genotype
n_genotypes <- length(unique(pub_metadata$genotype))
## -- mean sequencing depth : ~ 7e6 reads/sample
sequencing_depth <- mean(colSums(pub_data))
dt2display <- data.frame(genes = n_genes, samples = n_samples, genotypes = n_genotypes, "averageReadsSample" = sequencing_depth)
colnames(dt2display) <- c("genes", 'samples', 'genotypes', 'average reads/sample')
rownames(dt2display) <- "GSE175398"
kableExtra::kable(dt2display) %>%
kableExtra::kable_styling(full_width = F, position = "center")
```
##### Model fitting on GSE175398 project RNAseq dataset
We fit a generalized linear model (GLM) to the data to estimate plausible parameter distributions. The GLM allows us to estimate coefficients necessary for simulating experimental conditions. The `tidy_pub_fit` object contains HTRfit's estimates for each coefficient and gene.
```{r fit_pub_data, eval=FALSE, message=FALSE, warning=FALSE, include=TRUE}
# -- prepare data
## apply + 1 to each counts
## remove genes i with all(k_ij) < 10
data2fit = prepareData2fit(countMatrix = pub_data,
metadata = pub_metadata,
normalization = 'MRN',
response_name = "kij",
formula = kij ~ genotype,
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1)
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- extract results for best genes
tidy_pub_fit <- tidy_results(l_tmb[l_genes])
## get parameters sd ... +> we obtained a sd(genotype) = 0.2462256
## PUB_BASAL_EXPR <- ...
## DISP_PUB <- ...
```
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
```{r load_fit_pub_data, include = FALSE}
## -- hidded in vignette
## -- allows to not evaluate previous chunk (a bit long)
PUB_BASAL_EXPR_fn <- system.file("extdata", "pub_basal_expr.tsv", package = "HTRfit")
PUB_BASAL_EXPR <- read.table(file = PUB_BASAL_EXPR_fn, header = TRUE)$x
DISP_PUB_fn <- system.file("extdata", "pub_dispersion.tsv", package = "HTRfit")
DISP_PUB <- read.table(file = DISP_PUB_fn, header = TRUE)$x
```
To replicate the experimental setup of the [GSE175398](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE175398) project, we initialize 30 levels of genotypes using the init_variable() function. The estimated intercept obtained from the prior analysis of publicly available data serves as the basal expression level for each gene in our simulated dataset, ensuring consistency with the basal expression levels observed in the public data. Moreover, we reuse the gene dispersion observed in the public data for our simulation. To simplify the example, only 100 genes are simulated. Consequently, the sequencing depth is adjusted to accommodate 100 genes by dividing the sequencing depth observed in the public data by 64.28 (6428/100).
```{r mimic_pub, warning = FALSE, message = FALSE}
## -- design to simulate
N_GENES <- 100
MIN_REPLICATES <- 2
MAX_REPLICATES <- 4
SEQ_DEPTH <- mean(colSums(pub_data))/N_GENES
## -- init effects to simulate
input_var_list <- init_variable( name = "genotype", sd = 0.2462256, level = 30)
## -- simulate RNAseq data to mimic public data
mock_data <- mock_rnaseq(input_var_list,
n_genes = N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES,
basal_expression = PUB_BASAL_EXPR,
sequencing_depth = SEQ_DEPTH,
dispersion = DISP_PUB )
```
##### Fit model to simulated RNAseq data
We fit a model to the simulated RNAseq data using the `fitModelParallel()` function. Our model will fit the 30 coefficients of genotypes we simulated.
```{r fit_model_mimic_pub, warning = FALSE, message = FALSE }
## -- prepare data simulated to fit
## apply + 1 to each counts
## remove genes i with all(k_ij) < 10
data2fit = prepareData2fit(countMatrix = mock_data$counts,
metadata = mock_data$metadata,
normalization = 'MRN',
response_name = "kij",
## -- fit a model to the simulated data
l_tmb <- fitModelParallel(
formula = kij ~ genotype,
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1,
control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5,
eval.max=1e5)))
```
##### Evaluate fit obtained from simulated data
Next, we evaluate the fitting results obtained from the simulated RNAseq data by comparing the expected results computed by `evaluation_report()` with those obtained with `fitModelParallel()`.
```{r evaluation_mimic_pub, warning = FALSE, message = FALSE}
## -- eval params for Wald test
ln_FC_threshold <- log(1.25) ## 25% of expression change
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
list_gene = l_genes,
mock_obj = mock_data,
coeff_threshold = ln_FC_threshold,
alt_hypothesis = alternative_hypothesis )
```
##### Explore evaluation results - Identity plot
The `evaluation_report()` function produces an identity plot, offering a visual tool to juxtapose the simulated effects (actual effects) with the model-inferred effects. This graphical representation simplifies assessing how well the model aligns with the values of the simulated effects, enabling a visual analysis of the model's goodness of fit to the simulated data. For a more quantitative evaluation of inference quality, the R-squared (R^2) and RMSE metrics can be employed.
```{r id_params_mimic_pub , warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=7}
## -- Model params
resSimu$identity$params
```
The Receiver Operating Characteristic (ROC) curve is a valuable tool for assessing the performance of classification models, particularly in the context of identifying differentially expressed genes. It provides a graphical representation of the model's ability to distinguish between genes that are differentially expressed and those that are not, by varying the `coeff_threshold` and the `alt_hypothesis` parameters.
```{r roc_mimic_pub, warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=5}
## -- ROC curve
resSimu$roc$params
```
##### Explore evaluation results - Precision-recall curve
The precision-recall curve (PR curve) illustrates the relationship between precision and recall at various classification thresholds. It is particularly valuable in the context of imbalanced data, where one class is significantly more prevalent than the other. Unlike the ROC curve, which can be influenced by class distribution, the PR curve focuses on the model's ability to correctly identify examples of the minority class while maintaining high precision.
```{r pr_curve_mimic_pub, warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=5}
## -- precision recall by params
resSimu$precision_recall$params
```
##### Explore evaluation results - classification metrics
The area under the ROC curve (AUC) provides a single metric that summarizes the model's overall performance in distinguishing between differentially expressed and non-differentially expressed genes. A higher AUC indicates better model performance. In the case of the ROC curve, this AUC should be compared to the value of 0.5, representing a random classifier. On the other hand, for the PR curve, we compute the pr_AUC_random, serving as a baseline for comparison.
```{r perf_mimic_pub2, eval=FALSE}
## -- actual/estimate comparison
resSimu$performances$byparams[3, c('description', 'R2', 'RMSE')]
```
```{r perf_mimic_pub2_display, echo=FALSE, fig.align='center', fig.height=4, fig.width=7, message=FALSE, ,warning=FALSE}
## -- actual/estimate comparison
dt2display <- resSimu$performances$byparams[3, c('R2', 'RMSE')]
kableExtra::kable(dt2display, row.names = FALSE ) %>%
kableExtra::kable_styling(full_width = F, position = "center")
```
The classification metrics are computed for each fixed parameters (excluding the intercept when `skip_eval_intercept = TRUE`).
```{r perf_mimic_pub1, eval=FALSE}
## -- classification metrics
resSimu$performances$byparams[3, c('description', 'pr_AUC',
'pr_randm_AUC', 'pr_performance_ratio',
'roc_AUC', 'roc_randm_AUC')]
```
```{r perf_mimic_pub1_display, echo=FALSE, fig.align='center', fig.height=4, fig.width=7, message=FALSE, ,warning=FALSE}
## -- classification metrics
dt2display <- resSimu$performances$byparams[3, c('description', 'pr_AUC', 'pr_randm_AUC',
'pr_performance_ratio','roc_AUC', 'roc_randm_AUC')]
rownames(dt2display) <- NULL
kableExtra::kable(dt2display, row.names = FALSE) %>%
kableExtra::kable_styling(full_width = F, position = "center")
## Estimating `sd(genotype)` with random effects
In this section, we delve into the process of estimating the standard deviation of genotype effects (`sd(genotype)`) using random effects in HTRfit. By treating genotype as a random effect (`kij ~ (1 | genotype)`), HTRfit provides estimates for both the `(Intercept)` and `sd_(Intercept)` for each gene.
```{r fit_model_randm_mimic_pub, warning = FALSE, message = FALSE }
## fit genotype as random effect
l_tmb <- fitModelParallel(
formula = kij ~ (1 | genotype),
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1,
control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5,
eval.max=1e5)))
## -- eval params for Wald test
ln_FC_threshold <- log(1.25) ## 25% of expression change
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
list_gene = l_genes,
mock_obj = mock_data,
coeff_threshold = ln_FC_threshold,
alt_hypothesis = alternative_hypothesis )
```
##### Explore evaluation results - Violin plot
To evaluate the consistency between estimated and actual values of the random effect, we employ a violin plot. This graphical representation showcases how well the `sd_(Intercept)` estimates align with the simulated `sd(genotype)` value (0.2462256) across genes.
```{r id_params_randm_mimic_pub , warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=5}
## -- violin
resSimu$violin_rand_params + ggplot2::ylim(c(0, 0.5))
```
##### Strategies for improving `sd(genotype)` estimation
One approach to enhance the estimation of `sd(genotype)` involves increasing the number of genotype levels. By expanding the number of genotype levels in the simulated data, the model becomes more adept at capturing the inherent variability in genotype effects.
Notice that all other parameters are conserved as before (sequencing depth, replicates number, number of genes, basal gene expression). Only the number of genotype levels is increased to 300 instead of 30.
```{r id_params_randm_mimic_pub, warning=FALSE, message=FALSE}
## -- increase number of level to 300
input_var_list <- init_variable( name = "genotype", sd = 0.2462256, level = 300)
## -- simulate RNAseq data
mock_data <- mock_rnaseq(input_var_list,
n_genes = N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES,
basal_expression = PUB_BASAL_EXPR,
sequencing_depth = SEQ_DEPTH,
dispersion = DISP_PUB )
## -- prepare data
## apply + 1 to each counts
## remove genes i with all(k_ij) < 10
data2fit = prepareData2fit(countMatrix = mock_data$counts,
metadata = mock_data$metadata,
normalization = 'MRN',
response_name = "kij",
## fit random effect
l_tmb <- fitModelParallel(
formula = kij ~ (1 | genotype),
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1,
control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5,
eval.max=1e5)))
## -- eval params for Wald test on fixed
ln_FC_threshold <- log(1.25) ## 25% of expression change
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
list_gene = l_genes,
mock_obj = mock_data,
coeff_threshold = ln_FC_threshold,
alt_hypothesis = alternative_hypothesis )
```
Following the adjustment in genotype levels, we reevaluate the model's performance using the `evaluation_report()` function. The resulting violin plot demonstrates the improvement in estimating `sd(genotype)`, as evidenced by the reduction in RMSE.
```{r id_params , warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=5}
## -- Model params
resSimu$violin_rand_params + ggplot2::ylim(c(0, 0.5))
```
It's important to note that since we only fit random effects, classification metrics are not available in this case.
In this section, we demonstrate how HTRfit can simulate multiple main effects by "piping" (`%>%`) the `init_variable()` function. To illustrate, we introduce a second main effect, `environment`, with two levels. This allows us to explore the combined effects of genotype and environment on gene expression.
To maintain consistency with the public dataset from the public project, all effects are initialized around the observed value of 0.2462256. This ensures that the simulated data closely mirrors real-world observations. Additionally, gene dispersion, basal expression, and sequencing depth observed in the public data are preserved in our simulation. For simplicity, we simulate data for only 100 genes, adjusting the sequencing depth accordingly.
```{r simu_mainX2, warning = FALSE, message = FALSE}
N_GENES <- 100
MIN_REPLICATES <- 4
MAX_REPLICATES <- 4
SEQ_DEPTH <- mean(colSums(pub_data))/N_GENES
## -- init design
input_var_list <- init_variable( name = "genotype", sd = 0.2462256, level = 30) %>%
init_variable( name = "environment", sd = 0.2462256, level = 2 )
## -- simulate RNAseq data
mock_data <- mock_rnaseq(input_var_list,
n_genes = N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES,
basal_expression = PUB_BASAL_EXPR,
sequencing_depth = SEQ_DEPTH,
dispersion = DISP_PUB )
```
We fit a model to the simulated RNAseq data, incorporating both genotype and environment as main effects. Subsequently, we evaluate the model's performance using metrics and graphical representations available in HTRfit.
```{r simu_mainX2_fit, warning = FALSE, message = FALSE}
## -- prepare data
## apply + 1 to each counts
## remove genes i with all(k_ij) < 10
data2fit = prepareData2fit(countMatrix = mock_data$counts,
metadata = mock_data$metadata,
normalization = 'MRN',
response_name = "kij",
## fit complex model
l_tmb <- fitModelParallel(
formula = kij ~ genotype + environment,
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1,
control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5,
eval.max=1e5)))
## -- eval params for Wald test
ln_FC_threshold <- log(1.25) ## 25% of expression change
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
list_gene = l_genes,
mock_obj = mock_data,
coeff_threshold = ln_FC_threshold,
alt_hypothesis = alternative_hypothesis )
```
##### Explore evaluation results - Identity plot
The identity plot provides a visual tool to assess how well the model aligns with the values of the simulated effects. It enables a straightforward analysis of the model's goodness of fit to the simulated data.
```{r id_params_mainX2 , warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=7}
## -- Model params
resSimu$identity$params
```
The Receiver Operating Characteristic (ROC) curve evaluates the model's ability to distinguish between differentially expressed genes. By varying the classification threshold, this curve offers insights into the model's discriminatory power.
```{r roc_mainX2, warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=7}
## -- ROC curve
resSimu$roc$params
```
##### Explore evaluation results - Precision-recall curve
The precision-recall curve illustrates the trade-off between precision and recall at different classification thresholds. Particularly useful for imbalanced datasets, it highlights the model's ability to correctly identify examples of the minority class while maintaining high precision.
```{r pr_curve_mainX2, warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=7}
## -- precision recall by params
resSimu$precision_recall$params
```
##### Explore evaluation results - Metrics summary
Finally, we summarize the model's performance using various metrics, including R-squared (R²) and root mean square error (RMSE) for regression tasks, as well as area under the curve (AUC) for classification tasks.
```{r perf_mainX2, eval=FALSE}
## -- actual/estimate comparison
resSimu$performances$byparams[3:4, c('description', 'R2', 'RMSE')]
```
```{r perf_mainX2_display, echo=FALSE, fig.align='center', fig.height=4, fig.width=7, message=FALSE, ,warning=FALSE}
## -- actual/estimate comparison
dt2display <- resSimu$performances$byparams[3:4, c('R2', 'RMSE')]
rownames(dt2display) <- NULL
kableExtra::kable(dt2display, row.names = FALSE) %>%
kableExtra::kable_styling(full_width = F, position = "center")
```
```{r perf_mainX2_class, eval=FALSE}
## -- classification metrics
resSimu$performances$byparams[3:4, c('description', 'pr_AUC',
'pr_randm_AUC', 'pr_performance_ratio',
'roc_AUC', 'roc_randm_AUC')]
```
```{r perf_mainX2_class_display, echo=FALSE, fig.align='center', fig.height=4, fig.width=7, message=FALSE, ,warning=FALSE}
## -- classification metrics
dt2display <- resSimu$performances$byparams[3:4, c('description', 'pr_AUC', 'pr_randm_AUC',
'pr_performance_ratio','roc_AUC', 'roc_randm_AUC')]
rownames(dt2display) <- NULL
kableExtra::kable(dt2display, row.names = FALSE) %>%
kableExtra::kable_styling(full_width = F, position = "center")
## Estimating `environment` and `sd(genotype)`
In this section, we utilize a mixed model approach to estimate the fixed effect of `environment` and the variability of genotype using HTRfit. By specifying the model formula as `kij ~ environment + (1 | genotype)`, we obtain estimates for the fixed effect of `environment` for each gene, along with an estimate of the variability of genotype (referred to as `sd(Intercept)` by the model).
The mixed model incorporates both the fixed effect of `environment` and the random effect of `genotype`, allowing us to capture the influence of environmental factors on gene expression while accounting for genotype variability. We fit this model to the simulated RNAseq data and evaluate its performance using various metrics.
```{r fit_mainX2_mixed, warning = FALSE, message = FALSE}
## fit complex model
l_tmb <- fitModelParallel(
formula = kij ~ environment + ( 1 | genotype ) ,
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1,
control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5,
eval.max=1e5)))
## -- eval params for Wald test
ln_FC_threshold <- log(1.25) ## 25% of expression change
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
list_gene = l_genes,
mock_obj = mock_data,
coeff_threshold = ln_FC_threshold,
alt_hypothesis = alternative_hypothesis )
```
Given the inclusion of a fixed effect (`environment`), the ROC curve provides insights into the model's ability to discriminate between differentially expressed genes based on environmental factors. By varying the classification threshold, we assess the model's discriminatory power in distinguishing between gene expression patterns.
```{r roc_mainX2_mixed, warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=5}
## -- ROC curve
resSimu$roc$params
```
##### Explore evaluation results - Precision-recall curve
Similarly, the precision-recall curve evaluates the model's performance in identifying differentially expressed genes across various classification thresholds. This curve offers valuable insights, particularly in scenarios with imbalanced datasets, by highlighting the trade-off between precision and recall.
```{r pr_curve_mainX2_mixed, warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=5}
## -- precision recall by params
resSimu$precision_recall$params
```
##### Explore evaluation results - Violin plot
We also visualize the distribution of random parameters estimation. The violin plot provides a graphical representation of proximity between the actual `sd(genotype)` values and their corresponding estimations provided by the model.
```{r violin_mainX2_mixed , warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=5}
## -- Model params
resSimu$violin_rand_params + ggplot2::ylim(c(0, 0.5))
```
##### Explore evaluation results - Metrics summary
Finally, we summarize the model's performance using key metrics, including R-squared (R²) and root mean square error (RMSE) for regression tasks, as well as area under the curve (AUC) for classification tasks.
```{r perf_mainX2_mixed, eval=FALSE}
## -- actual/estimate comparison
resSimu$performances$byparams[3:4, c('description', 'R2', 'RMSE')]
```
```{r perf_mainX2_mixed_display, echo=FALSE, fig.align='center', fig.height=4, fig.width=7, message=FALSE, ,warning=FALSE}
## -- actual/estimate comparison
dt2display <- resSimu$performances$byparams[3:4, c('description','R2', 'RMSE')]
kableExtra::kable(dt2display, row.names = FALSE) %>%
kableExtra::kable_styling(full_width = F, position = "center")
```
```{r perf_mainX2_mixed_class, eval=FALSE}
## -- classification metrics
resSimu$performances$byparams[3, c('description', 'pr_AUC',
'pr_randm_AUC', 'pr_performance_ratio',
'roc_AUC', 'roc_randm_AUC')]
```
```{r perf_mainX2_mixed_class_display, echo=FALSE, fig.align='center', fig.height=4, fig.width=7, message=FALSE, ,warning=FALSE}
## -- classification metrics
dt2display <- resSimu$performances$byparams[3, c('description', 'pr_AUC', 'pr_randm_AUC',
'pr_performance_ratio','roc_AUC', 'roc_randm_AUC')]
kableExtra::kable(dt2display, row.names = FALSE) %>%
kableExtra::kable_styling(full_width = F, position = "center")
```
## Simulating interactions with HTRfit
HTRfit offers the capability to initialize and simulate complex experimental designs involving multiple variables with varying numbers of levels, as well as interactions between previously initialized main effects. To illustrate this capability, let's simulate an RNAseq experiment with 30 genotype levels and 2 environmental conditions. Using the `add_interaction()` function, interactions between each level can also be simulated.
To maintain consistency with observed effects in previous studies, all effects in our simulation are initialized around 0.2462256, as observed in the public data from the public project. Gene dispersion, basal expression, and sequencing depth observed in the public data are conserved in our simulation. For simplicity, only 100 genes are simulated, and consequently, the sequencing depth is adjusted to accommodate 100 genes.
```{r complex_design, warning = FALSE, message = FALSE}
N_GENES <- 100
MIN_REPLICATES <- 4
MAX_REPLICATES <- 4
SEQ_DEPTH <- mean(colSums(pub_data))/N_GENES
## -- init design
input_var_list <- init_variable( name = "genotype", sd = 0.2462256, level = 30) %>%
init_variable( name = "environment", sd = 0.2462256, level = 2 ) %>%
add_interaction(between_var = c("genotype", "environment"), sd = 0.2462256)
## -- simulate RNAseq data
mock_data <- mock_rnaseq(input_var_list,
n_genes = N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES,
basal_expression = PUB_BASAL_EXPR,
sequencing_depth = SEQ_DEPTH,
dispersion = DISP_PUB )
## -- prepare data
## apply + 1 to each counts
## remove genes i with all(k_ij) < 10
data2fit = prepareData2fit(countMatrix = mock_data$counts,
metadata = mock_data$metadata,
normalization = 'MRN',
response_name = "kij",
## fit complex model
l_tmb <- fitModelParallel(
formula = kij ~ genotype + environment + genotype:environment,
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1,
control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5,
eval.max=1e5)))
## -- eval params for Wald test
ln_FC_threshold <- log(1.25) ## 25% of expression change
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
list_gene = l_genes,
mock_obj = mock_data,
coeff_threshold = ln_FC_threshold,
alt_hypothesis = alternative_hypothesis )
```
##### Identity Plot, ROC Curve, and Precision-Recall Curve
These plots provide visual assessments of how well the model aligns with the values of the simulated effects and its ability to distinguish between differentially expressed genes. Each curve represents a different aspect of the model's performance, including its goodness of fit and classification accuracy.
```{r id_params_complex_design , warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=7}
## -- Model params
resSimu$identity$params
## -- ROC curve
resSimu$roc$params
## -- precision recall by params
resSimu$precision_recall$params
```
The following table summarizes the model's performance metrics, including R-squared (R2), root mean squared error (RMSE), precision-recall area under the curve (PR AUC), and receiver operating characteristic area under the curve (ROC AUC).
```{r perf_complex_design, eval=FALSE}
resSimu$performances$byparams[3:5, c('description', 'R2', 'RMSE')]
```
```{r perf_complex_design_display, echo=FALSE, fig.align='center', fig.height=4, fig.width=7, message=FALSE, ,warning=FALSE}
## -- precision-recall curve
kableExtra::kable(resSimu$performances$byparams[3:5, c('description', 'R2', 'RMSE')], row.names = FALSE) %>%
kableExtra::kable_styling(full_width = F, position = "center")
resSimu$performances$byparams[3:5, c('description', 'pr_AUC',
'pr_randm_AUC', 'pr_performance_ratio',
'roc_AUC', 'roc_randm_AUC')]
```
```{r perf_complex_design_class_display, echo=FALSE, fig.align='center', fig.height=4, fig.width=7, message=FALSE, ,warning=FALSE}
## -- precision-recall curve
kableExtra::kable(resSimu$performances$byparams[3:5, c('description', 'pr_AUC', 'pr_randm_AUC',
'pr_performance_ratio','roc_AUC', 'roc_randm_AUC')], row.names = FALSE) %>%
kableExtra::kable_styling(full_width = F, position = "center")
## Mixed model for estimating `sd(genotype:env)`
By employing the following mixed model `kij ~ environment + ( environment | genotype )`, HTRfit provides estimates for a fixed effect the `environment` for each gene. Additionally, we obtain in results an estimate of the `sd(genotype)` (referred to as `sd_(Intercept)` by the model), and the `sd(genotype:environment)` (referred to as `sd_(environment)` by the model).
```{r complex_mixed_model, warning = FALSE, message = FALSE }
## fit complex model
l_tmb <- fitModelParallel(
formula = kij ~ environment + ( environment | genotype ) ,
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1,
control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5,
eval.max=1e5)))
## -- eval params for Wald test
ln_FC_threshold <- log(1.25) ## 25% of expression change
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
list_gene = l_genes,
mock_obj = mock_data,
coeff_threshold = ln_FC_threshold,
alt_hypothesis = alternative_hypothesis )
```
Both the ROC curve and the precision-recall curve provide insights into the classification performance of the model, particularly regarding its ability to distinguish between differentially expressed genes and non-differentially expressed genes.
ROC curve
```{r roc_complex_mixed_model, warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=5}
## -- ROC curve
resSimu$roc$params
## -- precision recall by params
resSimu$precision_recall$params
```
The violin plot visualizes the distribution of random parameter estimates, providing insights into the variability of the `sd(genotype:environment)` (referred to as `sd_(environment)` by the model) , and `sd(genotype)` (referred to as `sd_(Intercept)` by the model) estimates compared to the actual values.
You can notice here that Additionally, `cor__(Intercept).env` is returned by the model. This random parameter refer to the correlation between the basal expression of the gene and the environment effect. As we did not observed such correlation in public data, this metric cannot bet set with HTRfit and is alway set to 0.
Additionally, the model returns `cor__(Intercept).env`, which represents the correlation between the basal expression of the gene and the environmental effect. However, as no such correlation was observed in the public data, this metric cannot be set in HTRfit simulation and is always expected to be 0. The violin plot allows to measure the proximity between the actual and inferred correlations
```{r id_params_complex_mixed_model , warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=7}
## -- Model params
resSimu$violin_rand_params + ggplot2::ylim(c(0, 0.5))
```
The following table summarizes the model's performance metrics, including R-squared (R2), root mean squared error (RMSE), precision-recall area under the curve (PR AUC), and receiver operating characteristic area under the curve (ROC AUC).
```{r perf_complex_mixed_model, eval=FALSE}
## -- actual/estimate comparison
resSimu$performances$byparams[c(2,4,5,6), c('description', 'R2', 'RMSE')]
```
```{r perf_complex_mixed_model_display, echo=FALSE, fig.align='center', fig.height=4, fig.width=7, message=FALSE, ,warning=FALSE}
## -- actual/estimate comparison
dt2display <- resSimu$performances$byparams[c(2,4,5,6), c('description','R2', 'RMSE')]
kableExtra::kable(dt2display, row.names = FALSE) %>%
kableExtra::kable_styling(full_width = F, position = "center")
```
```{r perf_complex_mixed_model_class, eval=FALSE}
## -- classification metrics
resSimu$performances$byparams[4, c('description', 'pr_AUC',
'pr_randm_AUC', 'pr_performance_ratio',
'roc_AUC', 'roc_randm_AUC')]
```
```{r perf_complex_mixed_model_class_display, echo=FALSE, fig.align='center', fig.height=4, fig.width=7, message=FALSE, ,warning=FALSE}
## -- classification metrics
dt2display <- resSimu$performances$byparams[4, c('description', 'pr_AUC', 'pr_randm_AUC',
'pr_performance_ratio','roc_AUC', 'roc_randm_AUC')]
kableExtra::kable(dt2display, row.names = FALSE) %>%
kableExtra::kable_styling(full_width = F, position = "center")
```
## Simulating triple interaction with HTRfit
HTRfit's capability extends to simulating triple interactions to evaluate the model's performance in capturing such complex relationships. To demonstrate this, we introduce a categorical variable, `age`, assuming that the transcriptome evolves with the age of organisms. We initialize this variable with three levels.
```{r design_mainX3, warning = FALSE, message = FALSE}
N_GENES <- 100
MIN_REPLICATES <- 4
MAX_REPLICATES <- 4
SEQ_DEPTH <- mean(colSums(pub_data))/N_GENES
## -- init design
input_var_list <- init_variable( name = "genotype", sd = 0.2462256, level = 30) %>%
init_variable( name = "env", sd = 0.2462256, level = 2 ) %>%
init_variable( name = "age", sd = 0.2462256, level = 3 ) %>%
add_interaction(between_var = c("genotype", "env"), sd = 0.2462256) %>%
add_interaction(between_var = c("genotype", "age"), sd = 0.2462256) %>%
add_interaction(between_var = c("env", "age"), sd = 0.2462256) %>%
add_interaction(between_var = c("env", "genotype", "age"), sd = 0.2462256)
## -- simulate RNAseq data
mock_data <- mock_rnaseq(input_var_list,
n_genes = N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES,
basal_expression = PUB_BASAL_EXPR,
sequencing_depth = SEQ_DEPTH,
dispersion = DISP_PUB )
## -- prepare data
## apply + 1 to each counts
## remove genes i with all(k_ij) < 10
data2fit = prepareData2fit(countMatrix = mock_data$counts,
metadata = mock_data$metadata,
normalization = 'MRN',
response_name = "kij",
model2fit <- kij ~ genotype + env + age +
genotype:env + genotype:age + env:age +
genotype:age:env
## fit complex model
l_tmb <- fitModelParallel(
formula = model2fit,
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1,
control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5,
eval.max=1e5)))
## -- eval params for Wald test
ln_FC_threshold <- log(1.25) ## 25% of expression change
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
list_gene = l_genes,
mock_obj = mock_data,
coeff_threshold = ln_FC_threshold,
alt_hypothesis = alternative_hypothesis )
```
The metrics returned by `evaluation_report()` indicate that capturing triple interactions is significantly more challenging than main and simple interaction effects (between two variables). The metrics comparing actual and inferred values are consistently lower for triple interactions, as well as the classification metrics (AUC)
```{r perf_mainX3, eval=FALSE}
## -- actual/estimate comparison
resSimu$performances$byparams[c(2,4,6,5,7,8,9), c('description','R2', 'RMSE')]
```
```{r perf_mainX3_display, echo=FALSE, fig.align='center', fig.height=4, fig.width=7, message=FALSE, ,warning=FALSE}
## -- actual/estimate comparison
dt2display <- resSimu$performances$byparams[c(2,4,6,5,7,8,9), c('description','R2', 'RMSE')]
kableExtra::kable(dt2display, row.names = FALSE) %>%
kableExtra::kable_styling(full_width = F, position = "center")
```
```{r perf_mainX3_class, eval=FALSE}
## -- classification metrics
resSimu$performances$byparams[c(2,4,6,5,7,8,9), c('description', 'pr_AUC',
'pr_randm_AUC', 'pr_performance_ratio',
'roc_AUC', 'roc_randm_AUC')]
```
```{r perf_mainX3_class_display, echo=FALSE, fig.align='center', fig.height=4, fig.width=7, message=FALSE, ,warning=FALSE}
## -- classification metrics
dt2display <- resSimu$performances$byparams[c(2,4,6,5,7,8,9), c('description', 'pr_AUC', 'pr_randm_AUC',
'pr_performance_ratio','roc_AUC', 'roc_randm_AUC')]
kableExtra::kable(dt2display, row.names = FALSE) %>%
kableExtra::kable_styling(full_width = F, position = "center")
##### Improving triple interaction estimation
To enhance the estimation of `genotype:age:env` triple interaction, one strategy is to increase the number of replicates in our simulation. In this scenario, we use 25 replicates for each experimental condition.
```{r design_mainX3_rep, warning = FALSE, message = FALSE}
MIN_REPLICATES <- 25
MAX_REPLICATES <- 25
## -- simulate RNAseq data
mock_data <- mock_rnaseq(input_var_list,
n_genes = N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES,
basal_expression = PUB_BASAL_EXPR,
sequencing_depth = SEQ_DEPTH,
dispersion = DISP_PUB )
## -- prepare data
## apply + 1 to each counts
## remove genes i with all(k_ij) < 10
data2fit = prepareData2fit(countMatrix = mock_data$counts,
metadata = mock_data$metadata,
normalization = 'MRN',
response_name = "kij",
model2fit <- kij ~ genotype + env + age +
genotype:env + genotype:age + env:age +
genotype:age:env
## fit complex model
l_tmb <- fitModelParallel(
formula = model2fit,
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1,
control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5,
eval.max=1e5)))
## -- eval params for Wald test
ln_FC_threshold <- log(1.25) ## 25% of expression change
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
list_gene = l_genes,
mock_obj = mock_data,
coeff_threshold = ln_FC_threshold,
alt_hypothesis = alternative_hypothesis )
```
We observe a significant improvement in performance across all parameters, including the `genotype:age:env` parameter. The inference of `genotype:age:env` is closer to the actual values (R2, RMSE), and the model's ability to determine whether a parameter is differentially expressed or not (with a 25% expression change) is enhanced (pr_AUC, AUC).
```{r perf_mainX3_rep, eval=FALSE}
## -- actual/estimate comparison
resSimu$performances$byparams[c(2,4,6,5,7,8,9), c('description','R2', 'RMSE')]
```
```{r perf_mainX3_rep_display, echo=FALSE, fig.align='center', fig.height=4, fig.width=7, message=FALSE, ,warning=FALSE}
## -- actual/estimate comparison
dt2display <- resSimu$performances$byparams[c(2,4,6,5,7,8,9), c('description','R2', 'RMSE')]
kableExtra::kable(dt2display, row.names = FALSE) %>%
kableExtra::kable_styling(full_width = F, position = "center")
```
```{r perf_mainX3_rep_class, eval=FALSE}
## -- classification metrics
resSimu$performances$byparams[c(2,4,6,5,7,8,9), c('description', 'pr_AUC',
'pr_randm_AUC', 'roc_AUC',
'roc_randm_AUC')]
```
```{r perf_mainX3_rep_class_display, echo=FALSE, fig.align='center', fig.height=4, fig.width=7, message=FALSE, ,warning=FALSE}
## -- classification metrics
dt2display <- resSimu$performances$byparams[c(2,4,6,5,7,8,9), c('description', 'pr_AUC', 'pr_randm_AUC','roc_AUC', 'roc_randm_AUC')]
kableExtra::kable(dt2display, row.names = FALSE) %>%
kableExtra::kable_styling(full_width = F, position = "center")
Currently, HTRfit's evaluation does not include a comparison between a simulation and a fit capable of capturing the `sd(genotype:age:env)` parameter. However, feel free to implement this feature.
## Advance user
The flexibility of HTRfit simulation framework in term of input parameters allows for iteratively monitoring the effect of an input parameters on the model fit.
##### Fixed model
To reproduce following graphs, follow this instructions
##### Mixed model
To reproduce following graphs, follow this instructions
##### Fixed model
To reproduce following graphs, follow this instructions
##### Mixed model
To reproduce following graphs, follow this instructions
##### Monitoring number of levels on random effect accuracy