In this setup, we will use HTRfit to simulate an experimental design involving multiple genotypes and environmental conditions. To do so, we will first need to estimate plausible parameter distributions based on publicly available data.
### Loading public dataset
Counts data from the [SRP217588](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE135473) project are embedded in HTRfit package. These data were processed by an RNAseq pipeline using Kallisto to generate count data. This dataset includes count data for 4 distinct environments, 2 genotypes in 6000 genes, providing a valuable resource for estimating distribution parameters.
pub_data <- read.table(file = "inst/extdata/rna_pub_data.tsv", h = T)
rownames(pub_data) <- pub_data$gene_id
pub_data$gene_id <- NULL
pub_metadata <- read.table(file = "inst/extdata/rna_pub_metadata.tsv", h = T)
```
### Fit a model on a public dataset
To estimate plausible parameter distributions, we will fit a generalized linear model (GLM) to the publicly available data. This GLM will allow us to estimate plausible coefficients betaG, betaE, and betaGxE for 6000 genes. The `tidy_pub_fit` object contains the estimate obtain by HTRfit for each coefficient and each genes.
Now let's examine the distributions of the estimated parameters. Means, standard deviation and correlation between these parameters are inferred using the following lines :
To reproduce the design of the public [SRP217588](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE135473) project, 2 levels of genotype and 4 levels of environment are initialized using `init_variable()`. By using `add_interaction()`, interaction effects between genotype and environment (betaGxE) are set. For all of them, mean and standard deviation are set concordantly to values obtained from the public dataset coefficient distribution (betaG, betaE, betaGxE).
By using the estimated intercept obtained from the previous analysis of publicly available data as the basal expression level for each gene in the simulated dataset, we ensure that the basal expression levels for our simulated genes are consistent with those in the publicly available data.
## Fit and evaluate a model on simulated RNAseq data
### Fit model
We fit a model to the simulated RNAseq data using the `fitModelParallel()` function. Our model includes terms for genotype, environment, and the interaction between genotype and environment.
control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, eval.max=1e5)))
```
### Evaluate fit
Next, we evaluate the fitting results obtained from the simulated RNAseq data by comparing the expected results computed by `evaluation_report()`, with those obtained with `fitModelParallel()` : `l_tmb`.
The `evaluation_report()` function produces an identity plot, offering a visual tool to juxtapose the simulated effects (actual effects) with the model-inferred effects. This graphical representation simplifies assessing how well the model aligns with the values of the simulated effects, enabling a visual analysis of the model's goodness of fit to the simulated data. For a more quantitative evaluation of inference quality, the R-squared (R2) metric can be employed.
The dispersion plot, generated by the `evaluation_report()` function, offers a visual comparison of the dispersion parameters used in the simulation \(dispersion_i\) with those estimated by the model. This graphical representation provides an intuitive way to assess the alignment between the simulated dispersion values and the model-inferred values, enabling a visual evaluation of how well the model captures the underlying data characteristics.
The Receiver Operating Characteristic (ROC) curve is a valuable tool for assessing the performance of classification models, particularly in the context of identifying differentially expressed genes. It provides a graphical representation of the model's ability to distinguish between genes that are differentially expressed and those that are not, by varying the `coeff_threshold` and the `alt_hypothesis` parameters.
The precision-recall curve (PR curve) illustrates the relationship between precision and recall at various classification thresholds. It is particularly valuable in the context of imbalanced data, where one class is significantly more prevalent than the other. Unlike the ROC curve, which can be influenced by class distribution, the PR curve focuses on the model's ability to correctly identify examples of the minority class while maintaining high precision.
The area under the ROC curve (AUC) provides a single metric that summarizes the model's overall performance in distinguishing between differentially expressed and non-differentially expressed genes. A higher AUC indicates better model performance. In the case of the ROC curve, this AUC should be compared to the value of 0.5, representing a random classifier. On the other hand, for the PR curve, we compute the pr_AUC_random, serving as a baseline for comparison.
In addition to evaluating model performance through the AUC for both ROC and PR curves, we provide access to key classification metrics, including Accuracy, Precision, Recall (or Sensitivity), and Specificity. These metrics offer a comprehensive view of the model's classification capabilities. These metrics are computed for each parameter (excluding the intercept when skip_eval_intercept = TRUE), providing detailed insights into individual parameter contributions. Furthermore, an aggregated assessment is performed, considering all parameters (except the intercept by default), offering a perspective on the model's overall classification effectiveness.
## Preparing and optimizing a futur experimental design
Let's consider a scenario where a team is strategizing a new experiment encompassing 100 genotypes and 4 environments. To enhance the precision of their analysis, the team is conducting a power analysis through HTRfit. For this purpose, they are utilizing the experimental design template, `input_var_list`, which draws insights from the publicly available [SRP217588](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE135473) dataset. They are adjusting the design levels to align with the specifics of their upcoming study involving 200 genotypes and 4 environments.
#### Fit model with fixed effect : ` ~ genotype + environment + genotype:environment`
We fit a model to the simulated RNAseq data using the `fitModelParallel()` function. Our model includes terms for genotype, environment, and the interaction between genotype and environment. They are all consider as fixed effects.
```{r fit200G, warning = FALSE, message = FALSE}
## -- prepare data & fit a model with mixed effect
#### Fit model with mixed effect : ` ~environment + ( environment | genotype )`
The inclusion of a large number of genotypes in the experimental design allows for the incorporation of random effects in the analysis. Utilizing HTRfit, the team has opted to assess the effectiveness of a model in which the `environment` is defined as a mixed effect (combining fixed and random components), and `genotype` is considered a random effect. The fitting of this mixed model is executed using the `fitModelParallel()` function.