<!-- WARNING - This vignette is generated by {fusen} from /dev/flat_full.Rmd: do not edit by hand -->
<!-- Run this 'development' chunk -->
<!-- Store every call to library() that you need to explore your functions -->
<!--
You need to run the 'description' chunk in the '0-dev_history.Rmd' file before continuing your code there.
If it is the first time you use {fusen}, after 'description', you can directly run the last chunk of the present file with inflate() inside.
-->
# High-Throughput RNA-seq model fit
In the realm of RNAseq analysis, various key experimental parameters play a crucial role in influencing the statistical power to detect expression changes. Parameters such as sequencing depth, the number of replicates, and others are expected to impact this statistical power. To help with the selection of optimal values for these experimental parameters, we introduce a comprehensive statistical framework known as **HTRfit**, underpinned by computational simulation. **HTRfit** serves as a versatile tool, not only for simulation but also for conducting differential expression analysis. It facilitates this analysis by fitting Generalized Linear Models (GLMs) with multiple variables, which could encompass genotypes, environmental factors, and more. These GLMs are highly adaptable, allowing the incorporation of fixed effects, mixed effects, and interactions between variables.
# Initializing variables for simulation
The `init_variable()` function is used for defining the variables in your experimental design. Sizes of effects for each variable and interaction term can be defined in two different ways: 1) The user can manually sets values of all levels of a variable, in which case the effects are necessarily considered fixed in the model; 2) The effects can be randomly picked in a normal distribution with mean and standard deviation defined by the user, in which case the user can decide whether the effects are considered fixed or random in the model.
## Manually init a single variable
The `init_variable()` function allows for precise control over the variables in your experimental design.
In this example, we manually initialize **varA** with specific size effects (mu) for three levels.
In this section, you will explore how to generate RNAseq data based on the previously defined input variables. The `mock_rnaseq()` function enables you to manage parameters in your RNAseq design, including number of genes, minimum and maximum number of replicates within your experimental setup, sequencing depth, basal expression of each gene, and dispersion of gene expression used for simulating counts.
## -- simulate RNAseq data based on list_var, minimum input required
## -- number of replicate randomly defined between MIN_REP and MAX_REP
mock_data <- mock_rnaseq(list_var, N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES)
## -- simulate RNAseq data based on list_var, minimum input required
## -- Same number of replicates between conditions
mock_data <- mock_rnaseq(list_var, N_GENES,
min_replicates = MAX_REPLICATES,
max_replicates = MAX_REPLICATES)
```
## Scaling genes counts based on sequencing depth
Sequencing depth is a critical parameter affecting the statistical power of an RNAseq analysis. With the `sequencing_depth` option in the `mock_rnaseq()` function, you have the ability to control this parameter.
SEQ_DEPTH = c(100000, 5000000, 10000000)## -- Possible number of reads/sample
SEQ_DEPTH = 10000000 ## -- all samples have same number of reads
mock_data <- mock_rnaseq(list_var, N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES,
sequencing_depth = SEQ_DEPTH)
```
## Set dispersion of gene expression
The dispersion parameter ($dispersion_i$), characterizes the relationship between the variance of the observed read counts and its mean value. In simple terms, it quantifies how much we expect observed counts to deviate from the mean value. You can specify the dispersion for individual genes using the dispersion parameter.
DISP = runif(N_GENES, 0, 1000) ## -- Dispersion can vary between genes
mock_data <- mock_rnaseq(list_var, N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES,
dispersion = DISP )
```
## Set basal gene expression
The basal gene expression parameter, defined using the basal_expression option, allows you to control the fundamental baseline of expression level of genes. When a single value is specified, the basal expression of all genes is the same and the effects of variables defined in list_var are applied from this basal expression level. When several values are specified, the basal expression of each gene is picked randomly among these values (with replacement). The effects of variables are then applied from the basal expression of each gene. This represents the fact that expression levels can vary among genes even when not considering the effects of the defined variables (for example, some genes are constitutively more highly expressed than others because they have stronger basal promoters).
In this modeling framework, counts denoted as $K_{ij}$ for gene i and sample j are generated using a negative binomial distribution. The negative binomial distribution considers a fitted mean $\mu_{ij}$ and a gene-specific dispersion parameter $dispersion_i$.
The fitted mean $\mu_{ij}$ is determined by a parameter, $q_{ij}$, which is proportionally related to the sum of all effects specified using `init_variable()` or `add_interaction()`. If basal gene expressions are provided, the $\mu_{ij}$ values are scaled accordingly using the gene-specific basal expression value ($bexpr_i$).
Furthermore, the coefficients $\beta_i$ represent the natural logarithm fold changes for gene i across each column of the model matrix X. The dispersion parameter $dispersion_i$ plays a crucial role in defining the relationship between the variance of observed counts and their mean value. In simpler terms, it quantifies how far we expect observed counts to deviate from the mean value for each genes.
# Fitting models
## Prepare data for fitting
The `prepareData2fit()` function serves the purpose of converting the counts matrix and sample metadata into a dataframe that is compatible with downstream **HTRfit** functions designed for model fitting. This function also includes an option to perform median ratio normalization on the data counts.
The `fitModelParallel()` function enables independent model fitting for each gene. The number of threads used for this process can be controlled by the `n.cores` parameter.
**HTRfit** uses the **glmmTMB** functions for model fitting algorithms. This choice allows for the utilization of random effects within your formula design. For further details on how to specify your model, please refer to the [mixed model documentation](https://rdrr.io/cran/glmmTMB/man/glmmTMBControl.html).
The function provides precise control over model settings for fitting optimization, including options for specifying the [model family](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/family) and [model control setting](https://rdrr.io/cran/glmmTMB/man/glmmTMBControl.html). By default, a Gaussian family model is fitted, but for RNA-seq data, it is highly recommended to specify `family = glmmTMB::nbinom2(link = "log")`.
## Extracts a tidy result table from a list tmb object
The tidy_results function extracts a data frame containing estimates of ln(fold changes), standard errors, test statistics, p-values, and adjusted p-values for fixed effects. Additionally, it provides access to correlation terms and standard deviations for random effects, offering a detailed view of HTRfit modeling results.
Visualizing fit metrics is essential for evaluating your models. Here, we show you how to generate various plots to assess the quality of your models. You can explore all metrics or focus on specific aspects like dispersion and log-likelihood.
Utilizing the `anovaParallel()` function enables you to perform model selection by assessing the significance of the fixed effects. You can also include additional parameters like type. For more details, refer to [car::Anova](https://rdrr.io/cran/car/man/Anova.html).
l_anova <- anovaParallel(list_tmb = l_tmb, type = "III" )
```
# Simulation evaluation report
In this section, we delve into the evaluation of your simulation results. The `evaluation_report()` function provide valuable insights into the performance of your simulated data and models.
The `evaluation_report()` function produces an identity plot, offering a visual tool to juxtapose the simulated effects (actual effects) with the model-inferred effects. This graphical representation simplifies assessing how well the model aligns with the values of the simulated effects, enabling a visual analysis of the model's goodness of fit to the simulated data. For a more quantitative evaluation of inference quality, the R-squared (R2) metric can be employed.
The dispersion plot, generated by the `evaluation_report()` function, offers a visual comparison of the dispersion parameters used in the simulation \(\alpha_i\) with those estimated by the model. This graphical representation provides an intuitive way to assess the alignment between the simulated dispersion values and the model-inferred values, enabling a visual evaluation of how well the model captures the underlying data characteristics.
## ROC curve
The Receiver Operating Characteristic (ROC) curve is a valuable tool for assessing the performance of classification models, particularly in the context of identifying differentially expressed genes. It provides a graphical representation of the model's ability to distinguish between genes that are differentially expressed and those that are not, by varying the `coeff_threshold` and the `alt_hypothesis` parameters.
The precision-recall curve (PR curve) illustrates the relationship between precision and recall at various classification thresholds. It is particularly valuable in the context of imbalanced data, where one class is significantly more prevalent than the other. Unlike the ROC curve, which can be influenced by class distribution, the PR curve focuses on the model's ability to correctly identify examples of the minority class while maintaining high precision.
The area under the ROC curve (AUC) provides a single metric that summarizes the model's overall performance in distinguishing between differentially expressed and non-differentially expressed genes. A higher AUC indicates better model performance. In the case of the ROC curve, this AUC should be compared to the value of 0.5, representing a random classifier. On the other hand, for the PR curve, we compute the pr_AUC_random, serving as a baseline for comparison.
In addition to evaluating model performance through the AUC for both ROC and PR curves, we provide access to key classification metrics, including Accuracy, Precision, Recall (or Sensitivity), and Specificity. These metrics offer a comprehensive view of the model's classification capabilities. These metrics are computed for each parameter (excluding the intercept when skip_eval_intercept = TRUE), providing detailed insights into individual parameter contributions. Furthermore, an aggregated assessment is performed, considering all parameters (except the intercept by default), offering a perspective on the model's overall classification effectiveness.
**HTRfit** offers a wrapper for **DESeq2** outputs. This functionality allows users to seamlessly integrate the results obtained from **DESeq2** into the **HTRfit** analysis pipeline. By doing so, you can readily compare the performance of **HTRfit** with **DESeq2** on your RNAseq data. This comparative analysis aids in determining which tool performs better for your specific research goals and dataset
In this section, we showcase the assessment of model performance on a subset of genes. Specifically, we focus on evaluating genes with low expression levels, identified by their basal expression ($bexpr_i$) initialized below 0 during the simulation.
**HTRfit** offers a versatile simulation framework capable of fitting various types of models involving mixed effects, thanks to its implementation of **glmmTMB**. By combining the functionalities of `init_variable()` and `add_interaction()`, **HTRfit** enables the simulation of complex experimental designs. As of now, HTRfit supports the evaluation of *Type I* mixed models. In this context, *Type I* models are defined as those that follow the structure:
- `~ varA + (1 | varB)` : where `varA` is defined as fixed effect and `varB` as random effect
- `~ varA + (varA | varB)`: where `varA` is defined as mixed effect (fixed + random) and `varB` as random effect.
Models not conforming to this specific form cannot be evaluated using **HTRfit's** current implementation. Nonetheless, you are welcome to extend its capabilities by implementing support for additional model types.