README.md

# High-Throughput RNA-seq model fit

## Why use HTRfit

HTRfit provides a robust statistical framework that allows you to investigate the essential experimental parameters influencing your ability to detect expression changes. Whether you're examining sequencing depth, the number of replicates, or other critical factors, HTRfit's computational simulation is your go-to solution.

Furthermore, by enabling the inclusion of fixed effects, mixed effects, and interactions in your RNAseq data analysis, HTRfit provides the flexibility needed to conduct your differential expression analysis effectively. HTRfit is particularly adapted for the analysis of large number of samples, or highly multiplexed experiments.


- [Installation](#installation)
- [CRAN packages dependencies](#cran-packages-dependencies)
- [Docker](#docker)
- [HTRfit simulation workflow](#htrfit-simulation-workflow)
- [Getting started](#getting-started)

## Installation

#### method A:  

To install the latest version of HTRfit, run the following in your R console :
```
if (!requireNamespace("remotes", quietly = TRUE))
    install.packages("remotes")
remotes::install_git("https://gitbio.ens-lyon.fr/aduvermy/HTRfit")
```

#### method B:

You also have the option to download a release directly from the [HTRfit release page](https://gitbio.ens-lyon.fr/aduvermy/HTRfit/-/tags). Once you've downloaded the release, simply launch following command.

```
## -- Example using the HTRfit-v2.0.0 release
install.packages('HTRfit-v2.0.0.tar.gz', repos = NULL, type='source')

```

When dependencies are met, installation should take a few minutes.


## CRAN packages dependencies

The following depandencies are required:

```
## -- required
install.packages(c('parallel', 'data.table', 'ggplot2', 'gridExtra', 'glmmTMB', 'magrittr', 'MASS', 'reshape2', 'rlang', 'stats', 'utils', 'BiocManager', 'car'))
BiocManager::install('S4Vectors', update = FALSE)
## -- optional 
BiocManager::install('DESeq2', update = FALSE)
```

## Docker

We have developed [Docker images](https://hub.docker.com/r/ruanad/htrfit/tags) to simplify the package's utilization.

```
docker pull ruanad/htrfit:v2.0.0-beta
docker run -it --rm ruanad/htrfit:v2.0.0-beta
```


## Biosphere virtual machine

A straightforward way to use **HTRfit** is to run it on a Virtual Machine (VM) through [Biosphere](https://biosphere.france-bioinformatique.fr/catalogue/). We recommend utilizing a VM that includes RStudio for an integrated development environment (IDE) experience. Biosphere VM resources can also be scaled according to your simulation needs.  
**HTRfit** can be installed using the [method A](#method-a).


## HTRfit simulation workflow

In the realm of RNAseq analysis, various key experimental parameters play a crucial role in influencing the statistical power to detect expression changes. Parameters such as sequencing depth, the number of replicates, and others are expected to impact statistical power. To navigate the selection of optimal values for these experimental parameters, we introduce a comprehensive statistical framework known as **HTRfit**, underpinned by computational simulation. Moreover, **HTRfit** offers seamless compatibility with DESeq2 outputs, facilitating a comprehensive evaluation of RNAseq analysis. 


<div id="bg"  align="center">
  <img src="./vignettes/figs/htrfit_workflow.png" width="500" height="300">
</div> 


## Getting started

[Download the vignette](https://gitbio.ens-lyon.fr/aduvermy/HTRfit/-/raw/master/vignettes/HTRfit.html?ref_type=heads&inline=false) for more in-depth information about how to use HTRfit.


### Init a design and simulate RNAseq data

```
library('HTRfit')
## -- init a design 
input_var_list <- init_variable( name = "varA", mu = 0, sd = 0.29, level = 2000) %>%
                  init_variable( name = "varB", mu = 0.27, sd = 0.6, level = 2) %>%
                    add_interaction( between_var = c("varA", "varB"), mu = 0.44, sd = 0.89)
## -- simulate RNAseq data 
mock_data <- mock_rnaseq(input_var_list, 
                         n_genes = 6000,
                         min_replicates  = 4,
                         max_replicates = 4 )
```


### Fit your model

```
## -- prepare data & fit a model with mixed effect
data2fit = prepareData2fit(countMatrix = mock_data$counts, 
                           metadata =  mock_data$metadata, 
                           normalization = F,   
                           response_name = "kij")
l_tmb <- fitModelParallel(formula = kij ~ varB + (varB | varA),
                          data = data2fit, 
                          group_by = "geneID",
                          family = glmmTMB::nbinom2(link = "log"), 
                          n.cores = 8)
```

The `fitModelParallel()` function in **HTRfit** provides a powerful way to fit models independently for each gene. This allows for efficient parallelization of the modeling process by specifying the `n.cores` option. However, it's essential to note that as more cores are utilized, there is a corresponding increase in the required RAM. This is because the data necessary for fitting the model needs to be loaded into memory. Our simulations have demonstrated significant time savings when employing more cores. For instance, using 25 cores was nearly three times faster for processing 6,000 genes and 2,000 experimental conditions (2000 levels in varA, 2 levels in varB - 8000 samples). However, using 50 cores yielded minimal time savings but had a noticeable impact on RAM consumption. Therefore, users must carefully balance computation speed and memory usage when selecting the number of cores. 


### Diagnostic metrics

The `diagnostic_plot()` function allows to plot a diagnostic plot of AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), logLik (log-likelihood), deviance, df.resid (residual degrees of freedom), and dispersion. These metrics provide insights into how well the model fits the data and help in comparing different models. By examining these metrics, users can quickly identify any anomalies or potential issues in the fitting process. 

```
## -- plot all metrics
p <- diagnostic_plot(list_tmb = l_tmb)
```

<div id="bg"  align="center">
  <img src="./vignettes/figs/diagnostic_plot.png" width="600" height="360">
</div> 

The diagnostic metrics show that the fit is not as good for all genes. 

### Evaluation 

```
## -- evaluation
resSimu <- evaluation_report(list_tmb = l_tmb,
                            mock_obj = mock_data,
                            coeff_threshold = 0.27, 
                            alt_hypothesis = "greater")

```

The identity plot, generated by the `evaluation_report()` function, provides a visual means to compare the effects used in the simulation (actual effects) with those inferred by the model. This graphical representation facilitates the assessment of the correspondence between the values of the simulated effects and those estimated by the model, allowing for a visual analysis of the model’s goodness of fit to the simulated data. For a quantitative evaluation of inference quality, the R-squared (R2) metric can be employed.

The evaluation of model performance in distinguishing between differentially expressed and non-differentially expressed genes relies on the area under the ROC curve (AUC), a comprehensive metric offering a singular summary of the model's overall effectiveness. A higher AUC is indicative of superior model performance. It is noteworthy that we not only calculate the AUC for the ROC curve but also extend this assessment to the Precision-Recall (PR) curve. In the case of the ROC curve, this AUC should be compared to the value of 0.5, representing a random classifier. On the other hand, for the PR curve, we compute the pr_AUC_random, serving as a baseline for comparison. This dual evaluation approach ensures a thorough understanding of the model's discrimination capabilities under different scenarios, providing valuable insights into its robustness and reliability.

In addition to evaluating model performance through the AUC for both ROC and PR curves, we provide access to key classification metrics, including Accuracy, Precision, Recall (or Sensitivity), and Specificity. These metrics offer a comprehensive view of the model's classification capabilities.

All metrics are computed for each parameter (excluding the intercept when skip_eval_intercept = TRUE), providing detailed insights into individual parameter contributions. Furthermore, an aggregated assessment is performed, considering all parameters (except the intercept by default), offering a perspective on the model's overall classification effectiveness.

### Comparing performances of HTRfit & DESeq2

**HTRfit** offers a wrapper for **DESeq2** outputs. This functionality allows users to seamlessly integrate the results obtained from **DESeq2** into the **HTRfit** analysis pipeline. By doing so, you can readily compare the performance of **HTRfit** with **DESeq2** on your RNAseq data. This comparative analysis aids in determining which tool performs better for your specific research goals and dataset.


```{r}
## -- init a design 
input_var_list <- init_variable( name = "varA", mu = 0, sd = 0.29, level = 2) %>%
                  init_variable( name = "varB", mu = 0.27, sd = 0.6, level = 2) %>%
                    add_interaction( between_var = c("varA", "varB"), mu = 0.44, sd = 0.89)
N_GENES = 1000
MIN_REPLICATES = 4
MAX_REPLICATES = 4
BASAL_EXPR = 2
SEQ_DEPTH = 1e+7

## -- simulate RNAseq data 
mock_data <- mock_rnaseq(input_var_list, 
                         n_genes = N_GENES,
                         min_replicates  = MIN_REPLICATES,
                         max_replicates = MAX_REPLICATES,
                         basal_expression = BASAL_EXPR,
                         sequencing_depth = SEQ_DEPTH )

## -- prepare data & fit a model with HTRfit
data2fit = prepareData2fit(countMatrix = mock_data$counts, 
                           metadata =  mock_data$metadata, 
                           normalization = F)
l_tmb <- fitModelParallel(formula = kij ~ varB + (varB | varA),
                          data = data2fit, 
                          group_by = "geneID",
                          family = glmmTMB::nbinom2(link = "log"), 
                          n.cores = 8)


## -- DESeq2
library(DESeq2)
dds <- DESeq2::DESeqDataSetFromMatrix(
          countData = round(mock_data$counts),
          colData =  mock_data$metadata,
          design = ~ varA + varB  + varA:varB )
dds <- DESeq2::DESeq(dds, quiet = TRUE)


## -- get simulation/fit evaluation
resSimu <- evaluation_report(list_tmb = l_tmb,
                            dds_obj = dds,
                            mock_obj = mock_data, 
                            coeff_threshold = 0.4, 
                            alt_hypothesis = "greaterAbs")
```


*** ADD plot ***