README.md

# High-Throughput RNA-seq model fit

## Why use HTRfit

HTRfit provides a robust statistical framework that allows you to investigate the essential experimental parameters influencing your ability to detect expression changes. Whether you're examining sequencing depth, the number of replicates, or other critical factors, HTRfit's computational simulation is your go-to solution.

Furthermore, by enabling the inclusion of fixed effects, mixed effects, and interactions in your RNAseq data analysis, HTRfit provides the flexibility needed to conduct your differential expression analysis effectively.


- [Installation](#installation)
- [CRAN packages dependencies](#cran-packages-dependencies)
- [Docker](#docker)
- [HTRfit simulation workflow](#htrfit-simulation-workflow)
- [Getting started](#getting-started)


## Installation

#### method A:  

To install the latest version of HTRfit, run the following in your R console :
```
if (!requireNamespace("remotes", quietly = TRUE))
    install.packages("remotes")
remotes::install_git("https://gitbio.ens-lyon.fr/aduvermy/HTRfit")
```

#### method B:

You also have the option to download a release directly from the [HTRfit release page](https://gitbio.ens-lyon.fr/aduvermy/HTRfit/-/releases). Once you've downloaded the release, simply untar the archive. After that, open your R console and execute the following command, where HTRfit-v1.0.0 should be replaced with the path to the untarred folder:

```
## -- Example using the HTRfit-v1.0.0 release
install.packages('/HTRfit-v1.0.0', repos = NULL, type='source')

```

When dependencies are met, installation should take a few minutes.


## CRAN packages dependencies

The following depandencies are required:

```
## -- required
install.packages(c('car', 'parallel', 'data.table', 'ggplot2', 'gridExtra', 'glmmTMB',
 'magrittr', 'MASS', 'plotROC', 'reshape2', 'rlang', 'stats', 'utils', 'BiocManager'))
BiocManager::install('S4Vectors', update = FALSE)
## -- optional 
BiocManager::install('DESeq2', update = FALSE)
```

## Docker

We have developed [Docker images](https://hub.docker.com/repository/docker/ruanad/htrfit/general) to simplify the package's utilization. For an optimal development and coding experience with the Docker container, we recommend using Visual Studio Code (VSCode) along with the DevContainer extension. This setup provides a convenient and isolated environment for development and testing.

1. Install VSCode.
2. Install Docker on your system and on VSCode.
3. Launch the HTRfit container directly from VSCode
4. Install the DevContainer extension for VSCode.
5. Launch a remote window connected to the running Docker container.
6. Install the R extension for VSCode.
7. Enjoy HTRfit !


## Biosphere virtual machine

A straightforward way to use **HTRfit** is to run it on a Virtual Machine (VM) through [Biosphere](https://biosphere.france-bioinformatique.fr/catalogue/). We recommend utilizing a VM that includes RStudio for an integrated development environment (IDE) experience. Biosphere VM resources can also be scaled according to your simulation needs.  
**HTRfit** can be installed using the [method A](#method-a).


## HTRfit simulation workflow

In the realm of RNAseq analysis, various key experimental parameters play a crucial role in influencing the statistical power to detect expression changes. Parameters such as sequencing depth, the number of replicates, and more have a significant impact. To navigate the selection of optimal values for these experimental parameters, we introduce a comprehensive statistical framework known as **HTRfit**, underpinned by computational simulation. Moreover, **HTRfit** offers seamless compatibility with DESeq2 outputs, facilitating a comprehensive evaluation of RNAseq analysis. 


<div id="bg"  align="center">
  <img src="./vignettes/figs/htrfit_workflow.png" width="500" height="300">
</div> 


## Getting started

[Download the vignette](https://gitbio.ens-lyon.fr/aduvermy/HTRfit/-/raw/master/vignettes/HTRfit.html?ref_type=heads&inline=false) for more in-depth information.


### Init a design and simulate RNAseq data

```
library('HTRfit')
## -- init a design 
input_var_list <- init_variable( name = "varA", mu = 0, sd = 0.29, level = 2000) %>%
                  init_variable( name = "varB", mu = 0.27, sd = 0.6, level = 2) %>%
                    add_interaction( between_var = c("varA", "varB"), mu = 0.44, sd = 0.89)
## -- simulate RNAseq data 
mock_data <- mock_rnaseq(input_var_list, 
                         n_genes = 6000,
                         min_replicates  = 4,
                         max_replicates = 4 )
```


The simulation process in HTRfit has been optimized to generate RNAseq counts for 30,000 genes and 4,000 experimental conditions (2000 levels in varA, 2 levels in varB), each replicated 4 times, resulting in a total of 16,000 samples, in less than 5 minutes. However, the object generated by the framework under these conditions can consume a significant amount of RAM, approximately 50 GB. For an equivalent simulation with 6,000 genes, less than a minute and 10 GB of RAM are required.


<div id="bg"  align="center">
  <img src="./vignettes/figs/simulation_step.png" width="500" height="200">
</div> 


### Fit your model

```
## -- prepare data & fit a model with mixed effect
data2fit = prepareData2fit(countMatrix = mock_data$counts, 
                           metadata =  mock_data$metadata, 
                           normalization = F)
l_tmb <- fitModelParallel(formula = kij ~ varB + (varB | varA),
                          data = data2fit, 
                          group_by = "geneID",
                          family = glmmTMB::nbinom2(link = "log"), 
                          n.cores = 8)
```

The `fitModelParallel()` function in **HTRfit** provides a powerful way to fit models independently for each gene. This allows for efficient parallelization of the modeling process by specifying the `n.cores` option. However, it's essential to note that as more cores are utilized, there is a corresponding increase in the required RAM. This is because the data necessary for fitting the model needs to be loaded into memory. Our simulations have demonstrated significant time savings when employing more cores. For instance, using 25 cores was nearly three times faster for processing 6,000 genes and 2,000 experimental conditions (2000 levels in varA, 2 levels in varB - 8000 samples). However, using 50 cores yielded minimal time savings but had a noticeable impact on RAM consumption. Therefore, users must carefully balance computation speed and memory usage when selecting the number of cores. To aid in making this decision, the graph below can assist in defining the optimal trade-off between computation speed and memory usage when choosing the number of cores.


<div id="bg"  align="center">
  <img src="./vignettes/figs/fit_step.png" width="836" height="220">
</div> 

Furthermore, it's worth noting that the output object generated by fitModelParallel can be substantial in terms of memory (RAM) usage. In simulations involving 6,000 genes and 2,000 experimental conditions (equivalent to 8,000 samples), the output object can occupy a significant amount of memory, reaching approximately 10 GB. Therefore, users need to ensure that their computing environment has enough available RAM to handle these large output objects.

### Diagnostic metrics

The `metrics_plot()` function allows to plot a diagnostic plot of AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), logLik (log-likelihood), deviance, df.resid (residual degrees of freedom), and dispersion. These metrics provide insights into how well the model fits the data and help in comparing different models. By examining these metrics, users can quickly identify any anomalies or potential issues in the fitting process

```
## -- plot all metrics
p <- metrics_plot(list_tmb = l_tmb)
```

<div id="bg"  align="center">
  <img src="./vignettes/figs/diagnostic_plot.png" width="600" height="360">
</div> 


### Evaluation

```
## -- evaluation
resSimu <- simulationReport(mock_data, 
                            list_tmb = l_tmb,
                            coeff_threshold = 0.27, 
                            alt_hypothesis = "greater")

```

The identity plot, generated by the `simulationReport()` function, provides a visual means to compare the effects used in the simulation (actual effects) with those inferred by the model. This graphical representation facilitates the assessment of the correspondence between the values of the simulated effects and those estimated by the model, allowing for a visual analysis of the model’s goodness of fit to the simulated data.

The dispersion plot, generated by the `simulationReport()` function, offers a visual comparison of the dispersion parameters used in the simulation $\alpha_i$ with those estimated by the model. This graphical representation provides an intuitive way to assess the alignment between the simulated dispersion values and the model-inferred values, enabling a visual evaluation of how well the model captures the underlying data characteristics.

The Receiver Operating Characteristic (ROC) curve is a valuable tool for assessing the performance of classification models, particularly in the context of identifying differentially expressed genes. It provides a graphical representation of the model’s ability to distinguish between genes that are differentially expressed and those that are not, by varying the `coeff_threshold` and the `alt_hypothesis` parameters. The area under the ROC curve (AUC) provides a single metric that summarizes the model’s overall performance in distinguishing between differentially expressed and non-differentially expressed genes. A higher AUC indicates better model performance.


<div id="bg"  align="center">
  <img src="./vignettes/figs/evaluation.png" width="680" height="400">
</div>