README.md

# High-Throughput RNA-seq model fit

## Why use HTRfit

HTRfit provides a robust statistical framework that allows you to investigate the essential experimental parameters influencing your ability to detect expression changes. Whether you're examining sequencing depth, the number of replicates, or other critical factors, HTRfit's computational simulation is your go-to solution.

Furthermore, by enabling the inclusion of fixed effects, mixed effects, and interactions in your RNAseq data analysis, HTRfit provides the flexibility needed to lead your differential expression analysis effectively.


- [Installation](#installation)
- [CRAN packages dependencies](#cran-packages-dependencies)
- [Docker](#docker)
- [HTRfit simulation workflow](#htrfit-simulation-workflow)
- [Getting started](#getting-started)


## Installation

#### method A:  

To install the latest version of HTRfit, run the following in your R console :
```
if (!requireNamespace("remotes", quietly = TRUE))
    install.packages("remotes")
remotes::install_git("https://gitbio.ens-lyon.fr/aduvermy/HTRfit")
```

#### method B:

You also have the option to download a release directly from the [HTRfit release page](https://gitbio.ens-lyon.fr/aduvermy/HTRfit/-/releases). Once you've downloaded the release, simply untar the archive. After that, open your R console and execute the following command, where HTRfit-v1.0.0 should be replaced with the path to the untarred folder:

```
## -- Example using the HTRfit-v1.0.0 release
install.packages('/HTRfit-v1.0.0', repos = NULL, type='source')

```

When dependencies are met, installation should take a few minutes.


## CRAN packages dependencies

The following depandencies are required:

```
## -- required
install.packages(c('car', 'parallel', 'data.table', 'ggplot2', 'gridExtra', 'glmmTMB',
 'magrittr', 'MASS', 'plotROC', 'reshape2', 'rlang', 'stats', 'utils', 'BiocManager'))
BiocManager::install('S4Vectors', update = FALSE)
## -- optional 
BiocManager::install('DESeq2', update = FALSE)
```

## Docker

We have developed [Docker images](https://hub.docker.com/repository/docker/ruanad/htrfit/general) to simplify the package's utilization. For an optimal development and coding experience with the Docker container, we recommend using Visual Studio Code (VSCode) along with the DevContainer extension. This setup provides a convenient and isolated environment for development and testing.

1. Install VSCode.
2. Install Docker on your system and on VSCode.
3. Launch the HTRfit container directly from VSCode
4. Install the DevContainer extension for VSCode.
5. Launch a remote window connected to the running Docker container.
6. Install the R extension for VSCode.
7. Enjoy HTRfit !


## Biosphere virtual machine

A straightforward way to use **HTRfit** is to run it on a Virtual Machine (VM) through [Biosphere](https://biosphere.france-bioinformatique.fr/catalogue/). We recommend utilizing a VM that includes RStudio for an integrated development environment (IDE) experience. Biosphere VM resources can also be scaled according to your simulation needs.  
**HTRfit** can be installed using the [method A](#method-a).


## HTRfit simulation workflow

In the realm of RNAseq analysis, various key experimental parameters play a crucial role in influencing the statistical power to detect expression changes. Parameters such as sequencing depth, the number of replicates, and more have a significant impact. To navigate the selection of optimal values for these experimental parameters, we introduce a comprehensive statistical framework known as **HTRfit**, underpinned by computational simulation. Moreover, **HTRfit** offers seamless compatibility with DESeq2 outputs, facilitating a comprehensive evaluation of RNAseq analysis. 


<div id="bg"  align="center">
  <img src="./vignettes/figs/htrfit_workflow.png" width="500" height="300">
</div> 


## Getting started


### Init a design and simulate RNAseq data

```
library('HTRfit')
## -- init a design 
input_var_list <- init_variable( name = "varA", mu = 0, sd = 0.29, level = 2000) %>%
                  init_variable( name = "varB", mu = 0.27, sd = 0.6, level = 2) %>%
                    add_interaction( between_var = c("varA", "varB"), mu = 0.44, sd = 0.89)
## -- simulate RNAseq data 
mock_data <- mock_rnaseq(input_var_list, 
                         n_genes = 6000,
                         min_replicates  = 4,
                         max_replicates = 4 )
```


The simulation process in HTRfit has been optimized to generate RNAseq counts for 30,000 genes and 4,000 experimental conditions (2000 levels in varA, 2 levels in varB), each replicated 4 times, resulting in a total of 16,000 samples, in less than 5 minutes. However, the object generated by the framework under these conditions can consume a significant amount of RAM, approximately 50 GB. For an equivalent simulation with 6,000 genes, less than a minute and 10 GB of RAM are required.


<div id="bg"  align="center">
  <img src="./vignettes/figs/simulation_step.png" width="500" height="200">
</div> 


### Fit your model

```
## -- prepare data & fit a model with mixed effect
data2fit = prepareData2fit(countMatrix = mock_data$counts, 
                           metadata =  mock_data$metadata, 
                           normalization = F)
l_tmb <- fitModelParallel(formula = kij ~ varB + (varB | varA),
                          data = data2fit, 
                          group_by = "geneID",
                          family = glmmTMB::nbinom2(link = "log"), 
                          n.cores = 8)
```

The `fitModelParallel()` function in **HTRfit** provides a powerful way to fit models independently for each gene. This allows for efficient parallelization of the modeling process by specifying the `n.cores` option. However, it's essential to note that as more cores are utilized, there is a corresponding increase in the required RAM. This is because the data necessary for fitting the model needs to be loaded into memory. Our simulations have demonstrated significant time savings when employing more cores. For instance, using 25 cores was nearly three times faster for processing 6,000 genes and 2,000 experimental conditions (2000 levels in varA, 2 levels in varB - 8000 samples). However, using 50 cores yielded minimal time savings but had a noticeable impact on RAM consumption. Therefore, users must carefully balance computation speed and memory usage when selecting the number of cores. To aid in making this decision, the graph below can assist in defining the optimal trade-off between computation speed and memory usage when choosing the number of cores.


<div id="bg"  align="center">
  <img src="./vignettes/figs/fit_step.png" width="836" height="220">
</div> 

Furthermore, it's worth noting that the output object generated by fitModelParallel can be substantial in terms of memory (RAM) usage. In simulations involving 6,000 genes and 2,000 experimental conditions (equivalent to 8,000 samples), the output object can occupy a significant amount of memory, reaching approximately 10 GB. Therefore, users need to ensure that their computing environment has enough available RAM to handle these large output objects.

### Evalutation

```
## -- evaluation
resSimu <- simulationReport(mock_data, 
                            list_tmb = l_tmb,
                            coeff_threshold = 0.27, 
                            alt_hypothesis = "greater")

```