Skip to content
Snippets Groups Projects
README.md 10.4 KiB
Newer Older
Arnaud Duvermy's avatar
Arnaud Duvermy committed
# High-Throughput RNA-seq model fit

Arnaud Duvermy's avatar
Arnaud Duvermy committed
## Why use HTRfit

HTRfit provides a robust statistical framework that allows you to investigate the essential experimental parameters influencing your ability to detect expression changes. Whether you're examining sequencing depth, the number of replicates, or other critical factors, HTRfit's computational simulation is your go-to solution.

fduveau's avatar
fduveau committed
Furthermore, by enabling the inclusion of fixed effects, mixed effects, and interactions in your RNAseq data analysis, HTRfit provides the flexibility needed to conduct your differential expression analysis effectively.
Arnaud Duvermy's avatar
Arnaud Duvermy committed

Arnaud Duvermy's avatar
Arnaud Duvermy committed

Arnaud Duvermy's avatar
Arnaud Duvermy committed
- [Installation](#installation)
- [CRAN packages dependencies](#cran-packages-dependencies)
- [Docker](#docker)
- [HTRfit simulation workflow](#htrfit-simulation-workflow)
- [Getting started](#getting-started)

Arnaud Duvermy's avatar
Arnaud Duvermy committed

Arnaud Duvermy's avatar
Arnaud Duvermy committed
## Installation

Arnaud Duvermy's avatar
Arnaud Duvermy committed
#### method A:  
Arnaud Duvermy's avatar
Arnaud Duvermy committed

Arnaud Duvermy's avatar
Arnaud Duvermy committed
To install the latest version of HTRfit, run the following in your R console :
```
if (!requireNamespace("remotes", quietly = TRUE))
    install.packages("remotes")
Arnaud Duvermy's avatar
Arnaud Duvermy committed
remotes::install_git("https://gitbio.ens-lyon.fr/aduvermy/HTRfit")
Arnaud Duvermy's avatar
Arnaud Duvermy committed
```

Arnaud Duvermy's avatar
Arnaud Duvermy committed
#### method B:
Arnaud Duvermy's avatar
Arnaud Duvermy committed

Arnaud Duvermy's avatar
Arnaud Duvermy committed
You also have the option to download a release directly from the [HTRfit release page](https://gitbio.ens-lyon.fr/aduvermy/HTRfit/-/tags). Once you've downloaded the release, simply launch following command.
Arnaud Duvermy's avatar
Arnaud Duvermy committed

```
Arnaud Duvermy's avatar
Arnaud Duvermy committed
## -- Example using the HTRfit-v2.0.0 release
install.packages('HTRfit-v2.0.0.tar.gz', repos = NULL, type='source')
Arnaud Duvermy's avatar
Arnaud Duvermy committed

Arnaud Duvermy's avatar
Arnaud Duvermy committed
```

When dependencies are met, installation should take a few minutes.
Arnaud Duvermy's avatar
Arnaud Duvermy committed


## CRAN packages dependencies

Arnaud Duvermy's avatar
Arnaud Duvermy committed
The following depandencies are required:
Arnaud Duvermy's avatar
Arnaud Duvermy committed

Arnaud Duvermy's avatar
Arnaud Duvermy committed
```
Arnaud Duvermy's avatar
Arnaud Duvermy committed
## -- required
Arnaud Duvermy's avatar
Arnaud Duvermy committed
install.packages(c('car', 'parallel', 'data.table', 'ggplot2', 'gridExtra', 'glmmTMB',
 'magrittr', 'MASS', 'plotROC', 'reshape2', 'rlang', 'stats', 'utils', 'BiocManager'))
Arnaud Duvermy's avatar
Arnaud Duvermy committed
BiocManager::install('S4Vectors', update = FALSE)
## -- optional 
BiocManager::install('DESeq2', update = FALSE)
Arnaud Duvermy's avatar
Arnaud Duvermy committed
```

Arnaud Duvermy's avatar
Arnaud Duvermy committed
## Docker

Arnaud Duvermy's avatar
Arnaud Duvermy committed
We have developed [Docker images](https://hub.docker.com/r/ruanad/htrfit/tags) to simplify the package's utilization.
Arnaud Duvermy's avatar
Arnaud Duvermy committed

Arnaud Duvermy's avatar
Arnaud Duvermy committed
```
docker pull ruanad/htrfit:v2.0.0-beta
docker run -it --rm ruanad/htrfit:v2.0.0-beta
```
Arnaud Duvermy's avatar
Arnaud Duvermy committed

Arnaud Duvermy's avatar
Arnaud Duvermy committed

Arnaud Duvermy's avatar
Arnaud Duvermy committed
## Biosphere virtual machine

A straightforward way to use **HTRfit** is to run it on a Virtual Machine (VM) through [Biosphere](https://biosphere.france-bioinformatique.fr/catalogue/). We recommend utilizing a VM that includes RStudio for an integrated development environment (IDE) experience. Biosphere VM resources can also be scaled according to your simulation needs.  
Arnaud Duvermy's avatar
Arnaud Duvermy committed
**HTRfit** can be installed using the [method A](#method-a).
Arnaud Duvermy's avatar
Arnaud Duvermy committed
## HTRfit simulation workflow
Arnaud Duvermy's avatar
Arnaud Duvermy committed

fduveau's avatar
fduveau committed
In the realm of RNAseq analysis, various key experimental parameters play a crucial role in influencing the statistical power to detect expression changes. Parameters such as sequencing depth, the number of replicates, and others are expected to impact statistical power. To navigate the selection of optimal values for these experimental parameters, we introduce a comprehensive statistical framework known as **HTRfit**, underpinned by computational simulation. Moreover, **HTRfit** offers seamless compatibility with DESeq2 outputs, facilitating a comprehensive evaluation of RNAseq analysis. 
Arnaud Duvermy's avatar
Arnaud Duvermy committed
<div id="bg"  align="center">
Arnaud Duvermy's avatar
Arnaud Duvermy committed
  <img src="./vignettes/figs/htrfit_workflow.png" width="500" height="300">
Arnaud Duvermy's avatar
Arnaud Duvermy committed
</div> 
Arnaud Duvermy's avatar
Arnaud Duvermy committed


Arnaud Duvermy's avatar
Arnaud Duvermy committed

## Getting started
Arnaud Duvermy's avatar
Arnaud Duvermy committed

fduveau's avatar
fduveau committed
[Download the vignette](https://gitbio.ens-lyon.fr/aduvermy/HTRfit/-/raw/master/vignettes/HTRfit.html?ref_type=heads&inline=false) for more in-depth information about how to use HTRfit.
Arnaud Duvermy's avatar
Arnaud Duvermy committed


### Init a design and simulate RNAseq data

Arnaud Duvermy's avatar
Arnaud Duvermy committed
```
Arnaud Duvermy's avatar
Arnaud Duvermy committed
library('HTRfit')
Arnaud Duvermy's avatar
Arnaud Duvermy committed
## -- init a design 
input_var_list <- init_variable( name = "varA", mu = 0, sd = 0.29, level = 2000) %>%
Arnaud Duvermy's avatar
Arnaud Duvermy committed
                  init_variable( name = "varB", mu = 0.27, sd = 0.6, level = 2) %>%
                    add_interaction( between_var = c("varA", "varB"), mu = 0.44, sd = 0.89)
## -- simulate RNAseq data 
mock_data <- mock_rnaseq(input_var_list, 
Arnaud Duvermy's avatar
Arnaud Duvermy committed
                         n_genes = 6000,
                         min_replicates  = 4,
                         max_replicates = 4 )
```



### Fit your model

```
Arnaud Duvermy's avatar
Arnaud Duvermy committed
## -- prepare data & fit a model with mixed effect
data2fit = prepareData2fit(countMatrix = mock_data$counts, 
                           metadata =  mock_data$metadata, 
                           normalization = F)
l_tmb <- fitModelParallel(formula = kij ~ varB + (varB | varA),
                          data = data2fit, 
                          group_by = "geneID",
                          family = glmmTMB::nbinom2(link = "log"), 
Arnaud Duvermy's avatar
Arnaud Duvermy committed
                          n.cores = 8)
Arnaud Duvermy's avatar
Arnaud Duvermy committed
The `fitModelParallel()` function in **HTRfit** provides a powerful way to fit models independently for each gene. This allows for efficient parallelization of the modeling process by specifying the `n.cores` option. However, it's essential to note that as more cores are utilized, there is a corresponding increase in the required RAM. This is because the data necessary for fitting the model needs to be loaded into memory. Our simulations have demonstrated significant time savings when employing more cores. For instance, using 25 cores was nearly three times faster for processing 6,000 genes and 2,000 experimental conditions (2000 levels in varA, 2 levels in varB - 8000 samples). However, using 50 cores yielded minimal time savings but had a noticeable impact on RAM consumption. Therefore, users must carefully balance computation speed and memory usage when selecting the number of cores. 
Arnaud Duvermy's avatar
Arnaud Duvermy committed
### Diagnostic metrics

fduveau's avatar
fduveau committed
The `diagnostic_plot()` function allows to plot a diagnostic plot of AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), logLik (log-likelihood), deviance, df.resid (residual degrees of freedom), and dispersion. These metrics provide insights into how well the model fits the data and help in comparing different models. By examining these metrics, users can quickly identify any anomalies or potential issues in the fitting process.
Arnaud Duvermy's avatar
Arnaud Duvermy committed

```
## -- plot all metrics
Arnaud Duvermy's avatar
Arnaud Duvermy committed
p <- diagnostic_plot(list_tmb = l_tmb)
Arnaud Duvermy's avatar
Arnaud Duvermy committed
```

<div id="bg"  align="center">
  <img src="./vignettes/figs/diagnostic_plot.png" width="600" height="360">
</div> 

Arnaud Duvermy's avatar
Arnaud Duvermy committed
### Evaluation 
Arnaud Duvermy's avatar
Arnaud Duvermy committed
## -- evaluation
Arnaud Duvermy's avatar
Arnaud Duvermy committed
resSimu <- evaluation_report(list_tmb = l_tmb,
                            mock_obj = mock_data,
Arnaud Duvermy's avatar
Arnaud Duvermy committed
                            coeff_threshold = 0.27, 
                            alt_hypothesis = "greater")

Arnaud Duvermy's avatar
Arnaud Duvermy committed
```

Arnaud Duvermy's avatar
Arnaud Duvermy committed
The identity plot, generated by the `evaluation_report()` function, provides a visual means to compare the effects used in the simulation (actual effects) with those inferred by the model. This graphical representation facilitates the assessment of the correspondence between the values of the simulated effects and those estimated by the model, allowing for a visual analysis of the model’s goodness of fit to the simulated data.
Arnaud Duvermy's avatar
Arnaud Duvermy committed

Arnaud Duvermy's avatar
Arnaud Duvermy committed
The dispersion plot, generated by the `evaluation_report()` function, offers a visual comparison of the dispersion parameters used in the simulation $\alpha_i$ with those estimated by the model. This graphical representation provides an intuitive way to assess the alignment between the simulated dispersion values and the model-inferred values, enabling a visual evaluation of how well the model captures the underlying data characteristics.
Arnaud Duvermy's avatar
Arnaud Duvermy committed

fduveau's avatar
fduveau committed
The evaluation of model performance in distinguishing between differentially expressed and non-differentially expressed genes relies on the area under the ROC curve (AUC), a comprehensive metric offering a singular summary of the model's overall effectiveness. A higher AUC is indicative of superior model performance. It is noteworthy that we not only calculate the AUC for the ROC curve but also extend this assessment to the Precision-Recall (PR) curve. In the case of the ROC curve, this AUC should be compared to the value of 0.5, representing a random classifier. On the other hand, for the PR curve, we compute the pr_AUC_random, serving as a baseline for comparison. This dual evaluation approach ensures a thorough understanding of the model's discrimination capabilities under different scenarios, providing valuable insights into its robustness and reliability.
Arnaud Duvermy's avatar
Arnaud Duvermy committed

Arnaud Duvermy's avatar
Arnaud Duvermy committed
In addition to evaluating model performance through the AUC for both ROC and PR curves, we provide access to key classification metrics, including Accuracy, Precision, Recall (or Sensitivity), and Specificity. These metrics offer a comprehensive view of the model's classification capabilities.
Arnaud Duvermy's avatar
Arnaud Duvermy committed

Arnaud Duvermy's avatar
Arnaud Duvermy committed

fduveau's avatar
fduveau committed
### Comparing performances of HTRfit & DESeq2
Arnaud Duvermy's avatar
Arnaud Duvermy committed

fduveau's avatar
fduveau committed
**HTRfit** offers a wrapper for **DESeq2** outputs. This functionality allows users to seamlessly integrate the results obtained from **DESeq2** into the **HTRfit** analysis pipeline. By doing so, you can readily compare the performance of **HTRfit** with **DESeq2** on your RNAseq data. This comparative analysis aids in determining which tool performs better for your specific research goals and dataset.
Arnaud Duvermy's avatar
Arnaud Duvermy committed



```{r}
## -- init a design 
input_var_list <- init_variable( name = "varA", mu = 0, sd = 0.29, level = 2) %>%
                  init_variable( name = "varB", mu = 0.27, sd = 0.6, level = 2) %>%
                    add_interaction( between_var = c("varA", "varB"), mu = 0.44, sd = 0.89)
N_GENES = 1000
MIN_REPLICATES = 4
MAX_REPLICATES = 4
BASAL_EXPR = 2
SEQ_DEPTH = 1e+7

## -- simulate RNAseq data 
mock_data <- mock_rnaseq(input_var_list, 
                         n_genes = N_GENES,
                         min_replicates  = MIN_REPLICATES,
                         max_replicates = MAX_REPLICATES,
                         basal_expression = BASAL_EXPR,
                         sequencing_depth = SEQ_DEPTH )

## -- prepare data & fit a model with HTRfit
data2fit = prepareData2fit(countMatrix = mock_data$counts, 
                           metadata =  mock_data$metadata, 
                           normalization = F)
l_tmb <- fitModelParallel(formula = kij ~ varB + (varB | varA),
                          data = data2fit, 
                          group_by = "geneID",
                          family = glmmTMB::nbinom2(link = "log"), 
                          n.cores = 8)


## -- DESeq2
library(DESeq2)
dds <- DESeq2::DESeqDataSetFromMatrix(
          countData = count_matrix,
          colData = metaData,
          design = ~ varA + varB  + varA:varB )
dds <- DESeq2::DESeq(dds, quiet = TRUE)


## -- get simulation/fit evaluation
resSimu <- evaluation_report(list_tmb = l_tmb,
                            dds_obj = dds,
                            mock_data, 
                            coeff_threshold = 0.4, 
                            alt_hypothesis = "greaterAbs")
```


*** ADD plot ***