-
Arnaud Duvermy authoredArnaud Duvermy authored
- High-Throughput RNA-seq model fit
- Why use HTRfit
- Installation
- method A:
- method B:
- CRAN packages dependencies
- Docker
- Biosphere virtual machine
- HTRfit simulation workflow
- Getting started
- Init a design and simulate RNAseq data
- Fit your model
- Diagnostic metrics
- Extracts a tidy result table from a list tmb object
- Evaluation
- Comparing performances of HTRfit & DESeq2
High-Throughput RNA-seq model fit
Why use HTRfit
HTRfit provides a robust statistical framework that allows you to investigate the essential experimental parameters influencing your ability to detect expression changes. Whether you're examining sequencing depth, the number of replicates, or other critical factors, HTRfit's computational simulation is your go-to solution.
Furthermore, by enabling the inclusion of fixed effects, mixed effects, and interactions in your RNAseq data analysis, HTRfit provides the flexibility needed to conduct your differential expression analysis effectively. HTRfit is particularly adapted for the analysis of large number of samples, or highly multiplexed experiments.
Installation
method A:
To install the latest version of HTRfit, run the following in your R console :
if (!requireNamespace("remotes", quietly = TRUE))
install.packages("remotes")
remotes::install_git("https://gitbio.ens-lyon.fr/aduvermy/HTRfit")
method B:
You also have the option to download a release directly from the HTRfit release page. Once you've downloaded the release, simply launch following command.
## -- Example using the HTRfit-v2.0.0 release
install.packages('HTRfit-v2.0.0.tar.gz', repos = NULL, type='source')
When dependencies are met, installation should take a few minutes.
CRAN packages dependencies
The following depandencies are required:
## -- required
install.packages(c('parallel', 'data.table', 'ggplot2', 'gridExtra', 'glmmTMB', 'magrittr', 'MASS', 'reshape2', 'rlang', 'stats', 'utils', 'BiocManager', 'car'))
BiocManager::install('S4Vectors', update = FALSE)
## -- optional
BiocManager::install('DESeq2', update = FALSE)
Docker
We have developed Docker images to simplify the package's utilization.
docker pull ruanad/htrfit:v2.0.0-beta
docker run -it --rm ruanad/htrfit:v2.0.0-beta
HTRfit allows building graphs to visualize results. Inside Docker, displaying a window from the R terminal can be tricky and requires specific settings before running the container. Following commands worked for us:
# Prepare target env
CONTAINER_DISPLAY="0"
CONTAINER_HOSTNAME="htrfit_user"
# Create a directory for the socket
mkdir -p display/socket
touch display/Xauthority
# Get the DISPLAY slot
DISPLAY_NUMBER=$(echo $DISPLAY | cut -d. -f1 | cut -d: -f2)
# Extract current authentication cookie
AUTH_COOKIE=$(xauth list | grep "^$(hostname)/unix:${DISPLAY_NUMBER} " | awk '{print $3}')
# Create the new X Authority file
xauth -f display/Xauthority add ${CONTAINER_HOSTNAME}/unix:${CONTAINER_DISPLAY} MIT-MAGIC-COOKIE-1 ${AUTH_COOKIE}
# Proxy with the :0 DISPLAY
socat UNIX-LISTEN:display/socket/X${CONTAINER_DISPLAY},fork TCP4:localhost:60${DISPLAY_NUMBER} &
# Launch the container
docker run -it --rm \
-e DISPLAY=:${CONTAINER_DISPLAY} \
-e XAUTHORITY=/tmp/.Xauthority \
-v ${PWD}/display/socket:/tmp/.X11-unix \
-v ${PWD}/display/Xauthority:/tmp/.Xauthority \
--hostname ${CONTAINER_HOSTNAME} \
ruanad/htrfit:v2.0.0-beta
## inspired by : https://blog.yadutaf.fr/2017/09/10/running-a-graphical-app-in-a-docker-container-on-a-remote-server/
Biosphere virtual machine
A straightforward way to use HTRfit is to run it on a Virtual Machine (VM) through Biosphere. We recommend utilizing a VM that includes RStudio for an integrated development environment (IDE) experience. Biosphere VM resources can also be scaled according to your simulation needs.
HTRfit can be installed using the method A.
HTRfit simulation workflow
In the realm of RNAseq analysis, various key experimental parameters play a crucial role in influencing the statistical power to detect expression changes. Parameters such as sequencing depth, the number of replicates, and others are expected to impact statistical power. To navigate the selection of optimal values for these experimental parameters, we introduce a comprehensive statistical framework known as HTRfit, underpinned by computational simulation. Moreover, HTRfit offers seamless compatibility with DESeq2 outputs, facilitating a comprehensive evaluation of RNAseq analysis.
Getting started
Download the vignette for more in-depth information about how to use HTRfit.
Init a design and simulate RNAseq data
library('HTRfit')
## -- init a design
input_var_list <- init_variable( name = "varA", mu = 0, sd = 0.29, level = 2000) %>%
init_variable( name = "varB", mu = 0.27, sd = 0.6, level = 2) %>%
add_interaction( between_var = c("varA", "varB"), mu = 0.44, sd = 0.89)
## -- simulate RNAseq data
mock_data <- mock_rnaseq(input_var_list,
n_genes = 6000,
min_replicates = 4,
max_replicates = 4 )
Fit your model
## -- prepare data & fit a model with mixed effect
data2fit = prepareData2fit(countMatrix = mock_data$counts,
metadata = mock_data$metadata,
normalization = F,
response_name = "kij")
l_tmb <- fitModelParallel(formula = kij ~ varB + (varB | varA),
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 8)
The fitModelParallel()
function in HTRfit provides a powerful way to fit models independently for each gene. This allows for efficient parallelization of the modeling process by specifying the n.cores
option. However, it's essential to note that as more cores are utilized, there is a corresponding increase in the required RAM. This is because the data necessary for fitting the model needs to be loaded into memory. Our simulations have demonstrated significant time savings when employing more cores. For instance, using 25 cores was nearly three times faster for processing 6,000 genes and 2,000 experimental conditions (2000 levels in varA, 2 levels in varB - 8000 samples). However, using 50 cores yielded minimal time savings but had a noticeable impact on RAM consumption. Therefore, users must carefully balance computation speed and memory usage when selecting the number of cores.
Diagnostic metrics
The diagnostic_plot()
function allows to plot a diagnostic plot of AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), logLik (log-likelihood), deviance, df.resid (residual degrees of freedom), and dispersion. These metrics provide insights into how well the model fits the data and help in comparing different models. By examining these metrics, users can quickly identify any anomalies or potential issues in the fitting process.
## -- plot all metrics
p <- diagnostic_plot(list_tmb = l_tmb)
The diagnostic metrics show that the fit is not as good for all genes.
Extracts a tidy result table from a list tmb object
The tidy_results function extracts a dataframe containing estimates of ln(fold changes), standard errors, test statistics, p-values, and adjusted p-values for fixed effects. Additionally, it provides access to correlation terms and standard deviations for random effects, offering a detailed view of HTRfit modeling results.
## -- get tidy results
tidy_res <- tidy_results(l_tmb, coeff_threshold = 0.1, alternative_hypothesis = "greater")
Evaluation
## -- evaluation
resSimu <- evaluation_report(list_tmb = l_tmb,
mock_obj = mock_data,
coeff_threshold = 0.27,
alt_hypothesis = "greater")
The identity plot, generated by the evaluation_report()
function, provides a visual means to compare the effects used in the simulation (actual effects) with those inferred by the model. This graphical representation facilitates the assessment of the correspondence between the values of the simulated effects and those estimated by the model, allowing for a visual analysis of the model’s goodness of fit to the simulated data. For a quantitative evaluation of inference quality, the R-squared (R2) metric can be employed.
The evaluation of model performance in distinguishing between differentially expressed and non-differentially expressed genes relies on the area under the ROC curve (AUC), a comprehensive metric offering a singular summary of the model's overall effectiveness. A higher AUC is indicative of superior model performance. It is noteworthy that we not only calculate the AUC for the ROC curve but also extend this assessment to the Precision-Recall (PR) curve. In the case of the ROC curve, this AUC should be compared to the value of 0.5, representing a random classifier. On the other hand, for the PR curve, we compute the pr_AUC_random, serving as a baseline for comparison. This dual evaluation approach ensures a thorough understanding of the model's discrimination capabilities under different scenarios, providing valuable insights into its robustness and reliability.
In addition to evaluating model performance through the AUC for both ROC and PR curves, we provide access to key classification metrics, including Accuracy, Precision, Recall (or Sensitivity), and Specificity. These metrics offer a comprehensive view of the model's classification capabilities.
All metrics are computed for each parameter (excluding the intercept when skip_eval_intercept = TRUE), providing detailed insights into individual parameter contributions. Furthermore, an aggregated assessment is performed, considering all parameters (except the intercept by default), offering a perspective on the model's overall classification effectiveness.
Comparing performances of HTRfit & DESeq2
HTRfit offers a wrapper for DESeq2 outputs. This functionality allows users to seamlessly integrate the results obtained from DESeq2 into the HTRfit analysis pipeline. By doing so, you can readily compare the performance of HTRfit with DESeq2 on your RNAseq data. This comparative analysis aids in determining which tool performs better for your specific research goals and dataset.
## -- init a design
input_var_list <- init_variable( name = "varA", mu = 0, sd = 0.29, level = 2) %>%
init_variable( name = "varB", mu = 0.27, sd = 0.6, level = 2) %>%
add_interaction( between_var = c("varA", "varB"), mu = 0.44, sd = 0.89)
N_GENES = 1000
MIN_REPLICATES = 4
MAX_REPLICATES = 4
BASAL_EXPR = 2
SEQ_DEPTH = 1e+7
## -- simulate RNAseq data
mock_data <- mock_rnaseq(input_var_list,
n_genes = N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES,
basal_expression = BASAL_EXPR,
sequencing_depth = SEQ_DEPTH )
## -- prepare data & fit a model with HTRfit
data2fit = prepareData2fit(countMatrix = mock_data$counts,
metadata = mock_data$metadata,
normalization = F)
l_tmb <- fitModelParallel(formula = kij ~ varA + varB + varA:varB,
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 8)
## -- DESeq2
library(DESeq2)
dds <- DESeq2::DESeqDataSetFromMatrix(
countData = round(mock_data$counts),
colData = mock_data$metadata,
design = ~ varA + varB + varA:varB )
dds <- DESeq2::DESeq(dds, quiet = TRUE)
## -- get simulation/fit evaluation
resSimu <- evaluation_report(list_tmb = l_tmb,
dds = dds,
mock_obj = mock_data,
coeff_threshold = 0.4,
alt_hypothesis = "greaterAbs")
*** ADD plot ***