Skip to content
Snippets Groups Projects
Commit dffa5912 authored by Arnaud Duvermy's avatar Arnaud Duvermy
Browse files

rm vignettes temporary

Former-commit-id: d00216303674b79b032d664ed42133be8d1f56fc
Former-commit-id: 58153b8018d6b6f5d8dffc93a37eb7862d922e8f
Former-commit-id: 2f8239590ec92f3eef17ff52a5de2e9df6e7a3c6
parent 37fc3633
No related branches found
No related tags found
No related merge requests found
---
title: "Theory behind HTRfit"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Theory behind HTRfit}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "../man/figures/"
)
```
In the realm of RNAseq analysis, various key experimental parameters play a crucial role in influencing the statistical power to detect expression changes. Parameters such as sequencing depth, the number of replicates, and others are expected to impact statistical power.
To navigate the selection of optimal values for these experimental parameters, we introduce a comprehensive statistical framework known as HTRfit, underpinned by computational simulation. Moreover, HTRfit offers seamless compatibility with DESeq2 outputs, facilitating a comprehensive evaluation of RNAseq analysis.
# HTRfit simulation workflow
<div id="bg" align="center">
<img src="../reference/figures/htrfit_workflow.png" width="500" height="300">
</div>
In this modeling framework, counts denoted as $K_{ij}$ for gene i and sample j are generated using a negative binomial distribution. The negative binomial distribution considers a fitted mean $\mu_{ij}$ and a gene-specific dispersion parameter $dispersion_i$.
The fitted mean $\mu_{ij}$ is determined by a parameter, $q_{ij}$, which is proportionally related to the sum of all effects specified using `init_variable()` or `add_interaction()`. If basal gene expressions are provided, the $q_{ij}$ values are scaled accordingly using the gene-specific basal expression value ($bexpr_i$).
Furthermore, the coefficients $\beta_i$ represent the natural logarithm fold changes for gene i across each column of the model matrix X. The dispersion parameter $dispersion_i$ plays a crucial role in defining the relationship between the variance of observed counts and their mean value. In simpler terms, it quantifies how far we expect observed counts to deviate from the mean value for each genes.
In addition, HTRfit allows for sequencing depth control using a scalar value specific to each sample ($s_j$) applied on the $\mu_{ij}$ value.
\ No newline at end of file
---
title: "Simulation tutorial"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Simulation tutorial}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(HTRfit)
```
## Initializing design
In this setup, we will use HTRfit to simulate an experimental design involving multiple genotypes and environmental conditions. To do so, we will first need to estimate plausible parameter distributions based on publicly available data.
### Loading public dataset
Counts data from the [SRP217588](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE135473) project are embedded in HTRfit package. These data were processed by an RNAseq pipeline using Kallisto to generate count data. This dataset includes count data for 4 distinct environments, 2 genotypes in 6000 genes, providing a valuable resource for estimating distribution parameters.
```{r loading_pub_data, warning = FALSE, message = FALSE}
## -- public data embedded in HTRfit
pub_data_fn <- system.file("extdata", "rna_pub_data.tsv", package = "HTRfit")
pub_data <- read.table(file = pub_data_fn, header = TRUE)
rownames(pub_data) <- pub_data$gene_id
pub_data$gene_id <- NULL
pub_metadata_fn <- system.file("extdata", "rna_pub_metadata.tsv", package = "HTRfit")
pub_metadata <- read.table(file = pub_metadata_fn, header = TRUE)
## -- 6713 genes & 40 samples
dim(pub_data)
```
### Fit a model on a public dataset
To estimate plausible parameter distributions, we will fit a generalized linear model (GLM) to the publicly available data. This GLM will allow us to estimate plausible coefficients betaG, betaE, and betaGxE for 6000 genes. The `tidy_pub_fit` object contains the estimate obtain by HTRfit for each coefficient and each genes.
```{r fit_pub_data, eval=FALSE, message=FALSE, warning=FALSE, include=TRUE}
# -- prepare data & fit a model
data2fit = prepareData2fit(countMatrix = pub_data,
metadata = pub_metadata,
normalization = 'MRN',
response_name = "kij",
transform = "x+1", ## apply + 1 to each counts
row_threshold = 10) ## remove genes i with all(k_ij) < 10
l_tmb <- fitModelParallel(
formula = kij ~ genotype + environment + genotype:environment,
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1)
## -- Select best fit base on AIC results
l_genes <- identifyTopFit(l_tmb)
## -- extract results for best genes
tidy_pub_fit <- tidy_results(l_tmb[l_genes])
```
### Estimating parameter distributions
#### Beta coefficients parameters
Now let's examine the distributions of the estimated parameters. Means, standard deviation and correlation between these parameters are inferred using the following lines :
```{r load_fit_pub_data, include = FALSE}
## -- hidded in vignette
## -- allows to not evaluate previous chunk (a bit long)
tidy_pub_fn <- system.file("extdata", "tidy_pub_results.tsv", package = "HTRfit")
tidy_pub_fit <- read.table(file = tidy_pub_fn, header = TRUE)
dispersion_pub_fn <- system.file("extdata", "pub_dispersion.tsv", package = "HTRfit")
dispersion_pub <- read.table(file = dispersion_pub_fn, header = TRUE)$x
```
```{r beta_pub_data, warning = FALSE, message = FALSE}
# -- Obtain the parameters
all_beta_pub <- reshape2::dcast(tidy_pub_fit, formula = ID ~ term, value.var = "estimate")
columns_env <- c("environmentenv2","environmentenv3", "environmentenv4" )
beta_env <- rowMeans(all_beta_pub[, columns_env])
columns_interaction <- c("genotype2:environmentenv2","genotype2:environmentenv3", "genotype2:environmentenv4" )
beta_interaction <- rowMeans( all_beta_pub[ , columns_interaction] )
# -- Group the parameters
grouped_beta_pub <- data.frame( betaG = all_beta_pub$genotype2,
betaE = beta_env,
betaGxE = beta_interaction )
# -- Obtain statistics for the parameters
beta0_pub = all_beta_pub$`(Intercept)`
## -- public effect standard deviation
sapply(grouped_beta_pub, sd)
quantile(beta0_pub, prob=c(.33,.66,1))
hist(beta0_pub)
```
#### Dispersion parameter
We also save the dispersion parameter obtain from the public dataset.
```{r pub_dispersion, warning = FALSE, message = FALSE, eval = FALSE}
dispersion_pub <- glance_tmb(l_tmb)$dispersion
```
## Mimic the public RNAseq data with HTRfit
To reproduce the design of the public [SRP217588](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE135473) project, 2 levels of genotype and 4 levels of environment are initialized using `init_variable()`. By using `add_interaction()`, interaction effects between genotype and environment (betaGxE) are set. For all of them, mean and standard deviation are set concordantly to values obtained from the public dataset coefficient distribution (betaG, betaE, betaGxE).
By using the estimated intercept obtained from the previous analysis of publicly available data as the basal expression level for each gene in the simulated dataset, we ensure that the basal expression levels for our simulated genes are consistent with those in the publicly available data.
```{r design_init, warning = FALSE, message = FALSE}
## -- design to simulate
N_GENES <- 100
MIN_REPLICATES <- 5
MAX_REPLICATES <- 5
SEQ_DEPTH <- 5e6
## -- init effects to simulate
input_var_list <- init_variable( name = "genotype", mu = 0, sd = 2.18, level = 2) %>%
init_variable( name = "environment", mu = 0, sd = 0.57, level = 4 ) %>%
add_interaction(between_var = c("genotype", "environment"), mu = 0 , sd = 1.018)
## -- simulate RNAseq data
mock_data <- mock_rnaseq(input_var_list,
n_genes = N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES,
basal_expression = beta0_pub,
sequencing_depth = SEQ_DEPTH,
dispersion = dispersion_pub,
normal_distr = 'univariate' ) ## for fix effect
```
## Fit and evaluate a model on simulated RNAseq data
### Fit model
We fit a model to the simulated RNAseq data using the `fitModelParallel()` function. Our model includes terms for genotype, environment, and the interaction between genotype and environment.
```{r fit_model, warning = FALSE, message = FALSE }
## -- prepare data & fit a model with mixed effect
data2fit = prepareData2fit(countMatrix = mock_data$counts,
metadata = mock_data$metadata,
normalization = 'MRN',
response_name = "kij",
transform = "x+1", ## apply + 1 to each counts
row_threshold = 10) ## remove genes i with all(k_ij) < 10
l_tmb <- fitModelParallel(
formula = kij ~ genotype + environment + genotype:environment,
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1,
control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, eval.max=1e5)))
```
### Evaluate fit
Next, we evaluate the fitting results obtained from the simulated RNAseq data by comparing the expected results computed by `evaluation_report()`, with those obtained with `fitModelParallel()` : `l_tmb`.
```{r evaluation, warning = FALSE, message = FALSE}
## -- eval params for Wald test
ln_FC_threshold <- 0.67
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
list_gene = l_genes,
mock_obj = mock_data,
coeff_threshold = ln_FC_threshold,
alt_hypothesis = alternative_hypothesis )
```
## Explore evaluation results
### Identity plot
The `evaluation_report()` function produces an identity plot, offering a visual tool to juxtapose the simulated effects (actual effects) with the model-inferred effects. This graphical representation simplifies assessing how well the model aligns with the values of the simulated effects, enabling a visual analysis of the model's goodness of fit to the simulated data. For a more quantitative evaluation of inference quality, the R-squared (R2) and RMSE metrics can be employed.
### Model parameters
```{r id_params , warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=7}
## -- Model params
resSimu$identity$params
```
### Dispersion parameter
The dispersion plot, generated by the `evaluation_report()` function, offers a visual comparison of the dispersion parameters used in the simulation \(dispersion_i\) with those estimated by the model. This graphical representation provides an intuitive way to assess the alignment between the simulated dispersion values and the model-inferred values, enabling a visual evaluation of how well the model captures the underlying data characteristics.
```{r id_disp, warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=5}
## -- Dispersion params
resSimu$identity$dispersion
```
### ROC curve
The Receiver Operating Characteristic (ROC) curve is a valuable tool for assessing the performance of classification models, particularly in the context of identifying differentially expressed genes. It provides a graphical representation of the model's ability to distinguish between genes that are differentially expressed and those that are not, by varying the `coeff_threshold` and the `alt_hypothesis` parameters.
```{r roc, warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=7}
## -- ROC curve
resSimu$roc$params
```
### Precision-recall curve
The precision-recall curve (PR curve) illustrates the relationship between precision and recall at various classification thresholds. It is particularly valuable in the context of imbalanced data, where one class is significantly more prevalent than the other. Unlike the ROC curve, which can be influenced by class distribution, the PR curve focuses on the model's ability to correctly identify examples of the minority class while maintaining high precision.
```{r pr_curve, warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=7}
## -- precision recall by params
resSimu$precision_recall$params
```
### AUC and classification performance metrics
The area under the ROC curve (AUC) provides a single metric that summarizes the model's overall performance in distinguishing between differentially expressed and non-differentially expressed genes. A higher AUC indicates better model performance. In the case of the ROC curve, this AUC should be compared to the value of 0.5, representing a random classifier. On the other hand, for the PR curve, we compute the pr_AUC_random, serving as a baseline for comparison.
In addition to evaluating model performance through the AUC for both ROC and PR curves, we provide access to key classification metrics, including Accuracy, Precision, Recall (or Sensitivity), and Specificity. These metrics offer a comprehensive view of the model's classification capabilities. These metrics are computed for each parameter (excluding the intercept when skip_eval_intercept = TRUE), providing detailed insights into individual parameter contributions. Furthermore, an aggregated assessment is performed, considering all parameters (except the intercept by default), offering a perspective on the model's overall classification effectiveness.
```{r perf, warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=7}
## -- precision-recall curve
resSimu$performances$byparams
```
## Mimic the public RNAseq data with HTRfit - increasing GxE interaction
To initialized GxE interaction to 0, set `sd = 0`.
```{r design_init_+GxE, warning = FALSE, message = FALSE}
## -- design to simulate
N_GENES <- 100
MIN_REPLICATES <- 5
MAX_REPLICATES <- 5
SEQ_DEPTH <- 5e6
## -- init effects to simulate
input_var_list <- init_variable( name = "genotype", mu = 0, sd = 2.18, level = 2) %>%
init_variable( name = "environment", mu = 0, sd = 0.57, level = 4 ) %>%
add_interaction(between_var = c("genotype", "environment"), mu = 0 , sd = 3)
## -- simulate RNAseq data
mock_data <- mock_rnaseq(input_var_list,
n_genes = N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES,
basal_expression = beta0_pub,
sequencing_depth = SEQ_DEPTH,
dispersion = dispersion_pub,
normal_distr = 'univariate' ) ## for fix effect
```
### Fit and evaluate a model on simulated RNAseq data
We fit a model to the simulated RNAseq data using the `fitModelParallel()` function. Our model includes terms for genotype, environment, and the interaction between genotype and environment (GxE).
```{r fit_model_+GxE, warning = FALSE, message = FALSE }
## -- prepare data & fit a model with mixed effect
data2fit = prepareData2fit(countMatrix = mock_data$counts,
metadata = mock_data$metadata,
normalization = 'MRN',
response_name = "kij",
transform = "x+1", ## apply + 1 to each counts
row_threshold = 10) ## remove genes i with all(k_ij) < 10
l_tmb <- fitModelParallel(
formula = kij ~ genotype + environment + genotype:environment,
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1,
control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, eval.max=1e5)))
```
### Evaluate fit
Next, we evaluate the fitting results obtained from the simulated RNAseq data by comparing the expected results computed by `evaluation_report()`, with those obtained with `fitModelParallel()` : `l_tmb`.
```{r evaluation_+GxE, warning = FALSE, message = FALSE}
## -- eval params for Wald test
ln_FC_threshold <- 0.67
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
list_gene = l_genes,
mock_obj = mock_data,
coeff_threshold = ln_FC_threshold,
alt_hypothesis = alternative_hypothesis)
```
The identity plot allows direct appreciation of the absence of GxE interaction.
```{r id_params_+GxE , warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=7}
## -- Model params
resSimu$identity$params
```
## Mimic the public RNAseq data with HTRfit - without GxE interaction
To initialized GxE interaction to 0, set `sd = 0`.
```{r design_init_0GxE, warning = FALSE, message = FALSE}
## -- design to simulate
N_GENES <- 100
MIN_REPLICATES <- 5
MAX_REPLICATES <- 5
SEQ_DEPTH <- 5e6
## -- init effects to simulate
input_var_list <- init_variable( name = "genotype", mu = 0, sd = 2.18, level = 2) %>%
init_variable( name = "environment", mu = 0, sd = 0.57, level = 4 ) %>%
add_interaction(between_var = c("genotype", "environment"), mu = 0 , sd = 0) ## sd = 0
## -- simulate RNAseq data
mock_data <- mock_rnaseq(input_var_list,
n_genes = N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES,
basal_expression = beta0_pub,
sequencing_depth = SEQ_DEPTH,
dispersion = dispersion_pub,
normal_distr = 'univariate' ) ## for fix effect
```
### Fit and evaluate a model on simulated RNAseq data
We fit a model to the simulated RNAseq data using the `fitModelParallel()` function. Our model includes terms for genotype, environment, and the interaction between genotype and environment.
```{r fit_model_0GxE, warning = FALSE, message = FALSE }
## -- prepare data & fit a model with mixed effect
data2fit = prepareData2fit(countMatrix = mock_data$counts,
metadata = mock_data$metadata,
normalization = 'MRN',
response_name = "kij",
transform = "x+1", ## apply + 1 to each counts
row_threshold = 10) ## remove genes i with all(k_ij) < 10
l_tmb <- fitModelParallel(
formula = kij ~ genotype + environment + genotype:environment,
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1,
control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, eval.max=1e5)))
```
Next, we evaluate the fitting results obtained from the simulated RNAseq data by comparing the expected results computed by `evaluation_report()`, with those obtained with `fitModelParallel()` : `l_tmb`.
```{r evaluation_0GxE, warning = FALSE, message = FALSE}
## -- eval params for Wald test
ln_FC_threshold <- 0.67
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
list_gene = l_genes,
mock_obj = mock_data,
coeff_threshold = ln_FC_threshold,
alt_hypothesis = alternative_hypothesis)
```
The identity plot allows direct appreciation of the absence of GxE interaction.
```{r id_params_0GxE , warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=7}
## -- Model params
resSimu$identity$params +
ggplot2::theme(axis.text.x = ggplot2::element_text(size = 7, angle = 45,
vjust = 0.9, hjust=1))
```
\ No newline at end of file
---
title: "RNAseq analysis"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{RNAseq analysis}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(HTRfit)
```
In RNAseq, we employ Generalized Linear Models (GLM) to unravel how genes respond to various experimental conditions. These models assist in deciphering the specific impacts of experimental variables on gene expression.HTRfit can be utilized to analyze such RNAseq data, providing a robust framework for exploring and interpreting the intricate relationships between genes and experimental conditions.
## Input data
HTRfit analysis necessitates a count matrix and sample metadata, in the form of dataframes.
Notice that gene_id have to be specified as rownames of `count_matrix`.
```{r create_data , include = FALSE}
## -- hided in vignette
## -- simulate small example to prevent excessively lengthy vignette construction
list_var <- init_variable( name = "genotype", mu = 3, sd = 0.2, level = 2) %>%
init_variable( name = "environment", mu = 2, sd = 0.43, level = 2) %>%
add_interaction( between_var = c("genotype", "environment"), mu = 0.44, sd = 0.2)
N_GENES = 30
MIN_REPLICATES = 4
MAX_REPLICATES = 4
BASAL_EXPR = 3
mock_data <- mock_rnaseq(list_var, N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES,
basal_expression = BASAL_EXPR)
########################
## -- data from simulation or real data
count_matrix <- mock_data$counts
metaData <- mock_data$metadata
##############################
```
```{r display_input }
## -- gene count matrix
count_matrix[1:4, 1:2]
## -- samples metadata
head(metaData)
```
## Prepare data for fitting
The `prepareData2fit()` function serves the purpose of converting the counts matrix and sample metadata into a dataframe that is compatible with downstream **HTRfit** functions designed for model fitting. This function also includes an option to perform median ratio normalization and transformation on the data counts.
```{r example-prepareData, warning = FALSE, message = FALSE}
## -- convert counts matrix and samples metadatas in a data frame for fitting
data2fit = prepareData2fit(
countMatrix = count_matrix,
metadata = metaData,
normalization = NULL,
response_name = "kij")
## -- median ratio normalization
data2fit = prepareData2fit(
countMatrix = count_matrix,
metadata = metaData,
normalization = 'MRN',
response_name = "kij")
## -- apply + 1 transformation on each counts
data2fit = prepareData2fit(
countMatrix = count_matrix,
metadata = metaData,
normalization = 'MRN',
transform = "x+1",
response_name = "kij")
```
## Fit model from your data
The `fitModelParallel()` function enables independent model fitting for each gene. The number of threads used for this process can be controlled by the `n.cores` parameter.
```{r example-fitModelParallel, warning = FALSE, message = FALSE}
l_tmb <- fitModelParallel(
formula = kij ~ genotype + environment + genotype:environment,
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1)
```
## Use mixed effect in your model
HTRfit uses the **glmmTMB** functions for model fitting algorithms. This choice allows for the utilization of random effects within your formula design. For further details on how to specify your model, please refer to the [mixed model documentation](https://rdrr.io/cran/glmmTMB/man/glmmTMBControl.html).
```{r example-fitModelParallel_mixed, warning = FALSE, message = FALSE}
l_tmb <- fitModelParallel(
formula = kij ~ genotype + ( 1 | environment ),
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1)
```
## Additional settings
The function provides precise control over model settings for fitting optimization, including options for specifying the [model family](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/family) and [model control setting](https://rdrr.io/cran/glmmTMB/man/glmmTMBControl.html). By default, a Gaussian family model is fitted, but for RNA-seq data, it is highly recommended to specify `family = glmmTMB::nbinom2(link = "log")`.
```{r example-fitModelParallel_addSet, warning = FALSE, message = FALSE}
l_tmb <- fitModelParallel(
formula = kij ~ genotype + environment + genotype:environment,
data = data2fit,
group_by = "geneID",
n.cores = 1,
family = glmmTMB::nbinom2(link = "log"),
control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5,
eval.max=1e5)))
```
## Extracts a tidy result table from a list tmb object
The tidy_results function extracts a data frame containing estimates of ln(fold changes), standard errors, test statistics, p-values, and adjusted p-values for fixed effects. Additionally, it provides access to correlation terms and standard deviations for random effects, offering a detailed view of HTRfit modeling results.
```{r example-tidyRes, warning = FALSE, message = FALSE}
## -- get tidy results
my_tidy_res <- tidy_results(l_tmb, coeff_threshold = 0.1,
alternative_hypothesis = "greaterAbs")
## -- head
my_tidy_res[1:3,]
```
## Update fit
The `updateParallel()` function updates and re-fits a model for each gene. It offers options similar to those in `fitModelParallel()`. In addition, it is possible to modify the reference level of the categorical variable used in your model in order to use different contrast.
```{r example-update, warning = FALSE, message = FALSE}
## -- update your fit modifying the model family
l_tmb <- updateParallel(
formula = kij ~ genotype + environment + genotype:environment,
list_tmb = l_tmb ,
family = gaussian(),
n.cores = 1)
## -- update fit using additional model control settings
l_tmb <- updateParallel(
formula = kij ~ genotype + environment + genotype:environment ,
list_tmb = l_tmb ,
family = gaussian(),
n.cores = 1,
control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e3,
eval.max=1e3)))
## -- update your model formula and your family model
l_tmb <- updateParallel(
formula = kij ~ genotype + environment ,
list_tmb = l_tmb ,
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1)
## -- modif reference levels
## -- genotype reference = "genotype2"
## -- environment reference = "environment2"
l_tmb <- updateParallel(
formula = kij ~ genotype + environment ,
list_tmb = l_tmb ,
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1,
reference_labels = c("genotype2", "environment2"))
```
#### Struture of list tmb object
```{r example-str_obj_l_tmb, warning = FALSE, message = FALSE}
str(l_tmb$gene1, max.level = 1)
```
## Plot fit metrics
Visualizing fit metrics is essential for evaluating your models. Here, we show you how to generate various plots to assess the quality of your models. You can explore all metrics or focus on specific aspects like dispersion and log-likelihood.
```{r example-plotMetrics, warning = FALSE, message = FALSE, fig.align = 'center', fig.height = 4, fig.width = 8}
## -- plot all metrics
diagnostic_plot(list_tmb = l_tmb)
```
```{r example-plotMetricsFocus, warning = FALSE, message = FALSE, fig.align = 'center', fig.height = 3, fig.width = 8}
## -- Focus on metrics
diagnostic_plot(list_tmb = l_tmb, focus = c("dispersion", "logLik"))
```
## Select best fit based on diagnostic metrics
The identifyTopFit function allows for selecting genes that best fit models, based on various metrics such as AIC, BIC, LogLik, deviance, dispersion . This function provides the flexibility to employ multiple filtering methods to identify genes most relevant for further analysis. We implemented a filtering method based on the Median Absolute Deviation (MAD). For example, using the MAD method with a tolerance threshold of 3, the function identifies genes whose metric values are greater ("top") or lower ("low") than `median(metric) - 3 * MAD(metric)`. The keep option allows choosing whether to retain genes with metric values above or below the MAD threshold.
```{r example-identifytopfit, warning = FALSE, message = FALSE}
identifyTopFit(list_tmb = l_tmb, metric = "AIC",
filter_method = "mad", keep = "top",
sort = TRUE, decreasing = TRUE,
mad_tolerance = 3)
```
## Anova to select the best model
Utilizing the `anovaParallel()` function enables you to perform model selection by assessing the significance of the fixed effects. You can also include additional parameters like type. For more details, refer to [car::Anova](https://rdrr.io/cran/car/man/Anova.html).
```{r example-anova, warning = FALSE, message = FALSE}
## -- update your fit modifying the model family
l_anova <- anovaParallel(list_tmb = l_tmb)
## -- additional settings
l_anova <- anovaParallel(list_tmb = l_tmb, type = "III" )
```
---
title: "Benchmarking HTRfit and DESeq2"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Benchmarking HTRfit and DESeq2}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup, warning = FALSE, message = FALSE, results='hide'}
library(HTRfit)
library(DESeq2)
```
HTRfit offers a wrapper for DESeq2 outputs. This functionality allows users to seamlessly integrate the results obtained from DESeq2 into the HTRfit evaluation pipeline. By doing so, you can readily compare the performance of HTRfit with DESeq2 on your RNAseq data. This comparative analysis aids in determining which tool performs better for your specific research goals and dataset
## Simulation
```{r design_init, warning = FALSE, message = FALSE}
## -- init a design
list_var <- init_variable( name = "genotype", mu = 0, sd = 0.29, level = 2) %>%
init_variable( name = "environment", mu = 0.27, sd = 0.6, level = 4) %>%
add_interaction( between_var = c("genotype", "environment"), mu = 0.44, sd = 0.89)
N_GENES <- 100
MIN_REPLICATES <- 4
MAX_REPLICATES <- 4
SEQ_DEPTH <- 5e6
## -- simulate RNAseq data
mock_data <- mock_rnaseq(list_var,
n_genes = N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES,
basal_expression = 2,
sequencing_depth = SEQ_DEPTH)
```
## Fit models
```{r data2fit, warning = FALSE, message = FALSE, results = 'hide'}
## -- data from simulation or real data
count_matrix <- mock_data$counts
metaData <- mock_data$metadata
```
#### HTRfit
```{r prepareData_and_fit, warning = FALSE, message = FALSE}
## -- convert counts matrix and samples metadatas in a data frame for fitting
data2fit = prepareData2fit(countMatrix = count_matrix,
metadata = metaData,
normalization = 'MRN',
response_name = "kij")
l_tmb <- fitModelParallel(
formula = kij ~ genotype + environment + genotype:environment,
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1)
```
#### DESeq2
```{r fit_dds, warning = FALSE, message = FALSE, results = 'hide'}
## -- DESeq2
dds <- DESeq2::DESeqDataSetFromMatrix(
countData = count_matrix,
colData = metaData,
design = ~ genotype + environment + genotype:environment )
dds <- DESeq2::DESeq(dds, quiet = TRUE)
```
## Evaluation
```{r example-ddsComparison, warning = FALSE, message = FALSE}
## -- get simulation/fit evaluation
resSimu <- evaluation_report(list_tmb = l_tmb,
dds = dds,
mock_obj = mock_data,
coeff_threshold = 0.4,
alt_hypothesis = "greaterAbs")
```
```{r example-outputResSimu_id, warning = FALSE, message = FALSE, fig.align = 'center', fig.height = 4, fig.width = 5}
## -- Model params
resSimu$identity$params
## -- dispersion
resSimu$identity$dispersion
```
```{r example-outputResSimu_metric, warning = FALSE, message = FALSE, fig.align = 'center', fig.height = 4, fig.width = 7}
## -- precision-recall curve
resSimu$precision_recall$params
## -- ROC curve
resSimu$roc$params
## -- Performances metrics
resSimu$performances
```
---
title: "Object - list var"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Object - list var}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(HTRfit)
```
The `init_variable()` function is used for defining the variables in your experimental design. Sizes of effects for each variable and interaction term can be defined in two different ways: 1) The user can manually sets values of all levels of a variable, in which case the effects are necessarily considered fixed in the model; 2) The effects can be randomly picked in a normal distribution with mean and standard deviation defined by the user, in which case the user can decide whether the effects are considered fixed or random in the model.
It is crucial to highlight that the choice of these parameters plays a critical role in the simulation process. For optimal results, we recommend basing these decisions on real data, as outlined in the [Simulation tutorial](articles/02-tutorial.html)
## Manually init a single variable
The `init_variable()` function allows for precise control over the variables in your experimental design.
In this example, we manually initialize **varA** with specific size effects (mu) for three levels.
```{r example-init_variable_man, warning = FALSE, message = FALSE}
list_var <- init_variable( name = "varA", mu = c(0.2, 4, -3), level = 3)
```
## Randomly init a single variable
Alternatively, you can randomly initialize **varA** by specifying a mean (mu) and standard deviation (sd).
This introduces variability into **varA**, making it either a fixed or random effect in your design.
```{r example-init_variable_rand, warning = FALSE, message = FALSE}
list_var <- init_variable( name = "varA", mu = 10, sd = 0.2, level = 5)
```
## Randomly init several variables
You can also initialize multiple variables, such as **varA** and **varB**, with random values.
This flexibility allows you to create diverse experimental designs.
```{r example-init_variable_mult, warning = FALSE, message = FALSE}
list_var <- init_variable( name = "varA", mu = 10, sd = 0.2, level = 5) %>%
init_variable( name = "varB", mu = -3, sd = 0.34, level = 2)
```
## Add interaction between variable
Similarly to `init_variable()`, `add_interaction()` allows to define an interaction between variable.
In this example, we initialize **varA** and **varB**, and create an interaction between **varA**, and **varB** using `add_interaction()`.
```{r example-add_interaction, warning = FALSE, message = FALSE}
list_var <- init_variable( name = "varA", mu = 3, sd = 0.2, level = 2) %>%
init_variable( name = "varB", mu = 2, sd = 0.43, level = 2) %>%
add_interaction( between_var = c("varA", "varB"), mu = 0.44, sd = 0.2)
```
## Complex design
Interactions can involve a maximum of three variables, such as **varA**, **varB**, and **varC**.
```{r example-add_interaction_complex, eval = FALSE, message = FALSE, warning = FALSE, include = TRUE}
list_var <- init_variable( name = "varA", mu = 5, sd = 0.2, level = 2) %>%
init_variable( name = "varB", mu = 1, sd = 0.78, level = 2) %>%
init_variable( name = "varC", mu = c(2, 3), sd = NA, level = 2) %>%
add_interaction( between_var = c("varA", "varC"), mu = 0.44, sd = 0.2) %>%
add_interaction( between_var = c("varA", "varB"), mu = 0.43, sd = 0.37) %>%
add_interaction( between_var = c("varB", "varC"), mu = -0.33, sd = 0.12) %>%
add_interaction( between_var = c("varA", "varB" ,"varC"), mu = 0.87, sd = 0.18)
```
## Control correlation between effects
HTRfit offers a convenient way to control the correlation between variables during simulation.
The following code demonstrates how variables, initialized with <ins> random values using mu and sd </ins>, can be further refined by setting the correlation between these variables. If the `set_correlation()` function is not used, the default correlation value is 0.
```{r example-set_corr, warning = FALSE, message = FALSE}
list_var <- init_variable( name = "varA", mu = 3, sd = 0.2, level = 2) %>%
init_variable( name = "varB", mu = c(1,3), level = 2) %>%
add_interaction( between_var = c("varA", "varB"), mu = 0.44, sd = 0.2) %>%
set_correlation(between_var = c("varA", "varA:varB"), corr = -0.99)
## -- simulate RNAseq data
mock_data <- mock_rnaseq(list_var,
n_genes = 100,
min_replicates = 4,
max_replicates = 4)
```
Observe the -0.99 correlation in the `mock_obj`:
```{r simu_corr_var}
plot(mock_data$groundTruth$effects$varA,
mock_data$groundTruth$effects$`varA:varB`,
xlab = "varA", ylab= "varA:varB")
```
## Structure of list_var object
The list_var object collected all the information needed to generate a [`mock_rnaseq` object](06-mock_rnaseq_object.html).
```{r example-str_obj_init, warning = FALSE, message = FALSE}
str(list_var)
```
---
title: "Object - Mock rnaseq"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Object - Mock rnaseq }
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(HTRfit)
```
In this section, you will explore how to generate RNAseq data based on the [`list var`](05-list_var_object.html) object. The `mock_rnaseq()` function enables you to manage parameters in your RNAseq design, including number of genes, minimum and maximum number of replicates within your experimental setup, sequencing depth, basal expression of each gene, and dispersion of gene expression used for simulating counts.
## Minimal example
```{r example-mock_rnaseq_min, warning = FALSE, message = FALSE}
list_var <- init_variable( name = "varA", mu = 3, sd = 0.2, level = 2) %>%
init_variable( name = "varB", mu = 2, sd = 0.43, level = 2) %>%
add_interaction( between_var = c("varA", "varB"), mu = 0.44, sd = 0.2)
## -- Required parameters
N_GENES = 6000
MIN_REPLICATES = 2
MAX_REPLICATES = 10
########################
## -- simulate RNAseq data based on list_var, minimum input required
## -- number of replicate randomly defined between MIN_REP and MAX_REP
mock_data <- mock_rnaseq(list_var, N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES)
## -- simulate RNAseq data based on list_var, minimum input required
## -- Same number of replicates between conditions
mock_data <- mock_rnaseq(list_var, N_GENES,
min_replicates = MAX_REPLICATES,
max_replicates = MAX_REPLICATES)
```
## Scaling genes counts based on sequencing depth
Sequencing depth is a critical parameter affecting the statistical power of an RNAseq analysis. With the `sequencing_depth` option in the `mock_rnaseq()` function, you have the ability to control this parameter.
```{r example-mock_rnaseq_seqDepth, warning = FALSE, message = FALSE}
## -- Required parameters
N_GENES = 6000
MIN_REPLICATES = 2
MAX_REPLICATES = 10
########################
SEQ_DEPTH = c(100000, 5000000, 10000000)## -- Possible number of reads/sample
SEQ_DEPTH = 10000000 ## -- all samples have same number of reads
mock_data <- mock_rnaseq(list_var, N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES,
sequencing_depth = SEQ_DEPTH)
```
## Set dispersion of gene expression
The dispersion parameter ($dispersion_i$), characterizes the relationship between the variance of the observed read counts and its mean value. In simple terms, it quantifies how much we expect observed counts to deviate from the mean value. You can specify the dispersion for individual genes using the dispersion parameter.
```{r example-mock_rnaseq_disp, warning = FALSE, message = FALSE}
## -- Required parameters
N_GENES = 6000
MIN_REPLICATES = 2
MAX_REPLICATES = 4
########################
DISP = 0.1 ## -- Same dispersion for each genes
DISP = 1000 ## -- Same dispersion for each genes
DISP = runif(N_GENES, 0, 1000) ## -- Dispersion can vary between genes
mock_data <- mock_rnaseq(list_var, N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES,
dispersion = DISP )
```
## Set basal gene expression
The basal gene expression parameter, defined using the basal_expression option, allows you to control the fundamental baseline of expression level of genes. When a single value is specified, the basal expression of all genes is the same and the effects of variables defined in list_var are applied from this basal expression level. When several values are specified, the basal expression of each gene is picked randomly among these values (with replacement). The effects of variables are then applied from the basal expression of each gene. This represents the fact that expression levels can vary among genes even when not considering the effects of the defined variables (for example, some genes are constitutively more highly expressed than others because they have stronger basal promoters).
```{r example-mock_rnaseq_bexpr, warning = FALSE, message = FALSE}
## -- Required parameters
N_GENES = 6000
MIN_REPLICATES = 10
MAX_REPLICATES = 10
########################
BASAL_EXPR = -3 ## -- Value can be negative to simulate low expressed gene
BASAL_EXPR = 2 ## -- Same basal gene expression for the N_GENES
BASAL_EXPR = c( -3, -1, 2, 8, 9, 10 ) ## -- Basal expression can vary between genes
mock_data <- mock_rnaseq(list_var, N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES,
basal_expression = BASAL_EXPR)
```
## Mock RNAseq object
```{r example-str_obj_mock, warning = FALSE, message = FALSE}
str(mock_data, max.level = 1)
```
---
title: "Not only RNAseq"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Not only RNAseq}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(HTRfit)
```
In the realm of RNAseq analysis, it's widely acknowledged that counts follow a Negative Binomial distribution. Hence, we recommend employing `family = glmmTMB::nbinom2(link = "log")` in the parameters of fitModelParallel() to align with this distribution.
In RNAseq it is well known that counts have a Negative binomial distribution that's why we recommand using `family = glmmTMB::nbinom2(link = "log")` in `fitModelParallel()` parameters.
```{r distrib_rnaseq, warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=4 }
pub_data_fn <- system.file("extdata", "rna_pub_data.tsv", package = "HTRfit")
pub_data <- read.table(file = pub_data_fn, header = TRUE)
plot(density(unlist(log10(round(pub_data[,-1])))), main = "Density of log RNAseq counts ")
```
While HTRfit is optimized for RNA-seq data, it's worth noting that the model family can be customized to accommodate other data types. In a different context, the model family can be adapted according to the nature of your response data.
To illustrate, consider an attempt to model Iris sepal length based on sepal width using the iris dataframe. Evaluating the distribution shape of the variable to be modeled (Sepal.Width), we observe a normal distribution.
```{r distrib_iris, warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=5}
data("iris")
plot(density(unlist(iris$Sepal.Width)))
```
Given this observation, we opt to fit a model for each species using a Gaussian family model (default). Details about model family [here](https://search.r-project.org/CRAN/refmans/glmmTMB/html/nbinom2.html).
```{r example-fitModelParallel_nonRNA, warning = FALSE, message = FALSE}
l_tmb_iris <- fitModelParallel(
formula = Sepal.Width ~ Sepal.Length ,
data = iris,
group_by = "Species",
family = stats::gaussian(),
n.cores = 1)
```
---
title: "About evaluation metrics"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{About evaluation metrics}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
## Root Mean Square Error (RMSE)
RMSE measures the average deviation between predicted values and actual values. It is calculated as follows:
\[ \text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2} \]
## R-squared (R²)
R² measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It is calculated as follows:
\[ R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2} \]
## Precision-Recall Area Under Curve (PR-AUC)
PR-AUC measures the area under the precision-recall curve. It is often used in binary classification tasks where the class distribution is imbalanced.
## Performance Ratio
performance ratio is calculated as the ratio of PR-AUC to the PR-AUC of a random classifier (`pr_randm_AUC`).
It provides a measure of how well the model performs compared to a random baseline.
\[ performanceRatio = \frac{PR_{AUC}}{PR_{randomAUC}} \]
## Receiver Operating Characteristic Area Under Curve (ROC AUC)
ROC AUC measures the area under the receiver operating characteristic curve. It evaluates the classifier's ability to distinguish between classes.
## Confusion Matrix
| | Predicted Negative | Predicted Positive |
|:---------------:|:------------------:|:------------------:|
| Actual Negative | TN | FP |
| Actual Positive | FN | TP |
## Accuracy
Accuracy measures the proportion of correct predictions out of the total predictions made by the model. It is calculated as follows:
\[
\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}
\]
## Specificity
Specificity measures the proportion of true negatives out of all actual negatives. It is calculated as follows:
\[
\text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}}
\]
## Recall (Sensitivity)
Recall, also known as sensitivity, measures the proportion of true positives out of all actual positives. It indicates the model's ability to correctly identify positive instances.
\[ \text{Recall} = \text{Sensitivity} = \frac{\text{TP}}{\text{TP} + \text{FN}} \]
## Precision
Precision measures the proportion of true positives out of all predicted positives. It indicates the model's ability to avoid false positives.
\[ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \]
---
title: "Frequently Asked Questions"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Frequently Asked Questions}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
## Why mu has no effect within simulation
some element of response
## Difference between norm_distrib = 'univariate' & = 'multivariate'
some element of response
## Why use `transform = 'x+1'` with `prepareData2Fit()`
some element of response. Zero Nightmare
## Why use `row_threshold = 10` with `prepareData2Fit()`
some element of response
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment