-
Arnaud Duvermy authored
Former-commit-id: e71fe318fcd796de54c295447f19d5bdc959f7b5 Former-commit-id: 117d7915a687a02adb8e8d80dcd9461f91d601b3 Former-commit-id: 0138c69a91388d4e339ec873112d430c9046f159
Arnaud Duvermy authoredFormer-commit-id: e71fe318fcd796de54c295447f19d5bdc959f7b5 Former-commit-id: 117d7915a687a02adb8e8d80dcd9461f91d601b3 Former-commit-id: 0138c69a91388d4e339ec873112d430c9046f159
- Initializing design
- Loading public dataset
- Fit a model on a public dataset
- Estimating parameter distributions
- Beta coefficients parameters
- Dispersion parameter
- Mimic the public RNAseq data with HTRfit
- Fit and evaluate a model on simulated RNAseq data
- Fit model
- Evaluate fit
- Explore evaluation results
- Identity plot
- Model parameters
- Dispersion parameter
- ROC curve
- Precision-recall curve
- AUC and classification performance metrics
- Mimic the public RNAseq data with HTRfit - inreasing GxE interaction
- Fit and evaluate a model on simulated RNAseq data
- Evaluate fit
- Mimic the public RNAseq data with HTRfit - without GxE interaction
- Fit and evaluate a model on simulated RNAseq data
title: "Simulation tutorial"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Simulation tutorial}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
library(HTRfit)
library(reshape2)
Initializing design
In this setup, we will use HTRfit to simulate an experimental design involving multiple genotypes and environmental conditions. To do so, we will first need to estimate plausible parameter distributions based on publicly available data.
Loading public dataset
Counts data from the SRP217588 project are embedded in HTRfit package. These data were processed by an RNAseq pipeline using Kallisto to generate count data. This dataset includes count data for 4 distinct environments, 2 genotypes in 6000 genes, providing a valuable resource for estimating distribution parameters.
## -- public data embedded in HTRfit
pub_data_fn <- system.file("extdata", "rna_pub_data.tsv", package = "HTRfit")
pub_data <- read.table(file = pub_data_fn, header = TRUE)
rownames(pub_data) <- pub_data$gene_id
pub_data$gene_id <- NULL
pub_metadata_fn <- system.file("extdata", "rna_pub_metadata.tsv", package = "HTRfit")
pub_metadata <- read.table(file = pub_metadata_fn, header = TRUE)
## -- 6713 genes & 40 samples
dim(pub_data)
Fit a model on a public dataset
To estimate plausible parameter distributions, we will fit a generalized linear model (GLM) to the publicly available data. This GLM will allow us to estimate plausible coefficients betaG, betaE, and betaGxE for 6000 genes. The tidy_pub_fit
object contains the estimate obtain by HTRfit for each coefficient and each genes.
# -- prepare data & fit a model
data2fit = prepareData2fit(countMatrix = pub_data,
metadata = pub_metadata,
normalization = 'MRN',
response_name = "kij")
l_tmb <- fitModelParallel(
formula = kij ~ genotype + environment + genotype:environment,
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 8)
## -- Select best fit base on AIC results
l_genes <- identifyTopFit(l_tmb)
## -- extract results for best genes
tidy_pub_fit <- tidy_results(l_tmb[l_genes])
Estimating parameter distributions
Beta coefficients parameters
Now let's examine the distributions of the estimated parameters. Means, standard deviation and correlation between these parameters are inferred using the following lines :
## -- hidded in vignette
## -- allows to not evaluate previous chunk (a bit long)
tidy_pub_fn <- system.file("extdata", "tidy_pub_results.tsv", package = "HTRfit")
tidy_pub_fit <- read.table(file = tidy_pub_fn, header = TRUE)
dispersion_pub_fn <- system.file("extdata", "pub_dispersion.tsv", package = "HTRfit")
dispersion_pub <- read.table(file = dispersion_pub_fn, header = TRUE)$x
# -- Obtain the parameters
all_beta_pub <- reshape2::dcast(tidy_pub_fit, formula = ID ~ term, value.var = "estimate")
columns_env <- c("environmentenv2","environmentenv3", "environmentenv4" )
beta_env <- rowMeans(all_beta_pub[, columns_env])
columns_interaction <- c("genotype2:environmentenv2","genotype2:environmentenv3", "genotype2:environmentenv4" )
beta_interaction <- rowMeans( all_beta_pub[ , columns_interaction] )
# -- Group the parameters
grouped_beta_pub <- data.frame( betaG = all_beta_pub$genotype2,
betaE = beta_env,
betaGxE = beta_interaction )
# -- Obtain statistics for the parameters
beta0_pub = all_beta_pub$`(Intercept)`
## -- public effect standard deviation
sapply(grouped_beta_pub, sd)
Dispersion parameter
We also save the dispersion parameter obtain from the public dataset.
dispersion_pub <- glance_tmb(l_tmb)$dispersion
Mimic the public RNAseq data with HTRfit
To reproduce the design of the public SRP217588 project, 2 levels of genotype and 4 levels of environment are initialized using init_variable()
. By using add_interaction()
, interaction effects between genotype and environment (betaGxE) are set. For all of them, mean and standard deviation are set concordantly to values obtained from the public dataset coefficient distribution (betaG, betaE, betaGxE).
By using the estimated intercept obtained from the previous analysis of publicly available data as the basal expression level for each gene in the simulated dataset, we ensure that the basal expression levels for our simulated genes are consistent with those in the publicly available data.
## -- design to simulate
N_GENES <- 100
MIN_REPLICATES <- 5
MAX_REPLICATES <- 5
SEQ_DEPTH <- 5e6
## -- init effects to simulate
input_var_list <- init_variable( name = "genotype", mu = 0, sd = 2.18, level = 2) %>%
init_variable( name = "environment", mu = 0, sd = 0.57, level = 4 ) %>%
add_interaction(between_var = c("genotype", "environment"), mu = 0 , sd = 1.018)
## -- simulate RNAseq data
mock_data <- mock_rnaseq(input_var_list,
n_genes = N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES,
basal_expression = beta0_pub,
sequencing_depth = SEQ_DEPTH,
dispersion = dispersion_pub,
normal_distr = 'univariate' ) ## for fix effect
Fit and evaluate a model on simulated RNAseq data
Fit model
We fit a model to the simulated RNAseq data using the fitModelParallel()
function. Our model includes terms for genotype, environment, and the interaction between genotype and environment.
## -- prepare data & fit a model with mixed effect
data2fit = prepareData2fit(countMatrix = mock_data$counts,
metadata = mock_data$metadata,
normalization = 'MRN',
response_name = "kij")
l_tmb <- fitModelParallel(
formula = kij ~ genotype + environment + genotype:environment,
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1,
control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, eval.max=1e5)))
Evaluate fit
Next, we evaluate the fitting results obtained from the simulated RNAseq data by comparing the expected results computed by evaluation_report()
, with those obtained with fitModelParallel()
: l_tmb
.
## -- eval params for Wald test
ln_FC_threshold <- 0.67
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
list_genes = l_genes
mock_obj = mock_data,
coeff_threshold = ln_FC_threshold,
alt_hypothesis = alternative_hypothesis )
Explore evaluation results
Identity plot
The evaluation_report()
function produces an identity plot, offering a visual tool to juxtapose the simulated effects (actual effects) with the model-inferred effects. This graphical representation simplifies assessing how well the model aligns with the values of the simulated effects, enabling a visual analysis of the model's goodness of fit to the simulated data. For a more quantitative evaluation of inference quality, the R-squared (R2) and RMSE metrics can be employed.
Model parameters
## -- Model params
resSimu$identity$params
Dispersion parameter
The dispersion plot, generated by the evaluation_report()
function, offers a visual comparison of the dispersion parameters used in the simulation (dispersion_i) with those estimated by the model. This graphical representation provides an intuitive way to assess the alignment between the simulated dispersion values and the model-inferred values, enabling a visual evaluation of how well the model captures the underlying data characteristics.
## -- Dispersion params
resSimu$identity$dispersion
ROC curve
The Receiver Operating Characteristic (ROC) curve is a valuable tool for assessing the performance of classification models, particularly in the context of identifying differentially expressed genes. It provides a graphical representation of the model's ability to distinguish between genes that are differentially expressed and those that are not, by varying the coeff_threshold
and the alt_hypothesis
parameters.
## -- ROC curve
resSimu$roc$params
Precision-recall curve
The precision-recall curve (PR curve) illustrates the relationship between precision and recall at various classification thresholds. It is particularly valuable in the context of imbalanced data, where one class is significantly more prevalent than the other. Unlike the ROC curve, which can be influenced by class distribution, the PR curve focuses on the model's ability to correctly identify examples of the minority class while maintaining high precision.
## -- precision recall by params
resSimu$precision_recall$params
AUC and classification performance metrics
The area under the ROC curve (AUC) provides a single metric that summarizes the model's overall performance in distinguishing between differentially expressed and non-differentially expressed genes. A higher AUC indicates better model performance. In the case of the ROC curve, this AUC should be compared to the value of 0.5, representing a random classifier. On the other hand, for the PR curve, we compute the pr_AUC_random, serving as a baseline for comparison. In addition to evaluating model performance through the AUC for both ROC and PR curves, we provide access to key classification metrics, including Accuracy, Precision, Recall (or Sensitivity), and Specificity. These metrics offer a comprehensive view of the model's classification capabilities. These metrics are computed for each parameter (excluding the intercept when skip_eval_intercept = TRUE), providing detailed insights into individual parameter contributions. Furthermore, an aggregated assessment is performed, considering all parameters (except the intercept by default), offering a perspective on the model's overall classification effectiveness.
## -- precision-recall curve
resSimu$performances$byparams
Mimic the public RNAseq data with HTRfit - inreasing GxE interaction
To initialized GxE interaction to 0, set sd = 0
.
## -- design to simulate
N_GENES <- 100
MIN_REPLICATES <- 5
MAX_REPLICATES <- 5
SEQ_DEPTH <- 5e6
## -- init effects to simulate
input_var_list <- init_variable( name = "genotype", mu = 0, sd = 2.18, level = 2) %>%
init_variable( name = "environment", mu = 0, sd = 0.57, level = 4 ) %>%
add_interaction(between_var = c("genotype", "environment"), mu = 0 , sd = 3)
## -- simulate RNAseq data
mock_data <- mock_rnaseq(input_var_list,
n_genes = N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES,
basal_expression = beta0_pub,
sequencing_depth = SEQ_DEPTH,
dispersion = dispersion_pub,
normal_distr = 'univariate' ) ## for fix effect
Fit and evaluate a model on simulated RNAseq data
We fit a model to the simulated RNAseq data using the fitModelParallel()
function. Our model includes terms for genotype, environment, and the interaction between genotype and environment.
## -- prepare data & fit a model with mixed effect
data2fit = prepareData2fit(countMatrix = mock_data$counts,
metadata = mock_data$metadata,
normalization = 'MRN',
response_name = "kij")
l_tmb <- fitModelParallel(
formula = kij ~ genotype + environment + genotype:environment,
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1,
control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, eval.max=1e5)))
Evaluate fit
Next, we evaluate the fitting results obtained from the simulated RNAseq data by comparing the expected results computed by evaluation_report()
, with those obtained with fitModelParallel()
: l_tmb
.
## -- eval params for Wald test
ln_FC_threshold <- 0.67
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
list_genes = l_genes
mock_obj = mock_data,
coeff_threshold = ln_FC_threshold,
alt_hypothesis = alternative_hypothesis)
The identity plot allows direct appreciation of the absence of GxE interaction.
## -- Model params
resSimu$identity$params
Mimic the public RNAseq data with HTRfit - without GxE interaction
To initialized GxE interaction to 0, set sd = 0
.
## -- design to simulate
N_GENES <- 100
MIN_REPLICATES <- 5
MAX_REPLICATES <- 5
SEQ_DEPTH <- 5e6
## -- init effects to simulate
input_var_list <- init_variable( name = "genotype", mu = 0, sd = 2.18, level = 2) %>%
init_variable( name = "environment", mu = 0, sd = 0.57, level = 4 ) %>%
add_interaction(between_var = c("genotype", "environment"), mu = 0 , sd = 0) ## sd = 0
## -- simulate RNAseq data
mock_data <- mock_rnaseq(input_var_list,
n_genes = N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES,
basal_expression = beta0_pub,
sequencing_depth = SEQ_DEPTH,
dispersion = dispersion_pub,
normal_distr = 'univariate' ) ## for fix effect
Fit and evaluate a model on simulated RNAseq data
We fit a model to the simulated RNAseq data using the fitModelParallel()
function. Our model includes terms for genotype, environment, and the interaction between genotype and environment.
## -- prepare data & fit a model with mixed effect
data2fit = prepareData2fit(countMatrix = mock_data$counts,
metadata = mock_data$metadata,
normalization = 'MRN',
response_name = "kij")
l_tmb <- fitModelParallel(
formula = kij ~ genotype + environment + genotype:environment,
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1,
control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, eval.max=1e5)))
Next, we evaluate the fitting results obtained from the simulated RNAseq data by comparing the expected results computed by evaluation_report()
, with those obtained with fitModelParallel()
: l_tmb
.
## -- eval params for Wald test
ln_FC_threshold <- 0.67
alternative_hypothesis <- "greaterAbs"
## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)
## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
list_genes = l_genes
mock_obj = mock_data,
coeff_threshold = ln_FC_threshold,
alt_hypothesis = alternative_hypothesis)
The identity plot allows direct appreciation of the absence of GxE interaction.
## -- Model params
resSimu$identity$params