Skip to content
Snippets Groups Projects
  • Arnaud Duvermy's avatar
    9cb65372
    update tutorial · 9cb65372
    Arnaud Duvermy authored
    Former-commit-id: e71fe318fcd796de54c295447f19d5bdc959f7b5
    Former-commit-id: 117d7915a687a02adb8e8d80dcd9461f91d601b3
    Former-commit-id: 0138c69a91388d4e339ec873112d430c9046f159
    9cb65372
    History
    update tutorial
    Arnaud Duvermy authored
    Former-commit-id: e71fe318fcd796de54c295447f19d5bdc959f7b5
    Former-commit-id: 117d7915a687a02adb8e8d80dcd9461f91d601b3
    Former-commit-id: 0138c69a91388d4e339ec873112d430c9046f159
title: "Simulation tutorial"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Simulation tutorial}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(HTRfit)
library(reshape2)

Initializing design

In this setup, we will use HTRfit to simulate an experimental design involving multiple genotypes and environmental conditions. To do so, we will first need to estimate plausible parameter distributions based on publicly available data.

Loading public dataset

Counts data from the SRP217588 project are embedded in HTRfit package. These data were processed by an RNAseq pipeline using Kallisto to generate count data. This dataset includes count data for 4 distinct environments, 2 genotypes in 6000 genes, providing a valuable resource for estimating distribution parameters.

## -- public data embedded in HTRfit
pub_data_fn <- system.file("extdata", "rna_pub_data.tsv", package = "HTRfit") 
pub_data <- read.table(file = pub_data_fn, header = TRUE)
rownames(pub_data) <- pub_data$gene_id
pub_data$gene_id <- NULL
pub_metadata_fn <- system.file("extdata", "rna_pub_metadata.tsv", package = "HTRfit") 
pub_metadata <- read.table(file = pub_metadata_fn,  header = TRUE)

## -- 6713 genes & 40 samples
dim(pub_data)

Fit a model on a public dataset

To estimate plausible parameter distributions, we will fit a generalized linear model (GLM) to the publicly available data. This GLM will allow us to estimate plausible coefficients betaG, betaE, and betaGxE for 6000 genes. The tidy_pub_fit object contains the estimate obtain by HTRfit for each coefficient and each genes.

# -- prepare data &  fit a model 
data2fit = prepareData2fit(countMatrix = pub_data, 
                           metadata = pub_metadata, 
                           normalization = 'MRN',   
                           response_name = "kij")
l_tmb <- fitModelParallel(
            formula = kij ~ genotype + environment + genotype:environment,
            data = data2fit, 
            group_by = "geneID",
            family = glmmTMB::nbinom2(link = "log"), 
            n.cores = 8)

## -- Select best fit base on AIC results
l_genes <- identifyTopFit(l_tmb)
## -- extract results for best genes
tidy_pub_fit <- tidy_results(l_tmb[l_genes])

Estimating parameter distributions

Beta coefficients parameters

Now let's examine the distributions of the estimated parameters. Means, standard deviation and correlation between these parameters are inferred using the following lines :

## -- hidded in vignette
## -- allows to not evaluate previous chunk (a bit long)
tidy_pub_fn <- system.file("extdata", "tidy_pub_results.tsv", package = "HTRfit") 
tidy_pub_fit <- read.table(file = tidy_pub_fn,  header = TRUE)
dispersion_pub_fn <-  system.file("extdata", "pub_dispersion.tsv", package = "HTRfit")
dispersion_pub <- read.table(file = dispersion_pub_fn,  header = TRUE)$x
# -- Obtain the parameters
all_beta_pub <- reshape2::dcast(tidy_pub_fit, formula = ID ~ term, value.var = "estimate")
columns_env <-  c("environmentenv2","environmentenv3", "environmentenv4" )
beta_env <- rowMeans(all_beta_pub[, columns_env])
columns_interaction <-  c("genotype2:environmentenv2","genotype2:environmentenv3", "genotype2:environmentenv4" )
beta_interaction <- rowMeans( all_beta_pub[ , columns_interaction] )
# -- Group the parameters
grouped_beta_pub <- data.frame( betaG = all_beta_pub$genotype2, 
                                betaE = beta_env, 
                                betaGxE = beta_interaction )

# -- Obtain statistics for the parameters
beta0_pub = all_beta_pub$`(Intercept)`
## -- public effect standard deviation 
sapply(grouped_beta_pub, sd)

Dispersion parameter

We also save the dispersion parameter obtain from the public dataset.

dispersion_pub  <- glance_tmb(l_tmb)$dispersion

Mimic the public RNAseq data with HTRfit

To reproduce the design of the public SRP217588 project, 2 levels of genotype and 4 levels of environment are initialized using init_variable(). By using add_interaction(), interaction effects between genotype and environment (betaGxE) are set. For all of them, mean and standard deviation are set concordantly to values obtained from the public dataset coefficient distribution (betaG, betaE, betaGxE). By using the estimated intercept obtained from the previous analysis of publicly available data as the basal expression level for each gene in the simulated dataset, we ensure that the basal expression levels for our simulated genes are consistent with those in the publicly available data.

## -- design to simulate
N_GENES <- 100 
MIN_REPLICATES <- 5
MAX_REPLICATES <- 5
SEQ_DEPTH <- 5e6

## -- init effects to simulate
input_var_list <- init_variable( name = "genotype", mu = 0, sd = 2.18, level = 2) %>%
                            init_variable( name = "environment", mu = 0, sd = 0.57, level = 4 ) %>%
                            add_interaction(between_var = c("genotype", "environment"), mu = 0 , sd = 1.018)

## -- simulate RNAseq data 
mock_data <- mock_rnaseq(input_var_list, 
                         n_genes = N_GENES,
                         min_replicates  = MIN_REPLICATES,
                         max_replicates = MAX_REPLICATES,
                         basal_expression = beta0_pub,
                         sequencing_depth = SEQ_DEPTH,
                         dispersion = dispersion_pub,
                         normal_distr = 'univariate' ) ## for fix effect

Fit and evaluate a model on simulated RNAseq data

Fit model

We fit a model to the simulated RNAseq data using the fitModelParallel() function. Our model includes terms for genotype, environment, and the interaction between genotype and environment.


## -- prepare data & fit a model with mixed effect
data2fit = prepareData2fit(countMatrix = mock_data$counts, 
                           metadata =  mock_data$metadata, 
                           normalization = 'MRN',   
                           response_name = "kij")
l_tmb <- fitModelParallel(
          formula = kij ~ genotype + environment + genotype:environment,
          data = data2fit, 
          group_by = "geneID",
          family = glmmTMB::nbinom2(link = "log"), 
          n.cores = 1, 
          control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, eval.max=1e5)))

Evaluate fit

Next, we evaluate the fitting results obtained from the simulated RNAseq data by comparing the expected results computed by evaluation_report(), with those obtained with fitModelParallel() : l_tmb.


## -- eval params for Wald test
ln_FC_threshold <- 0.67
alternative_hypothesis <- "greaterAbs"

## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)

## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
                              list_genes = l_genes
                              mock_obj = mock_data,
                              coeff_threshold = ln_FC_threshold, 
                              alt_hypothesis = alternative_hypothesis )

Explore evaluation results

Identity plot

The evaluation_report() function produces an identity plot, offering a visual tool to juxtapose the simulated effects (actual effects) with the model-inferred effects. This graphical representation simplifies assessing how well the model aligns with the values of the simulated effects, enabling a visual analysis of the model's goodness of fit to the simulated data. For a more quantitative evaluation of inference quality, the R-squared (R2) and RMSE metrics can be employed.

Model parameters

## -- Model params
resSimu$identity$params

Dispersion parameter

The dispersion plot, generated by the evaluation_report() function, offers a visual comparison of the dispersion parameters used in the simulation (dispersion_i) with those estimated by the model. This graphical representation provides an intuitive way to assess the alignment between the simulated dispersion values and the model-inferred values, enabling a visual evaluation of how well the model captures the underlying data characteristics.

## -- Dispersion params
resSimu$identity$dispersion

ROC curve

The Receiver Operating Characteristic (ROC) curve is a valuable tool for assessing the performance of classification models, particularly in the context of identifying differentially expressed genes. It provides a graphical representation of the model's ability to distinguish between genes that are differentially expressed and those that are not, by varying the coeff_threshold and the alt_hypothesis parameters.

## -- ROC curve
resSimu$roc$params

Precision-recall curve

The precision-recall curve (PR curve) illustrates the relationship between precision and recall at various classification thresholds. It is particularly valuable in the context of imbalanced data, where one class is significantly more prevalent than the other. Unlike the ROC curve, which can be influenced by class distribution, the PR curve focuses on the model's ability to correctly identify examples of the minority class while maintaining high precision.

## -- precision recall by params
resSimu$precision_recall$params

AUC and classification performance metrics

The area under the ROC curve (AUC) provides a single metric that summarizes the model's overall performance in distinguishing between differentially expressed and non-differentially expressed genes. A higher AUC indicates better model performance. In the case of the ROC curve, this AUC should be compared to the value of 0.5, representing a random classifier. On the other hand, for the PR curve, we compute the pr_AUC_random, serving as a baseline for comparison. In addition to evaluating model performance through the AUC for both ROC and PR curves, we provide access to key classification metrics, including Accuracy, Precision, Recall (or Sensitivity), and Specificity. These metrics offer a comprehensive view of the model's classification capabilities. These metrics are computed for each parameter (excluding the intercept when skip_eval_intercept = TRUE), providing detailed insights into individual parameter contributions. Furthermore, an aggregated assessment is performed, considering all parameters (except the intercept by default), offering a perspective on the model's overall classification effectiveness.

## -- precision-recall curve
resSimu$performances$byparams

Mimic the public RNAseq data with HTRfit - inreasing GxE interaction

To initialized GxE interaction to 0, set sd = 0.

## -- design to simulate
N_GENES <- 100 
MIN_REPLICATES <- 5
MAX_REPLICATES <- 5
SEQ_DEPTH <- 5e6

## -- init effects to simulate
input_var_list <- init_variable( name = "genotype", mu = 0, sd = 2.18, level = 2) %>%
                  init_variable( name = "environment", mu = 0, sd = 0.57, level = 4 ) %>%
                  add_interaction(between_var = c("genotype", "environment"), mu = 0 , sd = 3)

## -- simulate RNAseq data 
mock_data <- mock_rnaseq(input_var_list, 
                         n_genes = N_GENES,
                         min_replicates  = MIN_REPLICATES,
                         max_replicates = MAX_REPLICATES,
                         basal_expression = beta0_pub,
                         sequencing_depth = SEQ_DEPTH,
                         dispersion = dispersion_pub,
                         normal_distr = 'univariate' ) ## for fix effect

Fit and evaluate a model on simulated RNAseq data

We fit a model to the simulated RNAseq data using the fitModelParallel() function. Our model includes terms for genotype, environment, and the interaction between genotype and environment.


## -- prepare data & fit a model with mixed effect
data2fit = prepareData2fit(countMatrix = mock_data$counts, 
                           metadata =  mock_data$metadata, 
                           normalization = 'MRN',   
                           response_name = "kij")
l_tmb <- fitModelParallel(
          formula = kij ~ genotype + environment + genotype:environment,
          data = data2fit, 
          group_by = "geneID",
          family = glmmTMB::nbinom2(link = "log"), 
          n.cores = 1, 
          control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, eval.max=1e5)))

Evaluate fit

Next, we evaluate the fitting results obtained from the simulated RNAseq data by comparing the expected results computed by evaluation_report(), with those obtained with fitModelParallel() : l_tmb.


## -- eval params for Wald test
ln_FC_threshold <- 0.67
alternative_hypothesis <- "greaterAbs"

## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)

## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
                              list_genes = l_genes
                              mock_obj = mock_data,
                              coeff_threshold = ln_FC_threshold, 
                              alt_hypothesis = alternative_hypothesis)

The identity plot allows direct appreciation of the absence of GxE interaction.

## -- Model params
resSimu$identity$params

Mimic the public RNAseq data with HTRfit - without GxE interaction

To initialized GxE interaction to 0, set sd = 0.

## -- design to simulate
N_GENES <- 100 
MIN_REPLICATES <- 5
MAX_REPLICATES <- 5
SEQ_DEPTH <- 5e6

## -- init effects to simulate
input_var_list <- init_variable( name = "genotype", mu = 0, sd = 2.18, level = 2) %>%
                  init_variable( name = "environment", mu = 0, sd = 0.57, level = 4 ) %>%
                  add_interaction(between_var = c("genotype", "environment"), mu = 0 , sd = 0) ## sd = 0

## -- simulate RNAseq data 
mock_data <- mock_rnaseq(input_var_list, 
                         n_genes = N_GENES,
                         min_replicates  = MIN_REPLICATES,
                         max_replicates = MAX_REPLICATES,
                         basal_expression = beta0_pub,
                         sequencing_depth = SEQ_DEPTH,
                         dispersion = dispersion_pub,
                         normal_distr = 'univariate' ) ## for fix effect

Fit and evaluate a model on simulated RNAseq data

We fit a model to the simulated RNAseq data using the fitModelParallel() function. Our model includes terms for genotype, environment, and the interaction between genotype and environment.


## -- prepare data & fit a model with mixed effect
data2fit = prepareData2fit(countMatrix = mock_data$counts, 
                           metadata =  mock_data$metadata, 
                           normalization = 'MRN',   
                           response_name = "kij")
l_tmb <- fitModelParallel(
          formula = kij ~ genotype + environment + genotype:environment,
          data = data2fit, 
          group_by = "geneID",
          family = glmmTMB::nbinom2(link = "log"), 
          n.cores = 1, 
          control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, eval.max=1e5)))

Next, we evaluate the fitting results obtained from the simulated RNAseq data by comparing the expected results computed by evaluation_report(), with those obtained with fitModelParallel() : l_tmb.


## -- eval params for Wald test
ln_FC_threshold <- 0.67
alternative_hypothesis <- "greaterAbs"

## -- remove genes with very low AIC
l_genes <- identifyTopFit(l_tmb)

## -- evaluation
resSimu <- evaluation_report( list_tmb = l_tmb,
                              list_genes = l_genes
                              mock_obj = mock_data,
                              coeff_threshold = ln_FC_threshold, 
                              alt_hypothesis = alternative_hypothesis)

The identity plot allows direct appreciation of the absence of GxE interaction.

## -- Model params
resSimu$identity$params