update vignettes

1fbeda08 · Arnaud Duvermy · 02cac2b2 · 1fbeda08 · 1fbeda08
Commit 1fbeda08 authored 1 year ago by Arnaud Duvermy
--- a/vignettes/HTRfit_simulation_pipeline.Rmd
+++ b/vignettes/HTRfit_simulation_pipeline.Rmd
+---
+title: "HTRfit simulation pipeline"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{HTRfit simulation pipeline}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+```{r, include = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>"
+)
+```
+## Requirement
+```{r setup, warning = FALSE, message = FALSE}
+library(HTRfit)
+```
+## Initializing design
+In this setup, we will use HTRfit to simulate an experimental design involving multiple genotypes and environmental conditions. To do so, we will first need to estimate plausible parameter distributions based on publicly available data.
+### Loading public dataset
+Counts data from the [SRP217588](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE135473) project are embedded in HTRfit package. These data were processed by an RNAseq pipeline using Kallisto to generate count data. This dataset includes count data for 4 distinct environments, 2 genotypes in 6000 genes, providing a valuable resource for estimating distribution parameters.
+```{r loading_pub_data, warning = FALSE, message = FALSE}
+pub_data <- read.table(file = "inst/extdata/rna_pub_data.tsv", h = T)
+rownames(pub_data) <- pub_data$gene_id
+pub_data$gene_id <- NULL
+pub_metadata <-  read.table(file = "inst/extdata/rna_pub_metadata.tsv", h = T)
+```
+### Fit a model on a public dataset
+To estimate plausible parameter distributions, we will fit a generalized linear  model (GLM) to the publicly available data. This GLM will allow us to estimate plausible coefficients betaG, betaE, and betaGxE for 6000 genes. The `tidy_pub_fit` object contains the estimate obtain by HTRfit for each coefficient and each genes.
+```{r fit_pub_data, warning = FALSE, message = FALSE}
+## -- prepare data & fit a model 
+data2fit = prepareData2fit(countMatrix = pub_data, 
+                           metadata = pub_metadata, 
+                           normalization = F,   
+                           response_name = "kij")
+l_tmb <- fitModelParallel(formula = kij ~ genotype + environment + genotype:environment,
+                          data = data2fit, 
+                          group_by = "geneID",
+                          family = glmmTMB::nbinom2(link = "log"), 
+                          n.cores = 1)
+## -- extract results
+tidy_pub_fit <- tidy_results(l_tmb)
+```
+### Estimating parameter distributions
+#### Beta coefficients parameters
+Now let's examine the distributions of the estimated parameters. Means, standard deviation and correlation between these parameters are inferred using the following lines :
+```{r fit_pub_data, warning = FALSE, message = FALSE}
+# -- Obtain the parameters
+all_beta_pub <- dcast(tidy_pub_fit, formula = ID ~ term, value.var = "estimate")
+columns_env <-  c("environmentenv2","environmentenv3", "environmentenv4" )
+beta_env <- rowMeans(all_beta_pub[, columns_env])
+columns_interaction <-  c("genotype2:environmentenv2","genotype2:environmentenv3", "genotype2:environmentenv4" )
+beta_interaction <- rowMeans( all_beta_pub[ , columns_interaction] )
+# -- Group the parameters
+grouped_beta_pub <- data.frame( betaG = all_beta_pub$genotype2, 
+                                betaE = beta_env, 
+                                betaGxE = beta_interaction )
+# -- Obtain statistics for the parameters
+beta0_pub = all_beta_pub$`(Intercept)`
+colMeans(grouped_beta_pub)
+sapply(grouped_beta_pub, sd)
+cor(grouped_beta_pub)
+```
+#### Dispersion parameter
+We also save the dispersion parameter obtain from the public dataset.
+```{r pub_dispersion, warning = FALSE, message = FALSE }
+dispersion_pub  <- glance_tmb(l_tmb)$dispersion
+```
+## Mimic the public RNAseq data with HTRfit
+To reproduce the design of the public [SRP217588](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE135473) project, 2 levels of genotype and 4 levels of environment are initialized using `init_variable()`. By using `add_interaction()`, interaction effects between genotype and environment (betaGxE) are set.  For all of them, mean and standard deviation are set concordantly to values obtained from the public dataset coefficient distribution (betaG, betaE, betaGxE).
+By using the estimated intercept obtained from the previous analysis of publicly available data as the basal expression level for each gene in the simulated dataset, we ensure that the basal expression levels for our simulated genes are consistent with those in the publicly available data.
+```{r design_init, warning = FALSE, message = FALSE}
+## -- init a design 
+input_var_list <- init_variable( name = "genotype", mu = 0, sd = 0.29, level = 2) %>%
+                  init_variable( name = "environment", mu = 0.27, sd = 0.6, level = 4) %>%
+                    add_interaction( between_var = c("genotype", "environment"), mu = 0.44, sd = 0.89) %>% 
+                    set_correlation(between_var = c("genotype", "environment"), corr = 0.4224420) %>% 
+                    set_correlation(between_var = c("genotype", "genotype:environment"), corr = -0.4543833) %>%
+                    set_correlation(between_var = c("environment", "genotype:environment"), corr = -0.3095052)
+N_GENES <- 100
+MIN_REPLICATES <- 4
+MAX_REPLICATES <- 4
+SEQ_DEPTH <- 5e6
+## -- simulate RNAseq data 
+mock_data <- mock_rnaseq(input_var_list, 
+                         n_genes = N_GENES,
+                         min_replicates  = MIN_REPLICATES,
+                         max_replicates = MAX_REPLICATES,
+                         basal_expression = beta0_pub,
+                         sequencing_depth = SEQ_DEPTH,
+                         dispersion = dispersion_pub)
+```
+## Fit and evaluate a model on simulated RNAseq data
+### Fit model
+We fit a model to the simulated RNAseq data using the `fitModelParallel()` function. Our model includes terms for genotype, environment, and the interaction between genotype and environment.
+```{r fit_model, warning = FALSE, message = FALSE }
+## -- prepare data & fit a model with mixed effect
+data2fit = prepareData2fit(countMatrix = mock_data$counts, 
+                           metadata =  mock_data$metadata, 
+                           normalization = T,   
+                           response_name = "kij")
+l_tmb <- fitModelParallel(formula = kij ~ genotype + environment + genotype:environment,
+                          data = data2fit, 
+                          group_by = "geneID",
+                          family = glmmTMB::nbinom2(link = "log"), 
+                          n.cores = 1, 
+                          control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, eval.max=1e5)))
+```
+### Evaluate fit
+Next, we evaluate the fitting results obtained from the simulated RNAseq data by comparing the expected results computed by `evaluation_report()`, with those obtained with  `fitModelParallel()` : `l_tmb`. 
+```{r evaluation, warning = FALSE, message = FALSE}
+## -- evaluation
+resSimu <- evaluation_report(list_tmb = l_tmb,
+                             dds = NULL,
+                            mock_obj = mock_data,
+                            coeff_threshold = 0.67, 
+                            alt_hypothesis = "greaterAbs")
+```
+#### Identity plot
+The `evaluation_report()` function produces an identity plot, offering a visual tool to juxtapose the simulated effects (actual effects) with the model-inferred effects. This graphical representation simplifies assessing how well the model aligns with the values of the simulated effects, enabling a visual analysis of the model's goodness of fit to the simulated data. For a more quantitative evaluation of inference quality, the R-squared (R2) metric can be employed.
+#### Model parameters
+```{r id_params , warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=5}
+## -- Model params
+resSimu$identity$params
+```
+#### Dispersion parameter 
+The dispersion plot, generated by the `evaluation_report()` function, offers a visual comparison of the dispersion parameters used in the simulation \(dispersion_i\) with those estimated by the model. This graphical representation provides an intuitive way to assess the alignment between the simulated dispersion values and the model-inferred values, enabling a visual evaluation of how well the model captures the underlying data characteristics.
+```{r id_disp, warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=5}
+## -- Dispersion params
+resSimu$identity$dispersion
+```
+#### ROC curve
+The Receiver Operating Characteristic (ROC) curve is a valuable tool for assessing the performance of classification models, particularly in the context of identifying differentially expressed genes. It provides a graphical representation of the model's ability to distinguish between genes that are differentially expressed and those that are not, by varying the `coeff_threshold` and the `alt_hypothesis` parameters.
+```{r roc, warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=7}
+## -- ROC curve
+resSimu$roc$params
+```
+#### Precision-recall curve
+The precision-recall curve (PR curve) illustrates the relationship between precision and recall at various classification thresholds. It is particularly valuable in the context of imbalanced data, where one class is significantly more prevalent than the other. Unlike the ROC curve, which can be influenced by class distribution, the PR curve focuses on the model's ability to correctly identify examples of the minority class while maintaining high precision.
+```{r pr_curve, warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=7}
+## -- precision recall by params
+resSimu$precision_recall$params
+```
+#### AUC and classification performance metrics 
+The area under the ROC curve (AUC) provides a single metric that summarizes the model's overall performance in distinguishing between differentially expressed and non-differentially expressed genes. A higher AUC indicates better model performance.  In the case of the ROC curve, this AUC should be compared to the value of 0.5, representing a random classifier. On the other hand, for the PR curve, we compute the pr_AUC_random, serving as a baseline for comparison. 
+In addition to evaluating model performance through the AUC for both ROC and PR curves, we provide access to key classification metrics, including Accuracy, Precision, Recall (or Sensitivity), and Specificity. These metrics offer a comprehensive view of the model's classification capabilities. These metrics are computed for each parameter (excluding the intercept when skip_eval_intercept = TRUE), providing detailed insights into individual parameter contributions. Furthermore, an aggregated assessment is performed, considering all parameters (except the intercept by default), offering a perspective on the model's overall classification effectiveness.
+```{r perf, warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=7}
+## -- precision-recall curve
+resSimu$performances$byparams
+```
+## Preparing and optimizing a futur experimental design
+Let's consider a scenario where a team is strategizing a new experiment encompassing 100 genotypes and 4 environments. To enhance the precision of their analysis, the team is conducting a power analysis through HTRfit. For this purpose, they are utilizing the experimental design template, `input_var_list`, which draws insights from the publicly available [SRP217588](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE135473) dataset. They are adjusting the design levels to align with the specifics of their upcoming study involving 200 genotypes and 4 environments.
+#### Simulation
+```{r design_init_200, warning = FALSE, message = FALSE}
+## -- init a design 
+input_var_list <- init_variable( name = "genotype", mu = 0, sd = 0.29, level = 100) %>%
+                  init_variable( name = "environment", mu = 0.27, sd = 0.6, level = 4) %>%
+                    add_interaction( between_var = c("genotype", "environment"), mu = 0.44, sd = 0.89) %>% 
+                    set_correlation(between_var = c("genotype", "environment"), corr = 0.4224420) %>% 
+                    set_correlation(between_var = c("genotype", "genotype:environment"), corr = -0.4543833) %>%
+                    set_correlation(between_var = c("environment", "genotype:environment"), corr = -0.3095052)
+N_GENES <- 50
+MIN_REPLICATES <- 4
+MAX_REPLICATES <- 4
+SEQ_DEPTH <- 5e6
+## -- simulate RNAseq data 
+mock_data <- mock_rnaseq(input_var_list, 
+                         n_genes = N_GENES,
+                         min_replicates  = MIN_REPLICATES,
+                         max_replicates = MAX_REPLICATES,
+                         basal_expression = beta0_pub,
+                         sequencing_depth = SEQ_DEPTH,
+                         dispersion = dispersion_pub)
+```
+#### Fit model with fixed effect : ` ~ genotype + environment + genotype:environment`
+We fit a model to the simulated RNAseq data using the `fitModelParallel()` function. Our model includes terms for genotype, environment, and the interaction between genotype and environment. They are all consider as fixed effects.
+```{r fit200G, warning = FALSE, message = FALSE}
+## -- prepare data & fit a model with mixed effect
+data2fit = prepareData2fit(countMatrix = mock_data$counts, 
+                           metadata =  mock_data$metadata, 
+                           normalization = T,   
+                           response_name = "kij")
+l_tmb <- fitModelParallel(formula = kij ~ genotype + environment + genotype:environment,
+                          data = data2fit, 
+                          group_by = "geneID",
+                          family = glmmTMB::nbinom2(link = "log"), 
+                          n.cores = 6, 
+                          control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, eval.max=1e5)))
+```
+#### Evaluate fit with fixed effects: ` ~ genotype + environment + genotype:environment`
+```{r evaluation200G, warning = FALSE, message = FALSE}
+## -- evaluation
+resSimu <- evaluation_report(list_tmb = l_tmb,
+                             dds = NULL,
+                            mock_obj = mock_data,
+                            coeff_threshold = 0.67, 
+                            alt_hypothesis = "greaterAbs")
+```
+```{r pr200G, warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=7}
+## -- precision-recall curve
+resSimu$precision_recall$params
+```
+```{r roc200G, warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=7}
+## -- precision-recall curve
+resSimu$roc$params
+```
+```{r perf200G, warning = FALSE, message = FALSE}
+## -- precision-recall curve
+resSimu$performances$byparams
+```
+#### Fit model with mixed effect : ` ~environment + ( environment | genotype )`
+The inclusion of a large number of genotypes in the experimental design allows for the incorporation of random effects in the analysis. Utilizing HTRfit, the team has opted to assess the effectiveness of a model in which the `environment` is defined as a mixed effect (combining fixed and random components), and `genotype` is considered a random effect. The fitting of this mixed model is executed using the `fitModelParallel()` function.
+```{r mixed_fit_200G, warning = FALSE, message = FALSE}
+## -- prepare data & fit a model with mixed effect
+data2fit = prepareData2fit(countMatrix = mock_data$counts, 
+                           metadata =  mock_data$metadata, 
+                           normalization = T,   
+                           response_name = "kij")
+l_tmb <- fitModelParallel(formula = kij  ~environment + ( environment | genotype ),
+                          data = data2fit, 
+                          group_by = "geneID",
+                          family = glmmTMB::nbinom2(link = "log"), 
+                          n.cores = 6, 
+                          control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5, eval.max=1e5)))
+```
+#### Evaluate fit with fixed effects: ` ~ genotype + environment + genotype:environment`
+```{r  mixed_evaluation_200G, warning = FALSE, message = FALSE}
+## -- evaluation
+resSimu <- evaluation_report(list_tmb = l_tmb,
+                             dds = NULL,
+                            mock_obj = mock_data,
+                            coeff_threshold = 0.08, 
+                            alt_hypothesis = "greaterAbs")
+```
+```{r mixed_id200G, warning = FALSE, message = FALSE, fig.align='center', fig.height=5, fig.width=7}
+## -- precision-recall curve
+resSimu$identity$params
+```
+```{r mixed_roc200G, warning = FALSE, message = FALSE, fig.align='center', fig.height=4, fig.width=5}
+## -- precision-recall curve
+resSimu$roc$params
+```
+```{r mixed_pr200G, warning = FALSE, message = FALSE}
+## -- precision-recall curve
+resSimu$precision_recall$params
+```
+```{r mixed_perf200G, warning = FALSE, message = FALSE}
+## -- precision-recall curve
+resSimu$performances$byparams
+```
--- a/vignettes/test_vignette.Rmd
+++ b/vignettes/test_vignette.Rmd
@@ -20,4 +20,6 @@ print("oui")
 ```
 [test_link](htrfit.html)
\ No newline at end of file
+[test_link2](HTRfit_simulation_pipeline.html)
\ No newline at end of file