Skip to content
Snippets Groups Projects
Select Git revision
  • b47e4e791657a22ddbc491800be02cf033526a1c
  • master default protected
  • v2.1.1
  • v2.1.0
4 results

rnaseq_analysis.Rmd

Blame
  • title: "RNAseq analysis"
    output: rmarkdown::html_vignette
    vignette: >
      %\VignetteIndexEntry{RNAseq analysis}
      %\VignetteEngine{knitr::rmarkdown}
      %\VignetteEncoding{UTF-8}
    knitr::opts_chunk$set(
      collapse = TRUE,
      comment = "#>"
    )
    library(HTRfit)

    In RNAseq, we employ Generalized Linear Models (GLM) to unravel how genes respond to various experimental conditions. These models assist in deciphering the specific impacts of experimental variables on gene expression.HTRfit can be utilized to analyze such RNAseq data, providing a robust framework for exploring and interpreting the intricate relationships between genes and experimental conditions.

    Input data

    HTRfit analysis necessitates a count matrix and sample metadata, in the form of dataframes. Notice that gene_id have to be specified as rownames of count_matrix.

    ## -- hided in vignette
    ## -- simulate small example to prevent excessively lengthy vignette construction
    list_var <- init_variable( name = "genotype", mu = 3, sd = 0.2, level = 2) %>%
                init_variable( name = "environment", mu = 2, sd = 0.43, level = 2) %>%
                add_interaction( between_var = c("genotype", "environment"), mu = 0.44, sd = 0.2)
    N_GENES = 30 
    MIN_REPLICATES = 4
    MAX_REPLICATES = 4
    BASAL_EXPR = 3
    mock_data <- mock_rnaseq(list_var, N_GENES,
                             min_replicates  = MIN_REPLICATES,
                             max_replicates = MAX_REPLICATES,
                             basal_expression = BASAL_EXPR)
    ########################
    
    ## -- data from simulation or real data
    count_matrix <- mock_data$counts
    metaData <- mock_data$metadata
    ##############################
    ## -- gene count matrix
    count_matrix[1:4, 1:2]
    ## -- samples metadata
    head(metaData)

    Prepare data for fitting

    The prepareData2fit() function serves the purpose of converting the counts matrix and sample metadata into a dataframe that is compatible with downstream HTRfit functions designed for model fitting. This function also includes an option to perform median ratio normalization on the data counts.

    ## -- convert counts matrix and samples metadatas in a data frame for fitting
    data2fit = prepareData2fit(
                 countMatrix = count_matrix, 
                 metadata =  metaData, 
                 normalization = F,
                 response_name = "kij")
    
    
    ## -- median ratio normalization
    data2fit = prepareData2fit(
                 countMatrix = count_matrix, 
                 metadata =  metaData, 
                 normalization = T, 
                 response_name = "kij")

    Fit model from your data

    The fitModelParallel() function enables independent model fitting for each gene. The number of threads used for this process can be controlled by the n.cores parameter.

    l_tmb <- fitModelParallel(
              formula = kij ~ genotype + environment  + genotype:environment,
              data = data2fit, 
              group_by = "geneID",
              family = glmmTMB::nbinom2(link = "log"), 
              n.cores = 1)

    Use mixed effect in your model

    HTRfit uses the glmmTMB functions for model fitting algorithms. This choice allows for the utilization of random effects within your formula design. For further details on how to specify your model, please refer to the mixed model documentation.

    l_tmb <- fitModelParallel(
              formula = kij ~ genotype + ( 1 | environment ),
              data = data2fit, 
              group_by = "geneID",
              family = glmmTMB::nbinom2(link = "log"), 
              n.cores = 1)

    Additional settings

    The function provides precise control over model settings for fitting optimization, including options for specifying the model family and model control setting. By default, a Gaussian family model is fitted, but for RNA-seq data, it is highly recommended to specify family = glmmTMB::nbinom2(link = "log").

    l_tmb <- fitModelParallel(
              formula = kij ~ genotype + environment  + genotype:environment,
              data = data2fit, 
              group_by = "geneID",
              n.cores = 1, 
              family = glmmTMB::nbinom2(link = "log"),
              control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5,
                                                             eval.max=1e5)))

    Extracts a tidy result table from a list tmb object

    The tidy_results function extracts a data frame containing estimates of ln(fold changes), standard errors, test statistics, p-values, and adjusted p-values for fixed effects. Additionally, it provides access to correlation terms and standard deviations for random effects, offering a detailed view of HTRfit modeling results.

    ## -- get tidy results
    my_tidy_res <- tidy_results(l_tmb, coeff_threshold = 0.1, 
                                alternative_hypothesis = "greaterAbs")
    ## -- head
    my_tidy_res[1:3,]

    Update fit

    The updateParallel() function updates and re-fits a model for each gene. It offers options similar to those in fitModelParallel().

    ## -- update your fit modifying the model family
    l_tmb <- updateParallel(
              formula =  kij ~ genotype + environment  + genotype:environment,
              list_tmb = l_tmb ,
              family = gaussian(), 
              n.cores = 1)
    
    ## -- update fit using additional model control settings
    l_tmb <- updateParallel(
              formula =  kij ~ genotype + environment  + genotype:environment ,
              list_tmb = l_tmb ,
              family = gaussian(), 
              n.cores = 1,
              control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e3,
                                                             eval.max=1e3)))
    
    
    ## -- update your model formula and your family model
    l_tmb <- updateParallel(
                formula =   kij ~ genotype + environment  ,
                list_tmb = l_tmb ,
                family = glmmTMB::nbinom2(link = "log"), 
                n.cores = 1)

    Struture of list tmb object

    str(l_tmb$gene1, max.level = 1)

    Plot fit metrics

    Visualizing fit metrics is essential for evaluating your models. Here, we show you how to generate various plots to assess the quality of your models. You can explore all metrics or focus on specific aspects like dispersion and log-likelihood.

    ## -- plot all metrics
    diagnostic_plot(list_tmb = l_tmb)
    ## -- Focus on metrics
    diagnostic_plot(list_tmb = l_tmb, focus = c("dispersion", "logLik"))

    Anova to select the best model

    Utilizing the anovaParallel() function enables you to perform model selection by assessing the significance of the fixed effects. You can also include additional parameters like type. For more details, refer to car::Anova.

    ## -- update your fit modifying the model family
    l_anova <- anovaParallel(list_tmb = l_tmb)
    
    ## -- additional settings
    l_anova <- anovaParallel(list_tmb = l_tmb, type = "III" )