Skip to content
Snippets Groups Projects
Select Git revision
  • cpm_norm
  • master default protected
  • v1.0
3 results

deseq2-wrapper

  • Clone with SSH
  • Clone with HTTP
  • Fontrodona Nicolas's avatar
    nfontrod authored
    f27b7bde
    History

    DESeq2_wrapper

    Description

    This is an R package dedicated to perform Differential expression analysis from HTSeq_count results file

    Dependencies

    Depends:

    • R (>= 4.1.2), Imports:
    • DESeq2 (>= 1.34.0),
    • argparser (>= 0.7.1),
    • RColorBrewer (>= 1.1-2),
    • dplyr (>= 1.0.7),
    • ggplot2 (>= 3.3.5),
    • ggrepel (>= 0.9.1),
    • gplots (>= 3.1.3),
    • pheatmap (>= 1.0.12),
    • stringr (>= 1.4.0),
    • viridis (>= 0.6.2)

    The pandoc package must be installed. On ubuntu run:

    sudo apt-get install pandoc

    Installation

    To install this package, the devtools package must be installed.

    Run in R the following command to install the package:

    > library(devtools)
    > install_gitlab("LBMC/regards/deseq2-wrapper", host = "https://gitbio.ens-lyon.fr", quiet = FALSE)

    Alternatively you can use this procedure

    mkdir deseq2-wrapper
    git clone http://gitbio.ens-lyon.fr/LBMC/regards/deseq2-wrapper.git deseq2-wrapper
    cd deseq2-wrapper
    R

    Then

    library(devtools)
    install(".")

    Usage

    With a command line interface (CLI) script

    First, you must create an R file containing only the following code:

    # my_R_file.R
    
    #!/bin/Rscript
    library(DESeqwrapper)
    cli_run_deseq2()

    Then you can type the following commands to see if everything works:

    $ Rscript my_R_file.R --help
    
    usage: my_R_file.R [--] [--help] [--opts OPTS] [--output OUTPUT]
           [--filter FILTER] [--gene_expression_threshold
           GENE_EXPRESSION_THRESHOLD] [--basemean_threshold
           BASEMEAN_THRESHOLD] [--lfc_threshold LFC_THRESHOLD] design path
           formula contrasts
    
    Wrapper to perform DESeq2 enrichment analysis
    
    positional arguments:
      design                           A file containing the design of
                                       interest
      path                             The folder where the tsv file are
                                       stored (default .)
      formula                          The design formula for DESeq2
      contrasts                        The comparisons to perform in the
                                       form of deseq2 contrasts : each
                                       comparisons must be composed of
                                       three feature separed by a '-'. The
                                       first feature corresponds to a
                                       column given in the design file an
                                       present in the --formula, the two
                                       other are the names of the fold
                                       changes for the numerator, and the
                                       names of the fold changes for the
                                       denominator
    
    flags:
      -h, --help                       show this help message and exit
    
    optional arguments:
      -x, --opts                       RDS file containing argument values
      -o, --output                     folder were the results will be
                                       created [default: .]
      -f, --filter                     A file containg the genes id to keep
                                       [default: ]
      -g, --gene_expression_threshold  A threshold to keep genes with at
                                       least this average expression across
                                       all samples prior to deseq2 analysis
                                       [default: 2]
      -b, --basemean_threshold         A threshold to keep significant
                                       genes with at least this baseMean
                                       (after deseq2 analysis) [default: 0]
      -l, --lfc_threshold              A threshold to keep significant
                                       genes with at least this log2fc
                                       [default: 0]
      -c, --gene_names                 Optional: A file containing the
                                       correspondance between geneIDs and
                                       gene namesThe first colomn contains
                                       geneIDs and the header is 'gene'The
                                       second colomn contains corresponding
                                       gene names and the header is
                                       'gene_name [default: ]
      -n, --norm                       A two column dataframe containing
                                       thecolumns sample and size_factor.
                                       Sample column must correspond to the
                                       samples name given in design file.
                                       Size_factorcorrespond to the scaling
                                       value to applyto count data in
                                       deseq2 [default: ]

    The design parameter must point to a file with the following structure:

    prefix sample condition
    276_DMSO_DMSO_0 276_DMSO_DMSO_0 DMSO
    277_DMSO_DMSO_0 277_DMSO_DMSO_0 DMSO
    278_DMSO_DMSO_0 278_DMSO_DMSO_0 DMSO
    276_BRAF_DMSO_0 276_BRAF_DMSO_0 BRAF
    277_BRAF_DMSO_0 277_BRAF_DMSO_0 BRAF
    278_BRAF_DMSO_0 278_BRAF_DMSO_0 BRAF
    1. The prefix column must contain a prefix indentifying a unique file containing counts inside the folder identified by path
    2. The sample column must contain the name of the sample
    3. The condition column, the name of the condition. Note that unlike the two previous columns, this column can be renamed (e.g condition can be rename to treatment)

    The columns must be tab-separated.

    Note that you can add as many other columns as you want to include every covariate you might need.

    The formulaparameter can be: "~ condition". You can add any covariate as you want is they are present inside the design table. For example " ~ condition + cov1 + cov2".

    The gene_names parameter must have the following structure:

    gene gene_name
    ENSG00000223972 DDX11L1
    ENSG00000227232 WASH7P
    ENSG00000278267 MIR6859-1
    1. The genecolumns must contain the ENSEMBL gene_IDs.
    2. The gene_name column must contain the name of the gene corresponding to ENSEMBL gene_ID.

    Finally the norm parameter is an optional parameter that can be provided to use another normalization than the DESeq2 method. In most cases you won't need it. This parameter takes a 2 columns file with the following format:

    sample  size_factor
    276_DMSO_DMSO_0 1
    277_DMSO_DMSO_0 2

    Note that the header must be provided.

    • The sample column must corresponds to samples also defined in the sample column of the design file.
    • The size_factor column correspond to a factor used to normalize sample in DESeq2. They will be used in the sizeFactors function of DESeq2.

    With the function run_deseq2

    You can directly use the function run_deseq2 with R:

    > library(DESeqwrapper)
    > ?run_deseq2 #display the help of run_deseq2 function
    > # example usage
    > design <- "path_to/design.txt"
    > (design_table <- read.table(design, sep="\t", h = T))
                          count_files          sample condition
    1 276_DMSO_DMSO_0_no_spike-in.tsv 276_DMSO_DMSO_0      DMSO
    2 277_DMSO_DMSO_0_no_spike-in.tsv 277_DMSO_DMSO_0      DMSO
    3 278_DMSO_DMSO_0_no_spike-in.tsv 278_DMSO_DMSO_0      DMSO
    4 276_BRAF_DMSO_0_no_spike-in.tsv 276_BRAF_DMSO_0      BRAF
    5 277_BRAF_DMSO_0_no_spike-in.tsv 277_BRAF_DMSO_0      BRAF
    6 278_BRAF_DMSO_0_no_spike-in.tsv 278_BRAF_DMSO_0      BRAF
    > path <- "path/to/htseq_files"
    > my_formula <- "~ condition"
    > filter_list <- c("gene1", "...", "genen") # A vector containing the gene to keep. It can be equal to NULL (default value) to keep all genes
    > output_folder <- "results/"
    > my_contrasts <- list(c("condition", "BRAF", "DMSO"), #contrasts: a list containing the column used for the comparison in the design and the test and the control condition.
    + c("condition", "DMSO", "BRAF")) # you can do more that one differential expression analysis this way: (i.e BRAF vs DMSO and DMSO vs BRAF here)
    > res <- run_deseq2(design_table, path, my_contrasts,
    +             my_formula, filter_list,
    +             min_expression = 2, output_folder = output_folder,
    +             basemean_threshold = 0, lfc_threshold = 0)
    > # Note that lfc_threshold, basemean_threshold and min_expression corresponds to the parameters lfc_threshold, basemean_threshold, gene_expression_threshold of the cli program (see above) respectively.
    > # res is a list containing the name of each comparison defined in my_contrasts
    > str(res, max.level=1) # 2 comparisons in my_contrast, so we have a list of 2 elements
    List of 1
     $ BRAF_vs_DMSO:List of 4
     $ DMSO_vs_BRAF:List of 4
    > str(res$BRAF_vs_DMSO, max.level=1)
     $ data      :'data.frame':     1 obs. of  4 variables:
     $ dds       :Formal class 'DESeqDataSet' [package "DESeq2"] with 8 slots
     $ results   :'data.frame':     11893 obs. of  7 variables:
     $ de_results:'data.frame':     1620 obs. of  7 variables:
    • data: A dataframe indicating The number of differentially expressed genes and the number of up and down regulated genes
    • dds The dds object corresponding to a DESeqDataSet object obtained after running the DESeq function.
    • results A dataframe containing the results of DEseQ2 analysis but for all gene
    • de_results A dataframe containing only differentially expressed genes

    output

    Here a description of the results file you get by running the package with a test and a control conditions named respectively TEST and CTRL:

    result_folder
    ├── DE_plots_TEST_vs_CTRL.pdf
    ├── DE_statistics.txt
    ├── Differential_expression_TEST_vs_CTRL_lfct_0.585.txt
    ├── Differential_expression_TEST_vs_CTRL_lfct_0.585_baseMt_10_sig.txt
    ├── plots_general_view.pdf
    ├── readcounts_norm_[N]genes.csv
    └── readcounts_raw_[N]genes.csv
    1. readcounts_norm_[N]genes.csv: Raw read counts matrix for each sample. The [N]corresponds to the number of genes kept after the filter on mean count expression (--gene_expression_threshold) parameter and the selection of gene of interest if a file is given to the package with the --filter parameter
    2. readcounts_norm_[N]genes.csv: Same thing as above but with counts norm by DESeq2
    3. plots_general_view.pdf: Contains figures describing the experiment:
      1. The total raw read counts for each sample before DESeq2 normalisation and after the filters defined with the --filter and --gene_expression_threshold
      2. The total normalized read counts for each sample before DESeq2 normalisation and after the filters defined with the --filter and --gene_expression_threshold
      3. PCA plots of the sample
      4. A heatmap displaying the similarity between samples
      5. A heatmap representing the log10(norm count + 1) of every gene for evert condition
      6. The per-gene dispersion estimates in relation to the fitted mean-dispersion
    4. PCA_plot.html The PCA plot in html format (Usefull to get the name of each sample )
    5. DE_plots_TEST_vs_CTRL.pdf Distribution of corrected p-values, Volcano plot and MA plot for the chosen comparison
    6. volcano_plot_BRAF_vs_DMSO.html An html volcano plot: useful to get the name of genes in volcano plot
    7. ma_plot_BRAF_vs_DMSO.html An html MA plot: useful to get the name of genes in MA plot
    8. Differential_expression_TEST_vs_CTRL_lfct_0.585.txt DESeq2 result files indicating the log(2fold change) and the p-value of each gene between the TEST and the CTRL condition . Differential_expression_TEST_vs_CTRL_lfct_0.585_baseMt_[M]_sig.txt Same table as before but with only the significant genes (padj <= 0.05> and with a basemean greater than M (M is defined with the --basemean_threshold parameter))
    9. DE_statistics.txt A file recapitulating the number total, down and up-regulated genes for every comparison given with the --contrastsarguments