Skip to content
Snippets Groups Projects
Select Git revision
  • f86c33b1405e20e70c52cf6a77e9ffd223724f93
  • master default protected
  • cpm_norm
  • v1.0
4 results

deseq2-wrapper

DESeq2_wrapper

Description

This is an R package dedicated to perform Differential expression analysis from HTSeq_count results file

Dependencies

Depends:

  • R (>= 4.1.2), Imports:
  • DESeq2 (>= 1.34.0),
  • argparser (>= 0.7.1),
  • RColorBrewer (>= 1.1-2),
  • dplyr (>= 1.0.7),
  • ggplot2 (>= 3.3.5),
  • ggrepel (>= 0.9.1),
  • gplots (>= 3.1.3),
  • pheatmap (>= 1.0.12),
  • stringr (>= 1.4.0),
  • viridis (>= 0.6.2)

The pandoc package must be installed. On ubuntu run:

sudo apt-get install pandoc

Installation

To install this package, the devtools package must be installed.

Run in R the following command to install the package:

> library(devtools)
> install_gitlab("LBMC/regards/deseq2-wrapper", host = "https://gitbio.ens-lyon.fr", quiet = FALSE)

Alternatively you can use this procedure

mkdir deseq2-wrapper
git clone http://gitbio.ens-lyon.fr/LBMC/regards/deseq2-wrapper.git deseq2-wrapper
cd deseq2-wrapper
R

Then

library(devtools)
install(".")

Usage

With a command line interface (CLI) script

First, you must create an R file containing only the following code:

# my_R_file.R

#!/bin/Rscript
library(DESeqwrapper)
cli_run_deseq2()

Then you can type the following commands to see if everything works:

$ Rscript my_R_file.R --help

usage: my_R_file.R [--] [--help] [--opts OPTS] [--output OUTPUT]
       [--filter FILTER] [--gene_expression_threshold
       GENE_EXPRESSION_THRESHOLD] [--basemean_threshold
       BASEMEAN_THRESHOLD] [--lfc_threshold LFC_THRESHOLD] design path
       formula contrasts

Wrapper to perform DESeq2 enrichment analysis

positional arguments:
  design                           A file containing the design of
                                   interest
  path                             The folder where the tsv file are
                                   stored (default .)
  formula                          The design formula for DESeq2
  contrasts                        The comparisons to perform in the
                                   form of deseq2 contrasts : each
                                   comparisons must be composed of
                                   three feature separed by a '-'. The
                                   first feature corresponds to a
                                   column given in the design file an
                                   present in the --formula, the two
                                   other are the names of the fold
                                   changes for the numerator, and the
                                   names of the fold changes for the
                                   denominator

flags:
  -h, --help                       show this help message and exit

optional arguments:
  -x, --opts                       RDS file containing argument values
  -o, --output                     folder were the results will be
                                   created [default: .]
  -f, --filter                     A file containg the genes id to keep
                                   [default: ]
  -g, --gene_expression_threshold  A threshold to keep genes with at
                                   least this average expression across
                                   all samples prior to deseq2 analysis
                                   [default: 2]
  -b, --basemean_threshold         A threshold to keep significant
                                   genes with at least this baseMean
                                   (after deseq2 analysis) [default: 0]
  -l, --lfc_threshold              A threshold to keep significant
                                   genes with at least this log2fc
                                   [default: 0]
  -c, --gene_names                 Optional: A file containing the
                                   correspondance between geneIDs and
                                   gene namesThe first colomn contains
                                   geneIDs and the header is 'gene'The
                                   second colomn contains corresponding
                                   gene names and the header is
                                   'gene_name [default: ]

The design parameter must point to a file with the following structure:

prefix sample condition
276_DMSO_DMSO_0 276_DMSO_DMSO_0 DMSO
277_DMSO_DMSO_0 277_DMSO_DMSO_0 DMSO
278_DMSO_DMSO_0 278_DMSO_DMSO_0 DMSO
276_BRAF_DMSO_0 276_BRAF_DMSO_0 BRAF
277_BRAF_DMSO_0 277_BRAF_DMSO_0 BRAF
278_BRAF_DMSO_0 278_BRAF_DMSO_0 BRAF
  1. The prefix column must contain a prefix indentifying a unique file containing counts inside the folder identified by path
  2. The sample column must contain the name of the sample
  3. The condition column, the name of the condition. Note that unlike the two previous columns, this column can be renamed (e.g condition can be rename to treatment)

The columns must be tab-separated.

Note that you can add as many other columns as you want to include every covariate you might need.

The formulaparameter can be: "~ condition". You can add any covariate as you want is they are present inside the design table. For example " ~ condition + cov1 + cov2".

The gene_names parameter must have the following structure:

gene gene_name
ENSG00000223972 DDX11L1
ENSG00000227232 WASH7P
ENSG00000278267 MIR6859-1
  1. The genecolumns must contain the ENSEMBL gene_IDs.
  2. The gene_name column must contain the name of the gene corresponding to ENSEMBL gene_ID.

With the function run_deseq2

You can directly use the function run_deseq2 with R:

> library(DESeqwrapper)
> ?run_deseq2 #display the help of run_deseq2 function
> # example usage
> design <- "path_to/design.txt"
> (design_table <- read.table(design, sep="\t", h = T))
                      count_files          sample condition
1 276_DMSO_DMSO_0_no_spike-in.tsv 276_DMSO_DMSO_0      DMSO
2 277_DMSO_DMSO_0_no_spike-in.tsv 277_DMSO_DMSO_0      DMSO
3 278_DMSO_DMSO_0_no_spike-in.tsv 278_DMSO_DMSO_0      DMSO
4 276_BRAF_DMSO_0_no_spike-in.tsv 276_BRAF_DMSO_0      BRAF
5 277_BRAF_DMSO_0_no_spike-in.tsv 277_BRAF_DMSO_0      BRAF
6 278_BRAF_DMSO_0_no_spike-in.tsv 278_BRAF_DMSO_0      BRAF
> path <- "path/to/htseq_files"
> my_formula <- "~ condition"
> filter_list <- c("gene1", "...", "genen") # A vector containing the gene to keep. It can be equal to NULL (default value) to keep all genes
> output_folder <- "results/"
> my_contrasts <- list(c("condition", "BRAF", "DMSO"), #contrasts: a list containing the column used for the comparison in the design and the test and the control condition.
+ c("condition", "DMSO", "BRAF")) # you can do more that one differential expression analysis this way: (i.e BRAF vs DMSO and DMSO vs BRAF here)
> res <- run_deseq2(design_table, path, my_contrasts,
+             my_formula, filter_list,
+             min_expression = 2, output_folder = output_folder,
+             basemean_threshold = 0, lfc_threshold = 0)
> # Note that lfc_threshold, basemean_threshold and min_expression corresponds to the parameters lfc_threshold, basemean_threshold, gene_expression_threshold of the cli program (see above) respectively.
> # res is a list containing the name of each comparison defined in my_contrasts
> str(res, max.level=1) # 2 comparisons in my_contrast, so we have a list of 2 elements
List of 1
 $ BRAF_vs_DMSO:List of 4
 $ DMSO_vs_BRAF:List of 4
> str(res$BRAF_vs_DMSO, max.level=1)
 $ data      :'data.frame':     1 obs. of  4 variables:
 $ dds       :Formal class 'DESeqDataSet' [package "DESeq2"] with 8 slots
 $ results   :'data.frame':     11893 obs. of  7 variables:
 $ de_results:'data.frame':     1620 obs. of  7 variables:
  • data: A dataframe indicating The number of differentially expressed genes and the number of up and down regulated genes
  • dds The dds object corresponding to a DESeqDataSet object obtained after running the DESeq function.
  • results A dataframe containing the results of DEseQ2 analysis but for all gene
  • de_results A dataframe containing only differentially expressed genes

output

Here a description of the results file you get by running the package with a test and a control conditions named respectively TEST and CTRL:

result_folder
├── DE_plots_TEST_vs_CTRL.pdf
├── DE_statistics.txt
├── Differential_expression_TEST_vs_CTRL_lfct_0.585.txt
├── Differential_expression_TEST_vs_CTRL_lfct_0.585_baseMt_10_sig.txt
├── plots_general_view.pdf
├── readcounts_norm_[N]genes.csv
└── readcounts_raw_[N]genes.csv
  1. readcounts_norm_[N]genes.csv: Raw read counts matrix for each sample. The [N]corresponds to the number of genes kept after the filter on mean count expression (--gene_expression_threshold) parameter and the selection of gene of interest if a file is given to the package with the --filter parameter
  2. readcounts_norm_[N]genes.csv: Same thing as above but with counts norm by DESeq2
  3. plots_general_view.pdf: Contains figures describing the experiment:
    1. The total raw read counts for each sample before DESeq2 normalisation and after the filters defined with the --filter and --gene_expression_threshold
    2. The total normalized read counts for each sample before DESeq2 normalisation and after the filters defined with the --filter and --gene_expression_threshold
    3. PCA plots of the sample
    4. A heatmap displaying the similarity between samples
    5. A heatmap representing the log10(norm count + 1) of every gene for evert condition
    6. The per-gene dispersion estimates in relation to the fitted mean-dispersion
  4. PCA_plot.html The PCA plot in html format (Usefull to get the name of each sample )
  5. DE_plots_TEST_vs_CTRL.pdf Distribution of corrected p-values, Volcano plot and MA plot for the chosen comparison
  6. volcano_plot_BRAF_vs_DMSO.html An html volcano plot: useful to get the name of genes in volcano plot
  7. ma_plot_BRAF_vs_DMSO.html An html MA plot: useful to get the name of genes in MA plot
  8. Differential_expression_TEST_vs_CTRL_lfct_0.585.txt DESeq2 result files indicating the log(2fold change) and the p-value of each gene between the TEST and the CTRL condition . Differential_expression_TEST_vs_CTRL_lfct_0.585_baseMt_[M]_sig.txt Same table as before but with only the significant genes (padj <= 0.05> and with a basemean greater than M (M is defined with the --basemean_threshold parameter))
  9. DE_statistics.txt A file recapitulating the number total, down and up-regulated genes for every comparison given with the --contrastsarguments