DESeq2_wrapper
Description
This is an R package dedicated to perform Differential expression analysis from HTSeq_count results file
Dependencies
Depends:
- R (>= 4.1.2), Imports:
- DESeq2 (>= 1.34.0),
- argparser (>= 0.7.1),
- RColorBrewer (>= 1.1-2),
- dplyr (>= 1.0.7),
- ggplot2 (>= 3.3.5),
- ggrepel (>= 0.9.1),
- gplots (>= 3.1.3),
- pheatmap (>= 1.0.12),
- stringr (>= 1.4.0),
- viridis (>= 0.6.2)
The pandoc package must be installed. On ubuntu run:
sudo apt-get install pandoc
Installation
To install this package, the devtools
package must be installed.
Run in R the following command to install the package:
> library(devtools)
> install_gitlab("LBMC/regards/deseq2-wrapper", host = "https://gitbio.ens-lyon.fr", quiet = FALSE)
Alternatively you can use this procedure
mkdir deseq2-wrapper
git clone http://gitbio.ens-lyon.fr/LBMC/regards/deseq2-wrapper.git deseq2-wrapper
cd deseq2-wrapper
R
Then
library(devtools)
install(".")
Usage
With a command line interface (CLI) script
First, you must create an R file containing only the following code:
# my_R_file.R
#!/bin/Rscript
library(DESeqwrapper)
cli_run_deseq2()
Then you can type the following commands to see if everything works:
$ Rscript my_R_file.R --help
usage: my_R_file.R [--] [--help] [--opts OPTS] [--output OUTPUT]
[--filter FILTER] [--gene_expression_threshold
GENE_EXPRESSION_THRESHOLD] [--basemean_threshold
BASEMEAN_THRESHOLD] [--lfc_threshold LFC_THRESHOLD] design path
formula contrasts
Wrapper to perform DESeq2 enrichment analysis
positional arguments:
design A file containing the design of
interest
path The folder where the tsv file are
stored (default .)
formula The design formula for DESeq2
contrasts The comparisons to perform in the
form of deseq2 contrasts : each
comparisons must be composed of
three feature separed by a '-'. The
first feature corresponds to a
column given in the design file an
present in the --formula, the two
other are the names of the fold
changes for the numerator, and the
names of the fold changes for the
denominator
flags:
-h, --help show this help message and exit
optional arguments:
-x, --opts RDS file containing argument values
-o, --output folder were the results will be
created [default: .]
-f, --filter A file containg the genes id to keep
[default: ]
-g, --gene_expression_threshold A threshold to keep genes with at
least this average expression across
all samples prior to deseq2 analysis
[default: 2]
-b, --basemean_threshold A threshold to keep significant
genes with at least this baseMean
(after deseq2 analysis) [default: 0]
-l, --lfc_threshold A threshold to keep significant
genes with at least this log2fc
[default: 0]
-c, --gene_names Optional: A file containing the
correspondance between geneIDs and
gene namesThe first colomn contains
geneIDs and the header is 'gene'The
second colomn contains corresponding
gene names and the header is
'gene_name [default: ]
-n, --norm A two column dataframe containing
thecolumns sample and size_factor.
Sample column must correspond to the
samples name given in design file.
Size_factorcorrespond to the scaling
value to applyto count data in
deseq2 [default: ]
The design parameter must point to a file with the following structure:
prefix | sample | condition |
---|---|---|
276_DMSO_DMSO_0 | 276_DMSO_DMSO_0 | DMSO |
277_DMSO_DMSO_0 | 277_DMSO_DMSO_0 | DMSO |
278_DMSO_DMSO_0 | 278_DMSO_DMSO_0 | DMSO |
276_BRAF_DMSO_0 | 276_BRAF_DMSO_0 | BRAF |
277_BRAF_DMSO_0 | 277_BRAF_DMSO_0 | BRAF |
278_BRAF_DMSO_0 | 278_BRAF_DMSO_0 | BRAF |
- The
prefix
column must contain a prefix indentifying a unique file containing counts inside the folder identified bypath
- The
sample
column must contain the name of the sample - The
condition
column, the name of the condition. Note that unlike the two previous columns, this column can be renamed (e.g condition can be rename to treatment)
The columns must be tab-separated
.
Note that you can add as many other columns as you want to include every covariate you might need.
The formula
parameter can be: "~ condition". You can add any covariate as you want is they are present inside the design
table. For example " ~ condition + cov1 + cov2".
The gene_names parameter must have the following structure:
gene | gene_name |
---|---|
ENSG00000223972 | DDX11L1 |
ENSG00000227232 | WASH7P |
ENSG00000278267 | MIR6859-1 |
- The
gene
columns must contain the ENSEMBL gene_IDs. - The
gene_name
column must contain the name of the gene corresponding to ENSEMBL gene_ID.
Finally the norm
parameter is an optional parameter that can be provided to use another normalization than the DESeq2 method. In most cases you won't need it. This parameter takes a 2 columns file with the following format:
sample size_factor
276_DMSO_DMSO_0 1
277_DMSO_DMSO_0 2
Note that the header must be provided.
- The
sample
column must corresponds to samples also defined in the sample column of the design file. - The
size_factor
column correspond to a factor used to normalize sample in DESeq2. They will be used in the sizeFactors function of DESeq2.
With the function run_deseq2
You can directly use the function run_deseq2 with R:
> library(DESeqwrapper)
> ?run_deseq2 #display the help of run_deseq2 function
> # example usage
> design <- "path_to/design.txt"
> (design_table <- read.table(design, sep="\t", h = T))
count_files sample condition
1 276_DMSO_DMSO_0_no_spike-in.tsv 276_DMSO_DMSO_0 DMSO
2 277_DMSO_DMSO_0_no_spike-in.tsv 277_DMSO_DMSO_0 DMSO
3 278_DMSO_DMSO_0_no_spike-in.tsv 278_DMSO_DMSO_0 DMSO
4 276_BRAF_DMSO_0_no_spike-in.tsv 276_BRAF_DMSO_0 BRAF
5 277_BRAF_DMSO_0_no_spike-in.tsv 277_BRAF_DMSO_0 BRAF
6 278_BRAF_DMSO_0_no_spike-in.tsv 278_BRAF_DMSO_0 BRAF
> path <- "path/to/htseq_files"
> my_formula <- "~ condition"
> filter_list <- c("gene1", "...", "genen") # A vector containing the gene to keep. It can be equal to NULL (default value) to keep all genes
> output_folder <- "results/"
> my_contrasts <- list(c("condition", "BRAF", "DMSO"), #contrasts: a list containing the column used for the comparison in the design and the test and the control condition.
+ c("condition", "DMSO", "BRAF")) # you can do more that one differential expression analysis this way: (i.e BRAF vs DMSO and DMSO vs BRAF here)
> res <- run_deseq2(design_table, path, my_contrasts,
+ my_formula, filter_list,
+ min_expression = 2, output_folder = output_folder,
+ basemean_threshold = 0, lfc_threshold = 0)
> # Note that lfc_threshold, basemean_threshold and min_expression corresponds to the parameters lfc_threshold, basemean_threshold, gene_expression_threshold of the cli program (see above) respectively.
> # res is a list containing the name of each comparison defined in my_contrasts
> str(res, max.level=1) # 2 comparisons in my_contrast, so we have a list of 2 elements
List of 1
$ BRAF_vs_DMSO:List of 4
$ DMSO_vs_BRAF:List of 4
> str(res$BRAF_vs_DMSO, max.level=1)
$ data :'data.frame': 1 obs. of 4 variables:
$ dds :Formal class 'DESeqDataSet' [package "DESeq2"] with 8 slots
$ results :'data.frame': 11893 obs. of 7 variables:
$ de_results:'data.frame': 1620 obs. of 7 variables:
-
data
: A dataframe indicating The number of differentially expressed genes and the number of up and down regulated genes -
dds
The dds object corresponding to a DESeqDataSet object obtained after running the DESeq function. -
results
A dataframe containing the results of DEseQ2 analysis but for all gene -
de_results
A dataframe containing only differentially expressed genes
output
Here a description of the results file you get by running the package with a test and a control conditions named respectively TEST
and CTRL
:
result_folder
├── DE_plots_TEST_vs_CTRL.pdf
├── DE_statistics.txt
├── Differential_expression_TEST_vs_CTRL_lfct_0.585.txt
├── Differential_expression_TEST_vs_CTRL_lfct_0.585_baseMt_10_sig.txt
├── plots_general_view.pdf
├── readcounts_norm_[N]genes.csv
└── readcounts_raw_[N]genes.csv
-
readcounts_norm_[N]genes.csv
: Raw read counts matrix for each sample. The[N]
corresponds to the number of genes kept after the filter on mean count expression (--gene_expression_threshold
) parameter and the selection of gene of interest if a file is given to the package with the--filter
parameter -
readcounts_norm_[N]genes.csv
: Same thing as above but with counts norm by DESeq2 -
plots_general_view.pdf
: Contains figures describing the experiment:- The total raw read counts for each sample before DESeq2 normalisation and after the filters defined with the
--filter
and--gene_expression_threshold
- The total normalized read counts for each sample before DESeq2 normalisation and after the filters defined with the
--filter
and--gene_expression_threshold
- PCA plots of the sample
- A heatmap displaying the similarity between samples
- A heatmap representing the log10(norm count + 1) of every gene for evert condition
- The per-gene dispersion estimates in relation to the fitted mean-dispersion
- The total raw read counts for each sample before DESeq2 normalisation and after the filters defined with the
-
PCA_plot.html
The PCA plot in html format (Usefull to get the name of each sample ) -
DE_plots_TEST_vs_CTRL.pdf
Distribution of corrected p-values, Volcano plot and MA plot for the chosen comparison -
volcano_plot_BRAF_vs_DMSO.html
An html volcano plot: useful to get the name of genes in volcano plot -
ma_plot_BRAF_vs_DMSO.html
An html MA plot: useful to get the name of genes in MA plot -
Differential_expression_TEST_vs_CTRL_lfct_0.585.txt
DESeq2 result files indicating the log(2fold change) and the p-value of each gene between the TEST and the CTRL condition .Differential_expression_TEST_vs_CTRL_lfct_0.585_baseMt_[M]_sig.txt
Same table as before but with only the significant genes (padj <= 0.05> and with a basemean greater than M (M is defined with the--basemean_threshold
parameter)) -
DE_statistics.txt
A file recapitulating the number total, down and up-regulated genes for every comparison given with the--contrasts
arguments