TopGO_wrapper
Description
This is an R package dedicated to perform GO enrichment analysis from deseq2 differential expression file
Dependencies Installation
- Depends:
- R (>= 4.1.2)
- Imports:
- topGO (>= 2.46.0)
- org.Hs.eg.db (>= 3.14.0)
- argparser (>= 0.7.1)
- forcats (>= 0.5.1)
- readr (>= 2.1.2)
- dplyr (>= 1.0.7)
- ggplot2 (>= 3.3.5)
Installation with conda
To install a conda environment with all the required dependencies, the file topgo-env.yml
is provided, in this repository. In the project folder where this file is present, run the following command:
conda env create -f topgo-env.yml
Installation
To install this package, the devtools
package must be installed.
Run in R the following command to install the package:
library(devtools)
install_gitlab("LBMC/regards/topgo-wrapper", host = "https://gitbio.ens-lyon.fr", quiet = FALSE)
Other method
mkdir topgo_wrapper
git clone http://gitbio.ens-lyon.fr/LBMC/regards/topgo-wrapper.git topgo_wrapper
cd topgo_wrapper
R
Then
library(devtools)
install(".")
Limitations
For not, it only works with human datasets and only performs GO enrichment using the 'classic' topGO algorithm and the 'fisher' statistic
Usage
With a command line interface (CLI) script
First, you must create an R file (for example named my_R_file.R
) containing only the following code:
#!/bin/Rscript
library('TopGOwrapper')
library('topGO')
library('org.Hs.eg.db')
cli_run_topgo()
Then you can type the following commands to see if everything works:
$ Rscript my_R_file.R --help
...
usage: test.R [--] [--help] [--opts OPTS] [--de_file DE_FILE] [--id ID]
[--output OUTPUT] [--top TOP] [--alpha ALPHA]
[--log2fc_threshold LOG2FC_THRESHOLD] [--basemean_threshold
BASEMEAN_THRESHOLD]
Wrapper to perform TopGO enrichment analysis For now, it only work on
human genes, with the fisher enrichment method. Moreover, all genes in
that files are used as the gene universe
flags:
-h, --help show this help message and exit
optional arguments:
-x, --opts RDS file containing argument values
-d, --de_file A file containing deseq2 enrichment
analysis.All genes must be defined in this
file eventthose not differentially
expressed
-i, --id The id identifying the genes in de_file. It
can take the following values: 'entrez',
'genbank', 'alias', 'ensembl', 'symbol',
'genename', 'unigene'. Defaults to 'symbol'
[default: symbol]
-o, --output folder were the results will be created
[default: .]
-t, --top The number of top go term to display
[default: 20]
-a, --alpha The padj threshold in de_file below which
genes are considered as differentially
expressed defaults to 0.05 [default: 0.05]
-l, --log2fc_threshold The log2fc threshold in de_file above
which( in absolute value) genes are
considered as differentially expressed,
defaults to 0 [default: 0]
-b, --basemean_threshold The basemean threshold in de_file below
which genes cannot be considered as
differentially expressed, defaults to 0
[default: 0]
To run the top-go wrapper, --de_file parameter or --genes and --background parameter must be set
The de_file parameter must correspond to a file with the following structure:
gene | baseMean | log2FoldChange | lfcSE | stat | pvalue | padj |
---|---|---|---|---|---|---|
A1BG | 240.956914340076 | 0.133226932617053 | 0.328043296485508 | 0 | 1 | 1 |
A2M | 9636.70629697928 | -0.595284877763812 | 0.502037280842549 | -0.0204862829042333 | 0.983655454438162 | 1 |
A4GALT | 402.374065197262 | -1.0931216879495 | 0.347107578223515 | -1.46387379540962 | 0.143228434481562 | 0.976692858834766 |
AAAS | 226.084731795302 | 2.97777448404702 | 1.26846531308784 | 1.88635389502473 | 0.0592472810840546 | 0.549734319535217 |
- The
gene
columns must contain the ID of the gene. you can specify the type of ID with the--id
parameter - The
baseMean
column correspond the the mean gene expression across samples - The
log2FoldChange
column, corresponds to the log2FoldChange of expression between conditions - The
padj
column, corresponds to the pvalues adjusted
Note that only gene
, baseMean
, log2FoldChange
and padj
column are required
The columns must be tab-separated
.
Note that the column gene can corresponds to rownames of the table
** WARNING: This input file must contain ALL genes whether they are differentially expressed or not. Indeed, all genes defined in this file are used as the gene universe. **
The --genes
and --background
is a file containing a list of gene ids (the ids in the file should correspond to the id given with the --id parameter) in the form of:
SRSF1
SRSF2
...
output
Here a description of the results file you get by running the package:
result_folder
├── [REG]_genes_[GOTYPE]_a[ALPHA]_lfc[LFCT]_b[BT]_top[NUM].pdf
└── [REG]_genes_[GOTYPE]_a[ALPHA]_lfc[LFCT]_b[BT]_top[NUM].txt
Where:
-
REG
: Can be equal to down_regulated, up_regulated or all_de. WhenREG
is-
down_regulated
: Only consider the down-regulated (padj
<=ALPHA
&log2FoldChange
<=LFCT
&baseMean
>=BT
) genes... -
up_regulated
: Only consider the up-regulated (padj
<=ALPHA
&log2FoldChange
>=LFCT
&baseMean
>=BT
) genes... -
all_de
: Consider all differentially expressed (padj
<=ALPHA
&log2FoldChange
>= abs(LFCT
) &baseMean
>=basemean_threshold
) genes...
-
...to perform the GO enrichement analysis against all genes defined in the input file given with the --de_file
parameter
-
GOTYPE
: The GO term type considered in the result file:-
BP
: Biological Process -
MF
: Molecular Function -
CC
Cellular component
-
-
ALPHA
: consider genes having apadj
<=ALPHA
in the input file as significant -
LFCT
: consider genes having a-
log2FoldChange
<=LFCT
(for down-regulated genes)... -
log2FoldChange
>=LFCT
(for up-regulated genes)... -
log2FoldChange
>= abs(LFCT
) (for differentially expressed genes)...
-
... in the input file as differentially expressed
-BT
The basemean threshold (given by the -b, --basemean_threshold
parameter) below which a gene cannot be taken as significant in the topGO enrichment analysis
-
NUM
: The top number of enriched go term to display
The pdf file corresponds to a figure displying the top enriched go terms and the text file is the same thing but in the form a tabulated file with this structure:
GO.ID | Term | Annotated | Significant | fish | pvalue |
---|---|---|---|---|---|
GO:0051240 | positive regulation of multicellular org... | 986 | 67 | 4.6e-10 | 4.5978e-10 |
GO:0010941 | regulation of cell death | 1206 | 76 | 8.2e-10 | 8.2101e-10 |
GO:0042127 | regulation of cell population proliferat... | 1171 | 74 | 1.3e-09 | 1.2998e-10 |
Where:
-
fish
: rounded p-value of the GO term enrichment analysis using the Fisher method
Here is the kind of figure this tool produces