Skip to content
Snippets Groups Projects
Fontrodona Nicolas's avatar
nfontrod authored
111882dc
History

TopGO_wrapper

Description

This is an R package dedicated to perform GO enrichment analysis from deseq2 differential expression file

Dependencies Installation

  • Depends:
    • R (>= 4.1.2)
  • Imports:
    • topGO (>= 2.46.0)
    • org.Hs.eg.db (>= 3.14.0)
    • argparser (>= 0.7.1)
    • forcats (>= 0.5.1)
    • readr (>= 2.1.2)
    • dplyr (>= 1.0.7)
    • ggplot2 (>= 3.3.5)

Installation with conda

To install a conda environment with all the required dependencies, the file topgo-env.yml is provided, in this repository. In the project folder where this file is present, run the following command:

conda env create -f topgo-env.yml

Installation

To install this package, the devtools package must be installed.

Run in R the following command to install the package:

library(devtools)
install_gitlab("LBMC/regards/topgo-wrapper", host = "https://gitbio.ens-lyon.fr", quiet = FALSE)

Other method

mkdir topgo_wrapper
git clone http://gitbio.ens-lyon.fr/LBMC/regards/topgo-wrapper.git topgo_wrapper
cd topgo_wrapper
R

Then

library(devtools)
install(".")

Limitations

For not, it only works with human datasets and only performs GO enrichment using the 'classic' topGO algorithm and the 'fisher' statistic

Usage

With a command line interface (CLI) script

First, you must create an R file (for example named my_R_file.R) containing only the following code:

#!/bin/Rscript

library('TopGOwrapper')
library('topGO')
library('org.Hs.eg.db')
cli_run_topgo()

Then you can type the following commands to see if everything works:

$ Rscript my_R_file.R --help
...
usage: test.R [--] [--help] [--opts OPTS] [--de_file DE_FILE] [--id ID]
       [--output OUTPUT] [--top TOP] [--alpha ALPHA]
       [--log2fc_threshold LOG2FC_THRESHOLD] [--basemean_threshold
       BASEMEAN_THRESHOLD]

Wrapper to perform TopGO enrichment analysis For now, it only work on
human genes, with the fisher enrichment method. Moreover, all genes in
that files are used as the gene universe

flags:
  -h, --help                show this help message and exit

optional arguments:
  -x, --opts                RDS file containing argument values
  -d, --de_file             A file containing deseq2 enrichment
                            analysis.All genes must be defined in this
                            file eventthose not differentially
                            expressed
  -i, --id                  The id identifying the genes in de_file. It
                            can take the following values: 'entrez',
                            'genbank', 'alias', 'ensembl', 'symbol',
                            'genename', 'unigene'. Defaults to 'symbol'
                            [default: symbol]
  -o, --output              folder were the results will be created
                            [default: .]
  -t, --top                 The number of top go term to display
                            [default: 20]
  -a, --alpha               The padj threshold in de_file below which
                            genes are considered as differentially
                            expressed defaults to 0.05 [default: 0.05]
  -l, --log2fc_threshold    The log2fc threshold in de_file above
                            which( in absolute value) genes are
                            considered as differentially expressed,
                            defaults to 0 [default: 0]
  -b, --basemean_threshold  The basemean threshold in de_file below
                            which genes cannot be considered as
                            differentially expressed, defaults to 0
                            [default: 0]

To run the top-go wrapper, --de_file parameter or --genes and --background parameter must be set

The de_file parameter must correspond to a file with the following structure:

gene baseMean log2FoldChange lfcSE stat pvalue padj
A1BG 240.956914340076 0.133226932617053 0.328043296485508 0 1 1
A2M 9636.70629697928 -0.595284877763812 0.502037280842549 -0.0204862829042333 0.983655454438162 1
A4GALT 402.374065197262 -1.0931216879495 0.347107578223515 -1.46387379540962 0.143228434481562 0.976692858834766
AAAS 226.084731795302 2.97777448404702 1.26846531308784 1.88635389502473 0.0592472810840546 0.549734319535217
  1. The gene columns must contain the ID of the gene. you can specify the type of ID with the --id parameter
  2. The baseMean column correspond the the mean gene expression across samples
  3. The log2FoldChange column, corresponds to the log2FoldChange of expression between conditions
  4. The padj column, corresponds to the pvalues adjusted

Note that only gene, baseMean, log2FoldChange and padj column are required

The columns must be tab-separated.

Note that the column gene can corresponds to rownames of the table

** WARNING: This input file must contain ALL genes whether they are differentially expressed or not. Indeed, all genes defined in this file are used as the gene universe. **

The --genes and --background is a file containing a list of gene ids (the ids in the file should correspond to the id given with the --id parameter) in the form of:

SRSF1
SRSF2
...

output

Here a description of the results file you get by running the package:

result_folder
├── [REG]_genes_[GOTYPE]_a[ALPHA]_lfc[LFCT]_b[BT]_top[NUM].pdf
└── [REG]_genes_[GOTYPE]_a[ALPHA]_lfc[LFCT]_b[BT]_top[NUM].txt

Where:

  • REG: Can be equal to down_regulated, up_regulated or all_de. When REG is
    • down_regulated: Only consider the down-regulated (padj <= ALPHA & log2FoldChange <= LFCT & baseMean >= BT) genes...
    • up_regulated: Only consider the up-regulated (padj <= ALPHA & log2FoldChange >= LFCT & baseMean >= BT) genes...
    • all_de: Consider all differentially expressed (padj <= ALPHA & log2FoldChange >= abs(LFCT) & baseMean >= basemean_threshold) genes...

...to perform the GO enrichement analysis against all genes defined in the input file given with the --de_file parameter

  • GOTYPE: The GO term type considered in the result file:
    • BP: Biological Process
    • MF: Molecular Function
    • CCCellular component
  • ALPHA: consider genes having a padj <= ALPHA in the input file as significant
  • LFCT: consider genes having a
    • log2FoldChange <= LFCT (for down-regulated genes)...
    • log2FoldChange >= LFCT (for up-regulated genes)...
    • log2FoldChange >= abs(LFCT) (for differentially expressed genes)...

... in the input file as differentially expressed

-BT The basemean threshold (given by the -b, --basemean_threshold parameter) below which a gene cannot be taken as significant in the topGO enrichment analysis

  • NUM: The top number of enriched go term to display

The pdf file corresponds to a figure displying the top enriched go terms and the text file is the same thing but in the form a tabulated file with this structure:

GO.ID Term Annotated Significant fish pvalue
GO:0051240 positive regulation of multicellular org... 986 67 4.6e-10 4.5978e-10
GO:0010941 regulation of cell death 1206 76 8.2e-10 8.2101e-10
GO:0042127 regulation of cell population proliferat... 1171 74 1.3e-09 1.2998e-10

Where:

  • fish: rounded p-value of the GO term enrichment analysis using the Fisher method

Here is the kind of figure this tool produces

output