Commit 0ca79bea authored by nfontrod's avatar nfontrod
Browse files

README.md: update readme

parent c215a595
......@@ -183,6 +183,8 @@ The following table display the different parameters available for this program:
The result is the same as the output of the `Permutations tool`.
# Distributions software
## Description
......@@ -373,9 +375,96 @@ The description of parameters also applies to this section.
To launch display the help of the program in the psmn, you can enter the following command:
```console
$ singularity exec -C -B $PWD:/mnt /Xnfs/abc/singularity/lbmc-exon_enrichment-latest.img python3 -m script.src.descriptions -h
$ singularity exec -C -B $PWD:/mnt /Xnfs/abc/singularity/lbmc-exon_enrichment-latest.img python3 -m script.src.distributions -h
```
# Distributions gtf software
## Description
This tool does exactly the same thing as the Distributions software, but it allows you to do it with the **genome of your choice**.
In this program you must provide a multi-fasta file containing the genome of interest along with a gtf file containing annotations for this genome.
## Prerequisites
### Without singularity image
You must have `python 3.8` installed on your system along with `R>= 3.5`
It also requires the following python modules:
* lazyparser==0.2.0
* numpy==1.19.2
* pandas==1.0.3
* statsmodels==0.11.1
* scipy==1.3.3
* seaborn==0.11.1
* matplotlib==3.1.2
* rpy2==3.3.3
* pyfaidx==0.6.4
And the following R packages:
* DHARMa==0.3.1)
* emmeans==1.4.4
* glmmTMB==1.0.1
### Singularity image
The singularity image is available in the PSMN at `/Xnfs/abc/singularity/lbmc-exon_enrichment-latest.img` and contains already all the required dependencies.
## Usage
### Without singularity image
To display the help of the program, juste enter
```console
$ python3 -m src.distributions_gtf -h
```
The following table display the different parameters available for this program:
| Required arguments | Description |
|----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| -l, --list_file LIST[STR] | A list of files containing features id defined in the supplied gtf (with the `--gtf` parameter). Note: For gene it uses the field 'gene_id' in the attributes' column of a gtf and for exons it uses the field `gene_id` with the field `exon_number` to form id with the following structure: "gene_id"_"exon_number": (example for an exon of the gene (gene_id) 'SRSF3' with the number 3 (exon_number): its id is SRSF3_3) |
| -L, --list_group LIST[STR] | A list of names to give to the list of features files given with the --list_file parameter. |
| -G, --gtf STR | The gtf file containing annotation for the supplied genome |
| -g, --genome STR | A multi-fasta file containing the genome of interest |
| -F, --ft_type STR | The kind of feature of interest ('gene' or 'exon' (exon not implemented yet)) |
| -r, --region STR | he region of interest ('gene' for the entire gene, 'exon' for the concatenation exon's sequence or CDS sequences) |
| Optional arguments | Description |
|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| -h, --help | show this help message and exit |
| -C, --cpnt_type STR | The type of component of interest |
| -t, --test STR | The kind of test to perform: enter default to perform a linear model test for distribution of frequencies in nucleotide or a beta regression for other kind of distributions. enter 'wilcoxon' to to perform a wilcoxon tests. (Default "default") |
| -o, --output STR | Folder where the results will be created (default '.') |
| -f, --figure LIST[STR] | The list of component to display in a figure, ALL to create a figure for every component (default ALL) |
| -p, --percentile FLOAT | percentile of data displayed in the figure. Values above it are not displayed. (default 98) |
| -c, --colors LIST[STR] | List of colors to give to the datasets given with the parameter ` --list_files`. It can be left empty to use default color otherwise it must have the size of --list_name parameter (default []) |
| -m, --merge_violin BOOL | True to merge violins of two different input lists together. Note that for more than two lists, this parameter will be set to False automatically. (default False) |
| -s, --skip_parse BOOL | Skip the gtf deduplication step if it has already been performed by a previous analysis and the deduplicated gtf is given in input with the `--gtf` parameter. (deduplicated gtf: no overlapping CDS, one CDS associated with an exon by gene (and not by transcripts). |
### With singularity image
The description of parameters also applies to this section.
To launch display the help of the program in the psmn, you can enter the following command:
```console
$ singularity exec -C -B $PWD --pwd $PWD /Xnfs/abc/singularity/lbmc-exon_enrichment-latest.img python3 -m script.src.distributions_gtf -h
```
## Description of the result
See section `Description of the result` of the distribution software program.
# Stretches Distribution software
......@@ -383,7 +472,7 @@ $ singularity exec -C -B $PWD:/mnt /Xnfs/abc/singularity/lbmc-exon_enrichment-la
The goal of this tool is to compare the enrichments/impoverishments in some stretches of components (nucleotides, codon, amino acids) from many lists of genes given in input. The tool will compare the distribution of the number of stretches of each component separately between each couple of input files.
The stretches are defined by two number `A`and `B`that must be given through the parameters `stretche_size`. Example `--stretch_size A B` (see below). To find a stretch in a CDS sequence, a sliding window of size `B` and the step of 1 will be used (given by the second value). Each time that this window contains at least `A` of the same component (i.e `A`adenine for example) then the number of stretch for this component is increased by one, and so on.
The stretches are defined by two number `A`and `B` that must be given through the parameters `stretche_size`. Example `--stretch_size A B` (see below). To find a stretch in a CDS sequence, a sliding window of size `B` and the step of 1 will be used (given by the second value). Each time that this window contains at least `A` of the same component (i.e `A`adenine for example) then the number of stretch for this component is increased by one, and so on.
By default, the comparison is done with poisson regression model if the component type of interest corresponds to nucleotides. If the component type of interest is not nucleotides then a **negative binomial regression model** is used to compare the distributions of stretches across input files. You can also perform a wilcoxon test rather than those two previous test by using the parameter `--test` set to 'wilcoxon' !
......@@ -509,5 +598,5 @@ The description of parameters also applies to this section.
To launch display the help of the program in the psmn, you can enter the following command:
```console
$ singularity exec -C -B $PWD:/mnt /Xnfs/abc/singularity/lbmc-exon_enrichment-latest.img python3 -m script.src.descriptions -h
$ singularity exec -C -B $PWD:/mnt /Xnfs/abc/singularity/lbmc-exon_enrichment-latest.img python3 -m script.src.stretch_distribution -h
```
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment