README.md 45.3 KB
Newer Older
nfontrod's avatar
nfontrod committed
1
2
# Enrichment analysis

3
This tool contains to main softwares:
nfontrod's avatar
nfontrod committed
4
5
6
* permutations: Allow to test the enrichment/impoverishment in codon/encoded amino acids/nucleotides/etc. from a given file (containing fasterDB id of exons or genes) with a permutation test on many control test sets. Those test sets have the same size than the number of ids in the input files
* distributions: Allow to test the enrichment/impoverishment in codon/encoded amino acids/nucleotides/etc. across many input files containing fasterDB ids of exons or genes. The tests performed here can be a wilcoxon test, a beta regression test or a simple linear model test.
* stretch_distribution :  Allow to test the enrichment/impoverishment in stretches of codon/encoded amino acides and nucleotides across many input files containing fasterDB ids of genes (only). The test performed here can be a wilcoxon test (if chosen by the user) or a negative binomial/poisson regression test.
7
8
9
10
11

Those tools share the same singularity image in the PSMN

# Permutations software

nfontrod's avatar
nfontrod committed
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
## Description

The goal of this tool is to identify enrichment/impoverishment in nucleotides, di-nucelotides, tri-nucleotides, codon, amino acids in a list of exons or genes given in input. To perform this enrichment analysis, the tool uses a permutation test.

The permutation test uses the following procedure: it randomly samples many sets of gene/exons of the same size as the set of exon/genes of interest. Note that the control sets can't contain any genes/exons present in the interest set of gene/exons.


## Prerequisites

### Without singularity image

This program was tested with `python3.8` although version 3.6 to 3.9 are believed to be working.

It also requires the following modules:

* lazyparser==0.2.0
28
* numpy==1.19.2
nfontrod's avatar
nfontrod committed
29
30
31
* pandas==1.0.3
* statsmodels==0.11.1
* tqdm==4.47.0
32
33
34
35
* loguru==0.5.3
* seaborn==0.11.1
* matplotlib==3.1.2

nfontrod's avatar
nfontrod committed
36
37
38
39
40
41
42

Finally, you must have the ChIA-pet database in your `data` folder named `chia_pet_database.db`. You can build this database by following the instruction displayed [here](https://gitbio.ens-lyon.fr/LBMC/regards/chia-pet_network).
A file name `data/exon_orf.bed` is also required

The first thing you have to did before launching the program is to run the command:

```console
nfontrod's avatar
nfontrod committed
43
$ python3 -m src.permutations.compute_orf_length
nfontrod's avatar
nfontrod committed
44
45
46
47
48
49
50
51
52
```
### Singularity image

The singularity image is available in the PSMN at `/Xnfs/abc/singularity/lbmc-exon_enrichment-latest.img` and contains already all the required dependencies.

## Usage

### Without singularity images 

nfontrod's avatar
nfontrod committed
53
To display the help of the program, just enter
nfontrod's avatar
nfontrod committed
54
55

```console
nfontrod's avatar
nfontrod committed
56
$ python3 -m src.permutations -h
nfontrod's avatar
nfontrod committed
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
```

The following table display the different parameters available for this program:

| Required parameters 	| Description                                                                                                                                                                                                                                  	|
|:---------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| -e / --exon_file    	| A file containing a list of FasterDB exons or genes id. An exon file in Farline format can also be passed in input. This file must contains 3 columns seprated by tabulation. The gene symbol the position of exons and its genomic interval 	|
| -g / --group_name   	| A name to identify the list of genes/exons given                                                                                                                                                                                             	|
| -f / --feature      	| The kind of feature of interest ('gene' or 'exon')                                                                                                                                                                                           	|


| Optional parameters   | Description                                                                                                                                                                                                                                       |
|:---------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| -h / --help           | Show the help message and exit                                                                                                                                                                                                                    |
| -r / --region         | The region of gene analysed. Used only if `--feature` is gene. It can be 'gene' (to compute the frequency on the entire gene), exon (to compute the frequency only on exons) or intron (to compute the frequnecy only on intron). Default 'gene'. |
| -c / --component_type | The type of component for which we want to compute the enrichment. It can be 'nt' (for nucleotides), 'dnt' (for di-nucleotides), tnt (for tri-nucleotides), codon, aa (for amino acids). Default 'nt')                                            |
| -i / --iteration      | The number of iteration used to make permutation test (default 10000)                                                                                                                                                                             |
| -o / --output         | Folder where the result will be created                                                                                                                                                                                                           |
nfontrod's avatar
nfontrod committed
75
76
| -C / --ctrl           | The type of control to use (`ALL` to use all genes/exons as control, `RBP` to use only RBP, `ACE` to use alternative coding exons or `CCE` to use constitutive coding exons (note that the two last options can't be use on genes)). Note that you can also enter a file containing ids of FasterDB genes or exons to use this list as a control |
| -F / --figures        | A list of components for which you want to display the density figures of their mean content in the control sets and in the test set. If you want to display the figure for every component, juste enter ALL. If you want t odisplay this figure for a combination of component, you can enter ``cp1+cp2`` (for example to display to content of both A and T nucleotide you can enter ``A+T``). Finally to display a figure for every single nucleotide and two combination of three nucleotides for example, enter ``ALL cp1+cp2+cp3 cp4+cp1+cp2`` 
nfontrod's avatar
nfontrod committed
77
| -s / --shared_ft      | True to have shared features between different lists of features (control, test), False else. (default False) |
nfontrod's avatar
nfontrod committed
78
| -c / --colors LIST    | List of colors corresponding to the curve color, and the test sample color. (default (darkgrey, indianred)) |
nfontrod's avatar
nfontrod committed
79

nfontrod's avatar
nfontrod committed
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108

## Description of the result tables

Example for an enrichment analysis performed on amino acid

| aa | mean_ctrl          | mean_regulated | ratio | pval    | padj    | regulation |
|:---:|:------------------:|:--------------:|:-----:|:-------:|:-------:|:----------:|
| A  | 6.797893764019312  | 5.59422        | 0.823 | <0.0001 | <0.002  |  -         |
| C  | 2.241274318201287  | 1.53997        | 0.687 | <0.0001 | <0.002  |  -         |
| D  | 4.8635546642849325 | 4.95645        | 1.019 | 0.1552  | 0.30355 |  .         |
| E  | 7.116334633811131  | 10.52365       | 1.479 | <0.0001 | <0.002  |  +         |

The column:

* `aa` : contains the list of every amino acid of interest. It we use another parameter for `--component_type` for example `nt`, then, the column name will also be `nt`.
* `mean_ctrl` : Indicates the mean frequency of the component in the sets of control exons/genes
* `mean_regulated` : Indicates the mean frequency of the component in the set of exons/genes given in input
* `ratio` : The ratio of the column `mean_regulated` / `mean_ctrl`.
* `pval` : permutation test's p-value
* `p-adj` : The corrected p-value (FDR method)
* `regulation`: ' + ' if the interest set of exons/genes is enriched in the input list for a given component, ' - ' for impoverishment, ' . ' for no enrichment/impoverishment.

### With singularity image

The description of parameters also applies to this section.

To launch display the help of the program in the psmn, you can enter the following command:

```console
109
$ singularity exec -C -B $PWD:/mnt /Xnfs/abc/singularity/lbmc-exon_enrichment-latest.img python3 -m script.src.permutations -h
nfontrod's avatar
nfontrod committed
110
111
112
113
```

To run an enrichment analysis:
```
114
singularity exec -C -B $PWD:/mnt /Xnfs/abc/singularity/lbmc-exon_enrichment-latest.img python3 -m script.src.permutations --exon_file /mnt/test_file.txt -g Test_tra2 -f exon -c aa -o /mnt/
nfontrod's avatar
nfontrod committed
115
116
```

nfontrod's avatar
nfontrod committed
117
* The `singularity exec` part mean that we're going to execute the command after the name of the image: here `python3 -m script.src.permutations --exon_file /mnt/data/input_TRA2A_B-union.txt -g Test_tra2 -f exon -c aa -o /mnt/result/`
nfontrod's avatar
nfontrod committed
118
119
120
121
* The `$PWD:/mnt` means that we are mounting on the singularity container under the `/mnt` directory the current working directory.
* The `--exon_file /mnt/test.txt` mean that we have a file named `test.txt` directly on the working directory that is mounted under `/mnt` in the container. That why we can use `/mnt/test.txt` to reach the file.
* The `-o /mnt/result/` means that the results will be created direcly under the working directory.

nfontrod's avatar
nfontrod committed
122
123
124

## Build a singularity image:

nfontrod's avatar
nfontrod committed
125
To run this a make a working image, make sure you have all the files under the `data` folder described on the `prerequisites` section and you have launched `python3 -m src.permutations.compute_orf_length` 
nfontrod's avatar
nfontrod committed
126
127
```console
$ sudo singularity build ./bin/lbmc-exon_enrichment-latest.img Singularity
128
129
```

nfontrod's avatar
nfontrod committed
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
# Permutation_bed software

## Description

This tool does exactly the same thing as the tool `Permutations` but with different input files: It uses two bed files, one containing genomic intervals of interest and the over intervals used as control. It also takes a genome file in input to be able to infer the genomic sequences defined in the bed files.

## Prerequisites

This tool was written with python 3.8 and requires the following library to work 

* lazyparser==0.2.0
* numpy==1.19.2
* pandas==1.0.3
* statsmodels==0.11.1
* tqdm==4.47.0
* loguru==0.5.3
* seaborn==0.11.1
* matplotlib==3.1.2
* biopython == 1.75
* regex == 2.5.74

151
**Bedtools must also be installed on your computer and be available in your path.**
nfontrod's avatar
nfontrod committed
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185

## Usage


To display the help of the program, just enter

```console
$ python3 -m src.permutation_bed -h
```

The following table display the different parameters available for this program:

| Required arguments | Description                                                         |
|--------------------|---------------------------------------------------------------------|
| -e / --exon_bed    | A file containing the genomic intervals of interest                 |
| -g / --genome      | A file containing the genome used to infer sequences from bed files |
| -G / --group_name  | The name of the test list of exons/genes                            |
| -ct / --ctrl_bed   | A bed file containing genomic intervals used as background          |


| Optional arguments    | Description                                                                       |
|-----------------------|-----------------------------------------------------------------------------------|
| -h / --help           | show this help message and exit                                                   |
| -C / --component_type | The type of component to analyse; It can be 'nt', 'codon' or 'aa'. (default 'nt') |
| -i / --iteration      | The number of iteration to make (default 1000)                                    |
| -o / --output         | Folder where the result will be created                                           |
| -f / --figures        | The list of components to display in a figure                                     |
| -c / --colors         | List of colors corresponding to the curve color and the test sample color.        |


## Result

The result is the same as the output of the `Permutations tool`.

nfontrod's avatar
nfontrod committed
186
187


188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
# Distributions software

## Description

The goal of this tool is to compare the relative enrichments/impoverishments in some components (nucleotides, di-nucelotides, tri-nucleotides, codon, amino acids) from many lists of exons or genes given in input. The tool will compare the frequency distribution of each component separatly between each couple of input files. 

By default, the comparison is done with a simple **linear model** if the component type of interest corresponds to nucleotides. If the component type of interest is not nucleotides then a **beta regression model** is used to compare the frequency distributions across input files. You can also perform a wilcoxon test rather than those two previous test by using the parameter `--test` set to 'wilcoxon' !


## Prerequisites

### Without singularity image

You must have `python 3.8` installed on your system along with `R>= 3.5`

It also requires the following python modules:

* lazyparser==0.2.0
* numpy==1.19.2
* pandas==1.0.3
* statsmodels==0.11.1
* scipy==1.3.3
* seaborn==0.11.1
* matplotlib==3.1.2
* rpy2==3.3.3

And the following R packages: 

* DHARMa==0.3.1)
* emmeans==1.4.4
* glmmTMB==1.0.1

Finally, you must have the ChIA-pet database in your `data` folder named `chia_pet_database.db`. You can build this database by following the instruction displayed [here](https://gitbio.ens-lyon.fr/LBMC/regards/chia-pet_network).
A file name `data/exon_orf.bed` is also required

The first thing you have to did before launching the program is to run the command:

```console
$ python3 -m src.permutations.compute_orf_length
```

### Singularity image

The singularity image is available in the PSMN at `/Xnfs/abc/singularity/lbmc-exon_enrichment-latest.img` and contains already all the required dependencies.

## Usage

### Without singularity images 

To display the help of the program, juste enter

```console
$ python3 -m src.distributions -h
```

The following table display the different parameters available for this program:

nfontrod's avatar
nfontrod committed
245
246
247

| Required arguments | Description                                                                                                                                                   |
|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
nfontrod's avatar
nfontrod committed
248
| -l / --list_files  | A list of file containing FasterDB ids of the chosen feature (exon / gene). Each element in the list mus be separated by a space (eg. -l file1.txt file2.txt). An exon file in Farline format can also be passed in input. |
nfontrod's avatar
nfontrod committed
249
250
251
252
| -L / --list_name   | The name of the list of exons/genes located in the input files given with the `--list_files` parameter                                                                                                                 |
| -f / --feature     | The list of feature contained in the file given in input to the program                                                                                       |


253
254
255
| Optional parameters       | Description                                                                                                                                                                                                                                        |
|---------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| -h / --help               | show this help message and exit                                                                                                                                                                                                                    |
nfontrod's avatar
nfontrod committed
256
| -r / --region             | The region of gene analysed. Used only if `--feature` is gene. It can be 'gene' (to compute the frequency on the entire gene), exon (to compute the frequency only on exons) or intron (to compute the frequnecy only on intron). Default 'gene'.                                                                                                                                                                                                                  |
nfontrod's avatar
nfontrod committed
257
| -C / --component_type     | The type of component to study (default nt) (possible choice: 'codon', 'aa' (for amino acid), 'nt' (for nucleotide), 'dnt' (for di-nucleotide), 'tnt' (for tri-nucleotide))                                                                                                                                                                                                                |
258
259
| -o / --output             | Folder were the result folder will be created  (default .)                                                                                                                                                                                                  |
| -t / --test               | The kind of test to perform: enter default to perform a linear model test for distribution of frequencies in nucleotide or a beta regression for other kind of distributions. enter 'wilcoxon' to to perform a wilcoxon tests. (Default "default") |
nfontrod's avatar
nfontrod committed
260
| -F / --figure             | A list of components for which you want to display the distribution figures for each input files given. If you want to display the figure for every component, juste enter ALL. If you want to display this figure for a combination of component, you can enter ``cp1+cp2`` (for example to display to content of both A and T nucleotide you can enter ``A+T``). Finally to display a figure for every single nucleotide and two combination of three nucleotides for example, enter ``ALL cp1+cp2+cp3 cp4+cp1+cp2`` (default ALL)
nfontrod's avatar
nfontrod committed
261
| -s / --shared_ft          | `True` to have shared features between different lists of features, `False` else. (default False) |
nfontrod's avatar
nfontrod committed
262
| -p / --percentile         | percentile of data displayed in the figure. Values above it are not displayed. (default 98) |
nfontrod's avatar
nfontrod committed
263
264
| -c / --colors             | List of colors to give to the datasets given with the parameter ` --list_files`. The color can given with a color name (e.g 'red') or in an hexadecimal format. It can be left empty to use default color otherwise it must have the size of --list_name parameter (default []) |
| -m / --merge_violin BOOL  | True to merge violins of two different input lists together. Note that for more than two lists, this parameter will be set to False automatically. (default False) |
nfontrod's avatar
nfontrod committed
265

266
267
268
269
270

## Description of the result

Executing this tool will produce a folder located in the directory specified with the `--output option`. 

nfontrod's avatar
nfontrod committed
271
This folder will be named `distribution_analysis_[CPNT_TYPE]_[NAMES]_[TEST]` where:
272
273
* [NAMES]: corresponds to the names given with the `--list_name` parameter concatenated with '-'.
* [TEST]: corresponds to the value given with the `--test` parameter
nfontrod's avatar
nfontrod committed
274
* [CPNT_TYPE]: correspond to the type of component analysed. It is given with the `--component_type` parameter and can be : 'aa' or 'nt' or 'codon' or 'dnt' or 'tnt'.
275
276
277
278

It will have the following structure:

```shell
nfontrod's avatar
nfontrod committed
279
distribution_analysis_[CPNT_TYPE]_[NAMES]_[TEST]
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
├── diags
│   ├── [CPNT]_dignostics.png
│   └── ...
├── figures
│   ├── density_[CPNT_TYPE]_[CPNT]_[NAMES].pdf
│   └── ...
├── stats
│   ├── [CPNT_TYPE]_[CPNT]_stat.txt
│   └── ...
└── violin
    ├── violin_[CPNT_TYPE]_[CPNT]_[NAMES].pdf
    └── ...
```

Where:
* [CPNT_TYPE]: correspond to the type of component analysed. It is given with the `--component_type` parameter and can be : 'aa' or 'nt' or 'codon' or 'dnt' or 'tnt'.
* [CPNT]: correspond to the component studied in the file: it can be A, T, G or C if [CPNT_TYPE] is 'nt'
* [NAMES]: corresponds to the names given with the `--list_name` parameter concatenated with '-'.


**Example of files for  the component type 'nt (CPNT_TYPE)' and the component (CPNT) A** 

The folder `figures` will contain density figures, here is an example: 

![density](doc/density.png)

Here we can see the density distribution of the frequency of the nucleotide A across three list of genes named BARF_up, CTRL and BRAF_down respectively. Note that only 2% of the largest values are removed from the figure to avoid displaying some rare feature (gene or exon) with a very high frequency for a component.

The folder `violin` contains violin plots figure that are representing the same things but in the form of a violin plot. Example below

![violin](doc/violin.png)

The p-values are directly displayed in this figure but can also be recovered by opening the corresponding stat file in `stat` folder

For example the content of the file `stat/nt_A_stat.txt` in thi example will be:

```
Comparison	Estimate	Std.Error	z.ratio	p.value
BRAF_down - BRAF_up	0.8392530860283574	0.10072939293145826	8.331759594733493	9.084859753372143e-09
BRAF_down - CTRL	-0.36756873238154725	0.07804985689092483	-4.709409434218783	7.477269624578398e-06
BRAF_up - CTRL	-1.2068218184099047	0.07667619336639034	-15.739198379909322	9.084827778949034e-09
```

The column:
* Comparison indicates the name of the two files (containing a set of fasterDB id) being compared
* Estimate indicates the difference in mean between the two files. A negative value indicate that the first file in Comparison column is impoverished in a component
* p.value indicates the p.value of the comparison


Note that the `diags` folder is only created when `--test` is `default`. It contains figures like this:
This folder contains very important data, indeed if the diagnostics is bad, the p-value indicated by the test is not reliable.

Here is an example of a "good diagnostic": 
![good](doc/good_diag.png)

And an example of a "bad diagnostic"

![good](doc/bad_diag.png)

For the diagnostic to be good, the black line in the left figure should mathc as much as posible the red line. In the left figure the red dotted line must match as much as possible the solid red line.

341
It also creates a recap violin plot figure (with the name `recap_[CPNT_TYPE]_violin.pdf`) that looks like this: (this figure was produced by the command: `python3 -m src.distributions -l data/BRAF_DOWN_1.5.txt data/BRAF_UP_1.5.txt data/ALL_19000.txt -L BRAF_DOWN BRAF_UP ALL -f gene -r gene -c nt -o result/ -F ALL A+T
nfontrod's avatar
nfontrod committed
342
343
344
`)

![recap](doc/recap.png)
345

346

nfontrod's avatar
nfontrod committed
347
Moreover, it creates a tabulated file named `mean-median-sd_of_[CPNT_TYPE]_in_[NAMES].txt` that contain the mean, the median and the standard error of the frequencies of every component in each group of features given in input.
348

nfontrod's avatar
nfontrod committed
349
The p-values displayed in those figures are corrected with the Benjamini-Hotchberg procedure. Note that a blue (or red respectively) p-values means that the average values of the frequencies in the test list of features is impoverished (enriched respectively) compared to the control list of feature.
nfontrod's avatar
nfontrod committed
350
351
352

Finally, it also creates this kind of figure:

nfontrod's avatar
nfontrod committed
353
![recap](doc/rel_freq.png)
nfontrod's avatar
nfontrod committed
354

nfontrod's avatar
nfontrod committed
355
If you give two lists of feature in input (or more) in the program (with the parameter `--list_files`) then the last list of feature is used as a control to compute the relative frequency of all components in the first list(s) given. The relative frequency is calculated as follows: 
nfontrod's avatar
nfontrod committed
356
357
358
359
360

rel_freq = (avg(CPNT_test) - avg(CPNT_ctrl)) / avg(CPNT_ctrl) x 100.

Where:

nfontrod's avatar
nfontrod committed
361
* rel_freq: Is the relative frequency
nfontrod's avatar
nfontrod committed
362
363
364
365
366
367
* CPNT_test corresponds to the frequencies of a given component (a nucleotide, di-nucleotide, codon or amino acid) of interest (for example A, AA, AAA, etc..) for every feature (exons or genes) inside the first list(s) given to the program.
* CPNT_ctrl corresponds to the frequencies of a given component (a nucleotide, di-nucleotide, codon or amino acid) of interest (for example A, AA, AAA, etc..) for every feature (exons or genes) inside the last list given to the program.
* avg: Is a function to compute the mean values of a vector of values

The p-values are the same as those displayed in the summary violin plot figure (shown above).

nfontrod's avatar
nfontrod committed
368
Note that a blue (or red respectively) p-values means that the average values of the frequencies in the test list of features is impoverished (enriched respectively) compared to the control list of feature.
nfontrod's avatar
nfontrod committed
369
The results here may seem weird, especially for the `A amino acid`, because the p-values in this example were computed using a linear model which takes the size of the feature into account, and this figure doesn't ! It is the same thing when a beta regression model is built but not if a wilcoxon test is performed (if `--test wilcoxon` is set).
370

371
372
373
374
375
376
377
### With singularity image

The description of parameters also applies to this section.

To launch display the help of the program in the psmn, you can enter the following command:

```console
nfontrod's avatar
nfontrod committed
378
$ singularity exec -C -B $PWD:/mnt /Xnfs/abc/singularity/lbmc-exon_enrichment-latest.img python3 -m script.src.distributions -h
379
```
nfontrod's avatar
nfontrod committed
380

nfontrod's avatar
nfontrod committed
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
# Distributions gtf software

## Description

This tool does exactly the same thing as the Distributions software, but it allows you to do it with the **genome of your choice**. 

In this program you must provide a multi-fasta file containing the genome of interest along with a gtf file containing annotations for this genome.


## Prerequisites

### Without singularity image

You must have `python 3.8` installed on your system along with `R>= 3.5`

It also requires the following python modules:

* lazyparser==0.2.0
* numpy==1.19.2
* pandas==1.0.3
* statsmodels==0.11.1
* scipy==1.3.3
* seaborn==0.11.1
* matplotlib==3.1.2
* rpy2==3.3.3
* pyfaidx==0.6.4

And the following R packages: 

* DHARMa==0.3.1)
* emmeans==1.4.4
* glmmTMB==1.0.1

### Singularity image

The singularity image is available in the PSMN at `/Xnfs/abc/singularity/lbmc-exon_enrichment-latest.img` and contains already all the required dependencies.

## Usage

### Without singularity image

To display the help of the program, juste enter

```console
$ python3 -m src.distributions_gtf -h
```

The following table display the different parameters available for this program:

| Required arguments         | Description                                                                                                                                                                                                                                                                                                                                                                                                                    |
|----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| -l, --list_file LIST[STR]  | A list of files containing features id defined in the supplied gtf (with the `--gtf` parameter). Note: For gene it uses the field 'gene_id' in the attributes' column of a gtf and for exons it uses the field `gene_id` with the field `exon_number` to form id with the following structure: "gene_id"_"exon_number": (example for an exon of the gene (gene_id) 'SRSF3' with the number 3 (exon_number): its id is SRSF3_3) |
| -L, --list_group LIST[STR] | A list of names to give to the list of features files given with the --list_file parameter.                                                                                                                                                                                                                                                                                                                                    |
| -G, --gtf STR              | The gtf file containing annotation for the supplied genome                                                                                                                                                                                                                                                                                                                                                                     |
| -g, --genome STR           | A multi-fasta file containing the genome of interest                                                                                                                                                                                                                                                                                                                                                                           |
| -F, --ft_type STR          | The kind of feature of interest ('gene' or 'exon' (exon not implemented yet))                                                                                                                                                                                                                                                                                                                                                  |
| -r, --region STR           | he region of interest ('gene' for the entire gene, 'exon' for the concatenation exon's sequence or CDS sequences)                                                                                                                                                                                                                                                                                                              |


| Optional arguments      | Description                                                                                                                                                                                                                                                             |
|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| -h, --help              | show this help message and exit                                                                                                                                                                                                                                         |
| -C, --cpnt_type STR     | The type of component of interest                                                                                                                                                                                                                                       |
| -t, --test STR          | The kind of test to perform: enter default to perform a linear model test for distribution of frequencies in nucleotide or a beta regression for other kind of distributions. enter 'wilcoxon' to to perform a wilcoxon tests. (Default "default")                      |
| -o, --output STR        | Folder where the results will be created (default '.')                                                                                                                                                                                                                  |
| -f, --figure LIST[STR]  | The list of component to display in a figure, ALL to create a figure for every component (default ALL)                                                                                                                                                                  |
| -p, --percentile FLOAT  | percentile of data displayed in the figure. Values above it are not displayed. (default 98)                                                                                                                                                                             |
| -c, --colors LIST[STR]  | List of colors to give to the datasets given with the parameter ` --list_files`. It can be left empty to use default color otherwise it must have the size of --list_name parameter (default [])                                                                        |
| -m, --merge_violin BOOL | True to merge violins of two different input lists together. Note that for more than two lists, this parameter will be set to False automatically. (default False)                                                                                                      |
| -s, --skip_parse BOOL   | Skip the gtf deduplication step if it has already been performed by a previous analysis and the deduplicated gtf is given in input with the `--gtf` parameter. (deduplicated gtf: no overlapping CDS, one CDS associated with an exon by gene (and not by transcripts). |


### With singularity image

The description of parameters also applies to this section.

To launch display the help of the program in the psmn, you can enter the following command:

```console
$ singularity exec -C -B $PWD --pwd $PWD /Xnfs/abc/singularity/lbmc-exon_enrichment-latest.img python3 -m script.src.distributions_gtf -h
```


## Description of the result

See section `Description of the result` of the distribution software program.

nfontrod's avatar
nfontrod committed
468
469
470
471
472
473
474

# Stretches Distribution software

## Description

The goal of this tool is to compare the enrichments/impoverishments in some stretches of components (nucleotides, codon, amino acids) from many lists of genes given in input. The tool will compare the distribution of the number of stretches of each component separately between each couple of input files. 

nfontrod's avatar
nfontrod committed
475
The stretches are defined by two number `A`and `B` that must be given through the parameters `stretche_size`. Example `--stretch_size A B` (see below). To find a stretch in a CDS sequence, a sliding window of size `B` and the step of 1 will be used (given by the second value). Each time that this window contains at least `A` of the same component (i.e `A`adenine for example) then the number of stretch for this component is increased by one, and so on.
nfontrod's avatar
nfontrod committed
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504

By default, the comparison is done with poisson regression model if the component type of interest corresponds to nucleotides. If the component type of interest is not nucleotides then a **negative binomial regression model** is used to compare the distributions of stretches across input files. You can also perform a wilcoxon test rather than those two previous test by using the parameter `--test` set to 'wilcoxon' !


## Prerequisites

### Without singularity image

You must have `python 3.8` installed on your system along with `R>= 3.5`

It also requires the following python modules:

* lazyparser==0.2.0
* numpy==1.19.2
* pandas==1.0.3
* statsmodels==0.11.1
* seaborn==0.11.1
* matplotlib==3.1.2
* rpy2==3.3.3
* biopython==1.75
* tqdm==4.47.0
* loguru==0.5.3

And the following R packages: 

* DHARMa==0.3.1
* emmeans==1.4.4
* MASS==7.3-51.5

nfontrod's avatar
nfontrod committed
505
Finally, you must have a file named `data/transcrits_fasterDB.csv` in the`data` folder. This file must contain:
nfontrod's avatar
nfontrod committed
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
* a column `gene_id` indicating the FasterDB it of a gene
* A column `gene_name`indicating the gene symbol of the gene
* A column `sequence`indicating the CDS sequence of the gene.
A gene must have only one line in this file.


### Singularity image

The singularity image is available in the PSMN at `/Xnfs/abc/singularity/lbmc-exon_enrichment-latest.img` and already contains all the required dependencies.

## Usage

### Without singularity images 

To display the help of the program, juste enter

```console
$ python3 -m src.stretch_distribution -h
```

The following table display the different parameters available for this program:


| Required arguments | Description                                                                                                                           |
|--------------------|---------------------------------------------------------------------------------------------------------------------------------------|
| -l / --list_files  | A list of files containing FasterDB ids of genes. Each element in the list must be separated by a space (eg. -l file1.txt file2.txt). |
| -L / --list_name   | The name of the list of exons/genes located in the input files given with the `--list_files` parameter (in the same order)            |


| Optional parameters          | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| -h / --help                  | show this help message and exit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| -C / --component_type        | The type of component to study (default nt) (possible choice: 'codon', 'aa' (for amino acid), 'nt' (for nucleotide))                                                                                                                                                                                                                                                                                                                                                                                                              |
| -o / --output                | Folder were the result folder will be created  (default .)                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| -t / --test                  | The kind of test to perform: enter default to perform a poisson regression test for distribution of stretches of a given nucleotide or a negative binomial regression for other kind of distributions. enter 'wilcoxon' to to perform a wilcoxon tests. (Default "default")                                                                                                                                                                                                                                                       |
| -S, --stretch_size LIST[INT] | A list of 2 integers named repesectively `A`, `B`. `A` correspond to the number of component that must be find in a window of size `B` to define a stretch. `A` < `B` (default 3 6)                                                                                                                                                                                                                                                                                                                                               |
| -F / --figure                | A list of components for which you want to display the distribution figures for each input files given. If you want to display the figure for every component, juste enter ALL. If you want to display a figure for a combination of component, you can enter ``cp1+cp2`` (for example to display to content of both A and T nucleotide you can enter ``A+T``). Finally to display a figure for every single nucleotide and two combination of three nucleotides for example, enter ``ALL cp1+cp2+cp3 cp4+cp1+cp2`` (default ALL) | 
| -s / --shared_ft             | `True` to have shared features between different lists of features, `False` else. (default False)                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| -p / --percentile            | percentile of data displayed in the figure. Values above it are not displayed. (default 98)                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| -c / --colors                | List of colors to give to the datasets given with the parameter ` --list_files`. The color can given with a color name (e.g 'red') or in an hexadecimal format. It can be left empty to use default color otherwise it must have the size of --list_name parameter (default [])                                                                                                                                                                                                                                                   |
| -m / --merge_violin BOOL     | True to merge violins of two different input lists together. Note that for more than two lists, this parameter will be set to False automatically. (default False)                                                                                                                                                                                                                                                                                                                                                                |


## Description of the result

Executing this tool will produce a folder located in the directory specified with the `--output option`. 

This folder will be named `stretch_distribution_analysis_[CPNT_TYPE]_[NAMES]_[TEST]` where:
* [NAMES]: corresponds to the names given with the `--list_name` parameter concatenated with '-'.
* [TEST]: corresponds to the value given with the `--test` parameter
* [CPNT_TYPE]: correspond to the type of component analysed. It is given with the `--component_type` parameter and can be : 'aa' or 'nt' or 'codon'.

It will have the following structure:

```shell
distribution_analysis_[CPNT_TYPE]_[NAMES]_[TEST]
├── diags
│   ├── [CPNT]_dignostics.png
│   └── ...
├── figures
│   ├── density_[CPNT_TYPE]_[CPNT]_[NAMES].pdf
│   └── ...
├── stats
│   ├── [CPNT_TYPE]_[CPNT]_stat.txt
│   └── ...
└── violin
    ├── violin_[CPNT_TYPE]_[CPNT]_[NAMES].pdf
    └── ...
```

**Example of files for  the component type 'aa (CPNT_TYPE)' and the component (CPNT) A** 

The folder `figures` will contain density figures of the number of stretch for a given component, here is an example: 

![density](doc/stretch_density_A.png)

Here we can see the density distribution of the number of stretch in the nucleotide A across two lists of genes named BRAF_DOWN_riboRNAPeak and BRAF_DOWN respectively.

The folder `violin` contains violin plots figure that are representing the same things but in the form of a violin plot. The figures produced here are the same as those produced by the distributions figure. (see section `Description of the result` of the Distribution module, section of violin.)

The p-values are directly displayed in this figure but can also be recovered by opening the corresponding stat file in `stat` folder. An example a files in stat folder is also given in the section `Description of the result` of the Distributions module.

Note that the `diags` folder is only created when `--test` is `default`. It contains the same figure as those displayed in the section `Description of the result` of the Distributions module.
This folder contains very important data, indeed if the diagnostics is bad, the p-value indicated by the test is not reliable.


It also creates a recap violin plot figure and a figure of relative number of stretches: see the two last figures of the section `Description of the result` of the Distributions module.

### With singularity image

The description of parameters also applies to this section.

To launch display the help of the program in the psmn, you can enter the following command:

```console
nfontrod's avatar
nfontrod committed
601
$ singularity exec -C -B $PWD:/mnt /Xnfs/abc/singularity/lbmc-exon_enrichment-latest.img python3 -m script.src.stretch_distribution -h
nfontrod's avatar
nfontrod committed
602
```