Ribosome profiling pipelines
Note: all commands must be executed at the root of the project directory.
Download nexflow
Those pipelines work with nextflow. You must download it and place it into bin
folder. To get it run the following command:
curl -s https://get.nextflow.io | bash
mv nextflow bin/nextflow
Building images
The differents steps of the pipelines use Singularity or Docker images, so atl least one of them must be installed on your computer.
Note: Singularity is already installed on the PSMN
To build Singularity images that cannot yet be downloaded from docker hub, run the following command:
sudo bash src/build_images/build_singularity_images.sh
To build Docker images that cannot yet be downloaded from docker hub, run the following command:
bash src/build_images/build_docker_images.sh
Note: if you are using the PSMN, the images have already been built. They are located inside the folder
/scratch/Bio/singularity
.
Download a reference genome
To download all coding gene from hg38 using singularity you can type the following command:
bash src/modules/download_input/download_genome.sh singularity
To do the same thing but with docker, type
bash src/modules/download_input/download_genome.sh docker
This command will create a folder named data/GCF_000001405.39_GRCh38.p13
that contains several other folders and file. Below a description of each files in those subfolders.
- genome:
- GCF_000001405.39_GRCh38.p13_gene.fa: This file contains the sequence of every coding gene in hg38 in fasta format.
- GCF_000001405.39_GRCh38.p13_gene.fa.fai: This file correspond to the index of the previous fasta file.
- GCF_000001405.39_GRCh38.p13_genomic.fna.gz: The complete hg38 geneome in a gzipped fasta file.
- filter:
- contaminant_rna.fa: Fasta file containing rRNA, ncRNA, tRNA sequences. This file is used to remove reads mapping those sequences.
- transcriptome:
- GCF_000001405.39_GRCh38.p13_rna_from_genomic.fna.gz: A compressed fasta file containing the transcript of every gene in hg38. This file is used to build the file
contaminant_rna.fa
.
- GCF_000001405.39_GRCh38.p13_rna_from_genomic.fna.gz: A compressed fasta file containing the transcript of every gene in hg38. This file is used to build the file
- annotation:
- GCF_000001405.39_GRCh38.p13_annotation.bed: A file containing the exons, CDS start and stop codon of every gene in
GCF_000001405.39_GRCh38.p13_gene.fa
. - GCF_000001405.39_GRCh38.p13_gene_size.txt: A file containing the size of every gene located in
GCF_000001405.39_GRCh38.p13_gene.fa
. - GCF_000001405.39_GRCh38.p13_genomic.gtf.gz: The complete GTF of hg38.
- stats.txt: A file thats indicate how many CDS, exons and genes conserved during the build of theannotation.
- GCF_000001405.39_GRCh38.p13_annotation.bed: A file containing the exons, CDS start and stop codon of every gene in
Note: You can build an annotation for another genome by adding a NCBI url as the last positional parameter of the script
src/modules/download_input/download_genome.sh
. For example, to build the hg38.p13 genome, typebash src/modules/download_input/download_genome.sh singularity https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.39_GRCh38.p13
Mapping to the reference genome
The pipeline src/pipelines/RibosomeProfiling.nf
can be used to remove rRNA, tRNA and other ncRNA reads and align the rest of them on the reference genome using hisat2.
To run the pipeline, you can run the following command:
./bin/nextflow src/pipelines/RibosomeProfiling.nf \
-c src/pipelines/RibosomeProfiling.config \
-profile <PROFILE> \
--paired_end <BOOL> \
--genome <GENOME_FILE> \
--fastq <FASTQ> \
--rrna_fasta <NCRNA_SEQUENCE> \
--cutadapt <PARAM> \
--urqt <PARAM> \
--bowtie_seedlen <SEEDLEN> \
--folder <FOLDER> \
--skip_trimming <SKIP_TRIMMING> \
-resume
Everything inside <>
must be replaced by your parameters. Below a detailled description of every parameters used to launch this pipeline:
parameter | Description |
---|---|
profile | The nextflow profile to use. It can be 'docker' or 'singularity' to use it on a workstation or 'psmn' to use it on the psmn |
paired_end | A boolean indicating if the input data are paired-end, False else |
genome | A file containing the sequences of each gene in a reference genome |
fastq | The fastq files containing sequenced reads |
rrna_fasta | A fasta file containing sequences such as rRNA, tRNA or other ncRNA to remove reads mapping to those sequences from the analysis |
cutadapt | Custom cutadapt parameters to trim reads and remove adapters |
urqt | Custom urqt parameters to trim reads and remove adapters |
bowtie_seedlen | Custom bowtie_seedlen parameter used to map single-end Rrna reads to remove them (option only used when --paired-end is false) |
folder | folder under the results/ folder of the project where the results will be stored |
skip_trimming | A boolean parameter. If skip_trimming is true then read trimming in skipped (therfore parameters given with --urqt won't be used). If false, then the reads quality trimming is performed |
Periodicity analysis
To check if your Ribo-seq data display a good periodicity, the reads must have been mapped to a reference genome (see previous section). Then, you can use the script src/pipelines/peak_calling.nf
to perform a periodicity analysis on your data.
The script will display the density of the first mapped position of the reads or the first mapped position of the p-sites in the reads around start or stop codons.
To launch this pipeline, you can type the following commands:
./bin/nextflow src/pipelines/periodicity.nf \
-params-file src/pipelines/periodicity.yml \
-profile <PROFILE> \
-config src/pipelines/periodicity.config
Note that the can be either docker
, singularity
or psmn
.
The only file you have to modify is periodicity.yml
. The description of every variables you can change is defined in this file.
Peak calling analysis
To perform a peak calling analysis, the reads must have already been mapped to a reference genome (see Mapping to the reference genome section). Then you can use the script src/pipelines/peak_calling.nf
to perform peak calling on ribosome profiling data. This script will locate coverage peaks for every genes and then analyse their enrichment in codon or/and amino acids for each samples. It will also seek for common coverage peaks between a control and the others samples.
To launch this pipeline, you can type the following command:
./bin/nextflow src/pipelines/peak_calling.nf \
-params-file src/pipelines/peak_calling.yml \
-profile <PROFILE> \
-config src/pipelines/RibosomeProfiling.config
The only file to modify is src/pipelines/peak_calling.yml
. The description of every variables you can change is defined in this file.
Enrichment analysis
This pipeline is an alternative to the Peak calling
pipeline. To launch this pipeline, the reads must have already been mapped to a reference genome (see Mapping to the reference genome section) and the periodicy must have been analysed (see Periodicity analysis). Then you can use the script src/pipelines/enrichment.nf
to see if some codons/encoded amino acids are enriched or impoverished in the estimated location of P-sites in ribosome footprint reads.
To launch this pipeline, you can type the following command:
./bin/nextflow src/pipelines/enrichment.nf \
-params-file src/pipelines/enrichment.yml \
-profile <PROFILE> \
-config src/pipelines/enrichment.config
The only file to modify is src/pipelines/periodicity.yml
. The description of every variables you can change is defined in this file.