Skip to content
Snippets Groups Projects
Name Last commit Last update
bin
data
results
src
.gitignore
LICENCE
README.md

Ribosome profiling pipelines

Note: all commands must be executed at the root of the project directory.

Download nexflow

Those pipelines work with nextflow. You must download it and place it into bin folder. To get it run the following command:

curl -s https://get.nextflow.io | bash
mv nextflow bin/nextflow

Building images

The differents steps of the pipelines use Singularity or Docker images, so atl least one of them must be installed on your computer.

Note: Singularity is already installed on the PSMN

To build Singularity images that cannot yet be downloaded from docker hub, run the following command:

sudo bash src/build_images/build_singularity_images.sh

To build Docker images that cannot yet be downloaded from docker hub, run the following command:

bash src/build_images/build_docker_images.sh

Note: if you are using the PSMN, the images have already been built. They are located inside the folder /scratch/Bio/singularity.

Download a reference genome

To download all coding gene from hg38 using singularity you can type the following command:

bash src/modules/download_input/download_genome.sh singularity

To do the same thing but with docker, type

bash src/modules/download_input/download_genome.sh docker

This command will create a folder named data/GCF_000001405.39_GRCh38.p13 that contains several other folders and file. Below a description of each files in those subfolders.

  • genome:
    • GCF_000001405.39_GRCh38.p13_gene.fa: This file contains the sequence of every coding gene in hg38 in fasta format.
    • GCF_000001405.39_GRCh38.p13_gene.fa.fai: This file correspond to the index of the previous fasta file.
    • GCF_000001405.39_GRCh38.p13_genomic.fna.gz: The complete hg38 geneome in a gzipped fasta file.
  • filter:
    • contaminant_rna.fa: Fasta file containing rRNA, ncRNA, tRNA sequences. This file is used to remove reads mapping those sequences.
  • transcriptome:
    • GCF_000001405.39_GRCh38.p13_rna_from_genomic.fna.gz: A compressed fasta file containing the transcript of every gene in hg38. This file is used to build the file contaminant_rna.fa.
  • annotation:
    • GCF_000001405.39_GRCh38.p13_annotation.bed: A file containing the exons, CDS start and stop codon of every gene in GCF_000001405.39_GRCh38.p13_gene.fa.
    • GCF_000001405.39_GRCh38.p13_gene_size.txt: A file containing the size of every gene located in GCF_000001405.39_GRCh38.p13_gene.fa.
    • GCF_000001405.39_GRCh38.p13_genomic.gtf.gz: The complete GTF of hg38.
    • stats.txt: A file thats indicate how many CDS, exons and genes conserved during the build of theannotation.

Note: You can build an annotation for another genome by adding a NCBI url as the last positional parameter of the script src/modules/download_input/download_genome.sh. For example, to build the hg38.p13 genome, type bash src/modules/download_input/download_genome.sh singularity https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.39_GRCh38.p13

Mapping to the reference genome

The pipeline src/pipelines/RibosomeProfiling.nfcan be used to remove rRNA, tRNA and other ncRNA reads and align the rest of them on the reference genome using hisat2.

To run the pipeline, you can run the following command:

./bin/nextflow src/pipelines/RibosomeProfiling.nf \
  -c src/pipelines/RibosomeProfiling.config \
  -profile <PROFILE> \
  --paired_end <BOOL> \
  --genome <GENOME_FILE> \
  --fastq <FASTQ> \
  --rrna_fasta <NCRNA_SEQUENCE> \
  --cutadapt <PARAM> \
  --urqt <PARAM> \
  --bowtie_seedlen <SEEDLEN> \
  --folder <FOLDER> \
  --skip_trimming <SKIP_TRIMMING> \
  -resume

Everything inside <>must be replaced by your parameters. Below a detailled description of every parameters used to launch this pipeline:

parameter Description
profile The nextflow profile to use. It can be 'docker' or 'singularity' to use it on a workstation or 'psmn' to use it on the psmn
paired_end A boolean indicating if the input data are paired-end, False else
genome A file containing the sequences of each gene in a reference genome
fastq The fastq files containing sequenced reads
rrna_fasta A fasta file containing sequences such as rRNA, tRNA or other ncRNA to remove reads mapping to those sequences from the analysis
cutadapt Custom cutadapt parameters to trim reads and remove adapters
urqt Custom urqt parameters to trim reads and remove adapters
bowtie_seedlen Custom bowtie_seedlen parameter used to map single-end Rrna reads to remove them (option only used when --paired-end is false)
folder folder under the results/ folder of the project where the results will be stored
skip_trimming A boolean parameter. If skip_trimming is true then read trimming in skipped (therfore parameters given with --urqt won't be used). If false, then the reads quality trimming is performed

Periodicity analysis

To check if your Ribo-seq data display a good periodicity, the reads must have been mapped to a reference genome (see previous section). Then, you can use the script src/pipelines/peak_calling.nf to perform a periodicity analysis on your data. The script will display the density of the first mapped position of the reads or the first mapped position of the p-sites in the reads around start or stop codons.

To launch this pipeline, you can type the following commands:

./bin/nextflow src/pipelines/periodicity.nf \
-params-file src/pipelines/periodicity.yml \
-profile <PROFILE> \
-config src/pipelines/periodicity.config

Note that the can be either docker, singularity or psmn.

The only file you have to modify is periodicity.yml. The description of every variables you can change is defined in this file.

Peak calling analysis

To perform a peak calling analysis, the reads must have already been mapped to a reference genome (see Mapping to the reference genome section). Then you can use the script src/pipelines/peak_calling.nf to perform peak calling on ribosome profiling data. This script will locate coverage peaks for every genes and then analyse their enrichment in codon or/and amino acids for each samples. It will also seek for common coverage peaks between a control and the others samples.

To launch this pipeline, you can type the following command:

./bin/nextflow src/pipelines/peak_calling.nf \
-params-file src/pipelines/peak_calling.yml \
-profile <PROFILE> \
-config src/pipelines/RibosomeProfiling.config

The only file to modify is src/pipelines/peak_calling.yml. The description of every variables you can change is defined in this file.

Enrichment analysis

This pipeline is an alternative to the Peak calling pipeline. To launch this pipeline, the reads must have already been mapped to a reference genome (see Mapping to the reference genome section) and the periodicy must have been analysed (see Periodicity analysis). Then you can use the script src/pipelines/enrichment.nf to see if some codons/encoded amino acids are enriched or impoverished in the estimated location of P-sites in ribosome footprint reads.

To launch this pipeline, you can type the following command:

./bin/nextflow src/pipelines/enrichment.nf \
-params-file src/pipelines/enrichment.yml \
-profile <PROFILE> \
-config src/pipelines/enrichment.config

The only file to modify is src/pipelines/periodicity.yml. The description of every variables you can change is defined in this file.