Skip to content
Snippets Groups Projects
building_your_pipeline.md 17.88 KiB

Building your own pipeline

The goal of this guide is to walk you through the Nextflow pipeline building process you will learn:

  1. How to use this git repository (LBMC/nextflow) as a template for your project.
  2. The basis of Nextflow the pipeline manager that we use at the lab.
  3. How to build a simple pipeline for the transcript-level quantification of RNASeq data
  4. How to run the exact same pipeline on a computing center (PSMN)

This guide assumes that you followed the Git basis, training course.

Initialize your own project

You are going to build a pipeline for you or your team. So the first step is to create your own project.

Forking

Instead of reinventing the wheel, you can use the LBMC/nextflow as a template. To easily do so, go to the LBMC/nextflow repository and click on the fork button (you need to log-in).

fork button

In git, the action of forking means that you are going to make your own private copy of a repository. This repository will keep a link with the original LBMC/nextflow project from which you will be able to

Project organization

This project (and yours) follows the guide of good practices for the LBMC

You are now on the main page of your fork of the LBMC/nextflow. You can explore this project, all the codes in it is under the CeCILL licence (in the LICENCE file).

The README.md file contains instructions to run your pipeline and test its installation.

The CONTRIBUTING.md file contains guidelines if you want to contribute to the LBMC/nextflow.

The data folder will be the place where you store the raw data for your analysis. The results folder will be the place where you store the results of your analysis.

The content of data and results folders should never be saved on git.

The doc folder contains the documentation and this guide.

And most interestingly for you, the src contains code to wrap tools. This folder contains one visible subdirectories nf_modules some pipeline examples and other hidden folders and files.

Nextflow pipeline

A pipeline is a succession of process. Each process has data input(s) and optional data output(s). Data flows are modeled as channels.

Processes

Here is an example of process:

process sample_fasta {
  input:
  file fasta

  output:
  file "sample.fasta", emit: fasta_sample

  script:
"""
head ${fasta} > sample.fasta
"""
}

We have the process sample_fasta that takes fasta file input and as output a fasta file. The process task itself is defined in the script: block and within """.

input:
file fasta

When we zoom on the input: block, we see that we define a variable fasta of type file. This means that the sample_fasta process is going to get a flux of fasta file(s). Nextflow is going to write a file named as the content of the variable fasta in the root of the folder where script: is executed.

output:
file "sample.fasta", emit: fasta_sample

At the end of the script, a file named sample.fasta is found in the root the folder where script: is executed and will be emitted as fasta_sample.

Using the WebIDE of Gitlab, create a file src/fasta_sampler.nf webide

The first line that you need to add is

nextflow.enable.dsl=2

Then add the sample_fastq process and commit it to your repository.

Workflow

In Nexflow, process blocks are chained together within a workflow block. For the time being, we only have one process so workflow may look like an unnecessary complication, but keep in mind that we want to be able to write complex bioinformatic pipeline.

workflow {
  sample_fasta(fasta_file)
}

Like process blocks workflow can take some inputs: fasta_files and transmit this input to processes

  sample_fasta(fasta_file)

The main: block is where we are going to call our process(es) Add the definition of the workflow to the src/fasta_sampler.nf file and commit it to your repository.

Channels

Why bother with channels? In the above example, the advantages of channels are not really clear. We could have just given the fasta file to the workflow. But what if we have many fasta files to process? What if we have sub processes to run on each of the sampled fasta files? Nextflow can easily deal with these problems with the help of channels.

Channels are streams of items that are emitted by a source and consumed by a process. A process with a channel as input will be run on every item send through the channel.

channel
  .fromPath( "data/tiny_dataset/fasta/*.fasta" )
  .set { fasta_file }

Here we defined the channel, fasta_file, that is going to send every fasta file from the folder data/tiny_dataset/fasta/ into the process that takes it as input.

Add the definition of the channel, above the workflow block, to the src/fasta_sampler.nf file and commit it to your repository.

Run your pipeline locally

After writing this first pipeline, you may want to test it. To do that, first clone your repository. After following the Git basis, training course, you should have an up-to-date ssh configuration to connect to the gitbio.ens-lyon.fr git server.

You can then run the following commands to download your project on your computer:

and then :

git clone git@gitbio.ens-lyon.fr:<usr_name>/nextflow.git
cd nextflow
src/install_nextflow.sh

We also need data to run our pipeline:

cd data
git clone git@gitbio.ens-lyon.fr:LBMC/hub/tiny_dataset.git
cd ..

We can run our pipeline with the following command:

./nextflow src/fasta_sampler.nf

Getting your results

Our pipeline seems to work but we don’t know where is the sample.fasta. To get results out of a process, we need to tell nextflow to write it somewhere (we may don’t need to get every intermediate file in our results).

To do that we need to add the following line before the input: section:

publishDir "results/sampling/", mode: 'copy'

Every file described in the output: section will be copied from nextflow to the folder results/sampling/.

Add this to your src/fasta_sampler.nf file with the WebIDE and commit to your repository. Pull your modifications locally with the command:

git pull origin master

You can run your pipeline again and check the content of the folder results/sampling.

Fasta everywhere

We ran our pipeline on one fasta file. How would nextflow handle 100 of them? To test that we need to duplicate the tiny_v2.fasta file:

for i in {1..100}
do
cp data/tiny_dataset/fasta/tiny_v2.fasta data/tiny_dataset/fasta/tiny_v2_${i}.fasta
done

You can run your pipeline again and check the content of the folder results/sampling.

Every fasta_sampler process write a sample.fasta file. We need to make the name of the output file dependent of the name of the input file.

output:
file "*_sample.fasta", emit: fasta_sample

  script:
"""
head ${fasta} > ${fasta.simpleName}_sample.fasta
"""

Add this to your src/fasta_sampler.nf file with the WebIDE and commit it to your repository before pulling your modifications locally. You can run your pipeline again and check the content of the folder results/sampling.

Build your own RNASeq pipeline

In this section you are going to build your own pipeline for RNASeq analysis from the code available in the src/nf_modules folder.

Open the WebIDE and create a src/RNASeq.nf file.

The first line that we are going to add is

nextflow.enable.dsl=2

fastp

The first step of the pipeline is to remove any Illumina adaptors left in your read files and to trim your reads by quality.

The LBMC/nextflow template provide you with many tools for which you can find a predefined process block. You can find a list of these tools in the src/nf_modules folder. You can also ask for a new tool by creating a new issue for it in the LBMC/nextflow project.

We are going to include the src/nf_modules/fastp/main.nf in our src/RNASeq.nf file

include { fastp } from "./nf_modules/fastp/main.nf"

With this line we can call the fastp block in our future workflow without having to write it ! If we check the content of the file src/nf_modules/fastp/main.nf, we can see that by including fastp, we are including a sub-workflow (we will come back on this object latter). This sub-workflow takes a fastq channel. We need to make one The ./nf_modules/fastp/main.nf is relative to the src/RNASeq.nf file, this is why we don’t include the src/ part of the path.

channel
  .fromFilePairs( "data/tiny_dataset/fastq/*_R{1,2}.fastq", size: -1)
  .set { fastq_files }

The .fromFilePairs() can create a channel of pair of fastq files. Therefore, the items emitted by the fastq_files channel are going to be pairs of fastq for paired-end data. The option size: -1 allows arbitrary number of associated files. Therefore, we can use the same channel creation for single-end data.

We can now include the workflow definition, passing the fastq_files channel to fastp to our src/RNASeq.nf file.

workflow {
  fastp(fastq_files)
}

You can commit your src/RNASeq.nf file, pull your modification locally and run your pipeline with the command:

./nextflow src/RNASeq.nf

What is happening ?

Nextflow -profile

Nextflow tells you the following error: fastp: command not found. You haven’t fastp installed on your computer.