Skip to content
Snippets Groups Projects
Forked from LBMC / nextflow
1243 commits behind the upstream repository.
title: "TP for experimental biologists"
author: Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)
date: 6 Jun 2018
output:
  pdf_document:
    toc: true
    toc_depth: 3
    number_sections: true
    highlight: tango
    latex_engine: xelatex

The Goal of this practical is to learn how to build your own pipeline with nextflow and using the tools already wrapped. For this we are going to build a small RNASeq analysis pipeline that should run the following steps:

  • remove Illumina adaptors
  • trim reads by quality
  • build the index of a reference genome
  • estimate the number of RNA fragments mapping to the transcript of this genome

Initialize your own project

You are going to build a pipeline for you or your team. So the first step is to create your own project.

Forking

Instead of reinventing the wheel, you can use the pipelines/nextflow as a template. To easily do so, go to the pipelines/nextflow repository and click on the fork button.

fork button

In git, the action of forking means that you are going to make your own private copy of a repository. You can then write modifications in your project, and if they are of interest for the source repository (here pipelines/nextflow) create a merge request. Merge request are send to the source repository to ask the maintainers to integrate modifications.

merge request button

Project organisation

This project (and yours) follow the guide of good practices for the LBMC

You are now on the main page of your fork of the pipelines/nextflow. You can explore this project, all the code in it is under the CeCILL lience (in the LICENCE file).

The README.md file contains instructions to run your pipeline and test it's installation.

The CONTRIBUTING.md file contains guidelines to follow if you want to contribute to the pipelines/nextflow (making a merge request for example).

The data folder will be the place were you store the raw data for your analysis. The results folder will be the place were you store the results of your analysis. Note that the content of these two folders should never be saved on git.

The doc folder contains the documentation of this practical course.

And most interestingly for you, the src contains code to wrapp tools. This folder contains two subdirectory. A docker_modules, an nf_modules and an sge_modules folder.

docker_modules

The src/docker_modules contains the code to wrapp tools in Docker. Docker is a framework that allow you to execute software withing containers. The docker_modules contains directory corresponding to tools and subdirectories corresponding to their version.

ls -l src/docker_modules/
rwxr-xr-x  3 laurent  _lpoperator   96 May 25 15:42 BEDtools/
drwxr-xr-x  4 laurent  _lpoperator  128 Jun  5 16:14 Bowtie2/
drwxr-xr-x  3 laurent  _lpoperator   96 May 25 15:42 FastQC/
drwxr-xr-x  4 laurent  _lpoperator  128 Jun  5 16:14 HTSeq/

To each tools/version corresponds two files:

ls -l src/docker_modules/Bowtie2/2.3.4.1/
-rw-r--r--  1 laurent  _lpoperator  283 Jun  5 15:07 Dockerfile
-rwxr-xr-x  1 laurent  _lpoperator   79 Jun  5 16:18 docker_init.sh*

The Dockerfile is the Docker recipe to create a container containing Bowtie2 in it's 2.3.4.1 version. And the docker_init.sh file is a small script to create the container from this recipe.

By running this script you will be able to easily install tools in different version on your personal computer and use it in your pipeline. Some of the advantages are:

  • Whatever the computer, the installation and the results will be the same
  • You can keep container for old version of tools and run it on new systems (science = reproducibility)
  • You don't have to bother with tedious installation procedure, somebody else already did the job and wrote a Dockerfile.
  • You can easily keep container for different version of the same tools.

sge_modules

The src/sge_modules folder is not really there. It's a submodule of the project PSMN/modules. To populate it locally you can use the following command:

git submodule init

Like for the src/docker_modules the PSMN/modules project describe recipes to install tools and use them. The main difference is that you cannot use Docker on the PSMN. Instead you have to use another framework Environment Module which allows you to load modules for specific tools and version. The README.md file of the PSMN/modules respository contains all the instruction to be able to load the modules maintained by the LBMC en present in the PSMN/modules respository.

nf_modules

The src/nf_modules folder contains templates of nextflow wrapper for the tools available in Docker and SGE. The details of the nextflow wrapper will be presented in the next section. Alongside the .nf and .config there is a tests folder that contains a tests.sh script to run test on the tool.

Build your own RNASeq pipeline

In this section you are going to build your own pipeline for RNASeq analysis from the code available in the src/nf_modules folder.

Nextflow pipeline

A pipeline is a succession of process. Each process has data input(s) and optional data output(s). Data flow are modeled as channels.

Processes

Here are an example of process:

process sample_fasta {
  input:
    file fasta from fasta_file

  output:
    file "sample.fasta" into fasta_sample

  script:
"""
head ${fasta} > sample.fasta
"""
}

We have the process sample_fasta that take as fasta_file channel as imput and output a fasta_sample channel. The process itself is deffined in the script: block and within """.

  input:
    file fasta from fasta_file

When we zoom on the input: block we see that we define a variable fasta of type file from the fasta_file channel. This mean that groovy is going to write a file named as the content of the variable fasta in the root of the folder where script: is executed.

  output:
    file "sample.fasta" into fasta_sample

At the end of the script, a file named sample.fasta is found in the root the folder where script: is executed and send into the pipeline fasta_sample

Using the WebIDE of Gitlab create a file src/fasta_sampler.nf with this process and commit to your repository.

Channels

Why bother with channels ? In the above example, the advantages of channels are not really clear. We could have just given the fasta file to the process. But what if we have many fasta file to process ? What if we have sub processes to run on each of the sampled fasta files ? Nextflow can easily deal with these problems with the help of channels.

Channels are streams of items that are emitted by a source and consumed by a process. A process with a channel as input will be run on every items send through the channel.

Channel
  .fromPath( "data/tiny_dataset/fasta/*.fasta" )
  .set { fasta_file }

Here we defined a channel fasta_file that is going to send every fasta file from the folder data/fasta/ into the process that take it as input.

Add the definition of the channel to the src/fasta_sampler.nf file and commit to your repository.

Run your pipeline locally

git clone -c http.sslVerify=false https://gitlab.biologie.ens-lyon.fr/<usr_name>/nextflow.git
cd nextflow
src/install_nextflow.sh

Create your Docker containers

For this practical, we are going to need the following tools :

  • For Illumina adaptor removal : cutadapt
  • For reads trimming by quality : UrQt
  • For mapping and quantifying reads : Kallisto, RSEM and Bowtie2

To initialize these tools, follow the Installing section of the README.md file.