-
Laurent Modolo authoredLaurent Modolo authored
title: "TP for experimental biologists"
author: Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)
date: 6 Jun 2018
output:
pdf_document:
toc: true
toc_depth: 3
number_sections: true
highlight: tango
latex_engine: xelatex
The Goal of this practical is to learn how to build your own pipeline with nextflow and using the tools already wrapped. For this we are going to build a small RNASeq analysis pipeline that should run the following steps:
- remove Illumina adaptors
- trim reads by quality
- build the index of a reference genome
- estimate the amount of RNA fragments mapping to the transcripts of this genome
To do this practical you will need to have Docker installed and running on your computer
Initialize your own project
You are going to build a pipeline for you or your team. So the first step is to create your own project.
Forking
Instead of reinventing the wheel, you can use the LBMC/nextflow as a template. To easily do so, go to the LBMC/nextflow repository and click on the fork button (you need to log-in).
In git, the action of forking means that you are going to make your own private copy of a repository. You can then write modifications in your project, and if they are of interest for the source repository create a merge request (here LBMC/nextflow). Merge requests are sent to the source repository to ask the maintainers to integrate modifications.
Project organisation
This project (and yours) follows the guide of good practices for the LBMC
You are now on the main page of your fork of the LBMC/nextflow. You can explore this project, all the code in it is under the CeCILL licence (in the LICENCE file).
The README.md file contains instructions to run your pipeline and test its installation.
The CONTRIBUTING.md file contains guidelines if you want to contribute to the LBMC/nextflow (making a merge request for example).
The data folder will be the place where you store the raw data for your analysis. The results folder will be the place where you store the results of your analysis.
The content of
data
andresults
folders should never be saved on git.
The doc folder contains the documentation of this practical course.
And most interestingly for you, the src contains code to wrap tools. This folder contains one visible subdirectories nf_modules
some pipeline examples and other hidden files.
nf_modules
The src/nf_modules
folder contains templates of nextflow wrappers for the tools available in Docker. The details of the nextflow wrapper will be presented in the next section. Alongside the .nf
and .config
files, there is a tests.sh
script to run test on the tool.
Nextflow pipeline
A pipeline is a succession of process. Each process has data input(s) and optional data output(s). Data flows are modeled as channels.
Processes
Here is an example of process:
process sample_fasta {
input:
file fasta from fasta_file
output:
file "sample.fasta" into fasta_sample
script:
"""
head ${fasta} > sample.fasta
"""
}
We have the process sample_fasta
that takes a fasta_file
channel as input and as output a fasta_sample
channel. The process itself is defined in the script:
block and within """
.
input:
file fasta from fasta_file
When we zoom on the input:
block we see that we define a variable fasta
of type file
from the fasta_file
channel. This mean that groovy is going to write a file named as the content of the variable fasta
in the root of the folder where script:
is executed.
output:
file "sample.fasta" into fasta_sample
At the end of the script, a file named sample.fasta
is found in the root the folder where script:
is executed and send into the channel fasta_sample
.
Using the WebIDE of Gitlab, create a file src/fasta_sampler.nf
with this process and commit it to your repository.
Channels
Why bother with channels? In the above example, the advantages of channels are not really clear. We could have just given the fasta
file to the process. But what if we have many fasta files to process? What if we have sub processes to run on each of the sampled fasta files? Nextflow can easily deal with these problems with the help of channels.
Channels are streams of items that are emitted by a source and consumed by a process. A process with a channel as input will be run on every item send through the channel.
Channel
.fromPath( "data/tiny_dataset/fasta/*.fasta" )
.set { fasta_file }
Here we defined the channel fasta_file
that is going to send every fasta file from the folder data/tiny_dataset/fasta/
into the process that take it as input.
Add the definition of the channel to the src/fasta_sampler.nf
file and commit it to your repository.
Run your pipeline locally
After writing this first pipeline, you may want to test it. To do that, first clone your repository. To easily do that set the visibility level to public in the settings/General/Permissions page of your project.
You can then run the following commands to download your project on your computer:
and then :
git clone git@gitbio.ens-lyon.fr:<usr_name>/nextflow.git
cd nextflow
src/install_nextflow.sh
We also need data to run our pipeline:
cd data
git clone git@gitbio.ens-lyon.fr:LBMC/hub/tiny_dataset.git
cd ..
We can run our pipeline with the following command:
./nextflow src/fasta_sampler.nf