-
Laurent Modolo authoredLaurent Modolo authored
title: "TP for experimental biologists"
author: Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)
date: 6 Jun 2018
output:
pdf_document:
toc: true
toc_depth: 3
number_sections: true
highlight: tango
latex_engine: xelatex
The Goal of this practical is to learn how to build your own pipeline with nextflow and using the tools already wrapped. For this we are going to build a small RNASeq analysis pipeline that should run the following steps:
- remove Illumina adaptors
- trim reads by quality
- build the index of a reference genome
- estimate the amount of RNA fragments mapping to the transcripts of this genome
Initialize your own project
You are going to build a pipeline for you or your team. So the first step is to create your own project.
Forking
Instead of reinventing the wheel, you can use the pipelines/nextflow as a template. To easily do so, go to the pipelines/nextflow repository and click on the fork button.
In git, the action of forking means that you are going to make your own private copy of a repository. You can then write modifications in your project, and if they are of interest for the source repository (here pipelines/nextflow) create a merge request. Merge requests are sent to the source repository to ask the maintainers to integrate modifications.
Project organisation
This project (and yours) follows the guide of good practices for the LBMC
You are now on the main page of your fork of the pipelines/nextflow. You can explore this project, all the code in it is under the CeCILL licence (in the LICENCE file).
The README.md file contains instructions to run your pipeline and test its installation.
The CONTRIBUTING.md file contains guidelines to follow if you want to contribute to the pipelines/nextflow (making a merge request for example).
The data folder will be the place where you store the raw data for your analysis. The results folder will be the place where you store the results of your analysis. Note that the content of these two folders should never be saved on git.
The doc folder contains the documentation of this practical course.
And most interestingly for you, the src contains code to wrap tools. This folder contains two subdirectories. A docker_modules
, a nf_modules
and a sge_modules
folder.
docker_modules
The src/docker_modules
contains the code to wrap tools in Docker. Docker is a framework that allows you to execute software within containers. The docker_modules
contains directory corresponding to tools and subdirectories corresponding to their version.
ls -l src/docker_modules/
rwxr-xr-x 3 laurent _lpoperator 96 May 25 15:42 BEDtools/
drwxr-xr-x 4 laurent _lpoperator 128 Jun 5 16:14 Bowtie2/
drwxr-xr-x 3 laurent _lpoperator 96 May 25 15:42 FastQC/
drwxr-xr-x 4 laurent _lpoperator 128 Jun 5 16:14 HTSeq/
To each tools/version
corresponds two files:
ls -l src/docker_modules/Bowtie2/2.3.4.1/
-rw-r--r-- 1 laurent _lpoperator 283 Jun 5 15:07 Dockerfile
-rwxr-xr-x 1 laurent _lpoperator 79 Jun 5 16:18 docker_init.sh*
The Dockerfile
is the Docker recipe to create a container containing Bowtie2
in its 2.3.4.1
version. And the docker_init.sh
file is a small script to create the container from this recipe.
By running this script you will be able to easily install tools in different versions on your personal computer and use it in your pipeline. Some of the advantages are:
- Whatever the computer, the installation and the results will be the same
- You can keep container for old version of tools and run it on new systems (science = reproducibility)
- You don’t have to bother with tedious installation procedures, somebody else already did the job and wrote a
Dockerfile
. - You can easily keep containers for different version of the same tools.
sge_modules
The src/sge_modules
folder is not really there. It’s a submodule of the project PSMN/modules. To populate it locally you can use the following command:
git submodule init
Like the src/docker_modules
the PSMN/modules project describe recipes to install tools and use them. The main difference is that you cannot use Docker on the PSMN. Instead you have to use another framework Environment Module which allows you to load modules for specific tools and version.
The README.md file of the PSMN/modules repository contains all the instruction to be able to load the modules maintained by the LBMC and present in the PSMN/modules repository.