-
Laurent Modolo authoredLaurent Modolo authored
title: "TP for computational biologists"
author: Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)
date: 20 Jun 2018
output:
pdf_document:
toc: true
toc_depth: 3
number_sections: true
highlight: tango
latex_engine: xelatex
The goal of this practical is to learn how to wrap tools in Docker or Environment Module to make them available to nextflow on a personal computer or at the PSMN.
Here we assume that you followed the TP for experimental biologists, and that you know the basics on Docker containers and Environment Module usage. We are also going to assume that you know how to build and use a nextflow pipeline from the template pipelines/nextflow.
For the practical you can either work with the WebIDE of Gitlab, or locally as described in git : the basis formation.
Docker
To run a tool within a Docker container you need to write a Dockerfile
.
Dockerfile
are found in the pipelines/nextflow project under src/docker_modules/
. Each Dockerfile
are paired with a docker_init.sh
file like following the example for Kallisto
version 0.43.1
:
$ ls -l src/docker_modules/Kallisto/0.43.1/
total 16K
drwxr-xr-x 2 laurent users 4.0K Jun 5 19:06 ./
drwxr-xr-x 3 laurent users 4.0K Jun 6 09:49 ../
-rw-r--r-- 1 laurent users 587 Jun 5 19:06 Dockerfile
-rwxr-xr-x 1 laurent users 79 Jun 5 19:06 docker_init.sh*
docker_init.sh
The docker_init.sh
is a simple sh script with the executable right (chmod +x
).By executing this script, the user creates the Docker container for the tools in a specific version. You can check the docker_init.sh
file of any implemented tools as a template. Remember that the name of the container must be in lower case.
Dockerfile
The recipe to wrap your tool in a Docker container is written in a Dockerfile
file.
For Kallisto
version 0.44.0
the header of the Dockerfile
is :
FROM ubuntu:18.04
MAINTAINER Laurent Modolo
ENV KALLISTO_VERSION=0.44.0
This means that we initialize the container from a bare installation of Ubuntu 18.04. You can check the ubuntu available versions here or others operating systems like debian or worst.
Then we declare the maintainer of the container. Before declaring a environment variable for the container named KALLISTO_VERSION
which contains the version of the tools wrapped. This means that this bash variable will be declared within the container.
You should always declare a variable TOOLSNAME_VERSION
that contains the version number of commit number of the tools you wrap. Therefore in simple case you just have to modify this line to create a new Dockerfile
for another version of the tool.
The following of the Dockerfile
is a succession of bash
commands executed as the root user within the container.
When you build your Dockerfile
, instead of launching many times the docker_init.sh
script you can connect to a base container in interactive mode to launch tests your commands.
docker run -it ubuntu:18.04 bash
KALLISTO_VERSION=0.44.0
Each RUN
block is run sequentially by Docker
. If there is an error or modifications in a RUN
block, only this block and the following RUN
will be executed.
You can learn more about the building of Docker containers here.
SGE
To run easily tools on the PSMN, you need to build your own Environment Module.
You can read the Contributing guide of the PMSN/modules here
Nextflow
The last step to wrap your tool is to make it available in nextflow. For this you need to create at least 4 files, like the following for Kallisto version 0.44.0
:
ls -lR src/nf_modules/Kallisto
src/nf_modules/Kallisto/:
total 12
-rw-r--r-- 1 laurent users 866 Jun 18 17:13 kallisto.config
-rw-r--r-- 1 laurent users 2711 Jun 18 17:13 kallisto.nf
drwxr-xr-x 2 laurent users 4096 Jun 18 17:14 tests/
src/nf_modules/Kallisto/tests:
total 16
-rw-r--r-- 1 laurent users 551 Jun 18 17:14 index.nf
-rw-r--r-- 1 laurent users 901 Jun 18 17:14 mapping_paired.nf
-rw-r--r-- 1 laurent users 1037 Jun 18 17:14 mapping_single.nf
-rwxr-xr-x 1 laurent users 627 Jun 18 17:14 tests.sh*
The kallisto.config
file contains instructions for two profiles : sge
and docker
.
The kallisto.nf
file contains nextflow processes to use Kallisto
.
The tests/tests.sh
script, contains a series of nextflow calls on the other .nf
files of the tests/
folder. Those tests correspond to execution of the processes present in the kallisto.nf
file on the LBMC/tiny_dataset dataset with the docker
profile. You can read the Running the tests section of the README.md.
kallisto.config
The .config
file defines the configuration to apply to your process conditionally to the value of the -profile
option. You must define configuration for at least the sge
and docker
profile.
profiles {
docker {
docker.temp = 'auto'
docker.enabled = true
process {
}
}
sge {
process{
}
}
docker
profile
The docker
profile start by enabling docker for the whole pipeline. After that you only have to define the container name of each process:
For example, for Kallisto
, we have:
process {
$index_fasta {
container = "kallisto:0.44.0"
}
$mapping_fastq {
container = "kallisto:0.44.0"
}
}
sge
profile
The sge
profile define for each process all the information necessary to launch your process on a give queue at the PSMN.
For example, for Kallisto
, we have:
process{
$index_fasta {
beforeScript = "module purge; module load Kallisto/0.44.0"
executor = "sge"
cpus = 1
memory = "5GB"
time = "6h"
queueSize = 1000
pollInterval = '60sec'
queue = 'h6-E5-2667v4deb128'
penv = 'openmp8'
}
$mapping_fastq {
beforeScript = "module purge; module load Kallisto/0.44.0"
executor = "sge"
cpus = 4
memory = "5GB"
time = "6h"
queueSize = 1000
pollInterval = '60sec'
queue = 'h6-E5-2667v4deb128'
penv = 'openmp8'
}
}
The beforeScript
variable is executed before the main script of the corresponding process.
kallisto.nf
The kallisto.nf
file contains examples of nextflow process that execute Kallisto.
- Each example must be usable as is to be incorporated in a nextflow pipeline.
- You need to define, default value for the parameters passed to the process.
- Input and output must be clearly defined.
- Your process usable as a starting process or a process retrieving the output of another process.