pipeline_building_cec.md
- Building the pipeline
- Initialize your own project
- Forking
- Project organization
- Nextflow pipeline
- Build your own RNASeq pipeline
- fastp
- Pipeline -- arguments
- Kallisto
- Bowtie2
- Returning results
- Combining Kallisto and Bowtie2 in the same pipeline
- Run your RNASeq pipeline on the PSMN
- Set your environment
- Run nextflow
Building the pipeline
The goal of this guide is to keep the major steps coming from Laurent's documentation and to modify it according to our purposes.
Initialize your own project
You are going to build a pipeline for you or your team. So the first step is to create your own project.
Forking
Instead of reinventing the wheel, you can use the LBMC/nextflow as a template. To easily do so, go to the LBMC/nextflow repository and click on the fork button (you need to log-in).
In git, the action of forking means that you are going to make your own private copy of a repository. This repository will keep a link with the original LBMC/nextflow project from which you will be able to
-
get updates
LBMC/nextflow
from the repository - propose update (see contributing guide)
Project organization
This project (and yours) follows the guide of good practices for the LBMC
You are now on the main page of your fork of the LBMC/nextflow. You can explore this project, all the codes in it is under the CeCILL licence (in the LICENCE file).
The README.md file contains instructions to run your pipeline and test its installation.
The CONTRIBUTING.md file contains guidelines if you want to contribute to the LBMC/nextflow.
The data folder will be the place where you store the raw data for your analysis. The results folder will be the place where you store the results of your analysis.
The content of data
and results
folders should never be saved on git.
The doc folder contains the documentation and this guide.
And most interestingly for you, the src contains code to wrap tools. This folder contains one visible subdirectories nf_modules
some pipeline examples and other hidden folders and files.
Nextflow pipeline
A pipeline is a succession of process. Each process
has data input(s) and optional data output(s). Data flows are modeled as channels.
Build your own RNASeq pipeline
In this section you are going to build your own pipeline for RNASeq analysis from the code available in the src/nf_modules
folder.
Open atom and create a src/RNASeq.nf
file.
The first line that you are going to add is
nextflow.enable.dsl=2
fastp
The first step of the pipeline is to remove any Illumina adaptors left in your read files and to trim your reads by quality.
The LBMC/nextflow template provide you with many tools, for which you can find a predefined process
block.
You can find a list of these tools in the src/nf_modules
folder.
You can also ask for a new tool by creating a new issue for it in the LBMC/nextflow project.
You are going to include the src/nf_modules/fastp/main.nf
in our src/RNASeq.nf
file
include { fastp } from "./nf_modules/fastp/main.nf"
The ./nf_modules/fastp/main.nf
is relative to the src/RNASeq.nf
file, this is why you don’t include the src/
part of the path.
With this line you can call the fastp
block in your future workflow
without having to write it !
If you check the content of the file src/nf_modules/fastp/main.nf
, you can see that by including fastp
, you are including a sub-workflow
(we will come back on this object latter). Sub-workflow
can be used like process
es.
This sub-workflow
takes a fastq
channel
. You need to make one:
channel
.fromFilePairs( "data/tiny_dataset/fastq/*_R{1,2}.fastq", size: -1)
.set { fastq_files }
The .fromFilePairs()
function creates a channel
of pairs of fastq files. Therefore, the items emitted by the fastq_files
channel are going to be pairs of fastq for paired-end data.
The option size: -1
allows for arbitrary numbers of associated files. Therefore, you can use the same channel
creation for single-end data.
You can now include the workflow
definition, passing the fastq_files
channel
to fastp
to your src/RNASeq.nf
file.
workflow {
fastp(fastq_files)
}
You can commit your src/RNASeq.nf
file, pull
your modification locally and run your pipeline with the command:
--
arguments
Pipeline We have defined the fastq files path within our src/RNASeq.nf
file.
But what if we want to share our pipeline with someone who doesn’t want to analyze the tiny_dataset
and but other fastq.
You can define a variable instead of fixing the path.
params.fastq = "data/fastq/*_{1,2}.fastq"
channel
.fromFilePairs( params.fastq, size: -1)
.set { fastq_files }
Here you declare a variable using params : the name of the variable is fastq and is a fastq file. It contains the path of the fastq file to look for. The advantage of using params.fastq
is that the option --fastq
is now a parameter of your pipeline.
Thus, you can call your pipeline with the --fastq
option.
You can commit your src/RNASeq.nf
file, pull
your modification locally.
You can also add the following line:
log.info "fastq files: ${params.fastq}"
This line simply displays the value of the variable
Kallisto
Kallisto runs in two steps: the indexation of the reference and the quantification of the transcript on this index.
You can include two process
es with the following syntax:
include { index_fasta; mapping_fastq } from './nf_modules/kallisto/main.nf'
The index_fasta
process needs to take as input a fasta file corresponding to the sequence of transcripts. In our case it is :
c_elegans.PRJNA13758.WS278.all_transcripts.fa which contains transcripts and pseudogenes sequences.
You need to be able to input a fasta_files
channel
. For this you first need to declare a new variable which is a fasta file using params.fasta
; the corresponding option is then --fasta
log.info "fasta file : ${params.fasta}" # to display the value of the variable
channel
.fromPath( params.fasta )
.ifEmpty { error "Cannot find any fasta files matching: ${params.fasta}" }
.map { it -> [it.simpleName, it]}
.set { fasta_file }
We introduce 2 new directives:
-
.ifEmpty { error "Cannot find any fasta files matching: ${params.fasta}" }
to throw an error if the path of the file is not right -
.map { it -> [it.simpleName, it]}
to transform ourchannel
to a format compatible with theCONTRIBUTING
rules. Item, in thechannel
have the following shape [file_id, [file]], like the ones emited by the.fromFilePairs(..., size: -1)
function.
You can add the index_fasta
step to your workflow
workflow {
fastp(fastq_files)
index_fasta(fasta_file)
}
The input of your mapping_fastq
process
needs to take as input the output of your index_fasta
process
and the fastp
process
, of shape [index_id, [index_file]]
, and [fastq_id, [fastq_r1_file, fastq_r2_file]]
.
The output of a process
is accessible through <process_name>.out
.
In the cases where we have an emit: <channel_name>
we can access the corresponding channel with<process_name>.out.<channel_name>
We can add the mapping_fastq
step to our workflow
workflow {
fastp(fastq_files)
index_fasta(fasta_file)
mapping_fastq(index_fasta.out.index.collect(), fastp.out.fastq) //.collect to reuse the same index for each fastq file
}
Commit your work
For single-end RNAseq Kallisto need to have a mean fragment size (-l) together with an sd value (-s). These parameters are available from the librairies analysis. These parameters have to be added to kallisto parameters :
include { index_fasta; mapping_fastq } from './nf_modules/kallisto/main.nf' addParams( mapping_fastq: " -l mean_frag_size -s sd_value ")
Bowtie2
Bowtie2 runs in two steps: the indexation of the reference genome and the mapping of the sequences on this index. It returns bam files that can be viewed with IGV browser.
You can include two process
es with the following syntax:
include { index_fasta; mapping_fastq } from './nf_modules/bowtie2/main.nf'
The index_fasta
process needs to take as input a fasta file corresponding to the sequence of transcripts. In our case it is :
c_elegans.PRJNA13758.WS278.genomic.fa
You need to be able to input a fasta_files
channel
. For this you first need to declare a new variable which is a fasta file using params.fasta
; the corresponding option is then --fasta
log.info "fasta file : ${params.fasta}" # to display the value of the variable
channel
.fromPath( params.fasta )
.ifEmpty { error "Cannot find any fasta files matching: ${params.fasta}" }
.map { it -> [it.simpleName, it]}
.set { fasta_file }
You can add the index_fasta
step to your workflow
workflow {
fastp(fastq_files)
index_fasta(fasta_file)
}
As for Kallisto we also add the mapping_fastq
step to our workflow
which takes the index_fasta.out as input
workflow {
fastp(fastq_files)
index_fasta(fasta_file)
mapping_fastq(index_fasta.out.index.collect(), fastp.out.fastq)
}
Commit your work
Returning results
By default none of the process
defined in src/nf_modules
use the publishDir
instruction.
You can specify their publishDir
directory by specifying the :
params.<process_name>_out = "path"
Where "path" will describe a path within the results
folder
Therefore you can either:
- call your pipeline with the following parameter
--mapping_fastq_out "quantification/"
for Kallisto or--mapping_fastq_out "alignement/"
for Bowtie2 - add the following lines to your
src/RNASeq.nf
file to get the output of thefastp
andmapping_fastq
process:
include { fastp } from './nf_modules/fastp/main.nf' addParams(fastp_out: "fastQC/")
include { index_fasta; mapping_fastq } from './nf_modules/kallisto/main.nf' addParams(mapping_fastq: " -l mean_frag_size -s sd_value ", mapping_fastq_out: "quantification/")
or include { index_fasta; mapping_fastq } from './nf_modules/bowtie2/main.nf' addParams(mapping_fastq_out: "alignment")
Commit your work
Combining Kallisto and Bowtie2 in the same pipeline
Since Kallisto and Bowtie2 take the same kind of variables as input, you first have to rename the input fasta variables, for example :
log.info "fasta file : ${params.cds_fasta}" for kallisto which uses c_elegans.PRJNA13758.WS278.all_transcripts.fa
log.info "fasta file : ${params.genomic_fasta}" for bowtie2 which uses c_elegans.PRJNA13758.WS278.genomic.fa
the corresponding options will be --cds_fasta
and --genomic_fasta
Then you need to add a channel for each of these variables :
channel
.fromPath( params.cds_fasta )
.ifEmpty { error "Cannot find any fasta files matching: ${params.cds_fasta}" }
.map { it -> [it.simpleName, it]}
.set { cds_fasta_file }
channel
.fromPath( params.genomic_fasta )
.ifEmpty { error "Cannot find any fasta files matching: ${params.genomic_fasta}" }
.map { it -> [it.simpleName, it]}
.set { genomic_fasta_file }
Finally, since kallisto and bowtie2 use the same kind of process
you have to create alias for both index_fasta and mapping_fastq:
include {
index_fasta as kallisto_index_fasta;
mapping_fastq as kallisto_mapping_fastq
} from './nf_modules/kallisto/main.nf' addParams( mapping_fastq_out: "quantif/", mapping_fastq: " -l 459.8 -s 198.5 ")
include {
index_fasta as bowtie_index_fasta;
mapping_fastq as bowtie_mapping_fastq
} from './nf_modules/bowtie2/main.nf' addParams(mapping_fastq_out: "align/")
You can add the index_fasta
and mapping_fastq
steps to your workflow
workflow {
fastp(fastq_files)
kallisto_index_fasta(cds_fasta_file)
kallisto_mapping_fastq(kallisto_index_fasta.out.index.collect(), fastp.out.fastq) //.collect pour réutiliser l'index pour chaque fastq file
bowtie_index_fasta(genomic_fasta_file)
bowtie_mapping_fastq(bowtie_index_fasta.out.index.collect(), fastp.out.fastq)
}
Commit your work : your pipeline is ready
Run your RNASeq pipeline on the PSMN
First you need to connect to the PSMN:
login@allo-psmn
Then once connected to allo-psmn
, you can connect to cl6242comp2
:
login@cl6242comp2
Set your environment
Create and go to your scratch
folder:
mkdir -p /scratch/Bio/<login>
cd /scratch/Bio/<login>
Then you need to clone your pipeline and get the data:
git clone https://gitbio.ens-lyon.fr/<usr_name>/nextflow.git
cd nextflow/data
git clone https://gitbio.ens-lyon.fr/LBMC/hub/tiny_dataset.git
cd ..
Run nextflow
It is better to writ a script as a.sh file to lauch nextflow : Here is the RNAseq_script for Kallisto:
./nextflow run src/RNAseq_cec.nf \
-profile psmn \
--fastq "data/*.fastq" \
--cds_fasta "data/c_elegans.PRJNA13758.WS278.all_transcripts.fa" \
--genomic_fasta "data/c_elegans.PRJNA13758.WS278.genomic.fa"
Don't forget to make the script executable using : chmod 744 RNAseq_script.sh:
src/install_nextflow.sh
Then you can open an attached terminal using screen :
screen -S rnaseq # to create the rnaseq screen
src/RNAseq_script.sh # to launch the script
Maintain CTRL +A pressed and then press D to detach the screen.
Use :
screen -r rnaseq #to reattach the screen
You just ran your pipeline on the PSMN!