a583b7181506086b11ef4aece8585714d40d3be1 to master · LBMC / nextflow

Some changes are not shown.

For a faster browsing experience, only 98 of 392 files are shown. Download one of the files below to see all changes.
.gitignore

+6
−0

Original line number
Diff line number
Diff line

# SPDX-FileCopyrightText: 2022 Laurent Modolo <laurent.modolo@ens-lyon.fr>

#

# SPDX-License-Identifier: AGPL-3.0-or-later

nextflow

.nextflow.log*

.nextflow/

work/

results

workspace.code-workspace

.gitmodules

+7
−3

Original line number
Diff line number
Diff line

[submodule "src/sge_modules"]

	path = src/sge_modules

	url = gitlab_lbmc:PSMN/modules.git

# SPDX-FileCopyrightText: 2022 Laurent Modolo <laurent.modolo@ens-lyon.fr>

#

# SPDX-License-Identifier: AGPL-3.0-or-later

[submodule "src/.docker_modules/hicstuff/3.1.3/hicstuff"]

	path = src/.docker_modules/hicstuff/3.1.3/hicstuff

	url = git@github.com:koszullab/hicstuff.git

.reuse/dep50 → 100644

+10
−0

Original line number
Diff line number
Diff line

Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/

Upstream-Name: nextflow

Upstream-Contact: Laurent Modolo <laurent.modolo@ens-lyon.fr>

Source: https://gitbio.ens-lyon.fr/LBMC/nextflow

# Sample paragraph, commented out:

#

# Files: src/*

# Copyright: $YEAR $NAME <$CONTACT>

# License: ...

CHANGELOGdeleted100644 → 0

+0
−0

Original line number
Diff line number
Diff line

CHANGELOG.md0 → 100644

+123
−0

Original line number
Diff line number
Diff line

<!--

SPDX-FileCopyrightText: 2022 Laurent Modolo <laurent.modolo@ens-lyon.fr>

SPDX-License-Identifier: CC-BY-SA-4.0

-->

# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),

and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.4.0] - 2019-11-18

### Added

- Add new tools (star,...)

- conda support at the psmn

## Changed

- configuration simplification

- docker and singularity image download instead of local build

- hidden directories in `src` for project clarity (only `nf_modules` is visible)

## Removed

- conda support at in2p3 with `-profile in2p3_conda`

## [0.3.0] - 2019-05-23

### Added

- Add new tools (umi_tools, fastp,...)

- singularity support at in2p3 with `-profile in2p3`

- conda support at in2p3 with `-profile in2p3_conda`

## [0.2.9] - 2019-03-26

### Added

- Add new tools (fastq, macs2, umitools, ...)

- singularity support

### Changed

- every tool name is now in lowercase in each module section

## [0.2.7] - 2018-10-23

### Added

- Add new tools (BWA, GATK, sambamba, ...)

### Changed

- `sge` profile is now called `psmn` profile to prepare tests in the CCIN2P3

- every `psmn` config file has an update configuration for mono or 16 cpus queues

- update process naming to follow new nextflow format

## [0.2.6] - 2018-08-23

### Added

- Added `src/training_dataset.nf` to build a small training dataset from NGS data

### Changed

- the structure of `src/nf_modules`: the `tests` folder was removed

## [0.2.5] - 2018-08-22

### Added

- This fine changelog

### Changed

- the structure of `src/nf_modules`: the `tests` folder was removed

## [0.2.4] - 2018-08-02

### Changed

- add `paired_id` variable in the output of every single-end data processes to match the paired output

## [0.2.3] - 2018-07-25

### Added

- List of tools available as nextflow, docker or sge module to the `README.md`

## [0.2.2] - 2018-07-23

### Added

- SRA module from cigogne/nextflow-master 52b510e48daa1fb7

## [0.2.1] - 2018-07-23

### Added

- List of tools available as nextflow, docker or sge module

## [0.2.0] - 2018-06-18

### Added

- `doc/TP_computational_biologists.md`

- Kallisto/0.44.0

### Changed

- add `paired_id` variable in the output of every paired data processes

- BEDtools: fixes for fasta handling

- UrQt: fix git version in Docker

## [0.1.2] - 2018-06-18

### Added

- `doc/tp_experimental_biologist.md` and Makefile to build the pdf

- tests files for BEDtools

### Changed

- Kallisto: various fixes

- UrQt: improve output and various fixes

### Removed

- `src/nf_test.config` modules have their own `.config`

## [0.1.2] - 2018-06-18

### Added

- `doc/tp_experimental_biologist.md` and Makefile to build the pdf

- tests files for BEDtools

### Changed

- Kallisto: various fixes

- UrQt: improve output and various fixes

### Removed

- `src/nf_test.config` modules have their own `.config`

## [0.1.0] - 2018-05-06

This is the first working version of the repository as a nextflow module repository

CONTRIBUTING.md

+269
−68

Original line number
Diff line number
Diff line

<!--

SPDX-FileCopyrightText: 2022 Laurent Modolo <laurent.modolo@ens-lyon.fr>

SPDX-License-Identifier: CC-BY-SA-4.0

-->

# Contributing

When contributing to this repository, please first discuss the change you wish to make via issue,

email, or any other method with the owners of this repository before making a change. 

When contributing to this repository, please first discuss the change you wish to make via issues,

email, or on the [ENS-Bioinfo channel](https://matrix.to/#/#ens-bioinfo:matrix.org) before making a change. 

## Forking

In git, the [action of forking](https://git-scm.com/book/en/v2/GitHub-Contributing-to-a-Project) means that you are going to make your own private copy of a repository. You can then write modifications in your project, and if they are of interest for the source repository create a merge request (here [LBMC/nextflow](https://gitbio.ens-lyon.fr/LBMC/nextflow)). Merge requests are sent to the source repository to ask the maintainers to integrate modifications.

![merge request button](./doc/img/merge_request.png)

## Project organization

The `LBMC/nextflow` project is structured as follows:

- all the code is in the `src/` folder

- scripts downloading external tools should download them in the `bin/` folder

- all the documentation (including this file) can be found int he `doc/` folder

- the `data` and `results` folders contain the data and results of your pipelines and are ignored by `git`

## Code structure

The `src/` folder is where we want to save the pipeline (`.nf`) scripts. This folder also contains

- the `src/install_nextflow.sh` to install the nextflow executable at the root of the project.

- some pipelines examples (like the one build during the nf_pratical)

- the `src/nextflow.config` global configuration file which contains the `docker`, `singularity`, `psmn` and `ccin2p3` profiles.

- the `src/nf_modules` folder contains per tools `main.nf` modules with predefined process that users can import in their projects with the [DSL2](https://www.nextflow.io/docs/latest/dsl2.html)

But also some hidden folders that users don't need to see when building their pipeline:

- the `src/.docker_modules` contains the recipes for the `docker` containers used in the `src/nf_modules/<tool_names>/main.nf` files

- the `src/.singularity_in2p3` and `src/.singularity_psmn` are symbolic links to the shared folder where the singularity images are downloaded on the PSMN and CCIN2P3 

# Proposing a new tool

Each tool named `<tool_name>` must have two dedicated folders:

- [`src/nf_modules/<tool_name>`](./src/nf_modules/fastp/) where users can find `.nf` files to include

- [`src/.docker_modules/<tool_name>/<version_number>`](./src/.docker_modules/fastp/0.20.1/) where we have the [`Dockerfile`](./src/.docker_modules/fastp/0.20.1/Dockerfile) to construct the container used in the `main.nf` file

## `src/nf_module` guide lines

We are going to take the [`fastp`, `nf_module`](./src/nf_modules/fastp/) as an example.

The [`src/nf_modules/<tool_name>`](./src/nf_modules/fastp/) should contain a [`main.nf`](./src/nf_modules/fastp/main.nf) file that describe at least one process using `<tool_name>`

### container informations

The first two lines of [`main.nf`](./src/nf_modules/fastp/main.nf) should define two variables

```Groovy

version = "0.20.1"

container_url = "lbmc/fastp:${version}"

```

we can then use the `container_url` definition in each `process` in the `container` attribute.

In addition to the `container` directive, each `process` should have one of the following `label` attributes (defined in the `src/nextflow.config` file)

- `big_mem_mono_cpus`

- `big_mem_multi_cpus`

- `small_mem_mono_cpus`

- `small_mem_multi_cpus`

```Groovy

process fastp {

  container = "${container_url}"

  label = "big_mem_multi_cpus"

  ...

}

```

### process options

Before each process, you should declare at least two `params.` variables:

- A `params.<process_name>` defaulting to `""` (empty string) to allow user to add more command line option to your process without rewriting the process definition

- A `params.<process_name>_out` defaulting to `""` (empty string) that define the `results/` subfolder where the process output should be copied if the user wants to save the process output

```Groovy

params.fastp = ""

params.fastp_out = ""

process fastp {

  container = "${container_url}"

  label "big_mem_multi_cpus"

  if (params.fastp_out != "") {

    publishDir "results/${params.fastp_out}", mode: 'copy'

  }

  ...

  script:

"""

fastp --thread ${task.cpus} \

${params.fastp} \

...

"""

}

```

The user can then change the value of these variables:

- from the command line `--fastp "--trim_head1=10"``

- with the `include` command within their pipeline: `include { fastq } from "nf_modules/fastq/main" addParams(fastq_out: "QC/fastq/")

- by defining the variable within their pipeline: `params.fastq_out = "QC/fastq/"

### `input` and `output` format

You should always use `tuple` for input and output channel format with at least:

- a `val` containing variable(s) related to the item

- a `path` for the file(s) that you want to process

for example:

```Groovy

process fastp {

  container = "${container_url}"

  label "big_mem_multi_cpus"

  tag "$file_id"

  if (params.fastp_out != "") {

    publishDir "results/${params.fastp_out}", mode: 'copy'

  }

  input:

  tuple val(file_id), path(reads)

  output:

    tuple val(file_id), path("*.fastq.gz"), emit: fastq

    tuple val(file_id), path("*.html"), emit: html

    tuple val(file_id), path("*.json"), emit: report

...

```

Here `file_id` can be anything from a simple identifier to a list of several variables.

In which case the first item of the List should be usable as a file prefix.

So you have to keep that in mind if you want to use it to define output file names (you can test for that with `file_id instanceof List`).

In some case, the `file_id` may be a Map to have a cleaner access to the `file_id` content by explicit keywords.

If you want to use information within the `file_id` to name outputs in your `script` section, you can use the following snipet:

```Groovy

  script:

    switch(file_id) {

    case {it instanceof List}:

      file_prefix = file_id[0]

    break

    case {it instanceof Map}:

      file_prefix = file_id.values()[0]

    break

    default:

      file_prefix = file_id

    break

  }

```

and use the `file_prefix` variable.

This also means that channel emitting `path` item should be transformed with at least the following map function:

```Groovy

.map { it -> [it.simpleName, it]}

```

for example

```Groovy

channel

  .fromPath( params.fasta )

  .ifEmpty { error "Cannot find any fasta files matching: ${params.fasta}" }

  .map { it -> [it.simpleName, it]}

  .set { fasta_files }

```

The rationale behind taking a `file_id` and emitting the same `file_id` is to facilitate complex channel operations in pipelines without having to rewrite the `process` blocks.

### dealing with paired-end and single-end data

When oppening fastq files with `channel.fromFilePairs( params.fastq )`, item in the channel have the following shape:

```Groovy

[file_id, [read_1_file, read_2_file]]

```
Compare revisions

Source

Target

Files

Some changes are not shown.

.gitignore

.gitmodules

.reuse/dep5

CHANGELOG

CHANGELOG.md

CONTRIBUTING.md