Merge pull request #143 from nservant/dev

update docs

Merge pull request #143 from nservant/dev
f593a1b5 · Nicolas Servant · GitHub · 7234efaf · a8e07443 · f593a1b5
Unverified Commit f593a1b5 authored Jan 4, 2023 by Nicolas Servant Committed by GitHub Jan 4, 2023
--- a/docs/output.md
+++ b/docs/output.md
@@ -9,6 +9,7 @@ The directories listed below will be created in the results directory after the
 The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
+- [From raw data to valid pairs](#from-raw-data-to-valid-pairs)
  - [HiC-Pro](#hicpro)
    - [Reads alignment](#reads-alignment)
    - [Valid pairs detection](#valid-pairs-detection)
@@ -24,14 +25,16 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
 - [Export](#exprot) - additionnal export for compatibility with downstream
  analysis tool and visualization
-## HiC-Pro
+## From raw data to valid pairs
+### HiC-Pro
 The current version is mainly based on the
 [HiC-Pro](https://github.com/nservant/HiC-Pro) pipeline.
 For details about the workflow, see
 [Servant et al. 2015](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0831-x)
-### Reads alignment
+#### Reads alignment
 Using Hi-C data, each reads mate has to be independantly aligned on the
 reference genome.
@@ -41,17 +44,15 @@ Second, reads spanning the ligation junction are trimmmed from their 3' end,
 and aligned back on the genome.
 Aligned reads for both fragment mates are then paired in a single paired-end
 BAM file.
-Singletons are discarded, and multi-hits are filtered according to the
+Singletons and low quality mapped reads are filtered (`--min_mapq`).
-configuration parameters (`--rm-multi`).
 Note that if the `--dnase` mode is activated, HiC-Pro will skip the second
 mapping step.
 **Output directory: `results/hicpro/mapping`**
 - `*bwt2pairs.bam` - final BAM file with aligned paired data
- `*.pairstat` - mapping statistics
-if `--saveAlignedIntermediates` is specified, additional mapping file results
+if `--save_aligned_intermediates` is specified, additional mapping file results
 are available ;
 - `*.bam` - Aligned reads (R1 and R2) from end-to-end alignment
@@ -66,12 +67,12 @@ Usually, a high fraction of reads is expected to be aligned on the genome
 aligned reads. Those reads are chimeric fragments for which we detect a
 ligation junction. An abnormal level of chimeric reads can reflect a ligation
 issue during the library preparation.
-The fraction of singleton or multi-hits depends on the genome complexity and
+The fraction of singleton or low quality reads depends on the genome complexity and
 the fraction of unmapped reads. The fraction of singleton is usually close to
 the sum of unmapped R1 and R2 reads, as it is unlikely that both mates from the
 same pair were unmapped.
-### Valid pairs detection with HiC-Pro
+#### Valid pairs detection with HiC-Pro
 Each aligned reads can be assigned to one restriction fragment according to the
 reference genome and the digestion protocol.
@@ -91,7 +92,7 @@ Invalid pairs are classified as follow:
 Only valid pairs involving two different restriction fragments are used to
 build the contact maps.
 Duplicated valid pairs associated to PCR artefacts are discarded
-(see `--rm_dup`).
+(see `--keep_dup` to not discard them).
 In case of Hi-C protocols that do not require a restriction enzyme such as
 DNase Hi-C or micro Hi-C, the assignment to a restriction is not possible
@@ -108,12 +109,13 @@ can thus be discarded using the `--min_cis_dist` parameter.
 - `*.FiltPairs` - List of filtered pairs
 - `*RSstat` - Statitics of number of read pairs falling in each category
-The validPairs are stored using a simple tab-delimited text format ;
+Of note, these results are saved only if `--save_pairs_intermediates` is used.  
+The `validPairs` are stored using a simple tab-delimited text format ;
 ```bash
 read name / chr_reads1 / pos_reads1 / strand_reads1 / chr_reads2 / pos_reads2 /
 strand_reads2 / fragment_size / res frag name R1 / res frag R2 / mapping qual R1
-/ mapping qual R2 [/ allele_specific_tag]
+/ mapping qual R2
 ```
 The ligation efficiency can be assessed using the filtering of valid and
@@ -127,16 +129,16 @@ is skipped. The aligned pairs are therefore directly used to generate the
 contact maps. A filter of the short range contact (typically <1kb) is
 recommanded as this pairs are likely to be self ligation products.
-### Duplicates removal
+#### Duplicates removal
-Note that validPairs file are generated per reads chunck.
+Note that `validPairs` file are generated per reads chunck (and saved only if
-These files are then merged in the allValidPairs file, and duplicates are
+`--save_pairs_intermediates` is specified).
-removed if the `--rm_dup` parameter is used.
+These files are then merged in the `allValidPairs` file, and duplicates are
+removed (see `--keep_dups` to disable duplicates filtering).
 **Output directory: `results/hicpro/valid_pairs`**
 - `*allValidPairs` - combined valid pairs from all read chunks
- `*mergestat` - statistics about duplicates removal and valid pairs information
 Additional quality controls such as fragment size distribution can be extracted
 from the list of valid interaction products.
@@ -144,11 +146,35 @@ We usually expect to see a distribution centered around 300 pb which correspond
 to the paired-end insert size commonly used.
 The fraction of dplicates is also presented. A high level of duplication
 indicates a poor molecular complexity and a potential PCR bias.
-Finaly, an important metric is to look at the fraction of intra and
+Finally, an important metric is to look at the fraction of intra and
 inter-chromosomal interactions, as well as long range (>20kb) versus short
 range (<20kb) intra-chromosomal interactions.
-### Contact maps
+#### Pairs file
+`.pairs` is a standard tabular format proposed by the 4DN Consortium
+for storing DNA contacts detected in a Hi-C experiment
+(see https://pairtools.readthedocs.io/en/latest/formats.html).
+This format is the entry point of the downstream steps of the pipeline after
+detection of valid pairs.
+**Output directory: `results/hicpro/valid_pairs/pairix`**
+- `*pairix` - compressed and indexed pairs file
+#### Statistics
+Various statistics files are generated all along the data processing.
+All results are available in `results/hicpro/stats`.
+**Output directory: `results/hicpro/stats`**
+- \*mapstat - mapping statistics per read mate
+- \*pairstat - R1/R2 pairing statistics
+- \*RSstat - Statitics of number of read pairs falling in each category
+- \*mergestat - statistics about duplicates removal and valid pairs information
+#### Contact maps
 Intra et inter-chromosomal contact maps are build for all specified resolutions.
 The genome is splitted into bins of equal size. Each valid interaction is
@@ -195,15 +221,16 @@ downstream analysis.
 ## Hi-C contact maps
 Contact maps are usually stored as simple txt (`HiC-Pro`), .hic (`Juicer/Juicebox`) and .(m)cool (`cooler/Higlass`) formats.
-Note that .cool and .hic format are compressed and usually much more efficient that the txt format.  
+The .cool and .hic format are compressed and indexed and usually much more efficient that the txt format.  
 In the current workflow, we propose to use the `cooler` format as a standard to build the raw and normalized maps
 after valid pairs detection as it is used by several downstream analysis and visualization tools.
 Raw contact maps are therefore in **`results/contact_maps/raw`** which contains the different maps in `txt` and `cool` formats, at various resolutions.
 Normalized contact maps are stored in **`results/contact_maps/norm`** which contains the different maps in `txt`, `cool`, and `mcool` format.
+The bin coordinates used for all resolutions are available in **`results/contact_maps/bins`**.
 Note that `txt` contact maps generated with `cooler` are identical to those generated by `HiC-Pro`.
-However, differences can be observed on the normalized contact maps as the balancing algorithm is not the same.
+However, differences can be observed on the normalized contact maps as the balancing algorithm is not exactly the same.
 ## Downstream analysis

--- a/docs/usage.md
+++ b/docs/usage.md
@@ -27,19 +27,13 @@ CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz
 ### Full samplesheet
-The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 3 columns to match those defined in the table below.
+The `nf-core-hic` pipeline is designed to work only with paired-end data. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 3 columns to match those defined in the table below.
-A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 6 samples, where `TREATMENT_REP3` has been sequenced twice.
 ```console
 sample,fastq_1,fastq_2
-CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
+SAMPLE_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
-CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz
+SAMPLE_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz
-CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz
+SAMPLE_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz
-TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz,
-TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz,
-TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz,
-TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz,
 ```
 | Column    | Description                                                                                                                                                                            |
@@ -333,11 +327,11 @@ Please note the following requirements:
 If left unspecified, a default pattern is used: `data/*{1,2}.fastq.gz`
-Note that the Hi-C data analysis requires paired-end data.
+Note that the Hi-C data analysis workflow requires paired-end data.
 ## Reference genomes
-The pipeline config files come bundled with paths to the illumina iGenomes reference
+The pipeline config files come bundled with paths to the Illumina iGenomes reference
 index files. If running with docker or AWS, the configuration is set up to use the
 [AWS-iGenomes](https://ewels.github.io/AWS-iGenomes/) resource.
@@ -347,7 +341,7 @@ There are many different species supported in the iGenomes references. To run
 the pipeline, you must specify which to use with the `--genome` flag.
 You can find the keys to specify the genomes in the
-[iGenomes config file](../conf/igenomes.config).
+[iGenomes config file](https://github.com/nf-core/hic/blob/master/conf/igenomes.config).
 ### `--fasta`
@@ -361,7 +355,7 @@ run the pipeline:
 ### `--bwt2_index`
 The bowtie2 indexes are required to align the data with the HiC-Pro workflow. If the
-`--bwt2_index` is not specified, the pipeline will either use the igenome
+`--bwt2_index` is not specified, the pipeline will either use the iGenomes
 bowtie2 indexes (see `--genome` option) or build the indexes on-the-fly
 (see `--fasta` option)
@@ -371,8 +365,8 @@ bowtie2 indexes (see `--genome` option) or build the indexes on-the-fly
 ### `--chromosome_size`
-The Hi-C pipeline will also requires a two-columns text file with the
+The Hi-C pipeline also requires a two-column text file with the
-chromosome name and its size (tab separated).
+chromosome name and the chromosome size (tab-separated).
 If not specified, this file will be automatically created by the pipeline.
 In the latter case, the `--fasta` reference genome has to be specified.
@@ -396,7 +390,7 @@ In the latter case, the `--fasta` reference genome has to be specified.
 ### `--restriction_fragments`
-Finally, Hi-C experiments based on restriction enzyme digestion requires a BED
+Finally, Hi-C experiments based on restriction enzyme digestion require a BED
 file with coordinates of restriction fragments.
 ```bash
@@ -413,23 +407,23 @@ file with coordinates of restriction fragments.
   (...)
 ```
-If not specified, this file will be automatically created by the pipline.
+If not specified, this file will be automatically created by the pipeline.
 In this case, the `--fasta` reference genome will be used.
-Note that the `digestion` or `--restriction_site` parameter is mandatory to create this file.
+Note that the `--digestion` or `--restriction_site` parameter is mandatory to create this file.
 ## Hi-C specific options
 The following options are defined in the `nextflow.config` file, and can be
 updated either using a custom configuration file (see `-c` option) or using
-command line parameter.
+command line parameters.
 ### HiC-pro mapping
 The reads mapping is currently based on the two-steps strategy implemented in
 the HiC-pro pipeline. The idea is to first align reads from end-to-end.
-Reads that do not aligned are then trimmed at the ligation site, and their 5'
+Reads that do not align are then trimmed at the ligation site, and their 5'
 end is re-aligned to the reference genome.
-Note that the default option are quite stringent, and can be updated according
+Note that the default options are quite stringent, and can be updated according
 to the reads quality or the reference genome.
 #### `--bwt2_opts_end2end`
@@ -475,7 +469,7 @@ Available keywords are 'hindiii', 'dpnii', 'mboi', 'arima'.
 #### `--restriction_site`
 If the restriction enzyme is not available through the `--digestion`
-parameter, you can also defined manually the restriction motif(s) for
+parameter, you can also define manually the restriction motif(s) for
 Hi-C digestion protocol.
 The restriction motif(s) is(are) used to generate the list of restriction fragments.
 The precise cutting site of the restriction enzyme has to be specified using
@@ -498,7 +492,7 @@ that 'N' base are supported.
 Ligation motif after reads ligation. This motif is used for reads trimming and
 depends on the fill in strategy.
-Note that multiple ligation sites can be specified (comma separated) and that
+Note that multiple ligation sites can be specified (comma-separated) and that
 'N' base is interpreted and replaced by 'A','C','G','T'.
 Default: 'AAGCTAGCTT'
@@ -514,11 +508,11 @@ Exemple of the ARIMA kit: GATCGATC,GANTGATC,GANTANTC,GATCANTC
 In DNAse Hi-C mode, all options related to digestion Hi-C
 (see previous section) are ignored.
-In this case, it is highly recommanded to use the `--min_cis_dist` parameter
+In this case, it is highly recommended to use the `--min_cis_dist` parameter
 to remove spurious ligation products.
 ```bash
--dnase'
+--dnase
 ```
 ### HiC-pro processing
@@ -570,7 +564,7 @@ Mainly useful for DNase Hi-C. Default: '0'
 #### `--keep_dups`
-If specified, duplicates reads are not discarded before building contact maps.
+If specified, duplicate reads are not discarded before building contact maps.
 ```bash
 --keep_dups
@@ -594,7 +588,7 @@ framework to build the raw and balanced contact maps in txt and (m)cool formats.
 ### `--bin_size`
-Resolution of contact maps to generate (comma separated).
+Resolution of contact maps to generate (comma-separated).
 Default:'1000000,500000'
 ```bash
@@ -635,7 +629,7 @@ Default: 100
 #### `--ice_filer_low_count_perc`
-Define which pourcentage of bins with low counts should be force to zero.
+Define which percentage of bins with low counts should be forced to zero.
 Default: 0.02
 ```bash
@@ -644,7 +638,7 @@ Default: 0.02
 #### `--ice_filer_high_count_perc`
-Define which pourcentage of bins with low counts should be discarded before
+Define which percentage of bins with low counts should be discarded before
 normalization. Default: 0
 ```bash
@@ -667,7 +661,7 @@ normalization. Default: 0.1
 #### `--res_dist_decay`
 Generates distance vs Hi-C counts plots at a given resolution using `HiCExplorer`.
-Several resolution can be specified (comma separeted). Default: '250000'
+Several resolutions can be specified (comma-separeted). Default: '250000'
 ```bash
 --res_dist_decay '[string]'
@@ -679,7 +673,7 @@ Call open/close compartments for each chromosome, using the `cooltools` command.
 #### `--res_compartments`
-Resolution to call the chromosome compartments (comma separated).
+Resolution to call the chromosome compartments (comma-separated).
 Default: '250000'
 ```bash
@@ -692,7 +686,7 @@ Default: '250000'
 TADs calling can be performed using different approaches.
 Currently available options are `insulation` and `hicexplorer`.
-Note that all options can be specified (comma separated).
+Note that all options can be specified (comma-separated).
 Default: 'insulation'
 ```bash
@@ -701,7 +695,7 @@ Default: 'insulation'
 #### `--res_tads`
-Resolution to run the TADs calling analysis (comma separated).
+Resolution to run the TADs calling analysis (comma-separated).
 Default: '40000,20000'
 ```bash
@@ -744,7 +738,7 @@ results folder. Default: false
 ### `--save_interaction_bam`
-If specified, write a BAM file with all classified reads (valid paires,
+If specified, write a BAM file with all classified reads (valid pairs,
 dangling end, self-circle, etc.) and its tags.
 ```bash
@@ -756,7 +750,7 @@ dangling end, self-circle, etc.) and its tags.
 ### `--skip_maps`
 If defined, the workflow stops with the list of valid interactions, and the
-genome-wide maps are not built. Usefult for capture-C analysis. Default: false
+genome-wide maps are not built. Useful for capture-C analysis. Default: false
 ```bash
 --skip_maps
@@ -779,7 +773,7 @@ If defined, cooler files are not generated. Default: false
 --skip_cool
 ```
-### `skip_dist_decay`
+### `--skip_dist_decay`
 Do not run distance decay plots. Default: false
@@ -787,7 +781,7 @@ Do not run distance decay plots. Default: false
 --skip_dist_decay
 ```
-### `skip_compartments`
+### `--skip_compartments`
 Do not call compartments. Default: false
@@ -795,7 +789,7 @@ Do not call compartments. Default: false
 --skip_compartments
 ```
-### `skip_tads`
+### `--skip_tads`
 Do not call TADs. Default: false