Merge pull request #155 from jzohren/master

Fixed some spelling mistakes in the Output documentation

Merge pull request #155 from jzohren/master
1cc0a18c · Nicolas Servant · GitHub · b2a52352 · fa42adcd · 1cc0a18c
Unverified Commit 1cc0a18c authored 2 years ago by Nicolas Servant Committed by GitHub 2 years ago
--- a/docs/output.md
+++ b/docs/output.md
@@ -23,7 +23,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
 - [MultiQC](#multiqc) - aggregate report and quality controls, describing
  results of the whole pipeline
 - [Export](#exprot) - additionnal export for compatibility with downstream
-  analysis tool and visualization
+  analysis tool and visualisation

 ## From raw data to valid pairs

@@ -36,7 +36,7 @@ For details about the workflow, see

 #### Reads alignment

-Using Hi-C data, each reads mate has to be independantly aligned on the
+Using Hi-C data, each reads mate has to be independently aligned on the
 reference genome.
 The current workflow implements a two steps mapping strategy. First, the reads
 are aligned using an end-to-end aligner.
@@ -63,7 +63,7 @@ are available ;
 - `*.mapstat` - mapping statistics per read mate

 Usually, a high fraction of reads is expected to be aligned on the genome
-(80-90%). Among them, we usually observed a few percent (around 10%) of step 2
+(80-90%). Among them, we usually observe a few percent (around 10%) of step 2
 aligned reads. Those reads are chimeric fragments for which we detect a
 ligation junction. An abnormal level of chimeric reads can reflect a ligation
 issue during the library preparation.
@@ -142,9 +142,9 @@ removed (see `--keep_dups` to disable duplicates filtering).

 Additional quality controls such as fragment size distribution can be extracted
 from the list of valid interaction products.
-We usually expect to see a distribution centered around 300 pb which correspond
+We usually expect to see a distribution centered around 300 bp which corresponds
 to the paired-end insert size commonly used.
-The fraction of dplicates is also presented. A high level of duplication
+The fraction of duplicates is also presented. A high level of duplication
 indicates a poor molecular complexity and a potential PCR bias.
 Finally, an important metric is to look at the fraction of intra and
 inter-chromosomal interactions, as well as long range (>20kb) versus short
@@ -176,15 +176,15 @@ All results are available in `results/hicpro/stats`.

 #### Contact maps

-Intra et inter-chromosomal contact maps are build for all specified resolutions.
-The genome is splitted into bins of equal size. Each valid interaction is
+Intra and inter-chromosomal contact maps are built for all specified resolutions.
+The genome is split into bins of equal size. Each valid interaction is
 associated with the genomic bins to generate the raw maps.
 In addition, Hi-C data can contain several sources of biases which has to be
 corrected.
 The HiC-Pro workflow uses the [ìced](https://github.com/hiclib/iced) and
 [Varoquaux and Servant, 2018](http://joss.theoj.org/papers/10.21105/joss.01286)
 python package which proposes a fast implementation of the original ICE
-normalization algorithm (Imakaev et al. 2012), making the assumption of equal
+normalisation algorithm (Imakaev et al. 2012), making the assumption of equal
 visibility of each fragment.

 Importantly, the HiC-Pro maps are generated only if the `--hicpro_maps` option
@@ -221,16 +221,16 @@ downstream analysis.
 ## Hi-C contact maps

 Contact maps are usually stored as simple txt (`HiC-Pro`), .hic (`Juicer/Juicebox`) and .(m)cool (`cooler/Higlass`) formats.
-The .cool and .hic format are compressed and indexed and usually much more efficient that the txt format.  
-In the current workflow, we propose to use the `cooler` format as a standard to build the raw and normalized maps
-after valid pairs detection as it is used by several downstream analysis and visualization tools.
+The .cool and .hic format are compressed and indexed and usually much more efficient than the txt format.  
+In the current workflow, we propose to use the `cooler` format as a standard to build the raw and normalised maps
+after valid pairs detection as it is used by several downstream analysis and visualisation tools.

 Raw contact maps are therefore in **`results/contact_maps/raw`** which contains the different maps in `txt` and `cool` formats, at various resolutions.
-Normalized contact maps are stored in **`results/contact_maps/norm`** which contains the different maps in `txt`, `cool`, and `mcool` format.
+Normalised contact maps are stored in **`results/contact_maps/norm`** which contains the different maps in `txt`, `cool`, and `mcool` format.
 The bin coordinates used for all resolutions are available in **`results/contact_maps/bins`**.

 Note that `txt` contact maps generated with `cooler` are identical to those generated by `HiC-Pro`.
-However, differences can be observed on the normalized contact maps as the balancing algorithm is not exactly the same.
+However, differences can be observed on the normalised contact maps as the balancing algorithm is not exactly the same.

 ## Downstream analysis

@@ -246,23 +246,23 @@ The results generated with the `HiCExplorer hicPlotDistVsCounts` tool (plot and
 ### Compartments calling

 Compartments calling is one of the most common analysis which aims at detecting A (open, active) / B (close, inactive) compartments.
-In the first studies on the subject, the compartments were called at high/medium resolution (1000000 to 250000) which is enough to call A/B comparments.
+In the first studies on the subject, the compartments were called at high/medium resolution (1000000 to 250000) which is enough to call A/B compartments.
 Analysis at higher resolution has shown that these two main types of compartments can be further divided into compartments subtypes.

-Although different methods have been proposed for compartment calling, the standard remains the eigen vector decomposition from the normalized correlation maps.
+Although different methods have been proposed for compartment calling, the standard remains the eigen vector decomposition from the normalised correlation maps.
 Here, we use the implementation available in the [`cooltools`](https://cooltools.readthedocs.io/en/lates) package.

-Results are available in **`results/compartments/`** folder and includes :
+Results are available in **`results/compartments/`** folder and include :

 - `*cis.vecs.tsv`: eigenvectors decomposition along the genome
 - `*cis.lam.txt`: eigenvalues associated with the eigenvectors

 ### TADs calling

-TADs has been described as functional units of the genome.
-While contacts between genes and regulatority elements can occur within a single TADs, contacts between TADs are much less frequent, mainly due to the presence of insulation protein (such as CTCF) at their boundaries. Looking at Hi-C maps, TADs look like triangles around the diagonal. According to the contact map resolutions, TADs appear as hierarchical structures with a median size around 1Mb (in mammals), as well as smaller structures usually called sub-TADs of smaller size.
+TADs have been described as functional units of the genome.
+While contacts between genes and regulatority elements can occur within a single TAD, contacts between TADs are much less frequent, mainly due to the presence of an insulation protein (such as CTCF) at their boundaries. Looking at Hi-C maps, TADs look like triangles around the diagonal. According to the contact map resolutions, TADs appear as hierarchical structures with a median size around 1Mb (in mammals), as well as smaller structures usually called sub-TADs of smaller size.

-TADs calling remains a challenging task, and even if many methods have been proposed in the last decade, little overlap have been found between their results.
+TADs calling remains a challenging task, and even if many methods have been proposed in the last decade, little overlap has been found between their results.

 Currently, the pipeline proposes two approaches :

@@ -283,7 +283,7 @@ Usually, TADs results are presented as simple BED files, or bigWig files, with t

 </details>

-[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
+[MultiQC](http://multiqc.info) is a visualisation tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

 Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see <http://multiqc.info>.