diff --git a/README.md b/README.md index 5005fbe967c401ac77c58afdd004d1aff07cd27c..74b25285e92da232d5c6368f2140a4db070e0fe3 100644 --- a/README.md +++ b/README.md @@ -8,44 +8,74 @@ SPDX-License-Identifier: CC0-1.0 This is a reuse of [Microsplit](https://pypi.org/project/microsplit/) structure. Hi-Classifier is a command-line tool designed to identify, classify, and manage paired reads in BAM files derived from Hi‑C experiments. It follows the logic and structure of the [Parasplit](https://pypi.org/project/parasplit/) tool but is tailored for Micro-C data. The tool reads alignment files (SAM, BAM, or CRAM) using `pysam` and identifies events of soft-clipping or hard-clipping. +--- + +## Why another Hi-C parser ? + +Valid/invalid conventions ignore >30% of reads. +Some of these categories—dangling, self-circle, uncut sites—fluctuate with digestion and compaction. +Hi-Classifier is a fast, **site-aware** classifier able to: + +* label every pair (cis/trans-up/down, dangling, self-circle, re-joined, other) +* count categories **per restriction site** +* split a BAM into category-specific BAMs for downstream QC +* stream-process paired BAMs (R1/R2) could use ≤8 GB RAM + +--- + ## Features -- **Parallel Processing**: Microsplit utilizes parallel processing to enhance performance and efficiency. -- **Error Margin Handling**: Adds a fixed number of base pairs to new fragments to account for potential over-mapping by Bowtie2, ensuring more accurate downstream analysis. -- **Output Paired Reads**: Outputs both end-to-end aligned pairs and newly generated fragment pairs. +- **Parallel Processing**: Hi-Classifier utilizes parallel processing to enhance performance and efficiency. +- **Error Margin Handling**: +- **Output Paired Reads**: -## Installation -Hi-Classifier is available on PyPI and can be installed using pip: +## Quick start ```bash -pip install hi-classifier +hi-classifier \ + -1 sample_R1.bam -2 sample_R2.bam \ + -f genome.fa --enzyme dpnII \ + -o out/prefix --num_threads 8 ``` -## Usage +Outputs -Before using Hi-Classifier, you need to perform an initial alignment of reads using mapper to obtain explicit BAM files. Below is an example of how to use Microsplit from the command line: +``` +prefix_counts.tsv # matrix [site, class] +prefix_classified.bam # BAM with CAT tag (optional split per class) +``` -```bash -microsplit --bam_for_file path/to/forward.bam \ - --bam_rev_file path/to/reverse.bam \ - --output_forward path/to/output_forward.fastq.gz \ - --output_reverse path/to/output_reverse.fastq.gz \ - --num_threads 8 +## CLI + +```text +--bam_for_file BAM of Read 1 (required) +--bam_rev_file BAM of Read 2 (required) +--fasta_ref Reference genome (required) +--enzyme dpnII | mbol | arima +--len_max discard pairs > L [1500] +--mapq_score minimum MAPQ [1] +--tolerance ±bp for dangling [1] +--num_threads default 6 ``` -### Command-Line Arguments -- `--bam_for_file`: Path to the forward BAM file. -- `--bam_rev_file`: Path to the reverse BAM file. -- `--output_forward`: Path to the output forward FastQ file. -- `--output_reverse`: Path to the output reverse FastQ file. -- `--num_threads`: Total number of threads for parallel processing. +## Performance + +* Less than 4 hours on 20 cores to compute the data from HiConfidence (https://doi.org/10.1093/bib/bbad044) -## Methodology +## Class logic (default rules) -TO BE COMPLETED +| Class | Orientation | Distance | Same frag? | Potiential origin | +| ------------ | ----------- | -------- | ---------- | -------------------- | +| Valide | any | > len | no | bona fide contacts | +| Dangling | ↘↙ (conv.) | 0 | yes | ligation manquée | +| Self-circle | ↗↖ (div.) | ≤ 1 site | yes | auto-circularisation | +| Re-joined | ↘↙ | adjac. | no | religation immédiate | +| Cis-Up/Down… | mixed | > len | no | four intrachr. bins | +| Trans-… | mixed | N/A | no | inter-chromosomes | +| Other | – | – | – | tout le reste | ## License