diff --git a/doc/2023_12_15_presentation.qmd b/doc/2023_12_15_presentation.qmd new file mode 100644 index 0000000000000000000000000000000000000000..43e3b04a048aff43633e4d62c030135e58f59901 --- /dev/null +++ b/doc/2023_12_15_presentation.qmd @@ -0,0 +1,623 @@ +--- +title: "Kmer diff" +author: "laurent modolo" +format: + revealjs: + transition: none + format: + highlight-style: monokai + theme: white + footer: "laurent.modolo@ens-lyon.fr" + slide-number: c/t + fontsize: 22pt +revealjs-plugins: + - pointer +--- + +## Background + +The Delattre team’s studies, *Mesorhabditis worms*, some species of which present +atypical reproduction mechanisms. + +In a previous paper: *Males as somatic investment in a parthenogenetic nematode* [DOI: 10.1126/science.aau0099](https://doi.org/10.1126/science.aau0099), +we characterized contigs of a *de novo* genome assembly of *M. belaris* as + + +:::: {.columns} +::: {.column width="40%"} +- <span style="color:blue;">Autosomal</span> chromosomes +- <span style="color:red;">X</span> chromosome +- <span style="color:green;">Y</span> chromosome +::: + +::: {.column width="60%"} +{width=350px} +::: + +:::: + +## Goal + +From raw sequencing data of male and female individuals, we want to identify $k$-mers corresponding to : + +- <span style="color:blue;">Autosomal</span> chromosomes +- <span style="color:red;">X</span> chromosome, if present +- <span style="color:green;">Y</span> chromosome, if present + +<div style="text-align: center">**We can study the chromosomal system without an assembly**</div> + +## Phylogeny + +:::: {.columns} +::: {.column width="40%"} + +::: + +::: {.column width="60%"} +- *M. belari* +- *M. monhystera* +- *M. longespiculosa* +- *M. spiculigera* +::: +:::: + +## Kmer-diff + +:::: {.columns} +::: {.column width="60%"} + +A nextflow pipeline to analyze the $k$-mer content of fastq files + +1. preprocess the fastq files +2. count the $k$-mers of each file +3. merge the counts to get a table of male and female $k$-mers counts +3. test the sexual models +4. identify the <span style="color:blue;">A</span>,<span style="color:red;">X</span> and <span style="color:green;">Y</span> $k$-mers + +::: + +::: {.column width="40%"} + +{width=80%} + +::: +:::: + +## Kmer-diff + +:::: {.columns} +::: {.column width="60%"} + +A nextflow pipeline to analyze the $k$-mer content of fastq files + +1. **preprocess the fastq files** +2. count the $k$-mers of each file +3. merge the counts to get a table of male and female $k$-mers counts +3. test the sexual models +4. identify the <span style="color:blue;">A</span>,<span style="color:red;">X</span> and <span style="color:green;">Y</span> $k$-mers + +::: + +::: {.column width="40%"} + +{width=80%} + +::: +:::: + +## Count problem + +:::: {.columns} +::: {.column width="75%"} + + +::: +::: {.column width="25%"} + +- The coverage is not the same between the male and the female + +::: +:::: + +## Preprocessing + +**preprocess the fastq files** + +Important for the clustering analysis: + +- **subsample each file to have the same number of reads between male and female** + +Important for the $k$-mer counting + +- **split the fastq files into manageable size files ($10ˆ6$ reads per files)** + + +## Kmer-diff + +:::: {.columns} +::: {.column width="60%"} + +A nextflow pipeline to analyze the $k$-mer content of fastq files + +1. preprocess the fastq files +2. **count the $k$-mers of each file** +3. merge the counts to get a table of male and female $k$-mers counts +3. test the sexual models +4. identify the <span style="color:blue;">A</span>,<span style="color:red;">X</span> and <span style="color:green;">Y</span> $k$-mers + +::: + +::: {.column width="40%"} + +{width=80%} + +::: +:::: + +## $k$-mer count + +[fastqkmer](https://github.com/angelovangel/fastkmers) + +``` +fastkmers -k 12 file.fastq > file.csv +``` + +Run a **sliding window** of size $12$ by step of $1$ along the reads counting all the occurrences of each $k$-mers + +We have the letters: $A,C,T,G$ and $N$ + +<div style="text-align: center">$244,140,625$ possible $k$-mers</div> + +We split the fastq files into $\sim$ $1400$ subfiles of $10^6$ reads. + +## Merging the $k$-mers + +$\sim$ $1400$ splits of $10^6$ reads $\rightarrow$ $1400$ csv files + +:::: {.columns} +::: {.column width="50%"} + +- large number of $k$-mers +- Unordered +- not the same $k$-mers are present in every files + +build a **suffix tree** of the $k$-mer to store them +::: + +::: {.column width="50%"} + + + +::: +:::: + +## Merging the $k$-mers + +[mergekmer](https://gitbio.ens-lyon.fr/LBMC/Delattre/mergekmer) a small rust program that build a **suffix tree** of the $k$-mer + +:::: {.columns} +::: {.column width="50%"} + +``` +merge fastkmers output + +Usage: mergekmer [OPTIONS] --output <OUTPUT> + +Options: + -c, --csv <CSV>... list of csv files + -o, --output <OUTPUT> merged csv file + -c, --collate collate csv file + -h, --help Print help + -V, --version Print version +``` + +Each leaf of the tree contains the number of $k$-mers + +The tree traversal is easy to compute with a recursive function + +**We can merge all the counts file of a given sex and specie** + +::: +::: {.column width="50%"} + +::: +:::: + +## Merging the $k$-mers + +:::: {.columns} +::: {.column width="50%"} + +``` +merge fastkmers output + +Usage: mergekmer [OPTIONS] --output <OUTPUT> + +Options: + -c, --csv <CSV>... list of csv files + -o, --output <OUTPUT> merged csv file + -c, --collate collate csv file + -h, --help Print help + -V, --version Print version +``` + +In the `--collate` version each leave contains a list of the count of the $k$-mers in the female or the male of a specie. + +**We can fuse the counts of the male and female for each specie** + + +::: +::: {.column width="50%"} + +::: +:::: + +## Test the sexual model + +**We have the following possible models** + + +:::: {.columns} +::: {.column width="33%"} + +- <span style="color:red;">X</span><span style="color:green;">Y</span> system + + + +::: +::: {.column width="33%"} + +- <span style="color:red;">X</span>O system + + +::: +::: {.column width="33%"} + +- OO system + + + +::: +:::: + +## Test the sexual model + +:::: {.columns} +::: {.column width="33%"} + +- <span style="color:red;">X</span><span style="color:green;">Y</span> system + + + +::: +::: {.column width="33%"} + +- <span style="color:red;">X</span>O system + + +::: +::: {.column width="33%"} + +- OO system + + + +::: +:::: + +- A cluster with mean male $=$ mean female +- A cluster above the diagonal +- A cluster below the diagonal + + +## Test the sexual model + +:::: {.columns} +::: {.column width="25%"} + +**data** + + + +::: +::: {.column width="25%"} + +**XY model** + + + +::: +::: {.column width="25%"} + +**XO model** + + +::: +::: {.column width="25%"} + +**OO model** + + + +::: +:::: +:::: {.columns} +::: {.column width="50%"} + + +Bayesian information criterion (BIC) + + + +::: +::: {.column width="50%"} + +Loglikelihood + + + +::: +:::: + + +## identify the <span style="color:blue;">A</span>,<span style="color:red;">X</span> and <span style="color:green;">Y</span> $k$-mers: Simple Model + +:::: {.columns} +::: {.column width="50%"} + + + +::: +::: {.column width="50%"} + +- Can be used to compare model +- Not sensitive enough +- Cannot be used to classify individual $k$-mers + +::: +:::: + +## identify the <span style="color:blue;">A</span>,<span style="color:red;">X</span> and <span style="color:green;">Y</span> $k$-mers: Bayesian version + + +:::: {.columns} +::: {.column width="75%"} + + +::: +::: {.column width="25%"} + +- Can choose the prior for each cluster (mean and shape) +- Can choose the prior for the proportion between <span style="color:blue;">A</span>,<span style="color:red;">X</span> and <span style="color:green;">Y</span> +- Can choose the weight of each prior compared to the data + +::: +:::: + +## identify the <span style="color:blue;">A</span>,<span style="color:red;">X</span> and <span style="color:green;">Y</span> $k$-mers: Bayesian version + + +:::: {.columns} +::: {.column width="75%"} + + +::: +::: {.column width="25%"} + +- Can choose the prior for each cluster (mean and shape) +- Can choose the prior for the proportion between <span style="color:blue;">A</span>,<span style="color:red;">X</span> and <span style="color:green;">Y</span> +- Can choose the weight of each prior compared to the data + +::: +:::: + +## identify the <span style="color:blue;">A</span>,<span style="color:red;">X</span> and <span style="color:green;">Y</span> $k$-mers: Bayesian version + +:::: {.columns} +::: {.column width="75%"} + + +::: +::: {.column width="25%"} + +- Can choose the prior for each cluster (mean and shape) +- Can choose the prior for the proportion between <span style="color:blue;">A</span>,<span style="color:red;">X</span> and <span style="color:green;">Y</span> +- Can choose the weight of each prior compared to the data + +::: +:::: + + +## Count model + +We don't have nice Gaussian ellipses but 2D count data $(X_m, X_f)$ + +Bi-Poisson distribution + +$$(X_m, X_f) \sim \mathcal{P}(\lambda_1,\lambda_2,\lambda_3)$$ + +:::: {.columns} +::: {.column width="40%"} + +- $X_m = X_1 + Z$ +- $X_f = X_2 + Z$ + +::: +::: {.column width="20%"} + +with +::: +::: {.column width="40%"} + +- $X_1 \sim \mathcal{P}(\lambda_1)$ +- $X_2 \sim \mathcal{P}(\lambda_2)$ +- $Z \sim \mathcal{P}(\lambda_3)$ + +::: +:::: + +## Count model +With the EM algorithm we can estimate $Z$ the hidden variable of the model + +**E step:** + +$$z_i = E(Z_i | X_m, X_f, \lambda_1^{(k)}, \lambda_2^{(k)}, \lambda_3^{(k)})$$ +$$z_i = \lambda_3^{(k)} \frac{\mathcal{P}\left(x_{m,i-1}, x_{f, i-1} | \lambda_1^{(k)}, \lambda_2^{(k)}, \lambda_3^{(k)}\right)}{\mathcal{P}\left(x_{m,i}, x_{f, i} | \lambda_1^{(k)}, \lambda_2^{(k)}, \lambda_3^{(k)}\right)}$$ + +**M step:** + +- $\lambda_1 = \frac{\sum_i (x_{m,i} - z_i)}{n}$ +- $\lambda_2 = \frac{\sum_i (x_{f,i} - z_i)}{n}$ +- $\lambda_3 = \frac{\sum_i z_i}{n}$ + +## Count model + +We can write the <span style="color:red;">X</span><span style="color:green;">Y</span> model as: + +$$P(h = i | (x_m, x_f), \theta)=\frac{\alpha_i f\left((x_m, x_f), \theta_i\right)}{\sum_{\ell \in \{A,X,Y\}} \alpha_{\ell} f\left((x_m, x_f), \theta_{\ell}\right)}$$ + +With the EM algorithm we can estimate $H$ and $Z$ the two hidden variables of the model + +We can also add a bayesian prior on the $\alpha$'s with a Dirichlet distribution + +## Count data + +:::: {.columns} +::: {.column width="50%"} + + + +::: +::: {.column width="50%"} + +Works well for Poisson data: + +- Mean = Variance of the count +- no overdispersion + +::: +:::: + +## Count data + +:::: {.columns} +::: {.column width="50%"} + +{width=65%} + + + + +::: +::: {.column width="50%"} + + + + + +::: +:::: + +## Count data + +:::: {.columns} +::: {.column width="50%"} + +{width=65%} + + + + +::: +::: {.column width="50%"} + + + +{width=65%} + +::: +:::: + +## Remove the weird $k$-mers + +### mBelari + +:::: {.columns} +::: {.column width="40%"} + + + + + +::: +::: {.column width="60%"} + + + +::: +:::: + +## Remove the weird $k$-mers + +### mLongespiculosa + +:::: {.columns} +::: {.column width="40%"} + + + + +::: +::: {.column width="60%"} + + + +::: +:::: + +## Remove the weird $k$-mers + +### mSpiculigera + +:::: {.columns} +::: {.column width="40%"} + + + + +::: +::: {.column width="60%"} + + + +::: +:::: + +## Remove the weird $k$-mers + +### mMonhystera + +:::: {.columns} +::: {.column width="40%"} + + +::: +::: {.column width="60%"} + +::: +:::: + +## Phylogeny + +:::: {.columns} +::: {.column width="40%"} + +::: + +::: {.column width="60%"} +- *M. belari* <span style="color:blue;">A</span>,<span style="color:red;">X</span> and <span style="color:green;">Y</span> +- *M. monhystera* <span style="color:blue;">A</span> and <span style="color:green;">Y</span> +- *M. longespiculosa* <span style="color:blue;">A</span> +- *M. spiculigera* <span style="color:blue;">A</span>,<span style="color:red;">X</span> and <span style="color:green;">Y</span> +::: +:::: \ No newline at end of file diff --git a/doc/img/flowchart.png b/doc/img/flowchart.png index 6c9a8e9912f67f6ad5b930a7cfceeebcdd1481ba..c94c53de562ceec9b4620ee53160f137d6e769df 100644 Binary files a/doc/img/flowchart.png and b/doc/img/flowchart.png differ diff --git a/doc/img/mbelari.png b/doc/img/mbelari.png new file mode 100644 index 0000000000000000000000000000000000000000..dcafed9c603513d825901b1d78aebbdc07d7087f Binary files /dev/null and b/doc/img/mbelari.png differ diff --git a/doc/img/mbelari_G_clust_f.pdf b/doc/img/mbelari_G_clust_f.pdf new file mode 100644 index 0000000000000000000000000000000000000000..f8b63674ef408238e9ba2ef739e6188d3c0c27ad Binary files /dev/null and b/doc/img/mbelari_G_clust_f.pdf differ diff --git a/doc/img/mbelari_G_clust_f.png b/doc/img/mbelari_G_clust_f.png new file mode 100644 index 0000000000000000000000000000000000000000..b8f3fe8088626d501f76f2a2727cb20918d5fc3f Binary files /dev/null and b/doc/img/mbelari_G_clust_f.png differ diff --git a/doc/img/mbelari_G_clust_fm.pdf b/doc/img/mbelari_G_clust_fm.pdf new file mode 100644 index 0000000000000000000000000000000000000000..f8586ff58c52b821ede881ee85fb2a58b4315479 Binary files /dev/null and b/doc/img/mbelari_G_clust_fm.pdf differ diff --git a/doc/img/mbelari_G_clust_fm.png b/doc/img/mbelari_G_clust_fm.png new file mode 100644 index 0000000000000000000000000000000000000000..874c4203e04983d0a286a1eda89f04a8b9a81bfe Binary files /dev/null and b/doc/img/mbelari_G_clust_fm.png differ diff --git a/doc/img/mbelari_G_clust_m.pdf b/doc/img/mbelari_G_clust_m.pdf new file mode 100644 index 0000000000000000000000000000000000000000..3e509a273109c874dc20b4240918f04430603d1a Binary files /dev/null and b/doc/img/mbelari_G_clust_m.pdf differ diff --git a/doc/img/mbelari_G_clust_m.png b/doc/img/mbelari_G_clust_m.png new file mode 100644 index 0000000000000000000000000000000000000000..a266221305490fe02799ac357abc02dc6252dbec Binary files /dev/null and b/doc/img/mbelari_G_clust_m.png differ diff --git a/doc/img/mbelari_log1p.png b/doc/img/mbelari_log1p.png new file mode 100644 index 0000000000000000000000000000000000000000..5a2514f12f9e159fee80bafc77b4b36247db96c8 Binary files /dev/null and b/doc/img/mbelari_log1p.png differ diff --git a/doc/img/mlongespiculosa.png b/doc/img/mlongespiculosa.png new file mode 100644 index 0000000000000000000000000000000000000000..577dbd27233140bb5f35d5668a9bba2eea4d647f Binary files /dev/null and b/doc/img/mlongespiculosa.png differ diff --git a/doc/img/mlongespiculosa_G_clust_f.pdf b/doc/img/mlongespiculosa_G_clust_f.pdf new file mode 100644 index 0000000000000000000000000000000000000000..c0613c012cf5b67b89d9eb70c812549db27a1d03 Binary files /dev/null and b/doc/img/mlongespiculosa_G_clust_f.pdf differ diff --git a/doc/img/mlongespiculosa_G_clust_f.png b/doc/img/mlongespiculosa_G_clust_f.png new file mode 100644 index 0000000000000000000000000000000000000000..60343a78ccde7e937dc6ef41b43f9d631a4224f0 Binary files /dev/null and b/doc/img/mlongespiculosa_G_clust_f.png differ diff --git a/doc/img/mlongespiculosa_G_clust_fm.pdf b/doc/img/mlongespiculosa_G_clust_fm.pdf new file mode 100644 index 0000000000000000000000000000000000000000..e842a1442808b3241264ab539016306563420e0e Binary files /dev/null and b/doc/img/mlongespiculosa_G_clust_fm.pdf differ diff --git a/doc/img/mlongespiculosa_G_clust_fm.png b/doc/img/mlongespiculosa_G_clust_fm.png new file mode 100644 index 0000000000000000000000000000000000000000..3b1a5b88a787af27ac215844df79d343b1eda416 Binary files /dev/null and b/doc/img/mlongespiculosa_G_clust_fm.png differ diff --git a/doc/img/mlongespiculosa_G_clust_m.pdf b/doc/img/mlongespiculosa_G_clust_m.pdf new file mode 100644 index 0000000000000000000000000000000000000000..e936ba703fd9ad52f77e74e7a4238091b09c7e3a Binary files /dev/null and b/doc/img/mlongespiculosa_G_clust_m.pdf differ diff --git a/doc/img/mlongespiculosa_G_clust_m.png b/doc/img/mlongespiculosa_G_clust_m.png new file mode 100644 index 0000000000000000000000000000000000000000..692ffce567ea35808be4bfd4a6fcde8cefdbd533 Binary files /dev/null and b/doc/img/mlongespiculosa_G_clust_m.png differ diff --git a/doc/img/mlongespiculosa_log1p.png b/doc/img/mlongespiculosa_log1p.png new file mode 100644 index 0000000000000000000000000000000000000000..245bb3ee6ad09271a48ab66bdf73db157936e17b Binary files /dev/null and b/doc/img/mlongespiculosa_log1p.png differ diff --git a/doc/img/mmonhystera.png b/doc/img/mmonhystera.png new file mode 100644 index 0000000000000000000000000000000000000000..97a7afbbb698b82afe08cfa16a55d0c9393d1f78 Binary files /dev/null and b/doc/img/mmonhystera.png differ diff --git a/doc/img/mmonhystera_G_clust_f.pdf b/doc/img/mmonhystera_G_clust_f.pdf new file mode 100644 index 0000000000000000000000000000000000000000..bc2a6298a23bf4eabf1b3fb772109dd818bbbfc1 Binary files /dev/null and b/doc/img/mmonhystera_G_clust_f.pdf differ diff --git a/doc/img/mmonhystera_G_clust_f.png b/doc/img/mmonhystera_G_clust_f.png new file mode 100644 index 0000000000000000000000000000000000000000..06f78d77835ec9a1886222c9ce2f03b03e90b2ec Binary files /dev/null and b/doc/img/mmonhystera_G_clust_f.png differ diff --git a/doc/img/mmonhystera_G_clust_fm.pdf b/doc/img/mmonhystera_G_clust_fm.pdf new file mode 100644 index 0000000000000000000000000000000000000000..2055d332ab7a1d805cdbee20e371368b7a5e0858 Binary files /dev/null and b/doc/img/mmonhystera_G_clust_fm.pdf differ diff --git a/doc/img/mmonhystera_G_clust_fm.png b/doc/img/mmonhystera_G_clust_fm.png new file mode 100644 index 0000000000000000000000000000000000000000..ff5b7bb0d84de7a9c7e1f6385f78707e6be43a89 Binary files /dev/null and b/doc/img/mmonhystera_G_clust_fm.png differ diff --git a/doc/img/mmonhystera_G_clust_m.pdf b/doc/img/mmonhystera_G_clust_m.pdf new file mode 100644 index 0000000000000000000000000000000000000000..d753698874c1612bf9e2aa86a01f7bb488f91e9c Binary files /dev/null and b/doc/img/mmonhystera_G_clust_m.pdf differ diff --git a/doc/img/mmonhystera_G_clust_m.png b/doc/img/mmonhystera_G_clust_m.png new file mode 100644 index 0000000000000000000000000000000000000000..8de1cb0a92240ef97654216dbfe3d26f9a767494 Binary files /dev/null and b/doc/img/mmonhystera_G_clust_m.png differ diff --git a/doc/img/mmonhystera_log1p.png b/doc/img/mmonhystera_log1p.png new file mode 100644 index 0000000000000000000000000000000000000000..2804b61728f12f2fc8f58141b8766efd65a611e3 Binary files /dev/null and b/doc/img/mmonhystera_log1p.png differ diff --git a/doc/img/mspiculigera.png b/doc/img/mspiculigera.png new file mode 100644 index 0000000000000000000000000000000000000000..ca4b7afbc2655f155bedaca0bd686fc57ea023e6 Binary files /dev/null and b/doc/img/mspiculigera.png differ diff --git a/doc/img/mspiculigera_G_clust_f.pdf b/doc/img/mspiculigera_G_clust_f.pdf new file mode 100644 index 0000000000000000000000000000000000000000..611a6e6a8aed8c23fb1e4df1045572a476902f80 Binary files /dev/null and b/doc/img/mspiculigera_G_clust_f.pdf differ diff --git a/doc/img/mspiculigera_G_clust_f.png b/doc/img/mspiculigera_G_clust_f.png new file mode 100644 index 0000000000000000000000000000000000000000..03952465d6493475f2589f5cdbbb7ca1ab8dfab6 Binary files /dev/null and b/doc/img/mspiculigera_G_clust_f.png differ diff --git a/doc/img/mspiculigera_G_clust_fm.pdf b/doc/img/mspiculigera_G_clust_fm.pdf new file mode 100644 index 0000000000000000000000000000000000000000..f0dbd61d8abb1b483c5a0a0ab26eeb0fb37640d9 Binary files /dev/null and b/doc/img/mspiculigera_G_clust_fm.pdf differ diff --git a/doc/img/mspiculigera_G_clust_fm.png b/doc/img/mspiculigera_G_clust_fm.png new file mode 100644 index 0000000000000000000000000000000000000000..5aa1fc84388c84c9bc8aef05bc2fb61e22e0538b Binary files /dev/null and b/doc/img/mspiculigera_G_clust_fm.png differ diff --git a/doc/img/mspiculigera_G_clust_m.pdf b/doc/img/mspiculigera_G_clust_m.pdf new file mode 100644 index 0000000000000000000000000000000000000000..906073b151efd6236be145d70c3b0d550a81727c Binary files /dev/null and b/doc/img/mspiculigera_G_clust_m.pdf differ diff --git a/doc/img/mspiculigera_G_clust_m.png b/doc/img/mspiculigera_G_clust_m.png new file mode 100644 index 0000000000000000000000000000000000000000..39d2938244ba43f4f49f9791bcd0e237e9026972 Binary files /dev/null and b/doc/img/mspiculigera_G_clust_m.png differ diff --git a/doc/img/mspiculigera_log1p.png b/doc/img/mspiculigera_log1p.png new file mode 100644 index 0000000000000000000000000000000000000000..dd9ce5a2f8f73c13caaa266535cdf7faf167e639 Binary files /dev/null and b/doc/img/mspiculigera_log1p.png differ diff --git a/doc/img/poisson_clustering_XY.png b/doc/img/poisson_clustering_XY.png new file mode 100644 index 0000000000000000000000000000000000000000..9688793ef48908d30549cf046468d4cc04ec2e04 Binary files /dev/null and b/doc/img/poisson_clustering_XY.png differ