Kmer diff

laurent modolo

Background

The Delattre teams studies, Mesorhabditis worms, some species of which present atypical reproduction mechanisms.

In a previous paper: Males as somatic investment in a parthenogenetic nematode DOI: 10.1126/science.aau0099, we caracterized contigs of a de novo genome assembly of M. belaris as

  • Autosomal chromosomes
  • X chromosome
  • Y chromosome

Goal

From raw sequencing data of male and female individuals, we want to identify \(k\)-mers corresponding to :

  • Autosomal chromosomes
  • X chromosome, if present
  • Y chromosome, if present
We can study the chromosomal system without an assembly

Phylogeny

Kmer-diff

A nextflow pipeline to analyse the \(k\)-mer content of fastq files

  1. preprocess the fastq files
  2. count the \(k\)-mers of each file
  3. merge the counts to get a table of male and female \(k\)-mers counts
  4. test the sexual models
  5. identify the A,X and Y \(k\)-mers

Kmer-diff

Preprocessing

preprocess the fastq files

Important for the clustering analysis:

  • subsample to have the same number of reads between male and female

Important for the \(k\)-mer counting

  • split the fastq files into managable size files (\(10ˆ6\) reads per files)

\(k\)-mer count

fastqkmer

fastkmers -k 12 file.fastq > file.csv

Run a sliding windows of size \(12\) by step of \(1\) along the reads counting all the occurrences of each \(k\)-mers

We have the letters: \(A,C,T,G\) and \(N\)

\(244,140,625\) possible \(k\)-mers

We split the fastq files into \(\sim\) \(1400\) subfiles of \(10^6\) reads.

Merging the \(k\)-mers

\(\sim\) \(1400\) splits of \(10^6\) reads \(\rightarrow\) \(1400\) csv files

  • large number of \(k\)-mers
  • Unordered
  • not the same \(k\)-mers a present in every files

mergekmer a small rust programme that build a sufix-tree of the \(k\)-mer

Merging the \(k\)-mers

mergekmer a small rust programme that build a sufix-tree of the \(k\)-mer

merge fastkmers output

Usage: mergekmer [OPTIONS] --output <OUTPUT>

Options:
  -c, --csv <CSV>...     list of csv files
  -o, --output <OUTPUT>  merged csv file
  -c, --collate          collate csv file
  -h, --help             Print help
  -V, --version          Print version

Each leafs of the tree contains the number of \(k\)-mers

The tree traversal is easy to compute with a recursive function

we can merge the counts of a given sex for each specie

Merging the \(k\)-mers

merge fastkmers output

Usage: mergekmer [OPTIONS] --output <OUTPUT>

Options:
  -c, --csv <CSV>...     list of csv files
  -o, --output <OUTPUT>  merged csv file
  -c, --collate          collate csv file
  -h, --help             Print help
  -V, --version          Print version

In the --collate version earch leave contain a list of the count of the \(k\)-mers in the female or male of a specie

we can fuse the counts of the male and female for each specie

Test the sexual model

We have the following possible models

  • XY system

  • XO system

  • OO system

Test the sexual model

  • XY system

  • XO system

  • OO system

  • A cluster with mean male \(=\) mean female
  • A cluster above the diagonale
  • A cluster below the diagonale

Test the sexual model

data

XY model

XO model

OO model

Bayesian information criterion (BIC)

Loglikelihood

identify the A,X and Y \(k\)-mers

  • Can be used to compare model
  • Not sensitive enough
  • Cannot be used to classify individual \(k\)-mers

identify the A,X and Y \(k\)-mers

  • Can choose the prior for each cluster (mean and shape)
  • Can choose the prior for the proportion between A,X and Y
  • Can choose the weight of each prior compared to the data

identify the A,X and Y \(k\)-mers

  • Can choose the prior for each cluster (mean and shape)
  • Can choose the prior for the proportion between A,X and Y
  • Can choose the weight of each prior compared to the data

identify the A,X and Y \(k\)-mers

  • Can choose the prior for each cluster (mean and shape)
  • Can choose the prior for the proportion between A,X and Y
  • Can choose the weight of each prior compared to the data

Count problem

  • The coverage is not the same between the male and the female

Kmer-diff

A nextflow pipeline to analyse the \(k\)-mer content of fastq files

  1. preprocess the fastq files
    • subsample to have the same number of reads between male and female
  2. count the \(k\)-mers of each file
  3. merge the counts to get a table of male and female \(k\)-mers counts
  4. test the sexual models
  5. identify the A,X and Y \(k\)-mers