Impossibility of phylogeny reconstruction from $k$-mer counts

2020 
We consider phylogeny estimation under a two-state model of sequence evolution by site substitution on a tree. In the asymptotic regime where the sequence lengths tend to infinity, we show that for any fixed $k$ no statistically consistent phylogeny estimation is possible from $k$-mer counts of the leaf sequences alone. Formally, we establish that the joint leaf distributions of $k$-mer counts on two distinct trees have total variation distance bounded away from $1$ as the sequence length tends to infinity. That is, the two distributions cannot be distinguished with probability going to one in that asymptotic regime. Our results are information-theoretic: they imply an impossibility result for any reconstruction method using only $k$-mer counts at the leaves.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    35
    References
    0
    Citations
    NaN
    KQI
    []