What do we gain when tolerating loss? The information bottleneck, lossy compression, and detecting horizontal gene transfer

2021 
Most microbes have the capacity to acquire genetic material from their environment. Recombination of foreign DNA yields genomes that are, at least in part, incongruent with the vertical history of their species. Dominant approaches for detecting such horizontal gene transfer (HGT) and recombination are phylogenetic, requiring a painstaking series of analyses including sequence-based clustering, alignment, and phylogenetic tree reconstruction. Given the breakneck pace of genome sequencing, these traditional pan-genomic methods do not scale. Here we propose an alignment-free and tree-free technique based on the sequential information bottleneck (SIB), an optimization procedure designed to extract some portion of relevant information from one random variable conditioned on another. In our case, this joint probability distribution tabulates occurrence counts of k-mers with respect to their genomes of origin (the relevance information) with the expectation that HGT and recombination will create a strong signal that distinguishes certain sets of co-occuring k-mers. The technique is conceptualized as a rate-distortion problem. We measure distortion in the relevance information as k-mers are compressed into clusters based on their co-occurrence in the source genomes. This approach is similar to topic mining in the Natural Language Processing (NLP) literature. The result is model-free, unsupervised compression of k-mers into genomic topics that trace tracts of shared genome sequence whether vertically or horizontally acquired. We examine the performance of SIB on simulated data and on the known large-scale recombination event that formed the Staphylococcus aureus ST239 clade. We use this technique to detect recombined regions and recover the vertically inherited core genome with a fraction of the computing power required of current phylogenetic methods.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    38
    References
    0
    Citations
    NaN
    KQI
    []