Distributed representations of protein domains and genomes and their compositionality

2019 
Learning algorithms have at their disposal an ever-growing number of metagenomes for biomining and the study of microbial functions. We propose a novel representation of function called nanotext that scales to very large data sets while capturing precise functional relationships. These relationships are learned from a corpus of 32 thousand genome assemblies with 145 million protein domains. We treat the protein domains in a genome like words in a document, assuming that protein domains in a similar context have similar "meaning". This meaning can be distributed by the Word2Vec embedding algorithm over a vector of numbers. These vectors not only encode function but can be used to predict even complex genomic features and phenotypes. We apply nanotext to data from the Tara ocean expedition to predict plausible culture media and growth temperatures for microorganisms from their metagenome assembled genomes (MAGs) alone. nanotext is freely released under a BSD licence (https://github.com/phiweger/nanotext).
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    56
    References
    10
    Citations
    NaN
    KQI
    []