Chapter 3. Chemical Topic Modeling – An Unsupervised Approach Originating from Text-mining to Organize Chemical Data

2020 
In the era of big data one of the key challenges is the conversion of data into knowledge by organization and searching. This is also true in the chemistry field, where novel technologies such as DNA encoded libraries, peptide libraries and new in silico enumeration methods produce immense amounts of molecules and related data. Handling these extremely large sets of molecules is tremendously complex and requires compromises that often come at the expense of interpretability. In this chapter we introduce and discuss an alternative, novel approach called “chemical topic modeling” which has been adopted from the text-mining field. This probabilistic framework offers an intuitive and meaningful way to organize large data sets. On the ChEMBL database (v23), an extremely heterogonous set of more than 1.6 million molecules, the method has proven its efficacy and robustness: a 100-topic model provided interesting topics like “proteins”, “DNA” or “steroids”. These rather general, yet nonetheless sensible and humanly understandable topics can provide the basis for further investigation.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []