Artificial intelligence (AI) is undergoing a revolution thanks to the breakthroughs of machine learning algorithms in computer vision, speech recognition, natural language processing and generative modelling. Recent works on publicly available pharmaceutical data showed that AI methods are highly promising for Drug Target prediction. However, the quality of public data might be different than that of industry data due to different labs reporting measurements, different measurement techniques, fewer samples and less diverse and specialized assays. As part of a European funded project (ExCAPE), that brought together expertise from pharmaceutical industry, machine learning, and high-performance computing, we investigated how well machine learning models obtained from public data can be transferred to internal pharmaceutical industry data. Our results show that machine learning models trained on public data can indeed maintain their predictive power to a large degree when applied to industry data. Moreover, we observed that deep learning derived machine learning models outperformed comparable models, which were trained by other machine learning algorithms, when applied to internal pharmaceutical company datasets. To our knowledge, this is the first large-scale study evaluating the potential of machine learning and especially deep learning directly at the level of industry-scale settings and moreover investigating the transferability of publicly learned target prediction models towards industrial bioactivity prediction pipelines.
Abstract A significant challenge in high‐throughput screening (HTS) campaigns is the identification of assay technology interference compounds. A C ompound I nterfering with an A ssay T echnology (CIAT) gives false readouts in many assays. CIATs are often considered viable hits and investigated in follow‐up studies, thus impeding research and wasting resources. In this study, we developed a machine‐learning (ML) model to predict CIATs for three assay technologies. The model was trained on known CIATs and non‐CIATs (NCIATs) identified in artefact assays and described by their 2D structural descriptors. Usual methods identifying CIATs are based on statistical analysis of historical primary screening data and do not consider experimental assays identifying CIATs. Our results show successful prediction of CIATs for existing and novel compounds and provide a complementary and wider set of predicted CIATs compared to BSF, a published structure‐independent model, and to the PAINS substructural filters. Our analysis is an example of how well‐curated datasets can provide powerful predictive models despite their relatively small size.
Machine learning models predicting the bioactivity of chemical compounds belong nowadays to the standard tools of cheminformaticians and computational medicinal chemists. Multi-task and federated learning are promising machine learning approaches that allow priva-cy-preserving usage of large amount of data from diverse sources, which is crucial for achieving good generalization and high-performance results. Using large, real world data sets from six pharmaceutical companies, here we investigate different strategies for averaging weighted task loss functions to train multi-task bioactivity classification models. The weighting strategies shall be suitable for federated learning and ensure that learning efforts are well distributed even if data are diverse. Comparing several approaches using weights that depend on the number of sub-tasks per assay, task size, and class balance, respectively, we find that a simple sub-task weighting approach leads to robust model performance for all investigated data sets and is especially suited for federated learning.
This study aims at improving upon existing activity predictions methods by augmenting chemical structure fingerprints with bio-activity based fingerprints derived from high-throughput screening (HTS) data (HTSFPs). The HTSFPs were generated from HTS data obtained from PubChem and combined with an ECFP4 structural fingerprint. The combined experimental and structural fingerprint (CESFP) was benchmarked against the individual ECFP4 and HTSFP fingerprints. Results showed that the CESFP has improved predictive performance as well as scaffold hopping capability. The CESFP identified unique compounds compared to both the ECFP4 and the HTSFP fingerprint indicating synergistic effects between the two fingerprints. A feature importance analysis showed that a small subset of the HTSFP features contribute most to the overall performance of the CESFP. This combined approach allows for activity prediction of compounds with only sparse HTSFPs due to the supporting effect from the structural fingerprint.
Natural products are made by nature through interaction with biosynthetic enzymes. They also exert their effect as drugs by interaction with proteins. To address the question "Do biosynthetic enzymes and therapeutic targets share common mechanisms for the molecular recognition of natural products?", we compared the active site of five flavonoid biosynthetic enzymes to 8077 ligandable binding sites in the Protein Data Bank using two three-dimensional-based methods (SiteAlign and Shaper). Virtual screenings efficiently retrieved known flavonoid targets, in particular protein kinases. A consistent performance obtained for variable site descriptions (presence/absence of water, variable boundaries, or small structural changes) indicated that the methods are robust and thus well suited for the identification of potential target proteins of natural products. Finally, our results suggested that flavonoid binding is not primarily driven by shape, but rather by the recognition of common anchoring points.
Drug repurposing has become an important branch of drug discovery. Several computational approaches that help to uncover new repurposing opportunities and aid the discovery process have been put forward, or adapted from previous applications. A number of successful exam-ples are now available. Overall, future developments will greatly benefit from integration of different methods, approaches and disciplines. Steps forward in this direction are expected to help to clarify, and therefore to rationally predict, new drug-target, target-disease, and ulti-mately drug-disease associations.
We question the level of detail required in protein 3D-representation to detect site similarity which is relevant for polypharmacology prediction.We modified the in-house program SiteAlign to replace generic pharmacophoric descriptors of cavity-lining amino acids by descriptors accounting for solvent exposure. Benchmarking the novel, atom-based, method (SiteAlign2) revealed no global improvement of performance. However, in the rare cases of no sequence or global structure similarities between the compared proteins, SiteAlign2 was more successful if backbone atoms are key determinants of ligand binding.SiteAlign suits the comparison of binding sites for close or distant homologs. SiteAlign2 provides a better insight into the physical model of site similarity between nonhomologs, but at the expense of an increased sensitivity to atomic coordinates.
Machine learning models predicting the bioactivity of chemical compounds belong nowadays to the standard tools of cheminformaticians and computational medicinal chemists. Multi-task and federated learning are promising machine learning approaches that allow privacy-preserving usage of large amounts of data from diverse sources, which is crucial for achieving good generalization and high-performance results. Using large, real world data sets from six pharmaceutical companies, here we investigate different strategies for averaging weighted task loss functions to train multi-task bioactivity classification models. The weighting strategies shall be suitable for federated learning and ensure that learning efforts are well distributed even if data are diverse. Comparing several approaches using weights that depend on the number of sub-tasks per assay, task size, and class balance, respectively, we find that a simple sub-task weighting approach leads to robust model performance for all investigated data sets and is especially suited for federated learning.
With the increase in applications of machine learning methods in drug design and related fields, the challenge of designing sound test sets becomes more and more prominent. The goal of this challenge is to have a realistic split of chemical structures (compounds) between training, validation and test set such that the performance on the test set is meaningful to infer the performance in a prospective application. This challenge is by its own very interesting and relevant,but is even more complex in a federated machine learning approach where multiple partners jointly train a model under privacy-preserving conditions where chemical structures must not be shared between the different participating parties in the federated learning. In this work we discuss three methods which provide a splitting of the data set and are applicable in a federated privacy-preserving setting, namely: a. locality-sensitive hashing (LSH), b. sphere exclusion clustering, c. scaffold-based binning (scaffold network). For evaluation of these splitting methods we consider the following quality criteria: bias in prediction performance, label and data imbalance, distance of the test set compounds to the training set and compare them to a random splitting. The main findings of the paper are a. both sphere exclusion clustering and scaffold-based binning result in high quality splitting of the data sets, b. in terms of compute costs sphere exclusion clustering is very expensive in the case of federated privacy-preserving setting.
To apply federated learning to drug discovery we developed a novel platform in the context of European Innovative Medicines Initiative (IMI) project MELLODDY (grant n°831472), which was comprised of 10 pharmaceutical companies, academic research labs, large industrial companies and startups. The MELLODDY platform was the first industry-scale platform to enable the creation of a global federated model for drug discovery without sharing the confidential data sets of the individual partners. The federated model was trained on the platform by aggregating the gradients of all contributing partners in a cryptographic, secure way following each training iteration. The platform was deployed on an Amazon Web Services (AWS) multi-account architecture running Kubernetes clusters in private subnets. Organisationally, the roles of the different partners were codified as different rights and permissions on the platform and administrated in a decentralized way. The MELLODDY platform generated new scientific discoveries which are described in a companion paper.