The notion of word embedding plays a fundamental role in natural language processing (NLP). However, pre-training word embedding for very large-scale vocabulary is computationally challenging for most existing methods. In this work, we show that with merely a small fraction of contexts (Q-contexts)which are typical in the whole corpus (and their mutual information with words), one can construct high-quality word embedding with negligible errors. Mutual information between contexts and words can be encoded canonically as a sampling state, thus, Q-contexts can be fast constructed. Furthermore, we present an efficient and effective WEQ method, which is capable of extracting word embedding directly from these typical contexts. In practical scenarios, our algorithm runs 11$\sim$13 times faster than well-established methods. By comparing with well-known methods such as matrix factorization, word2vec, GloVeand fasttext, we demonstrate that our method achieves comparable performance on a variety of downstream NLP tasks, and in the meanwhile maintains run-time and resource advantages over all these baselines.
Despite the remarkable progress of machine learning (ML) techniques in chemistry, modeling the optoelectronic properties of long conjugated oligomers and polymers with ML remains challenging due to the difficulty in obtaining sufficient training data. Here we use transfer learning to address the data scarcity issue by pre-training graph neural networks using data from short oligomers. With only a few hundred training data, we are able to achieve an average error of about 0.1 eV for excited state energy of oligothiophenes against TDDFT calculations. We show that the success of our transfer learning approach relies on the relative locality of low-lying electronic excitations in long conjugated oligomers. Finally, we demonstrate the transferability of our approach by modeling the lowest-lying excited-state energies of poly(3-hexylthiopnene) (P3HT) in its single-crystal and solution phases using the transfer learning models trained with data of gas-phase oligothiophenes. The transfer learning predicted excited-state energy distributions agree quantitatively with TDDFT calculations and capture some important qualitative features observed in experimental absorption spectra.
We investigate and discuss the viability of graphene plasmons excited through the aloof-scattering of free electrons and inelastic electron tunneling. Excitation efficiencies may be potentially larger compared to that for metal plasmons.
Metalloproteins play essential roles in various biological processes ranging from reaction catalysis to free radical scavenging, and they are also pertinent to numerous pathologies including cancer, HIV infection,and inflammation.
Abstract Element tuning of targeted materials and obtaining the optimal synthesis recipe are major goals for many material scientists. However, this is often limited by conventional trial‐and‐error procedures, which are time‐consuming and labor‐intensive. In this work, fine element tuning of halide double perovskite Cs 2 Na x Ag 1‐x In y Bi 1‐y Cl 6 is conducted by performing a data‐driven investigation combining high‐throughput experiments with machine learning (ML). A positive correlation between the more accessible R value in emission RGB values (the intensities of the red/green/blue primary colors) and photoluminescence intensity is revealed, and over a thousand R values of the Cs 2 Na x Ag 1‐x In y Bi 1‐y Cl 6 crystals synthesized with different additives and element compositions are collected. More importantly, the volume ratios of Na + /Ag + (V Na : V Ag ) and Bi 3+ /In 3+ (V Bi : V In ) with the corresponding R values are correlated through ML, and the synergistic regulation of the two ion pairs is revealed. A possible correlation between R and XRD is also proposed. Finally, different emission intensities of LED beads coated with Cs 2 Na x Ag 1‐x In y Bi 1‐y Cl 6 synthesized using parameters obtained from ML are demonstrated, and an emission enhancement of ≈50 times is observed between the brightest and dimmest LEDs. This work illustrates that data‐driven investigation helps guide material synthesis and will significantly reduce the workload for developing novel materials, especially for complex compositions.
Designing molecules with desirable physiochemical properties and functionalities is a long-standing challenge in chemistry, material science, and drug discovery. Recently, machine learning-based generative models have emerged as promising approaches for \emph{de novo} molecule design. However, further refinement of methodology is highly desired as most existing methods lack unified modeling of 2D topology and 3D geometry information and fail to effectively learn the structure-property relationship for molecule design. Here we present MolCode, a roto-translation equivariant generative framework for \underline{Mol}ecular graph-structure \underline{Co-de}sign. In MolCode, 3D geometric information empowers the molecular 2D graph generation, which in turn helps guide the prediction of molecular 3D structure. Extensive experimental results show that MolCode outperforms previous methods on a series of challenging tasks including \emph{de novo} molecule design, targeted molecule discovery, and structure-based drug design. Particularly, MolCode not only consistently generates valid (99.95$\%$ Validity) and diverse (98.75$\%$ Uniqueness) molecular graphs/structures with desirable properties, but also generate drug-like molecules with high affinity to target proteins (61.8$\%$ high-affinity ratio), which demonstrates MolCode's potential applications in material design and drug discovery. Our extensive investigation reveals that the 2D topology and 3D geometry contain intrinsically complementary information in molecule design, and provide new insights into machine learning-based molecule representation and generation.
A bstract Proteins govern most biological functions essential for life, but achieving controllable protein discovery and optimization remains challenging. Recently, machine learning-assisted protein editing (MLPE) has shown promise in accelerating optimization cycles and reducing experimental workloads. However, current methods struggle with the vast combinatorial space of potential protein edits and cannot explicitly conduct protein editing using biotext instructions, limiting their interactivity with human feedback. To fill these gaps, we propose a novel method called ProtET for efficient CLIP-informed protein editing through multi-modality learning. Our approach comprises two stages: in the pretraining stage, contrastive learning aligns protein-biotext representations encoded by two large language models (LLMs), respectively. Subsequently, during the protein editing stage, the fused features from editing instruction texts and original protein sequences serve as the final editing condition for generating target protein sequences. Comprehensive experiments demonstrated the superiority of ProtET in editing proteins to enhance human-expected functionality across multiple attribute domains, including enzyme catalytic activity, protein stability and antibody specific binding ability. And ProtET improves the state-of-the-art results by a large margin, leading to significant stability improvements of 16.67% and 16.90%. This capability positions ProtET to advance real-world artificial protein editing, potentially addressing unmet academic, industrial, and clinical needs.