Large language models (LLMs) demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English, thereby limiting their applicability and research in other languages. Consequently, we present PolyLM, a multilingual LLM trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning. To assess the model's performance, we collect several existing multilingual tasks, including multilingual understanding, question answering, generation, and translation. Extensive experiments show that PolyLM surpasses other open-source models such as LLaMA and BLOOM on multilingual tasks while maintaining comparable performance in English. Our models, alone with the instruction data and multilingual benchmark, are available at: \url{https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation}.
Identifying the purpose of citations plays an important role in evaluating the impact of the literature. There is a data imbalanced problem on different types of citation intents which harms the performance of the classification model. To alleviate this problem, We adapt the bilateral-branch network proposed in the computer vision domain to our topic in the natural language processing domain by constructing shared and non-shared encoder layers using pre-trained language model and word attention layer respectively. In addition, to learn rich representations by leveraging the auxiliary information, we propose a multi-task based bilateral-branch network. On the issue of how to integrate multi-task model and bilateral-branch network, because one advantage of multi-task learning is using more data or information to learn better representations, we propose a solution of integrating the networks of the auxiliary tasks with the representation learning branch of the bilateral- branch network. The experimental results show that our model outperforms other models used for citation intent classification.
Concerns regarding Large Language Models (LLMs) to memorize and disclose private information, particularly Personally Identifiable Information (PII), become prominent within the community. Many efforts have been made to mitigate the privacy risks. However, the mechanism through which LLMs memorize PII remains poorly understood. To bridge this gap, we introduce a pioneering method for pinpointing PII-sensitive neurons (privacy neurons) within LLMs. Our method employs learnable binary weight masks to localize specific neurons that account for the memorization of PII in LLMs through adversarial training. Our investigations discover that PII is memorized by a small subset of neurons across all layers, which shows the property of PII specificity. Furthermore, we propose to validate the potential in PII risk mitigation by deactivating the localized privacy neurons. Both quantitative and qualitative experiments demonstrate the effectiveness of our neuron localization algorithm.
In computer-assisted orthodontics, three-dimensional tooth models are required for many medical treatments. Tooth segmentation from cone-beam computed tomography (CBCT) images is a crucial step in constructing the models. However, CBCT image quality problems such as metal artifacts and blurring caused by shooting equipment and patients' dental conditions make the segmentation difficult. In this paper, we propose ToothSegNet, a new framework which acquaints the segmentation model with generated degraded images during training. ToothSegNet merges the information of high and low quality images from the designed degradation simulation module using channel-wise cross fusion to reduce the semantic gap between encoder and decoder, and also refines the shape of tooth prediction through a structural constraint loss. Experimental results suggest that ToothSegNet produces more precise segmentation and outperforms the state-of-the-art medical image segmentation methods.
Text generation tasks require that the generated text have certain diversity while ensuring the relevance. Traditional Seq2Seq models usually use cross entropy as the objective function. It demands the results keep strictly consistent with the ground truth texts, which easily leads to the lack of variability in generated texts. In this paper, we propose a novel framework, TransVAE, which applies Variational Auto-Encoder (VAE) to improve the Seq2Seq architecture. We design the Translator module to transform the latent variable spaces of origin input to target output, thus enhancing the diversity of generated texts and supporting semi-supervised learning. Moreover, we add attention and copy mechanisms to the TransVAE model to balance the relevance and diversity. Abundant experiments are carried out on three different string transduction tasks: dialogue generation, machine translation, and text summarization. The experiment results verify the effectiveness of our method.
Links between issue reports and corresponding code commits to fix them can greatly reduce the maintenance costs of a software project. More often than not, however, these links are missing and thus cannot be fully utilized by developers. Current practices in issue-commit link recovery extract text features and code features in terms of textual similarity from issue reports and commit logs to train their models. These approaches are limited since semantic information could be lost. Furthermore, few of them consider the effect of source code files related to a commit on issue-commit link recovery, let alone the semantics of code context. To tackle these problems, we propose to construct code knowledge graph of a code repository and generate embeddings of source code files to capture the semantics of code context. We also use embeddings to capture the semantics of issue- or commit-related text. Then we use these embeddings to calculate semantic similarity and code similarity using a deep learning approach before training a SVM binary classification model with additional features. Evaluations on real-world projects show that our approach DeepLink can outperform the state-of-the-art method.