Previous cross-lingual transfer methods are restricted to orthographic representation learning via textual scripts. This limitation hampers cross-lingual transfer and is biased towards languages sharing similar well-known scripts. To alleviate the gap between languages from different writing scripts, we propose PhoneXL, a framework incorporating phonemic transcriptions as an additional linguistic modality beyond the traditional orthographic transcriptions for cross-lingual transfer. Particularly, we propose unsupervised alignment objectives to capture (1) local one-to-one alignment between the two different modalities, (2) alignment via multi-modality contexts to leverage information from additional modalities, and (3) alignment via multilingual contexts where additional bilingual dictionaries are incorporated. We also release the first phonemic-orthographic alignment dataset on two token-level tasks (Named Entity Recognition and Part-of-Speech Tagging) among the understudied but interconnected Chinese-Japanese-Korean-Vietnamese (CJKV) languages. Our pilot study reveals phonemic transcription provides essential information beyond the orthography to enhance cross-lingual transfer and bridge the gap among CJKV languages, leading to consistent improvements on cross-lingual token-level tasks over orthographic-based multilingual PLMs.
ABSTRACT Bipolar disorder is a highly heritable psychiatric disorder that features episodes of mania and depression. We performed the largest genome-wide association study to date, including 20,352 cases and 31,358 controls of European descent, with follow-up analysis of 822 sentinel variants at loci with P<1×10 -4 in an independent sample of 9,412 cases and 137,760 controls. In the combined analysis, 30 loci reached genome-wide significant evidence for association, of which 20 were novel. These significant loci contain genes encoding ion channels and neurotransmitter transporters ( CACNA1C , GRIN2A , SCN2A , SLC4A1 ), synaptic components ( RIMS1 , ANK3 ), immune and energy metabolism components. Bipolar disorder type I (depressive and manic episodes; ~ 73% of our cases) is strongly genetically correlated with schizophrenia whereas bipolar disorder type II (depressive and hypomanic episodes; ~ 17% of our cases) is more strongly correlated with major depressive disorder. These findings address key clinical questions and provide potential new biological mechanisms for bipolar disorder.
Abstract Bipolar disorder (BD) is a serious mental illness with substantial common variant heritability. However, the role of rare coding variation in BD is not well established. We examined the protein-coding (exonic) sequences of 3,987 unrelated individuals with BD and 5,322 controls of predominantly European ancestry across four cohorts from the Bipolar Sequencing Consortium (BSC). We assessed the burden of rare, protein-altering, single nucleotide variants classified as pathogenic or likely pathogenic (P-LP) both exome-wide and within several groups of genes with phenotypic or biologic plausibility in BD. While we observed an increased burden of rare coding P-LP variants within 165 genes identified as BD GWAS regions in 3,987 BD cases (meta-analysis OR = 1.9, 95% CI = 1.3–2.8, one-sided p = 6.0 × 10 −4 ), this enrichment did not replicate in an additional 9,929 BD cases and 14,018 controls (OR = 0.9, one-side p = 0.70). Although BD shares common variant heritability with schizophrenia, in the BSC sample we did not observe a significant enrichment of P-LP variants in SCZ GWAS genes, in two classes of neuronal synaptic genes (RBFOX2 and FMRP) associated with SCZ or in loss-of-function intolerant genes. In this study, the largest analysis of exonic variation in BD, individuals with BD do not carry a replicable enrichment of rare P-LP variants across the exome or in any of several groups of genes with biologic plausibility. Moreover, despite a strong shared susceptibility between BD and SCZ through common genetic variation, we do not observe an association between BD risk and rare P-LP coding variants in genes known to modulate risk for SCZ.
Abstract Transcriptomic imputation approaches offer an opportunity to test associations between disease and gene expression in otherwise inaccessible tissues, such as brain, by combining eQTL reference panels with large-scale genotype data. These genic associations could elucidate signals in complex GWAS loci and may disentangle the role of different tissues in disease development. Here, we use the largest eQTL reference panel for the dorso-lateral pre-frontal cortex (DLPFC), collected by the CommonMind Consortium, to create a set of gene expression predictors and demonstrate their utility. We applied these predictors to 40,299 schizophrenia cases and 65,264 matched controls, constituting the largest transcriptomic imputation study of schizophrenia to date. We also computed predicted gene expression levels for 12 additional brain regions, using publicly available predictor models from GTEx. We identified 413 genic associations across 13 brain regions. Stepwise conditioning across the genes and tissues identified 71 associated genes (67 outside the MHC), with the majority of associations found in the DLPFC, and of which 14/67 genes did not fall within previously genome-wide significant loci. We identified 36 significantly enriched pathways, including hexosaminidase-A deficiency, and multiple pathways associated with porphyric disorders. We investigated developmental expression patterns for all 67 non-MHC associated genes using BRAINSPAN, and identified groups of genes expressed specifically pre-natally or post-natally.
Few-shot Intent Detection is challenging due to the scarcity of available annotated utterances. Although recent works demonstrate that multi-level matching plays an important role in transferring learned knowledge from seen training classes to novel testing classes, they rely on a static similarity measure and overly fine-grained matching components. These limitations inhibit generalizing capability towards Generalized Few-shot Learning settings where both seen and novel classes are co-existent. In this paper, we propose a novel Semantic Matching and Aggregation Network where semantic components are distilled from utterances via multi-head self-attention with additional dynamic regularization constraints. These semantic components capture high-level information, resulting in more effective matching between instances. Our multi-perspective matching method provides a comprehensive matching measure to enhance representations of both labeled and unlabeled instances. We also propose a more challenging evaluation setting that considers classification on the joint all-class label space. Extensive experimental results demonstrate the effectiveness of our method. Our code and data are publicly available.
Recent advanced methods in Natural Language Understanding for Task-oriented Dialogue (TOD) Systems (e.g., intent detection and slot filling) require a large amount of annotated data to achieve competitive performance. In reality, token-level annotations (slot labels) are time-consuming and difficult to acquire. In this work, we study the Slot Induction (SI) task whose objective is to induce slot boundaries without explicit knowledge of token-level slot annotations. We propose leveraging Unsupervised Pre-trained Language Model (PLM) Probing and Contrastive Learning mechanism to exploit (1) unsupervised semantic knowledge extracted from PLM, and (2) additional sentence-level intent label signals available from TOD. Our approach is shown to be effective in SI task and capable of bridging the gaps with token-level supervised models on two NLU benchmark datasets. When generalized to emerging intents, our SI objectives also provide enhanced slot label representations, leading to improved performance on the Slot Filling tasks.
Few-shot Intent Detection is challenging due to the scarcity of available annotated utterances. Although recent works demonstrate that multi-level matching plays an important role in transferring learned knowledge from seen training classes to novel testing classes, they rely on a static similarity measure and overly fine-grained matching components. These limitations inhibit generalizing capability towards Generalized Few-shot Learning settings where both seen and novel classes are co-existent. In this paper, we propose a novel Semantic Matching and Aggregation Network where semantic components are distilled from utterances via multi-head self-attention with additional dynamic regularization constraints. These semantic components capture high-level information, resulting in more effective matching between instances. Our multi-perspective matching method provides a comprehensive matching measure to enhance representations of both labeled and unlabeled instances. We also propose a more challenging evaluation setting that considers classification on the joint all-class label space. Extensive experimental results demonstrate the effectiveness of our method. Our code and data are publicly available.