logo
    Abstract:
    Large language models (LLMs) demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English, thereby limiting their applicability and research in other languages. Consequently, we present PolyLM, a multilingual LLM trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning. To assess the model's performance, we collect several existing multilingual tasks, including multilingual understanding, question answering, generation, and translation. Extensive experiments show that PolyLM surpasses other open-source models such as LLaMA and BLOOM on multilingual tasks while maintaining comparable performance in English. Our models, alone with the instruction data and multilingual benchmark, are available at: \url{https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation}.
    Keywords:
    Polyglot
    Benchmark (surveying)
    Limiting
    Named-entities are inherently multilingual, and annotations in any given language may be limited. This motivates us to consider polyglot named-entity recognition (NER), where one model is trained using annotated data drawn from more than one language. However, a straightforward implementation of this simple idea does not always work in practice: naive training of NER models using annotated data drawn from multiple languages consistently underperforms models trained on monolingual data alone, despite having access to more training data. The starting point of this paper is a simple solution to this problem, in which polyglot models are fine-tuned on monolingual data to consistently and significantly outperform their monolingual counterparts. To explain this phenomena, we explore the sources of multilingual transfer in polyglot NER models and examine the weight structure of polyglot models compared to their monolingual counterparts. We find that polyglot models efficiently share many parameters across languages and that fine-tuning may utilize a large number of those parameters.
    Polyglot
    Training set
    Named Entity Recognition
    Transfer of learning
    Citations (0)
    Named-entities are inherently multilingual, and annotations in any given language may be limited. This motivates us to consider polyglot named-entity recognition (NER), where one model is trained using annotated data drawn from more than one language. However, a straightforward implementation of this simple idea does not always work in practice: naive training of NER models using annotated data drawn from multiple languages consistently underperforms models trained on monolingual data alone, despite having access to more training data. The starting point of this paper is a simple solution to this problem, in which polyglot models are fine-tuned on monolingual data to consistently and significantly outperform their monolingual counterparts. To explain this phenomena, we explore the sources of multilingual transfer in polyglot NER models and examine the weight structure of polyglot models compared to their monolingual counterparts. We find that polyglot models efficiently share many parameters across languages and that fine-tuning may utilize a large number of those parameters.
    Polyglot
    Training set
    Named Entity Recognition
    Transfer of learning
    The Complutensian Polyglot Get access Notes and Queries, Volume s3-III, Issue 54, 10 January 1863, Page 21, https://doi.org/10.1093/nq/s3-III.54.21a Published: 10 January 1863
    Polyglot
    Citations (0)
    # RP_polyglot_developers Replication package for the paper "Do Polyglot Systems Have Polyglot Developers?" ## Contents of the Replication Package This package contains all the necessary material to replicate the study. It also contains supplementary data to give better insights into the results. - **Scripts/** - `top_contributor_byCommits.py` - the script used to generate columns "Author","Commits" - `getGroupName.py` - auxiliary script used to match aliases - `top_contributor_byChangedLines.py` - the script used to generate columns "AddedLines","ChangedLines" - `top_contributor_byChangedJavaComments.py` - the script used to generate columns "Added_Java_Lines","Deleted_Java_Lines","Added_Java_Comment_Lines", ,"Deleted_Java_Comment_Lines" - `javaparser-core-3.24.4.jar` - javaparser package - `top_contributor_byChangedPythonComments.py` - the script used to generate columns "Added_Python_Lines","Deleted_Python_Lines","Added_Python_Comment_Lines", "Deleted_Python_Comment_Lines" for Python projects - `parsePythonComments.py` - auxiliary script for parsing Python comments with ast in Python2 - `contributor_languages.py` - the script used to generate programmin lanuage collumns beginning with "Changed Java Files Overall" - **csv/** all data is conatained in csv's - ***Java/*** all projects with Java as core language - ***Python/*** all projects with Python as core language 
    Polyglot
    Citations (0)
    Named-entities are inherently multilingual, and annotations in any given language may be limited. This motivates us to consider polyglot named-entity recognition (NER), where one model is trained using annotated data drawn from more than one language. However, a straightforward implementation of this simple idea does not always work in practice: naive training of NER models using annotated data drawn from multiple languages consistently underperforms models trained on monolingual data alone, despite having access to more training data. The starting point of this paper is a simple solution to this problem, in which polyglot models are fine-tuned on monolingual data to consistently and significantly outperform their monolingual counterparts. To explain this phenomena, we explore the sources of multilingual transfer in polyglot NER models and examine the weight structure of polyglot models compared to their monolingual counterparts. We find that polyglot models efficiently share many parameters across languages and that fine-tuning may utilize a large number of those parameters.
    Polyglot
    Training set
    Named Entity Recognition
    Transfer of learning
    Citations (2)
    # RP_polyglot_developers Replication package for the paper "Do Polyglot Systems Have Polyglot Developers?" ## Contents of the Replication Package This package contains all the necessary material to replicate the study. It also contains supplementary data to give better insights into the results. - **Scripts/** - `top_contributor_byCommits.py` - the script used to generate columns "Author","Commits" - `getGroupName.py` - auxiliary script used to match aliases - `top_contributor_byChangedLines.py` - the script used to generate columns "AddedLines","ChangedLines" - `top_contributor_byChangedJavaComments.py` - the script used to generate columns "Added_Java_Lines","Deleted_Java_Lines","Added_Java_Comment_Lines", ,"Deleted_Java_Comment_Lines" - `javaparser-core-3.24.4.jar` - javaparser package - `top_contributor_byChangedPythonComments.py` - the script used to generate columns "Added_Python_Lines","Deleted_Python_Lines","Added_Python_Comment_Lines", "Deleted_Python_Comment_Lines" for Python projects - `parsePythonComments.py` - auxiliary script for parsing Python comments with ast in Python2 - `contributor_languages.py` - the script used to generate programmin lanuage collumns beginning with "Changed Java Files Overall" - **csv/** all data is conatained in csv's - ***Java/*** all projects with Java as core language - ***Python/*** all projects with Python as core language 
    Polyglot
    Citations (0)
    One of the most famous Swedish-language poets, Edith Sodergran (1892-1923), was also one of the most multilingual writers in Northern Europe. She had knowledge of at least seven languages and wrote in five, yet published only in Swedish. Sodergran's childhood in multilingual Saint Petersburg, her education in the German-language school St. Petrischule, and travels in Europe, formed a polyglot globetrotter and world citizen, whose linguistic and cultural competences are only beginning to be appreciated in recent years; Sodergran's multilingualism has however not been researched in depth. This study discusses the multilingualism in the life of this versatile writer and tries to reconstruct her multilingual biography based on fragmentary archive material. It also provides some examples of how her poetry reflects the multidimensional and multilingual world she lived in, and asks the question if her multilingualism was really hidden or has just been overlooked.
    Polyglot
    Multilingual Education
    Citations (0)
    Journal Article Complutensian polyglot bible Get access Joseph Rix Joseph Rix 1St. Neots Search for other works by this author on: Oxford Academic Google Scholar Notes and Queries, Volume s2-VI, Issue 142, 18 September 1858, Page 233, https://doi.org/10.1093/nq/s2-VI.142.233d Published: 18 September 1858
    Polyglot
    Citations (0)
    In the few articles that have been written about ‘la Muse Belgique’ Marie-Caroline Murray over the past 250 years, she is invariably labelled a polyglot. However, the question arises to what extent Murray effectively internalised this multilingualism, and to what extent she actively deployed it in her writings. An initial analysis of her personal correspondence shows that her exceptional erudition was accompanied by a predominantly passive knowledge of these languages, and that this was mainly determined by the changing networks in which Murray moved throughout her life. This article examines the role of these networks in Marie-Caroline Murray’s perception of multilingualism by systematically mapping them. This staged ‘deconstruction’ of multilingualism shows that Murray repeatedly outdid herself in making original translations through these cosmopolitan contacts (Camões), but that this did not lead to an internalisation of multilingualism as revealed by Murray’s later descriptions of literary agency.
    Polyglot
    Deconstruction (building)
    Journal Article Swinburne as Polyglot author Get access G. W. E. R. G. W. E. R. Search for other works by this author on: Oxford Academic Google Scholar Notes and Queries, Volume s11-IX, Issue 217, 21 February 1914, Page 157, https://doi.org/10.1093/nq/s11-IX.217.157 Published: 21 February 1914
    Polyglot
    Citations (0)