L'analyse lexicale au service de la cliodynamique : traitement par intelligence artificielle de la base Google Ngram

2017 
Cliodynamics is a fairly recent research field that considers history as an object of scientific study. Thanks to its transdisciplinary nature, cliodynamics tries to explain historical dynamical processes such as the rise or collapse of empires or civilizations, economic cycles, population booms, fashions through mathematical modeling, datamining, econometrics or cultural sociology. "Big data" aggregating historical, archaeological or economic informations is the material to feed these quantitative models. It can also incluse empirical analysis to validate assumptions and predictions of dynamic models using historical data. Cliodynamics is part of the cliometrics approach or "new economic history" which studies history through econometrics. Objectives On the one hand, we designed a robust lexical analysis method able to deal with a very large dated corpus series whose content evolves over time (big data) with the challenge of identifying societal evolutions and major historical periods in a cliodynamics perspective. Lexical analysis also examined the teachings to be learned from the Google books Ngram database, which details the number of annual words occurrences in scanned publications available in the Google Books search engine . It is assumed that this database has compiled about 20% of all books ever published in major languages. We focused our study on English-language books published in the United States and Great Britain. The objective was to identify the words frequencies evolving from year 1860 to 2008. Method Principles The method was to constitute, as a first step, a dictionary of the most commonly used English words, disregarding two-way terms, preposition, articles, pronouns. This dictionary has collected 1592 words covering many aspects of social and cultural life with terms related to politics, religion, arts and sciences, industry, objects, family and sentiments. In a second step, the percentage representation of each word in the dictionary was determined for each year after loading the huge Ngram Google Books (1-gram) database on Postgresql. Some words like "king" or "queen" are very well represented in the 19th century dictionary with the reign and power of royalties in Europe, but the use of these phrases declined in the 20th century. The words frequency in books is constantly evolving as time goes by. The third step was to perform a centered and standardized principal component analysis (PCA) on the table describing the representation of words in % by years from 1860 to 2008. A clustering of "years" is carried out using a neural network (artificial intelligence Kohonen map). The results show 8 different periods in history according to 3 different major tendancies in speeches : Humanist versus Scientific ; Chaos versus Organization ; Individualist versus Collectivist.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []