Automatic Cluster Analysis of Texts in Simplified German

2019 
Text simplification is the process of reducing lexical and syntactic complexity of a text, while preserving most of the original information content [Saggion, 2017, 1]. This process aims at making texts accessible for everyone, including persons with low literacy skills, cognitive or learning disabilities, aphasia or dementia, among others. Because of the heterogeneity of the target users, simplified German as an instance of simplified language has been conceptualised at multiple complexity levels [Bredel and Maas, 2016; Bock, 2014; Kellermann, 2014]. However, to date neither guidelines nor evidence support this claim. In this master thesis, I present an approach to automatically analyse existing texts in simplified German, with the goal of investigating evidence of multiple complexity levels. This approach was tested with two different corpora in simplified German. The first task in my analysis is to address a key question in text simplification research, namely the identification of complexity structures of given texts. This includes the creation of a feature framework reflecting the linguistic and structural characteristics of texts in simplified German. The second task is to cluster documents by exploring various unsupervised algorithms and combinations of the previously extracted features. In the third task, the output of the cluster analysis is validated to calculate its robustness; finally, the clustering results are linguistically interpreted to identify feature behaviours. The results show that clustering techniques are able to discriminate among texts in simplified German, suggesting that some groups of texts share a high degree of linguistic similarity. This thesis emphasises the necessity of exploring not only linguistic features but also structural and layout characteristics of simplified language in order to meet the requirements of the various target users.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    2
    Citations
    NaN
    KQI
    []