Towards Cultural-Scale Models of Full Text.

2015 
This technical report consists of two components: an administrative report for the HathiTrust Research Center (HTRC) Advanced Collaborative Support (ACS) program and a research report on the variance of topic models trained over random samples of books in the Hathi Trust. Cultural-scale models of full text documents are prone to over-interpretation by researchers making unintentionally strong socio-linguistic claims without recognizing that even large digital libraries are merely samples of all the books ever produced. In this study, we test the sensitivity of the topic models to the sampling process by taking random samples of books in the Hathi Trust Digital Library within different Library of Congress Classification (LCC) areas. For each classification area, we train several topic models over the entire class with different random seeds, generating a set of spanning models. Then, we train topic models on random samples of books from the classification area, generating a set of sample models. Finally, we align topics from the sample models to the spanning models and measure the alignment distance and topic overlap. We find that sample models with a large sample size typically have an alignment distance that falls in the range of the alignment distance between spanning models. Unsurprisingly, as sample size increases, alignment distance decreases. We also find that the topic overlap increases as sample size increases. However, the decomposition of these measures by sample size differs by field and by number of topics. We speculate that these measures could be used to find classes which have a common "canon" discussed among all books in the area, as shown by high topic overlap and low alignment distance even in small sample sizes.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    14
    References
    1
    Citations
    NaN
    KQI
    []