On a Distribution Representing Sentence-length in written Prose

1974 
SUMMARY A new model for representing sentence-length distributions is suggested in equation (8) which is a special case of equation (2), with parameter y = -2 known a priori. Eight known sentence-length frequency counts taken from English, Greek and Latin prose were all satisfactorily described by distribution (8). For these eight fits, the average probability P(X2) was 0 50. A ninth observed distribution, taken from a Latin text of unknown authorship failed the x2 test applied to the fit of the data to the model in equation (8). This corroborates Yule's (1939) conclusion that it is highly unlikely that de Gerson could have written De Imitatione Christi. It is further conjectured that the last-mentioned observed frequency distribution could be well represented by the more general model in equation (2), with a parameter y much smaller than -2 THE first substantial investigation on sentence-length as a statistical tool to be used in deciding disputed authorship was published by Yule in 1939. Simple statistical indices such as the average number of words per sentence and the standard deviation of sentence-lengths were employed. Yule did not suggest a particular mathematical distribution model. Later (Yule, 1944) he explored word-frequency of an author in addition to sentence-length. Although Yule mentions in that book the negative binomial, he discards this distribution model as totally inadequate for representation of word frequencies and sentence-lengths. Williams (1940, 1970) suggests and uses the lognormal distribution as a model for sentence-length. To verify lognormality, Williams plots the observed cumulative percentage frequencies of sentence-lengths on log-probability paper in the hope that these plots will approach a straight line. No x2 tests are given for any of Williams's examples. Wake (1957), who discusses sentence-lengths in works of Greek authors, also makes use of the lognormal distribution by superimposing the observed histograms of the logarithms of sentence-lengths over the "expected" normal distributions. No x2 tests are given. The authorship of Greek prose is again investigated by Morton (1965) who works with distribution-free statistics such as the mean, the median, the quartiles and the deciles. Mosteller and Wallace (1963), in their study of the authorship of the Federalist papers, came to the conclusion that the mean and standard deviation of sentence- length was of no help in solving disputed authorship. In their particular research Mosteller and Wallace found the mean and standard deviations of sentence-length to be virtually identical for Madison and Hamilton. It can be shown, however, that two
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    8
    References
    115
    Citations
    NaN
    KQI
    []