A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books

Yoav Goldberg,Jon Orwant

A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books

2013

Yoav Goldberg
Jon Orwant

We created a dataset of syntactic-ngrams (counted dependency-tree fragments) based on a corpus of 3.5 million English books. The dataset includes over 10 billion distinct items covering a wide range of syntactic configurations. It also includes temporal information, facilitating new kinds of research into lexical semantics over time. This paper describes the dataset, the syntactic representation, and the kinds of information provided.

Keywords:

Information retrieval
Syntax
Natural language processing
Lexical semantics
Computer science
Artificial intelligence
temporal information

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

137

Citations