Linking Linguistic Resources: time aligned corpus and dictionary

2002 
We present preliminary results in linking computerized, multimedia speech documents of the LACITO Archive project to a computerized dictionary. The speech documents are time-aligned with recordings and have a structure defined by an XML DTD, which has been presented elsewhere (Michailovsky 2001, Jacobson and Michailovsky 2000). Over 70 of these documents, including the sound, may be consulted on the LACITO Archive Project site (http://www.lacito.archivage.vjf.cnrs.fr.) The dictionary we will start with is a Limbu-English bilingual dictionary originally developed in a plain-ASCII structured format for use with Robert Hsu's LEXWARE suite of modules for lexicography, and recently converted to a TEI-inspired XML format. Limbu is a Tibeto-Burman language of Eastern Nepal. A basic design philosophy of the LACITO Archive project has been to keep the markup of speech documents simple, or at least to allow for simple markup. This is to make it easy for researchers to mark up large amounts of text, perhaps reserving more detailed markup for a few demonstration texts or texts of particular interest. To compensate, we would like to be able to link items in running text to dictionary entries, which in our view is where a lot of the detail belongs, although these, too, may start out simply. The dictionary entries supply lexical information that is not in the text markup -and which should not have to be repeated every time a word occurs in a corpus. Further, we would like to see how far we can get with automatic linking -that is, without having to hand-lemmatize text items -even if it means that not all items link correctly and unambiguously to dictionary entries. A background assumption is that many linguists will simultaneously be working on texts and dictionaries.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    2
    References
    1
    Citations
    NaN
    KQI
    []