Exploiting Coreference Annotations for Text-to-Hypertext Conversion.

2004 
The paper describes an annotation scheme for coreference developed within the application context of text-to-hypertext conversion. In this context coference is used (1) for generating document-internal and cross-document hyperlinks, and (2) for resolving anaphoric expressions in order to achieve cohesive closedness in hypertext nodes. We will argue that for the purpose of cross-document linking it is necessary to separate the annotation of coreference relations from the annotation of anaphoric relations. To account for this requirement, we developed a knowledge-based annotation scheme that relates referential expressions in the text to entities in a knowledge representation, which is modeled using XML Topic Maps. 1. Project Framework Converting linear text documents into documents that can be published in a hypertext environment is a complex task requiring conversion software on the technical side as well as conversion strategies and methods on the conceptual side. In the project HyTex1, which is the framework of the approach discussed in this paper, we concentrate on principles and strategies for handling conceptual problems of text-to-hypertext conversion such as: • S e g m e n t a t i o n : What are the criteria for segmenting documents into text segments to be used as hypertext nodes? • Reorganization: What are the guidelines for generating “cohesive closedness” in hypertext nodes, i.e. what kinds of transformations are necessary to unchain text segments from their linkage to the reading path of the sequential document, so that they may be integrated into different user-selected pathways? • Linking: What are the guidelines and principles for reconnecting the nodes via hyperlinks? Using XML as the technical basis for hypertext modeling and viewing, the project develops strategies and methods which (semi)-automatically create hypertext layers and views based on text-grammatical annotations. By storing 1 The acronym „HyTex“ is spelt out as Hypertextualisierung auf textgrammatischer Grundlage (‘Hypertext conversion on a textgrammatical basis’). The project was launched in November 2001 as part of the research group Text technologische Informationsmodellierung (‘Text-technological information modelling’), cf. http://www.text-technology.de. For more information on the HyTex project see http://www.hytex.info. the hypertext as additional document layers, our approach preserves structure and content of the original text documents, and thus provides the reader with the choice between sequential and selective reading modes. The general aim of the project is to support selective hypertext readers in finding coherent pathways through the document network and thus make selective reading and browsing more efficient and more convenient than it would be possible with printmedia. Feasibility and performance of the methodology is tested and evaluated using a German text corpus, which comprises documents that deal with two subject domains, namely "text technology" and “hypertext research” (Lenz & Storrer, 2002). The central idea of the conversion approach in HyTex is to base strategies for segmentation, reorganization and linking on information coming from two levels: • On the document level, we explicitly markup the text-grammatical structures and relations between text segments, e.g. coreference relations, semantics of connectives, text-deictic expressions, and expressions indicating topic handling. • On the domain knowledge level, we represent the main concepts of this subject domain and their interrelations, using the WordNet model (Fellbaum, 1998) as the conceptual and XML Topic Maps (XTM, 2001) as the technical basis (Beiswenger & Storrer & Runte, 2004; Lenz & Birkenhage & Maas, 2004). A dynamic-adaptive component that processes logs of usage has been considered but not been put into practice during the current phase of the project. In a later stage, this document usage level would supply information about the hypertext nodes already visited by a user and with this about the knowledge prerequisites that he already has. 2. Annotation of Coreference Phenomena The focus of this paper is on an annotation scheme for coreference phenomena. This scheme serves two purposes in our approach: firstly, generating document-internal and cross-document hyperlinks, cf. 2.3., and secondly, resolving anaphoric expressions in order to achieve cohesive closedness in hypertext nodes, cf. 2.4. Focusing on these two tasks, we will now discuss how a proper annotation of the relations of coreference and anaphora can be exploited for text-to-hypertext conversion in the above-mentioned framework. We argue that existing annotation schemes need to be extended in order to meet this task. As a result, a new annotation scheme is proposed that encodes coreference as a relation between the document level and the domain knowledge level. Thereby, it is possible to strictly separate the annotation of the relation of coreference from the annotation of anaphoric relations. Furthermore, the paper describes how the presented scheme can be employed to annotate the sequentially organized documents enclosed in the HyTex text corpus. 2.1. Existing Annotation Schemes Existing annotation formats such as the proposal of the Text Encoding Initiative (TEI), the task definition of the Message Understanding Conferences (MUC) and the annotation guidelines published by the project Multilevel Annotation Tools Engineering (MATE) treat coreference as one specific type of a generalized anaphoric relation. Arguing from a semantics viewpoint, (van Deemter & Kibble, 2001) point at some fundamental problems with this general practice by means of the MUC annotation exercise. They argue that in fact anaphora and coreference are two different things. Coreference constitutes an equivalence relation; anaphoric relations, by contrast, are irreflexive, nonsymmetrical, and nontransitive. Although anaphoric and coreferential relations can coincide, it is not generally the case that all coreferential relations are anaphoric, nor are all anaphoric relations coreferential. For instance, nonreferring NPs can enter anaphoric relations and thus should not be marked as coreferential, cf. (1a). Moreover, the notion of coreference may not be applied to bound anaphora, cf. (1b), and to intensional contexts, cf. (1c).2 (1) a. Whenever a solution emerged, we embraced it. b. Every TV network reported its profits. c. Henry Higgins, who was formerly sales director of Sudsy Soaps, became president of Dreamy Detergents. In addition to these linguistic arguments, there are practical reasons that militate in favor of a separated annotation of coreference and anaphora. Investigating a hypertext base, one will encounter cases where two items are coreferential without being anaphoric. For example, two mentions of the same entity, e.g. a person, in two different hypertext documents are coreferential, but will 2 The examples are taken from (van Deemter & Kibble, 2001). not stand in anaphoric relation. This type of coreference is also termed ‘cross-document coreference’; cf. (Baldwin & Bagga, 1998; Mitkov, 2003; Holler-Feldhaus, in press). Thus, marking coreference on the document level only leaves out of account that coreference is established with regard to entities in a real or mental world. To account for the coreference phenomena observable in a hypertext base, a coreference scheme is needed which does not presuppose that coreferential items are in any case anaphoric as well. 2.2. Knowledge-based Annotation of Coreference As mentioned before, HyTex implements a two-level architecture. The first level comprises the manually annotated documents of the hyperbase, whereas the second level represents the domain knowledge, which is modelled as an XML Topic Map, the so-called TermNet. Exploiting this architecture, we suggest regarding coreference as a relation between expressions occurring in a document and entries of the TermNet. Two expressions occurring in (maybe different) documents of the hyperbase are identified as coreferential if they point to the same term of the domain knowledge model. Since coreference is analyzed as a relation between items of the document level and items of the domain knowledge level, we do not presume any more that two expressions stand in anaphoric relation if they are marked as coreferential. We will show next how this basic idea is realized by the definition of our annotation scheme. First of all, relevant terms are marked as a nominal discourse entity by adding the tag . For this element two attributes are defined: deID, whose value enumerates all discourse entities, and deType, which specifies the semantic type of the respective entity.3 For annotating coreference, a link element as given in (2) is introduced. describes the relation between a text item given by its deIDref value and a referential anchor in the topic-map based TermNet represented by the value of the tmIDRef attribute. (2) The term ‘link’ in sentence (3a) taken from our text corpus is for example annotated as shown in (3b). (3) a. Link L verknupft A mit B im Hinblick auf C. ‘Link L connects A with B with regard to C.’ b. Link verknupft A mit B im Hinblick auf C. In addition to the -element, a element is introduced into the annotation scheme, cf. (4). This element is used to annotate document-internal anaphoric relations. It bears three attributes: relType, phorIDRef and antecedentIDRef. As you can see from 3 In principle, it is possible to mark entities different from nominals such as abstract objects by using the developed
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    12
    References
    13
    Citations
    NaN
    KQI
    []