Advantages, Opportunities and Limits of Empirical Evaluations: Evaluating Adaptive Systems

2002 
While empirical evaluations are a common research method in some areas of Artificial Intelligence (AI), others still neglect this approach. This article outlines both the opportunities and the limits of empirical evaluations for AI techniques exemplified by the evaluation of adaptive systems. Using the so called layered evaluation approach, we demonstrate that empirical evaluations are able to identify errors in AI systems that would otherwise remain undiscovered. To encourage new evaluations we implemented an online database of studies that are concerned with empirical evaluations of adaptive systems (EASy-D). 1 Advantages: Why Evaluations are needed Some areas of AI apply empirical methods regularly. E.g., planning and search algorithms are benchmarked in standard domains, and machine learning algorithms are usually tested with real data sets. However, looking at some applied areas such as user modeling, empirical studies are rare. E.g., only a quarter of the articles published in User Modeling and User Adapted Interaction (UMUAI) are reporting significant empirical evaluations [4]. Many of them include a simple evaluation study with small sample sizes and often without any statistical methods. On the other hand, for an estimation of the effectiveness, the efficiency, and the usability of a system that applies AI techniques in real world scenarios, empirical research is absolutely necessary. Especially user modeling techniques which are based on human-computer interaction require empirical evaluations. Otherwise, as we are going to demonstrate in this paper, certain types of errors will remain undiscovered. Undoubtedly, verification, formal correctness, and tests are important methods for software engineering, however, we argue that empirical evaluation—seen as an important complement—can improve AI techniques considerably. Moreover, the empirical approach is an important way to both, legitimize the efforts spent, and to give evidence to the usefulness of an approach. 2 Opportunities: What we may learn from Empirical Evaluations According to Cohen [5] empirical methods for AI should answer three basic research questions: How will a change in the agent’s structure affect its behavior given a task and an environment? How will a change in an agent’s task affect its behavior in a particular environment? How will a change in an agent’s environment affect its basic behavior on a particular task? These questions may be answered by a combination of four kinds of empirical studies: exploratory studies that yield causal hypotheses; assessment studies that establish baselines, ranges, and benchmarks; manipulation experiments to test hypotheses about causal influences; and finally observation experiments (or quasi-experiments) that disclose effects of factors on measured variables without random assignment of treatments [5, 9]. These general and goal defining questions have to be specified in terms of each AI area. As an illustrative example, we outline the opportunities of empirical evaluations for adaptive systems and user modeling. Similar results can be obtained for other AI systems. The evaluation of adaptive systems can be seen as a layered process where each evaluation layer is prerequisite for the subsequent layers (see KI-Lexikon). Three approaches have been proposed [3, 12, 16] that basically just differ in layer granularity. Thus, we will outline four layers of evaluation of adaptive systems here as introduced by Weibelzahl [16, 17]. Figure 1 shows the four layers: During interaction the adaptive system observes the user and registers certain events or behavior cues (1). Based on these input data abstract user properties are inferred (2). Finally the system decides what and how to adapt (3) and presents the adapted interface to the user (4). Each layer has to be evaluated to guarantee adaptation success. 2.1 Evaluation of Reliability and Validity of Input Data The first layer evaluates the reliability and the external validity of input data (see KILexikon). Unreliable input data would result in misadaptations. E.g., Spooner and Edwards [13] tried to identify typical dyslexic errors of authors to improve a spell checking system. In an exploratory study they evaluated the stability of certain errors
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    17
    References
    49
    Citations
    NaN
    KQI
    []