Empirical evaluation of tools for hairy requirements engineering tasks

Daniel M. Berry

Empirical evaluation of tools for hairy requirements engineering tasks

2021

Daniel M. Berry

A hairy requirements engineering (RE) task involving natural language (NL) documents is (1) a non-algorithmic task to find all relevant answers in a set of documents, that is (2) not inherently difficult for NL-understanding humans on a small scale, but is (3) unmanageable in the large scale. In performing a hairy RE task, humans need more help finding all the relevant answers than they do in recognizing that an answer is irrelevant. Therefore, a hairy RE task demands the assistance of a tool that focuses more on achieving high recall, i.e., finding all relevant answers, than on achieving high precision, i.e., finding only relevant answers. As close to 100% recall as possible is needed, particularly when the task is applied to the development of a high-dependability system. In this case, a hairy-RE-task tool that falls short of close to 100% recall may even be useless, because to find the missing information, a human has to do the entire task manually anyway. On the other hand, too much imprecision, too many irrelevant answers in the tool’s output, means that manually vetting the tool’s output to eliminate the irrelevant answers may be too burdensome. The reality is that all that can be realistically expected and validated is that the recall of a hairy-RE-task tool is higher than the recall of a human doing the task manually. Therefore, the evaluation of any hairy-RE-task tool must consider the context in which the tool is used, and it must compare the performance of a human applying the tool to do the task with the performance of a human doing the task entirely manually, in the same context. The context of a hairy-RE-task tool includes characteristics of the documents being subjected to the task and the purposes of subjecting the documents to the task. However, traditionally, many a hairy-RE-task tool has been evaluated by considering only (1) how high is its precision, or (2) how high is its F-measure, which weights recall and precision equally, ignoring the context, and possibly leading to incorrect — often underestimated — conclusions about how effective it is. To evaluate a hairy-RE-task tool, this article offers an empirical procedure that takes into account not only (1) the performance of the tool, but also (2) the context in which the task is being done, (3) the performance of humans doing the task manually, and (4) the performance of those vetting the tool’s output. The empirical procedure uses (I) on one hand, (1) the recall and precision of the tool, (2) a contextually weighted F-measure for the tool, (3) a new measure called summarization of the tool, and (4) the time required for vetting the tool’s output, and (II) on the other hand, (1) the recall and precision achievable by and (2) the time required by a human doing the task. The use of the procedure is shown for a variety of different contexts, including that of successive attempts to improve the recall of an imagined hairy-RE-task tool. The procedure is shown to be context dependent, in that the actual next step of the procedure followed in any context depends on the values that have emerged in previous steps. Any recommendation for a hairy-RE-task tool to achieve close to 100% recall comes with caveats and may be required only in specific high-dependability contexts. Appendix C applies parts of this procedure to several hairy-RE-task tools reported in the literature using data published about them. The surprising finding is that some of the previously evaluated tools are actually better than they were thought to be when they were evaluated using mainly precision or an unweighted F-measure.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations