Similarity to a single set.

Lee Naish

Similarity to a single set.

2016

Lee Naish

Identifying patterns and associations in data is fundamental to discovery in science. This work investigates a very simple instance of the problem, where each data point consists of a vector of binary attributes, and attributes are treated equally. For example, each data point may correspond to a person and the attributes may be their sex, whether they smoke cigarettes, whether they have been diagnosed with lung cancer, etc. Measuring similarity of attributes in the data is equivalent to measuring similarity of sets—an attribute can be mapped to the set of data points which have the attribute. Furthermore, there is one identified base set (or attribute) and only similarity to that set is considered—the other sets are just ranked according to how similar they are to the base set. For example, if the base set is lung cancer sufferers, the set of smokers may well be high in the ranking. Identifying set similarity or correlation has many uses and is often the first step in determining causality. Set similarity is also the basis for comparing binary classifiers such as diagnostic tests for any data set. More than a hundred set similarity measures have been proposed in the literature is but there is very little understanding of how best to choose a similarity measure for a given domain. This work discusses numerous properties that similarity measures can have, weakening some previously proposed definitions so they are no longer incompatible, and identifying important forms of symmetry which have not previously been considered. It defines ordering relations over similarity measures and shows how some properties of a domain can be used to help choose a similarity measure which will perform well for that domain.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations