New Algorithms for Monotone Classification

2021 
In \em monotone classification, the input is a set P of n points in d-dimensional space, where each point carries a label 0 or 1. A point p \em dominates another point q if the coordinate of p is at least that of q on every dimension. A \em monotone classifier is a function h mapping each d-dimensional point to $\0, 1\ $, subject to the condition that $h(p) \ge h(q)$ holds whenever p dominates q. The classifier h \em mis-classifies a point $p \in P$ if $h(p)$ is different from the label of p. The \em error of h is the number of points in P mis-classified by h. The objective is to find a monotone classifier with a small error. The problem is fundamental to numerous database applications in entity matching, record linkage, and duplicate detection. This paper studies two variants of the problem. In the first \em active version, all the labels are hidden in the beginning; an algorithm must pay a unit cost to \em probe (i.e., reveal) the label of a point in P. We prove that $Omega(n)$ probes are necessary to find an optimal classifier even in one-dimensional space ($d=1$). On the other hand, given an arbitrary $\eps > 0$, we show how to obtain (with high probability) a monotone classifier whose error is worse than the optimum by at most a $1 + \eps$ factor, while probing $\tO(w/\eps^2)$ labels, where w is the dominance width of P and $\tO(.)$ hides a polylogarithmic factor. For constant $\eps$, the probing cost matches an existing lower bound up to an $\tO(1)$ factor. In the second \em passive version, the point labels in P are explicitly given; the goal is to minimize CPU computation in finding an optimal classifier. We show that the problem can be settled in time polynomial to both d and n.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    1
    Citations
    NaN
    KQI
    []