Nonparametric approaches to statistical matching

2006 
Imputation of missing items is a usual practice in applied statistics. One of the most common approaches is hot deck. It consists in properly selecting donors from the respondents of a data set, and imputing the donor values to the unobserved ones. This approach aims at obtaining a completed, synthetic, data set that is much easier to analyze. A natural question is: To which extent is the completed data set reliable? In fact, it can be actually used for any kind of analysis if the mechanism that generates imputations coincides with the random mechanism that generates “true” observations. The discrepancy between these two processes is the matching noise (cfr. Paass, 1985). This paper aims at studying the matching noise of the distance hot deck procedure, as well as of some nonparametric alternatives, in the special case of statistical matching (cfr. Rassler, 2002). Let A, B be two samples of size nA and nB, respectively, of independent and identically distributed (i.i.d.) records generated from a P + Q dimensional random variable (r.v.) (X, Z) with joint density function (d.f.) f(x, z). From now on, we will denote by Tc the cth observation of the variate T in the sample C (C = A, B, c = 1, . . . , nC , T = X, Z). The r.v. Z is not observed in A, and is imputed using B as a set of donors via hot deck procedures. Distance hot deck consists in selecting, for each a = 1, . . . , nA, the donor b1(a) ∈ B such that d(xa ,xBb1(a)) = minb∈B d(x A a ,x B b ), where d(·, ·) is a distance function. In the sequel we will confine ourselves to Euclidean distances d(xa ,x B b ) = {(xb − xa )D(xb − xa )}, D being a positive definite matrix. The completed A consists of the records (xa , z B b1(a) ). As remarked in Chen and Shao (2001), distance hot deck provides asymptotically unbiased and consistent estimators for population means as well as for quantiles. No attempt to define the correspondent matching noise has been made. Distance hot deck is nonparametric because it does not require any specific parametric assumption on the d.f. of (X, Z). Furthermore, it corresponds to use the k nearest neighbour (kNN) method with k = 1. It can be easily generalized for k > 1.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    1
    References
    0
    Citations
    NaN
    KQI
    []