Extracting Aggregate Answer Statistics for Integration

2015 
Aggregate queries in integration contexts often do not have one“true”answer; there can be multiple correct answers for the same aggregate query. This is due to the existence of duplicate or overlapping data points, possibly with di↵erent values, across the data sources. Depending on the choice of data source combinations that are used to answer the query, di↵erent answers can be generated. Thus, representing the answer to the aggregate query as an answer distribution instead of a single scalar value, will allow the users to better understand the range of possible answers. This work provides a suite of methods for extracting statistics that convey meaningful information about aggregate query answers in heterogeneous integration settings. We focus on the following challenges: 1. determining which statistics best represent an answer’s distribution; and 2. eciently computing the desired statistics. Our solution includes the following answer statistics 1. a set of point estimates with confidence intervals; 2. a high coverage interval that unveils “hot areas” in a distribution; and 3. a stability score that measures the impact of source dynamics. We optimize the extraction of the above statistical information by minimizing the sampling load and applying fast approximate algorithms. We verify the e↵ectiveness and eciency of our methods with empirical studies using real-life and synthetic, scaled data sets.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    24
    References
    2
    Citations
    NaN
    KQI
    []