Attributing value in a data pooling setting for predictive modeling

2017 
The rapid growth of data sources comes with numerous challenges. One of them is the determination of its value. That is, when building prediction models based on different data sources, it is interesting to know how much each of the features has contributed to that specific prediction. As such, we get an idea on how the benefits created by the prediction model could be divided over the features responsible for it. The goal of this paper is to define, solve and evaluate a data attribution scheme for predictive modeling that is “fair”, which is defined by using concepts from game theory. We use two methods from various research fields in order to distribute the value both on an instance level and ultimately on a feature level: The (approximate) Shapley value and an explanation approach for high-dimensional data. By using a high-dimensional and sparse data set, consisting of website visits for each user, we show that: (i) the proposed methods allow to create a fair value distribution among a very large number of data sources (websites in this case) in a prediction model, and (i) are able to obtain a double amount of instances that are explained for a given number of features as compared to just looking at the high-coefficient features. Interestingly, (iii) although the proposed methods come from different sources and motivations, the two new alternatives provide strikingly similar rankings of important features and division of the revenues.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []