Record linkage for farm-level data analytics: Comparison of deterministic, stochastic and machine learning methods

2019 
Abstract The advent of big data in agriculture increased the necessity of extracting useful information from large data collections. This knowledge is critical in optimizing production systems, while also addressing prevailing issues such as sustainability. One of the first, yet crucial, data analytics steps comprises integration. Integrated data from different sources can provide enhanced insight, as they may retain complementary information on the same entity. In the absence of a “unique universal identifier” to link entities (e.g. farms) from different databases, it is necessary to rely on their recorded attributes (e.g. farm name, owner). We propose a fully automated framework to match farms, across different datasets (i.e. farm matching) in a big data context. To assess performance, we used information on Brazilian beef cattle farms from two large datasets: 44,566 farms that made purchases at an animal nutrition company, and 32,776 that processed cattle at a meat packing company. Geographical search space reduction was implemented as an alternative to reduce the number of comparisons evaluated. To compare attributes between farm pairs, we contrasted two edit-based approaches, the Levenshtein and Jaro-Winkler metrics. We also compare deterministic, stochastic, and machine learning (ML) approaches, for classification of farm pairs as match or non-match. These techniques have been used in other record linkage domains. The deterministic approach requires all attributes to match exactly. The probabilistic approaches tested were Epi-Weights (CR) and Fellegi-Slunter (FS). Unsupervised ML approaches were k-means and bagged clustering (BC). Supervised methods were recursive partitioning trees, bagging of decision trees, bootstrap based classification trees, stochastic boosting, support vector machines (SVM), single-layer neural networks and logistic regression. Labels were produced by specialist review for both a training set of 295,012 comparison pairs and a testing set of 32,780. All techniques were evaluated in terms of testing set quality (accuracy, precision, sensitivity, and specificity) and completeness (number of matches) as well as efficiency (run-time). ML approaches outperformed the deterministic matching, which was superior than probabilistic methods. Within ML approaches, supervised methods outperformed unsupervised (except for BC). The best string metric was the Levenshtein. The best classification method in terms of quality and completeness was SVM (accuracy = 99.9%, precision = 91.1%, sensitivity = 97.3%, specificity = 99.9%), followed by BC (accuracy = 99.9%, precision = 90.8%, sensitivity = 93.2%, specificity = 99.9%). Results indicate that both SVM and BC are suitable for farm matching in scenarios where training labels are available, or not, respectively.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    31
    References
    12
    Citations
    NaN
    KQI
    []