Exploiting partially-labeled data in learning predictive clustering trees for multi-target regression: A case study of water quality assessment in Ireland

2020 
Abstract Providing clean drinking water for society is an important priority and a growing challenge in many regions. Increasing food production to feed a growing world population can lead to intensification of agricultural management systems and land use (sometimes appropriate, sometimes not), and the potential for pollution of surface and ground water bodies may be is drastically increased. In order to meet drinking water quality standards, the concentration of the nutrients in the water must be below the limit thresholds set by the European Environment Agency. Consequently, the need for predictive models for modelling water pollution is crucial. In order to decrease the amount of nutrient (nitrogen and phosphorus) loss to water, the already existing data from monitoring programs consisting of pressure (management) and pathway (environmental) variables can be used. In this study, we use advanced data mining techniques to predict the water quality using the national river water quality monitoring network across Ireland and nutrient enrichment based on environmental pressure (soil fertilization and grass growing season) and pathway (soil drainage characteristics, net rainfall and rainfall intensity) variables. More precisely, we use predictive clustering trees for multi-target regression (MTR) and random forest ensembles thereof to predict the values of three continuous target variables: biological water quality, nitrogen concentration and phosphorus concentration. The three targets can be dealt independently, i.e., by building a separate local model for each target, or jointly by building a global model, which predicts all three targets simultaneously. Additional complexity in this kind of data is that not all of the examples in the dataset are completely labeled, i.e., some of the values for their target variables can be missing. In the classical supervised machine learning algorithms, such ‘incomplete’ samples are discarded. In our experiments, we propose to use methods that exploit all of the available data instead of discarding parts of them. To this end, we use predictive clustering trees, these can handle unlabeled and partially- labeled data directly. We compare the performance of such trees to the performance of supervised regression trees, which use only complete data and are unable to exploit incomplete data. We build both single-target (i.e., local) and multi-target (i.e., global) models. Our results reveal that better performance can be achieved if incomplete data are exploited by predictive clustering trees, rather than discarded. Moreover, global models are the more practical for the domain experts: They can be easily interpreted as they predict all the targets simultaneously, while overfitting less and maintaining or even improving the performance of local models.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    43
    References
    1
    Citations
    NaN
    KQI
    []