Abstract The COVID-19 pandemic led to unparalleled pressure on healthcare services. Improved health-care planning in relation to diseases affecting the respiratory system has consequently become a key concern. We investigated the value of integrating sales of non-prescription medications commonly bought for managing respiratory symptoms, to improve forecasting of weekly registered deaths from respiratory disease at local levels across England, by using over 2 billion transactions logged by a UK high street retailer from March 2016 to March 2020. We report the results from the novel AI explainability variable importance tool Model Class Reliance implemented on the PADRUS model. PADRUS is a machine learning model optimised to predict registered deaths from respiratory disease in 314 local authority areas across England through the integration of shopping sales data and focused on purchases of non-prescription medications. We found strong evidence that models incorporating sales data significantly out-perform other models that solely use vari-ables traditionally associated with respiratory disease (e.g. sociodemographics and weather data). Accuracy gains are highest (increases in R2 between 0.09 to 0.11) in periods of maximum risk to the general public. Results demonstrate the potential to utilise sales data to monitor population health with information at a high level of geographic granularity.
Introduction & BackgroundThe COVID-19 pandemic led to unparalleled pressure on healthcare services, highlighting the need for improved healthcare planning for respiratory disease outbreaks. With rapid virus diversification, and correspondingly rapid shifts in symptom expression, there is often a complete lack of representative clinical testing data available to modellers. This is especially true at the onset in outbreaks, where traditional epidemiological and statistical approaches that utilise case data ‘ground truths’ are extremely challenging to apply. In this abstract we preview the results of two novel studies that investigate how the use of digital footprint data - in the form of over-the-counter medication sales - might serve as a predictive proxy for underlying and often hidden disease incidence, and the extent to which such data might improve mortality rate forecasting at local area levels.
Objectives & ApproachOver 2 billion transactions logged by a UK high-street health retailer were collated across English local authorities (n=314), generating weekly variables corresponding to a range of health purchase behaviours (e.g cough mixture / pain-relief sales) in each authority. These purchase data were additionally linked to a set of independent variables describing each local authority’s 1. weekly environment (e.g. weather, temperature, pollution), 2. socio-demographics (e.g. age distributions, deprivation levels, population densities) and 3. available local test case data. Machine learning regression models were then deployed to investigate the ability of each of these variable sets to underpin predictions of weekly registered deaths in the 314 authorities that were due to: COVID-19 between Apr 2020 - Dec 2021 (Study 1) or general respiratory disease between March 2016 - Mar 2020 (Study 2). All models were rigorously tested out-of-sample via walk forward cross-validation, and across a range of forecast windows.
Relevance to Digital FootprintsEpidemics such as COVID-19 are recognised as being driven as much by behavioural factors as they are by clinical ones. Indicators of infection rates may be revealed in purchasing and self-medication logs, where there exists rich data: in 2022 UK citizens were reported to generate >1 billion prescriptions; consume ~6,300 tonnes of paracetamol; and spend £572m on cough, cold and sore throat treatments. Application of the digital footprint data logs generated by such activities may hold potential to reveal hidden disease incidence and risk to vulnerable communities, without reliance on prohibitively expensive testing infrastructures.
ResultsEvidence was found that models incorporating digital footprint sales data were able to significantly out-perform models that used variables traditionally associated with respiratory disease alone (e.g. sociodemographics, weather, or case data). In Study 1, XGBoost models were able to optimally predict the number of COVID deaths 21 days in advance (R2=0.71***), significantly outperforming models based on official COVID case data alone at local-area levels (R2=0.44**). For the pre-COVID period, where registered deaths express a far greater seasonal pattern, models optimally predicted registered respiratory deaths 17 days in advance (R2=0.78***), with highest accuracy gains over models without digital footprint data (increases in R2 between 0.09 to 0.11) occurring in periods of maximum risk to the general public (winter periods).
Conclusions & ImplicationsOver-the-counter medication purchases related to management of respiratory illness are correlated with registered deaths at a 17-21 day window. Results demonstrate the potential for sales data to support early warning population health mechanisms at local area levels, and the need for ongoing research into their application to support health planning.
Introduction & BackgroundPrevious studies have found shopping data could increase the predictive accuracy of disease surveillance systems and illuminate behavioural responses in the self-management of symptoms of disease. Yet, accessing individual sales datasets for linkage to health datasets is challenging, and the recruitment of appropriate sample sizes for medical research has been limited.
Objectives & ApproachObjectivesCollect and link individual health data to individual shopping data to investigate COVID-19. Assess the feasibility of scaling-up this method, and use the collected data to investigate using loyalty card data in machine learning (ML) models for disease.
MethodsBased on recommendations on the public’s preferences for data donation a new protocol was designed for collecting, linking and analysing shopping and health data. Participants were requested to use the Tesco Clubcard website data portability function to share their loyalty card data and complete an online health survey. An exploratory data analysis was conducted on the linked dataset. Participants were recruited online (18/01/2022 to 04/02/2022) with a recruitment target of 200.
Relevance to Digital FootprintsThe collection and analysis of individual transactional sales data for health research.
Results197 participants shared their Tesco Clubcard and health survey data. Tesco Clubcard data contained 893,414 transactions of 65,310 uniquely named items purchased from 2015 to 2022. Average transactions per participant were 4,653 (SD 5256) and average timeframe recorded was five years 6 months and 30 days (SD 836 days). A total of 6,993 medication sales were recorded accounting for 1% of sales, 81% (159/197) of participants bought medications and the average was 44 (STD 68) medications per individual. Most participants (196/197) shared their health status in the survey, and 94% (81/86) of those on medication shared the medication names. Participants reported donating their data to do good (79%, 155/197), help the NHS (77%, 152/197), be socially responsible (74%, 144/197) and because data was secure and anonymised (78%, 153/197).
Conclusions & ImplicationsUsing this new protocol which enables convenient data sharing with transparent data safeguards, the public were willing to share both their shopping and health data for research into COVID-19. To apply robust ML analysis, particularly to explore self-medication at an individual level, recruitment must be significantly scaled to collect data from enough individuals with high sales and regular shopping frequency, or new ML techniques developed to address sparseness in loyalty card data of key purchasing events related to health. The study suggests public readiness to share shopping data for health research, but investment is needed for large-scale data collection and AI application.
Abstract Background: A growing number of studies show the potential of loyalty card data for use in health research. However, research into public perceptions of using this data is limited. This study aimed to investigate public attitudes towards donating loyalty card data for academic health research, and the safeguards the public would want to see implemented. The way in which participant attitudes varied according to whether loyalty card data would be used for either cancer or COVID-19 research was also examined. Methods: Participants (N=40) were recruited via Prolific Academic to take part in semi-structured telephone interviews, with questions focused on data sharing related to either COVID-19 or ovarian/bowel cancer as the proposed health condition to be researched. Content analysis was used to identify sub-themes corresponding to the two a priori themes, attitudes and safeguards. Results: Participant attitudes were found to fall into two categories, either rational or emotional. Under rational , most participants were in favour of sharing loyalty card data. Support of health research was seen as an important reason to donate such data, with loyalty card logs being considered as already within the public domain. With increased understanding of research purpose, participants expressed higher willingness to donate data. Within the emotional category, participants shared fears about revealing location information and of third parties obtaining their data. With regards to safeguards, participants described the importance of anonymisation and the level of data detail; the control, convenience and choice they desired in sharing data; and the need for transparency and data security. The change in hypothetical purpose of the data sharing, from Covid-19 to cancer research, had no impact on participants’ decision to donate, although did affect their understanding of how loyalty card data could be used. Conclusions: Based on interviews with the public, this study contributes recommendations for those researchers and the wider policy community seeking to obtain loyalty card data for health research. Whilst participants were largely in favour of donating loyalty card data for academic health research, information, choice and appropriate safeguards are all exposed as prerequisites upon which decisions are made.
Introduction & BackgroundChronic pain is considered a priority in healthcare and a threat to well-being across the globe, it is thus crucial to accurately measure the national levels of pain conditions and their impacts on workplace productivity and well-being. Chronic pain has traditionally been studied in isolation with either self-reported survey data or standalone shopping records. The former are limited in scale and can be marred by response biases, while the latter lack ‘ground truths’: what research teams can measure are usually the purchase patterns of pain relief products, but neither the severity nor types of pain conditions. Objectives & ApproachData donation tools offer a novel approach to study chronic pain by linking the two aspects and establish statistical relationships between medicine consumptions and the multiple facets of pain experience. In a survey, we asked participants (N = 953) to share their loyalty card data with us, which is made possible with the data portability tool provided by Tesco (i.e., the largest supermarket chain in the United Kingdom) as part of the General Data Protection Regulation (GDPR). Based on questions adopted from popular inventories used in health research (e.g., EQ5D Health States, ONS4 Well-being, WEMWBS scales), we also asked participants to report the details of their pain conditions, hours of employment, and both general and mental health states. This allowed us to associate chronic pain - both subjective and objective (i.e., reflected by medicine consumption) - with its economic and personal consequences. Data collection was conducted via research panel providers, thus should approximate national representativeness. Relevance to Digital FootprintsThis work links digital footprints data donated by individuals to self-reported survey data, also develops an infrastructure for these data to be collected and safely stored. Conclusions & ImplicationsOne key value of this project is to pioneer a measure of chronic pain that can be applied to transactional records that are much bigger in scale in future analytic works. Our research team has access to an array of different digital footprints data, including longitudinal transactional data provided by a major pharmacy chain (~20 million customers and ~429 million baskets). In order to utilise these data to associate them with regional workplace productivity measures and well-being data released by the Office for National Statistics, a metric must be defined to extract the prevalence of chronic pain from shopping data, which is informed by the patterns found by the data donation project.
Shopping data can be analyzed using machine learning techniques to study population health. It is unknown if the use of such methods can successfully investigate prediagnosis purchases linked to self-medication of symptoms of ovarian cancer.The aims of this study were to gain new domain knowledge from women's experiences, understand how women's shopping behavior relates to their pathway to the diagnosis of ovarian cancer, and inform research on computational analysis of shopping data for population health.A web-based survey on individuals' shopping patterns prior to an ovarian cancer diagnosis was analyzed to identify key knowledge about health care purchases. Logistic regression and random forest models were employed to statistically examine how products linked to potential symptoms related to presentation to health care and timing of diagnosis.Of the 101 women surveyed with ovarian cancer, 58.4% (59/101) bought nonprescription health care products for up to more than a year prior to diagnosis, including pain relief and abdominal products. General practitioner advice was the primary reason for the purchases (23/59, 39%), with 51% (30/59) occurring due to a participant's doctor believing their health problems were due to a condition other than ovarian cancer. Associations were shown between purchases made because a participant's doctor believing their health problems were due to a condition other than ovarian cancer and the following variables: health problems for longer than a year prior to diagnosis (odds ratio [OR] 7.33, 95% CI 1.58-33.97), buying health care products for more than 6 months to a year (OR 3.82, 95% CI 1.04-13.98) or for more than a year (OR 7.64, 95% CI 1.38-42.33), and the number of health care product types purchased (OR 1.54, 95% CI 1.13-2.11). Purchasing patterns are shown to be potentially predictive of a participant's doctor thinking their health problems were due to some condition other than ovarian cancer, with nested cross-validation of random forest classification models achieving an overall in-sample accuracy score of 89.1% and an out-of-sample score of 70.1%.Women in the survey were 7 times more likely to have had a duration of more than a year of health problems prior to a diagnosis of ovarian cancer if they were self-medicating based on advice from a doctor rather than having made the decision to self-medicate independently. Predictive modelling indicates that women in such situations, who are self-medicating because their doctor believes their health problems may be due to a condition other than ovarian cancer, exhibit distinct shopping behaviors that may be identifiable within purchasing data. Through exploratory research combining women sharing their behaviors prior to diagnosis and computational analysis of these data, this study demonstrates that women's shopping data could potentially be useful for early ovarian cancer detection.
Anthocyanins are a class of polyphenols that have received widespread recent attention due to their potential health benefits. However, estimating the dietary intake of anthocyanins at a population level is a challenging task, due to the difficulty of scaling dietary surveys. Further, there is limited evidence as to who regularly consumes anthocyanins, whether temporally, spatially, or culturally according to levels of socioeconomic deprivation. Leveraging a massive retail loyalty card dataset in the UK, we pair two years of real-world purchasing data for 619,524 regular shoppers and 207 million shopping baskets with anthocyanin estimates drawn from polyphenol databases. We subsequently analyse relative deprivation levels of the neighbourhoods in which shoppers reside, illustrating how anthocyanin intake varies according to affluence. Results indicate that deprivation is linked dramatically with both lower total intake of anthocyanins and lower breadth of dietary sources for them, potentially aggravating the incidence of diet-related diseases in the poorest sections of society.