Exploration of Large-Scale SPARQL Query Collections : Finding Structure and Regularity for Optimizing Database Systems

2020 
After the World Wide Web successfully penetrated the lives of people everywhere, it gave rise to the Semantic Web. Whereas the World Wide Web started to be used by humans, the Semantic Web is meant to facilitate machines to process data. To this end, data is modelled as ontology as opposed to storing it in classical relational databases. The work presented here deals with the research of large-scale collections of queries for semantic databases. Specifically, more than half a billion queries are investigated. The World Wide Web Consortium (W3C) specification Resource Description Framework (RDF) became the prominent standard for modelling semantic data. As corresponding language for querying, the SPARQL Protocol and RDF Query Language (SPARQL, a recursive acronym) was developed by the W3C. There are various large-scale public databases that offer semantic data for querying. These public endpoints log their usage for various purposes. These logs can offer insight into the actual usage of data and features in SPARQL. We investigate two primary sources for queries: A diverse collections mostly obtained from USEWOD, and publicly available query logs from Wikidata. The diverse collections consists mostly of logs from DBpedia, but it also includes sources such as LinkedGeoData, OpenBioMed, and BioPortal. The goal of the study in this work is to organize the data in the logs to make sense of it, so trends and insights on the nature of queries in the logs can be identified, which can be used to derive future directions for optimizing database systems that handle linked data and technology surrounding this topic. Therefore, questions guiding the research are from topics such as query evaluation, query optimization, tuning, and benchmarking. It turns out that quite a few observations can be made and it allows to draw several interesting conclusion. For instance, a very large number of queries is extremely simple. It is possible to describe the shapes of most queries, even more complex ones, with a shape that has favorable properties regarding the efficiency of evaluation. Furthermore, there are differences in queries originating from humans when compared to machine-generated queries. In this work, several novel, new approaches are taken such as the analysis of shapes of queries, the study of logs with a temporal analysis, and the investigation of query similarity based on structure. Results are entirely reproducible, the accompanying software is made available under an open-source license, and it can be used to explore logs in addition to analyzing them.%%%%Nachdem das Word Wide Web erfolgreich in das Leben von Menschen uberall Einzug gehalten hatte, bereitete es den Weg fur das Semantic Web. Wahrend das Word Wide Web zur menschlichen Nutzung konzipiert worden ist, soll das Semantic Web Maschinen die Verarbeitung von Daten erleichtern. Zu diesem Zweck werden Daten als Ontologie statt mit klassischen relationalen Datenbanken modelliert. Die vorliegende Arbeit befasst sich mit der Erforschung von riesigen…
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    1
    References
    1
    Citations
    NaN
    KQI
    []