There are thousands of data repositories on the Web, providing access to millions of datasets. National and regional governments, scientific publishers and consortia, commercial data providers, and others publish data for fields ranging from social science to life science to high-energy physics to climate science and more. Access to this data is critical to facilitating reproducibility of research results, enabling scientists to build on others' work, and providing data journalists easier access to information and its provenance. In this paper, we discuss Google Dataset Search, a dataset-discovery tool that provides search capabilities over potentially all datasets published on the Web. The approach relies on an open ecosystem, where dataset owners and providers publish semantically enhanced metadata on their own sites. We then aggregate, normalize, and reconcile this metadata, providing a search engine that lets users find datasets in the "long tail" of the Web. In this paper, we discuss both social and technical challenges in building this type of tool, and the lessons that we learned from this experience.
Scientists, governments, and companies increasingly publish datasets on the Web. Google's Dataset Search extracts dataset metadata -- expressed using schema.org and similar vocabularies -- from Web pages in order to make datasets discoverable. Since we started the work on Dataset Search in 2016, the number of datasets described in schema.org has grown from about 500K to almost 30M. Thus, this corpus has become a valuable snapshot of data on the Web. To the best of our knowledge, this corpus is the largest and most diverse of its kind. We analyze this corpus and discuss where the datasets originate from, what topics they cover, which form they take, and what people searching for datasets are interested in. Based on this analysis, we identify gaps and possible future work to help make data more discoverable.
As more ontologies become publicly available, finding the "right" ontologies becomes much harder. In this paper, we address the problem of ontology search: finding a collection of ontologies from an ontology repository that are relevant to the user's query. In particular, we look at the case when users search for ontologies relevant to a particular topic (e.g., an ontology about anatomy). Ontologies that are most relevant to such query often do not have the query term in the names of their concepts (e.g., the Foundational Model of Anatomy ontology does not have the term "anatomy" in any of its concepts' names). Thus, we present a new ontology-search technique that helps users in these types of searches. When looking for ontologies on a particular topic (e.g., anatomy), we retrieve from the Web a collection of terms that represent the given domain (e.g., terms such as body, brain, skin, etc. for anatomy). We then use these terms to expand the user query. We evaluate our algorithm on queries for topics in the biomedical domain against a repository of biomedical ontologies. We use the results obtained from experts in the biomedical-ontology domain as the gold standard. Our experiments demonstrate that using our method for query expansion improves retrieval results by a 113%, compared to the tools that search only for the user query terms and consider only class and property names (like Swoogle). We show 43% improvement for the case where not only class and property names but also property values are taken into account.
topics in this research. We talked about types of heterogeneity between ontologies, various mapping representations, classified methods for discovering methods both between ontology concepts and data, and talked about various tasks where mappings are used. In this extended abstract of our talk, we provide an annotated bibliography for this area of research, giving readers brief pointers on representative papers in each of the topics mentioned above. We did not attempt to compile a comprehensive bibliography and hence the list in this abstract is necessarily incomplete. Rather, we tried to sketch a map of the field, with some specific reference to help interested readers in their exploration of the work to-date.
Semantic technologies provide flexible and scalable solutions to master and make sense of an increasingly vast and complex data landscape. However, while this potential has been acknowledged for various application scenarios and domains, and a number of success stories exist, it is equally clear that the development and deployment of semantic technologies will always remain reliant of human input and intervention. This is due to the very nature of some of the tasks associated with the semantic data management life cycle, which are famous for their knowledge-intensive and/or context-specific character; examples range from conceptual modeling in almost any flavor, to labeling resources (in different languages), describing their content in terms of ontological terms, or recognizing similar concepts and entities. For this reason, the Semantic Web community has always looked into applying the latest theories, methods and tools from CSCW (Computer Supported Cooperative Work), participatory design, Web 2.0, social computing, and, more recently crowdsourcing to find ways to engage with users and encourage their involvement in the execution of technical tasks. Existing approaches include the usage of wikis as semantic content authoring environments, leveraging folksonomies to create formal ontologies, but also human computation approaches such as games with a purpose or micro-tasks.
This document provides a summary of the Dagstuhl Seminar 14282: Crowdsourcing and the Semantic Web, which in July 2014 brought together researchers of the emerging scientific community at the intersection of crowdsourcing and Semantic Web technologies. We collect the position statements written by the participants of seminar, which played a central role in the discussions about the evolution of our research field.