Article Share on Keys for XML Authors: Peter Buneman University of Pennsylvania University of PennsylvaniaView Profile , Susan Davidson University of Pennsylvania University of PennsylvaniaView Profile , Wenfei Fan Temple University Temple UniversityView Profile , Carmem Hara Universidade Federal do Parana, Brazil Universidade Federal do Parana, BrazilView Profile , Wang-Chiew Tan University of Pennsylvania University of PennsylvaniaView Profile Authors Info & Claims WWW '01: Proceedings of the 10th international conference on World Wide WebMay 2001 Pages 201–210https://doi.org/10.1145/371920.371984Online:01 April 2001Publication History 121citation892DownloadsMetricsTotal Citations121Total Downloads892Last 12 Months7Last 6 weeks0 Get Citation AlertsNew Citation Alert added!This alert has been successfully added and will be sent to:You will be notified whenever a record that you have chosen has been cited.To manage your alert preferences, click on the button below.Manage my AlertsNew Citation Alert!Please log in to your account Save to BinderSave to BinderCreate a New BinderNameCancelCreateExport CitationPublisher SiteGet Access
Archiving is important for scientific data, where it is necessary to record all past versions of a database in order to verify findings based upon a specific version. Much scientific data is held in a hierachical format and has a key structure that provides a canonical identification for each element of the hierarchy. In this article, we exploit these properties to develop an archiving technique that is both efficient in its use of space and preserves the continuity of elements through versions of the database, something that is not provided by traditional minimum-edit-distance diff approaches. The approach also uses timestamps. All versions of the data are merged into one hierarchy where an element appearing in multiple versions is stored only once along with a timestamp. By identifying the semantic continuity of elements and merging them into one data structure, our technique is capable of providing meaningful change descriptions, the archive allows us to easily answer certain temporal queries such as retrieval of any specific version from the archive and finding the history of an element. This is in contrast with approaches that store a sequence of deltas where such operations may require undoing a large number of changes or significant reasoning with the deltas. A suite of experiments also demonstrates that our archive does not incur any significant space overhead when contrasted with diff approaches. Another useful property of our approach is that we use XML format to represent hierarchical data and the resulting archive is also in XML. Hence, XML tools can be directly applied on our archive. In particular, we apply an XML compressor on our archive, and our experiments show that our compressed archive outperforms compressed diff-based repositories in space efficiency. We also show how we can extend our archiving tool to an external memory archiver for higher scalability and describe various index structures that can further improve the efficiency of some temporal queries on our archive.
The Diversity, Equity and Inclusion (DEI) initiative started as the Diversity/Inclusion initiative in 2020 [4]. The current report summarizes our activities in 2022. Our responsibility as a community is to ensure that attendees of DB conferences feel included, irrespective of their scientific perspective and personal background. One of the first steps was to establish the role of the DEI chairs at DB Conferences, with the DEI team dedicated to providing leadership to help our community achieve this goal. In this leadership role, the DEI team is advising DEI chairs at DB conferences, serving as a memory of DEI events at conferences, building an agreed-upon vision, and committing to working together to devise a set of measures for achieving DEI. That is pursued via actions led by our core members (Figure 1) and liaisons of individual executive bodies (Figure 2): REACH OUT collects data and experiences from our community. INCLUDE monitors and recommends inclusion efforts. ORGANIZE focuses on in-conference organization efforts, such as adopting a code of conduct. INFORM communicates through various channels. SUPPORT coordinates DEI support from executive bodies and sponsors. SCOUT collates DEI efforts from other communities. COORDINATE manages all actions. Two new actions: MEDIA preserves and disseminates the digital media produced by DEI@DB events. ETHICS establishes and promotes ethics guidelines for publications in our community.
A schema mapping is a high-level declarative specification of the relationship between two schemas; it specifies how data structured under one schema, called the source schema, is to be converted into data structured under a possibly different schema, called the target schema. Schema mappings are fundamental components for both data exchange and data integration. To date, a language for specifying (or programming) schema mappings exists. However, developmental support for programming schema mappings is still lacking. In particular, a tool for debugging schema mappings has not yet been developed. In this paper, we propose to build a debugger for understanding and exploring schema mappings. We present a primary feature of our debugger, called routes, that describes the relationship between source and target data with the schema mapping. We present two algorithms for computing all routes or one route for selected target data. Both algorithms execute in polynomial time in the size of the input. In computing all routes, our algorithm produces a concise representation that factors common steps in the routes. Furthermore, every minimal route for the selected data can, essentially, be found in this representation. Our second algorithm is able to produce one route fast, if there is one, and alternative routes as needed. We demonstrate the feasibility of our route algorithms through a set of experimental results on both synthetic and real datasets.
The Web is teeming with rich structured information in the form of HTML tables, which provides us with the opportunity to build a knowledge repository by integrating these tables. An essential problem of web data integration is to discover semantic correspondences between web table columns, and schema matching is a popular means to determine the semantic correspondences. However, conventional schema matching techniques are not always effective for web table matching due to the incompleteness in web tables. In this paper, we propose a two-pronged approach for web table matching that effectively addresses the above difficulties. First, we propose a concept-based approach that maps each column of a web table to the best concept, in a well-developed knowledge base, that represents it. This approach overcomes the problem that sometimes values of two web table columns may be disjoint, even though the columns are related, due to incompleteness in the column values. Second, we develop a hybrid machine-crowdsourcing framework that leverages human intelligence to discern the concepts for "difficult" columns. Our overall framework assigns the most "beneficial" column-to-concept matching tasks to the crowd under a given budget and utilizes the crowdsourcing result to help our algorithm infer the best matches for the rest of the columns. We validate the effectiveness of our framework through an extensive experimental study over two real-world web table data sets. The results show that our two-pronged approach outperforms existing schema matching techniques at only a low cost for crowdsourcing.
The 16th International Conference on Database Theory (ICDT 2013) was held in Genoa, Italy, March 18--22, 2013. Originally biennial, the ICDT conference has been held annually and jointly with EDBT (Extending Database Technology) since 2009.
In recent years, data examples have been at the core of several different approaches to schema-mapping design. In particular, Gottlob and Senellart introduced a framework for schema-mapping discovery from a single data example, in which the derivation of a schema mapping is cast as an optimization problem. Our goal is to refine and study this framework in more depth. Among other results, we design a polynomial-time log( n )-approximation algorithm for computing optimal schema mappings from a given set of data examples (where n is the combined size of the given data examples) for a restricted class of schema mappings; moreover, we show that this approximation ratio cannot be improved. In addition to the complexity-theoretic results, we implemented the aforementioned log( n )-approximation algorithm and carried out an experimental evaluation in a real-world mapping scenario.
Fraud detection rules, written by domain experts, are often employed by financial companies to enhance their machine learning-based mechanisms for accurate detection of fraudulent transactions. Accurate rule writing is a challenging task where domain experts spend significant effort and time. A key observation is that much of this difficulty originates from the fact that experts typically work as "lone rangers" or in isolated groups to define the rules, or work on detecting frauds in one context in isolation from frauds that occur in another context. However, in practice there is a lot of commonality in what different experts are trying to achieve. In this demo, we present the GOLDRUSH system, which facilitates knowledge sharing via effective adaptation of fraud detection rules from one context to another. GOLDRUSH abstracts the possible semantic interpretations of each of the conditions in the rules in one context and adapts them to the target context. Efficient algorithms are used to identify the most effective rule adaptations w.r.t a given cost-benefit metric. We showcase GOLDRUSH through a reenactment of a real-life fraud detection event. Our demonstration will engage the VLDB'18 audience, allowing them to play the role of experts collaborating in the fight against financial frauds.
The Web is a major resource of both factual and subjective information. While there are significant efforts to organize factual information into knowledge bases, there is much less work on organizing opinions, which are abundant in subjective data, into a structured format.
Detecting the semantic types of data columns in relational tables is important for various data preparation and information retrieval tasks such as data cleaning, schema matching, data discovery, and semantic search. However, existing detection approaches either perform poorly with dirty data, support only a limited number of semantic types, fail to incorporate the table context of columns or rely on large sample sizes for training data. We introduce Sato, a hybrid machine learning model to automatically detect the semantic types of columns in tables, exploiting the signals from the context as well as the column values. Sato combines a deep learning model trained on a large-scale table corpus with topic modeling and structured prediction to achieve support-weighted and macro average F1 scores of 0.925 and 0.735, respectively, exceeding the state-of-the-art performance by a significant margin. We extensively analyze the overall and per-type performance of Sato, discussing how individual modeling components, as well as feature categories, contribute to its performance.