Database management systems are becoming available for semistructured data, however, these tools cannot be used on many real-world data sources (e.g., most web sites) in their native form. Often, wrappers are needed to extract information and organize it into a graph structure that makes explicit the concepts users want to query and update. This paper presents a new approach to wrapper generation that exploits linguistic knowledge. The approach produces a more fine-grained parse of sources with natural language text than previous efforts. The resulting graph structured databases answer queries that could not be formulated in database produced by prior generated wrappers. In addition, our approach may be more robust in the face of slight variations in word choice and order. We discuss a prototype implementation, lessons learned to date, evaluation issues, and future research directions.
In this demonstration, we exhibit a new type of provenance system, one that is not tied to any particular domain, closed-world system or use. The PLUS provenance system was inspired by government requirements to enable provenance capture, storage and use across multi-organizational systems. PLUS is general enough to interact across open-world distributed systems, often without administrative access to those underlying distributed systems. It captures and stores provenance, permits user annotations, and provides tools for analyzing the provenance on the basis of those annotations. Due to the need to share provenance across many organizations, much attention has been paid to provenance access and security. We highlight all of these features via a demonstration using an Emergency Preparedness and Response (EP&R) scenario.
The MITRE Corporation provides technical assistance, system engineering, and acquisition support to large organizations, especially U.S. Government agencies. We help our customers to plan complex systems based on emerging technologies, and to implement systems based on commercial-off-the-shelf products. In MITRE's research program, instead of emphasizing concerns of DBMS or CASE vendors, our research emphasizes the issues of organizations who need to use such products. For example, we favor areas where we can build over commercial products, rather than changing their internals.Data management at MITRE goes beyond research, to include technology transition, system engineering, product evaluation, prototypes, tutorials, advice on customers' strategic directions, and participation in standards efforts. We use prototyping to illustrate potential improvements in customer systems, to understand vendors' capabilities, or both. There are close connections with efforts in object management, real-time systems, reengineering, artificial intelligence, and security.This paper emphasizes the research efforts, grouped into five major themes: information integration, security and privacy, active and responsive systems, metrics, and digital libraries. For each theme, we list the major questions being explored, and identify projects and contacts for further information.
Lineage stores often contain sensitive information that needs protection from unauthorized access. We build on prior work for security and privacy of lineage information, focusing on complex conditions and scalable administration. We use Attribute-Based Access Control (ABAC) to express conditions based on many attributes, instead of roles. We then make administration and management more scalable, instead of managing large, monolithic access predicates for each object. To do so, we first support modular traceability and maintainability for separate concerns (e.g. security, legally mandated privacy, organizationally mandated privacy). We then provide constructs to manage authority when multiple administrators must collaborate. We show that these security techniques are needed for easy lineage security administration.
This demonstration presents Galaxy, a schema manager that facilitates easy and correct data sharing among autonomous but related, evolving data sources. Galaxy reduces heterogeneity by helping database developers identify, reuse, customize, and advertise related schema components. The central idea is that as schemata are customized, Galaxy maintains a derivation graph, and exploits it for data exchange, discovery, and multi-database query over the "galaxy" of related data sources. Using a set of schemata from the biomedical domain, we demonstrate how Galaxy facilitates schema and data sharing.
Welcome to the Twelfth ACM International Conference on Information and Knowledge Management (CIKM 2003)! The Organizing Committee, as well as the sponsors of CIKM 2003, join me in the desire that this conference will be the opportunity for you to learn, to grow, to share knowledge and skills about information and knowledge management.In twelve years, the world of information and knowledge management has changed greatly. However, the need for information scientists and technologists and practitioners to meet face-to-face and exchange ideas, as well as to welcome new members into our scientific community, is still here. To the extent that we are successful in these goals, CIKM 2003 will have served its purpose.
The data resources in a large enterprise typically exist as many separate islands of data. Each is maintained by a distinct community for its purposes, and is largely unusable by others. It is common to see whole data archipelagos comprised of thousands of separate resources [Ston00]. We would, of course, prefer to see one single integrated data resource usable by all. This is the “grand vision” of data integration: discovery of and access to all data, with multiple sources properly combined, delivered in a form that each consumer can interpret. Decentralized organizations such as the US Air Force might accept for now a slightly less ambitious dream – the ability to establish a connection between islands, a way to obtain any desired information from any other source.
The Extensible Markup Language (XML) is receiving much attention as a likely successor to HTML for expressing much of the Web’s content. In addition, XML can benefit databases and data sharing by providing a common format in which to express data structure and content. But like many new technologies, XML has raised unrealistic expectations. We give a brief overview of XML and offer opinions to help separate the benefits from the hype. In some areas, XML promises to provide significant and revolutionary improvements, such as by increasing the availability of database outputs across diverse types of systems, and by extending data management to include semi-structured data. This paper will first describe the limitations of current Web technologies for data sharing, and how XML addresses them. Next, it assesses the impact of XML on data management for both well structured and more loosely structured data. The longest section outlines the challenges of data interoperability and then describes which of these challenges XML does (and does not) address. While some of the benefits of XML are already becoming apparent, others will require years of development of new database technologies and associated standards.