Relational database management system support for sparse data sets

2006 
This dissertation describes techniques for optimizing physical storage of sparse data sets that are stored in a Relational Database Management System (RDBMS). "Sparse" data, in which relations have many attributes that are null for most tuples, presents a challenge for an RDBMS. The traditional RDBMS approach to physical storage assumes that rows of tables have few null values, and that most tables will have tens of columns, and never more than a few hundred columns. Sparse data sets challenge both assumptions by having hundreds to thousands of columns in a table and rows with mostly null values. In the space of physical storage optimization, we consider new techniques for physical record storage, providing indexing support over the data, and vertically partitioning the data. We analyze a row storage format called interpreted storage and show that it is a better than the traditional positional storage format for sparse data. For indexing, we show that it is feasible to index most, if not all, attributes in a sparse data set and efficiently maintain all of the indexes by utilizing sparse indexes, which do not index null values. We continue our study by analyzing real-world sparse data sets and identify that the data contains frequently co-occur attributes within the rows of the data. We recognize that vertical partitioning can exploit these patterns to store a sparse data set in smaller tables. The small tables provide faster scan times over groups of co-occurring attributes than scanning the entire data set. We isolate the tradeoffs in choosing groups of attributes for vertical partitions, and suggest, a method that uses k-nearest neighbors for mining the patterns in the data. We integrated all of our techniques into the open-source database management system PostgreSQL. The integration illustrates the practicality and viability of implementing our techniques in other systems. We conclude with an overview of our results and directions for future work.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []