Clustering header categories extracted from web tables

George Nagy,David W. Embley,Mukkai S. Krishnamoorthy,Sharad C. Seth

Clustering header categories extracted from web tables

2015

George Nagy
David W. Embley
Mukkai S. Krishnamoorthy
Sharad C. Seth

Revealing related content among heterogeneous web tables is part of our long term objective of formulating queries over multiple sources of information. Two hundred HTML tables from institutional web sites are segmented and each table cell is classified according to the fundamental indexing property of row and column headers. The categories that correspond to the multi-dimensional data cube view of a table are extracted by factoring the (often multi-row/column) headers. To reveal commonalities between tables from diverse sources, the Jaccard distances between pairs of category headers (and also table titles) are computed. We show how about one third of our heterogeneous collection can be clustered into a dozen groups that exhibit table-title and header similarities that can be exploited for queries.

Keywords:

Jaccard index
Header
Dozen
Table (information)
Data cube
Artificial intelligence
Information retrieval
Search engine indexing
Cluster analysis
Pattern recognition
Computer science
Information integration
Segmentation

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations