Mining Massive Relational Databases

Geoff Hulten,Pedro M. Domingos,Yeuhi Abe

Mining Massive Relational Databases

2003

Geoff Hulten
Pedro M. Domingos
Yeuhi Abe

There is a large and growing mismatch between the size of the relational data sets available for mining and the amount of data our relational learning systems can process. In particular, most relational learning systems can operate on data sets containing thousands to tens of thousands of objects, while many real-world data sets grow at a rate of millions of objects a day. In this paper we explore the challenges that prevent relational learning systems from operating on massive data sets, and develop a learning system that overcomes some of them. Our system uses sampling, is efficient with disk accesses, and is able to learn from an order of magnitude more relational data than existing algorithms. We evaluate our system by using it to mine a collection of massive Web crawls, each containing millions of pages.

Keywords:

Relational model
Change data capture
Semi-structured data
Relational database
Statistical relational learning
Database
Database model
Object-relational impedance mismatch
Database design
Data mining
Computer science
Information retrieval
Entity–relationship model

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations