Scalable Analysis of Open Data Graphs

Andrei Stoica,Michael Valdron,Ken Pu

Scalable Analysis of Open Data Graphs

2019

Andrei Stoica
Michael Valdron
Ken Pu

We have studied Open Data as a connected graph. Each data package is considered a vertex, and we studied the similarity graph induced by several different similarity measures. We analyzed the resulting similarity graph using different metrics to estimate its quality and informativeness. In order to cope with the size of the open data graph (over 6 billion edges), the graph constructions and analysis are done using a distributed computation framework, Apache Spark. The algorithms were implemented using the Spark resilient distributed data algebra, and executed on the Google Cloud Platform (GCP).

Keywords:

Open data
Graph
Scalability
Computer science
Data mining
Cloud computing
Computation
Spark (mathematics)
Vertex (geometry)
Computer cluster
Connectivity
Distributed database

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations