SparkR: Scaling R Programs with Spark

Shivaram Venkataraman,Zongheng Yang,Davies Liu,Eric Liang,Hossein Falaki,Xiangrui Meng,Reynold Xin,Ali Ghodsi,Michael J. Franklin,Ion Stoica,Matei Zaharia

SparkR: Scaling R Programs with Spark

2016

Shivaram Venkataraman
Zongheng Yang
Davies Liu
Eric Liang
Hossein Falaki
Xiangrui Meng
Reynold Xin
Ali Ghodsi
Michael J. Franklin
Ion Stoica
Matei Zaharia

R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the R runtime is single threaded and can only process data sets that fit in a single machine's memory. We present SparkR, an R package that provides a frontend to Apache Spark and uses Spark's distributed computation engine to enable large scale data analysis from the R shell. We describe the main design goals of SparkR, discuss how the high-level DataFrame API enables scalable computation and present some of the key details of our implementation.

Keywords:

Computation
Data mining
Computer science
Parallel computing
Database
Spark (mathematics)
Programming with Big Data in R
Scalability
Computational statistics
Theoretical computer science
Scaling
Data set
Data processing
Programming language

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations