A course on big data analytics

Joshua Eckroth

A course on big data analytics

2018

Joshua Eckroth

Abstract This report details a course on big data analytics designed for undergraduate junior and senior computer science students. The course is heavily focused on projects and writing code for big data processing. It is designed to help students learn parallel and distributed computing frameworks and techniques commonly used in industry. The curriculum includes a progression of projects requiring increasingly sophisticated big data processing ranging from data preprocessing with Linux tools, distributed processing with Hadoop MapReduce and Spark, and database queries with Hive and Google’s BigQuery. We discuss hardware infrastructure and experimentally evaluate the cost/benefit of an on-premise server versus Amazon’s Elastic MapReduce. Finally, we showcase outcomes of our course in terms of student engagement and anonymous student feedback.

Keywords:

Big data
Data science
Curriculum
Distributed computing
Data pre-processing
Ranging
Student engagement
Computer science
undergraduate education
Spark (mathematics)
big data processing
Cloud computing

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations