CS 7311 - Data-Driven Computational Methods and Infrastructure

Course Description:

This course covers computational and statistical methods for using large-scale data sets (‘big data’) to answer scientific and business questions. It focuses on framing research questions, understanding how data can answer them, and using modern software tools such as Spark and Hadoop for scalable data storage, processing, and analysis.

Prerequisite:

Consent of the instructor.

Course Objectives:

The students will be able to:
• Formulate concrete research questions to address business or scientific objectives
• Identify or collect data to answer research questions
• Design tools to process, clean, and organize data for subsequent analysis
• Create and run data processing and analysis pipelines to compute statistical results over large-scale data sets using modern high-performance computing infrastructure such as Apache Spark
• Present results clearly using data visualizations and written prose
• Interpret analysis results and identify their implications for business concerns or scientific interest
• Determine appropriate data processing technology to support a desired analysis method

Course Notes:

New course effective Fall 2017.  Available only for computer science majors.