DA 231o Data Engineering at Scale 3:1 (August 2022)

Course Instructor: Yogesh Simmhan, CDS

Course description: This four-credit course will be offered every year in the August-December term as a core course of the Data Science and Business Analytics (DSBA) M.Tech (Online) programme. This course is aimed to be an introductory graduate-level (200-series) course. It will motivate the need for Big Data processing, use of distributed systems to design scalable storage and processing systems, and programming and algorithm design on such platforms to implement large-scale data science applications. The course will have several assignments and mini projects. The course lectures will be delivered over video conference and recorded. The students will have access to computing resources on campus or on the Cloud that they access in order to complete different assignments.

Syllabus

Module 1: Introduction to Big Data storage systems. Motivation for data engineering at scale. Architecture of Google File System/HDFS.
Module 2: Introduction to Big Data processing systems. Overview of distributed systems, Cloud computing, strong and weak scaling. Architecture and internals of Apache Spark. Programming using Spark.
Module 3: Introduction to relational and NoSQL databases. ACID, BASE and CAP Theorem. Architecture of Dynamo/Cassandra. Overview of Data lakes.
Module 4: Introduction to streaming and linked data processing. Architecture and programming of distributed streaming systems like Kafka, Storm and/or Spark Streaming. Architecture and programming of distributed graph processing systems like Pregel/Giraph.
Module 5: Machine learning at large scales. Designing ML pipelines. Distributed and federated learning.
Module 6: Topics on Big Data and IoT, Cloud Computing, Ethics, etc.

Textbooks / References

  1. Hadoop: The Definitive Guide, by Tom White
  2. Learning Spark, by Holden Karau, Patrick Wendell, Matei Zaharia, Andy Konwinski
  3. Select reading from literature

Prerequisites: Basics of programming, data structures, algorithms, computer systems.

Grading:

  • Homeworks and Quizzes 60%
  • Project 20%
  • Final exam 20%.

The course will follow a continuous evaluation philosophy. Students will have short quizzes after each module that will be held during the class and be proctored over video conference. There will be programming assignments and/or mini projects of different modules. A final exam will be held to evaluate their opera learning in the course. This may be a common proctored exam across all courses. Optionally, this may be replaced by a take-home exam.