E0 259-O Data Analytics 3:1 (Summer 2023)

Course Instructor: Rajesh Sundaresan, ECE , Ramesh Hariharan, CSA and Vikram Srinivasan

Course description:Data Analytics has assumed increasing importance in recent times. Several industries are now built around the use of data for decision making. Several research areas too, genomics and neuroscience being notable examples, are increasingly focused on large-scale data generation rather than small-scale experimentation to generate initial hypotheses. This brings about a need for data analytics. This course will develop modern statistical tools and modeling techniques through hands-on data analysis in a variety of application domains.

The course will illustrate the principles of hands-on data analytics through several case studies (6 such studies). On each topic, we will introduce a scientific question and discuss why it should be addressed. Next, we will present the available data, how it was collected, etc. We will then discuss models, provide analyses, and finally touch upon how to address the scientific question using the analyses.

We will cover a subset of the following case studies:

  1. Astronomy: From Tycho Brahe's observations to the conclusion that Mars moves in an elliptical orbit.
  2. Sports: The Duckworth-Lewis-Stern method for setting targets in shortened limited overs cricket matches.
  3. COVID-19: Serological surveys
  4. Visual Neuroscience: Neural correlates predict search difficulty.
  5. Genomics: Understanding the causes of cancer.
  6. Genomics: The basis for red-green colour blindness.
  7. Biology: Effects of smoking
  8. Networks: Community detection

Syllabus

Data sets from astronomy, genomics, neuroscience, sports, biology, epidemiology, and networks will be analysed to answer specific scientific questions. Statistical tools and modeling techniques will be introduced as needed to analyse the data and eventually address the scientific question. Specific data sets will vary across offerings. Example topics are the following: Tycho Brahe's data on Mars and Kepler's analysis of its orbit (astronomy), the Duckworth-Lewis-Stern method for setting targets in shortened limited overs cricket matches (cricket), retinoblastoma and causes of cancer (genomics), the basis for red-green colour blindness (genomics), serological surveys for COVID-19 (epidemiology), effects of smoking (biology), and community detection (networks).

Textbooks / References

  1. B. Efron and T. Hastie, Computer Age Statistical Inference, Cambridge University Press, 2016.

Prerequisites:

  1. Random Processes (E2 202A) OR Probability and Statistics (E0 232) OR equivalent
  2. Linear Algebra (E1 NEW) OR Matrix Theory

Grading:

  • Assignments 40%
  • Midterm exam 30%
  • Final exam 30%.