DA 227o Data Mining 3:1 (August 2022)

Course Instructor: : Parthasarathy Ramachandran MS

Course description: This is a four-credit elective course that will be offered in the August-December term. The course is intended as an introduction to various data mining techniques. The course will cover Linear models, Linear model selection and Regularization, market basket analysis, classification, and clustering. The course will also give the participants to implement some of the algorithms discussed in the course using the MapReduce framework.

Topics

  • Introduction to statistical learning, Bias-Variance trade off
  • Linear regression, model estimation and assessing the accuracy of the model, Quantitative vs qualitative predictors
  • Linear model selection, Shrinkage methods – Ridge regression and Lasso
  • Market basket analysis, apriori algorithm, FP-tree construction and projection, association rule interestingness measures
  • Classification, Logistic regression, Discriminant analysis, Decision trees - ID3 and C4.5, Bagging, Boosting and Random forests, Naïve Bayes, SVM
  • Clustering, K-Means, Mixture models and Expectation Maximization
  • Recommendation systems – Content based systems and collaborative filtering
  • Mining social networks – clustering social network graphs, communities
  • Factor analysis and Principal component analysis
  • Error estimation, Resampling methods, k-fold cross validation and bootstrap

Textbooks / References

  1. An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
  2. Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach and Vipin Kumar
  3. Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeff Ullman

Prerequisites: Basic programming experience, probability and statistics, Linear algebra

Grading:

  • Assignments: 30
  • Final project: 20
  • Midterm: 25
  • Final: 25