Data Centric Computing
Instructor: Manoj Gupta and Mayank Singh
Assessment
50% :Exams
30% : Lab Exams
20% : Lab Assignments and Attendance
Syllabus
Algorithms for processing data: Understanding time complexity Asymptotics (big-O notation and examples).
Storing sequential data: arrays and linked lists, data structures for storing tabular data.
Organizing data and finding items : Searching (linear/binary search), Sorting: sorting almost sorted data (bubble sort, insertion sort),
fast sorting in the worst-case (heapsort), using randomization to sort faster (quicksort), sorting without random-access ex. linked-lists (mergesort)
Data Analysis: Use of spreadsheets for data analysis; data manipulation specific Python libraries e.g. Pandas;
introduction to SQL. Answering queries over tabular data: map-filter-reduce, joins and unions. Using vectorized
operations for efficiency. Techniques: Data structuring (tidy data) and cleaning (dealing with multiple labels for the same
data, missing data). Summarizing data (averages, variance, moments), functions of tables (loading, cleaning, normalizing).
Introduction to data visualization using matplotlib; plotting various statistics; categorical distributions, numerical distributions, overlaid graphs.
Deriving Conclusions: Anomaly detection techniques in data: tests e.g. GRIM test, Benford’s law; Statistical traps: Correlation vs causation,
Base-rate fallacy / prosecutor's fallacy. Simpson's paradox. Data censoring. Will Rogers effect, lead-time bias, and length time bias.
Means versus medians. Importance of higher moments. Misleading visualizations.
Making Predictions: Linear regression; basics of classification (train, test, validation); creating features; naive Bayes classification,
logistic regression and its interpretation; introduction to clustering (k-means). Confidence in predictions: prediction intervals, confidence intervals.