EC2023E: Foundations of Data
Science
ANUP APREM
Big Data Phenomenon
• We are collecting and storing data at an unprecedented
rate.
• Examples: –
• YouTube, Facebook, MOOCs, news sites.
• Credit cards transactions and Amazon purchases.
• Transportation data (Google Maps, Waze, Uber)
• Gene expression data and protein interaction assays. –
• Maps and satellite data.
• Large hadron collider and surveying the sky.
• Phone call records and speech recognition results.
• Video game worlds and user actions.
Data Science
• What to do with all this data?
• Too much data to search through manually
• But there is valuable information in the data
• How can we use it for fun, profit, and/or greater good
• Process of extracting information from raw data is called data analysis.
• Why Data Analytics
• Understand mechanism of data generation
• Forecast response ➔ Machine learning model
• Data visualization: Communicate hidden information to business.
Data Science Pipeline
Classic Example: Consumer churn in wireless market
• Churn: Customers who switch from one wireless provider at the end
of contract.
• Incentives: Offered to a particular customer to stay in the contract
• Data extraction? History, Frequent callers
• Data cleaning? Missing address, age, occupation
• Data exploration & visualization? Difference in churn between
males and females, based on occupation, address
• Predictive model: Can we build a ML model that predicts whether
churn occurs?
Python and Data Science
• Generic Programming and Scripting
• Large Libraries for data analytics and predictive modelling
• Numpy, Pandas, Matplotlib
• Interface to databases
• SQL, NoSQL
• API programming, Web scraping
Data Science and India
• The NITI Aayog Indian recognizes data science as a technology which
can solve Indian needs across sectors such as: education and health.
• Coursera's 2021 Global Skills report, India ranks 66th globally in data
science, with an estimated 58% skill gap.
• Reports of 1 lac jobs vacancies in the data analytics domain.
• Increased focus on data analytics on healthcare, agriculture,
personalized education, smart cities and transportation
• This Course: Knowledge of data science and tools in Python
Data Science – Skills
• Python and Data Science (This course)
• Foundations of Machine Learning (EC2011E)
• Deep Learning (EC3057E)
• Artificial Intelligence (EC3051E)
• Application areas: Computer Vision (EC3055E), Autonomous
Intelligent Systems (EC3052E), Reinforcement Learning (EC3059E)
• Domain Knowledge
Course Outline
• Core Database concepts for Data science
• Structured, Unstructured, BigData
• Data Cleaning (Pandas)
• Data Pre-processing (Pandas)
• Data Visualization – Grammar of Graphics (Matplotlib, Seaborn)
• Exploratory data analysis
• Statistics for data science
Course Schedule
• Credit: 2-0-2-3
• Lectures (NLHC 301) (I reserve the right to use any 2 out of 3)
• Mon 5-6pm
• Thu 5-6pm
• Wed 8-9am (Used in lieu)
• Lab (IC Lab, ECED Block II)
• TA1, TA2, TB1, TB2 (20 students per slot) – Please give your choice in the
spreadsheet in Eduserver
• Individual lab (10 computers in IC Lab)
• Google Classroom for lab and submission
• Evaluation: 75% lab assessment, 25%
Course Evaluation
Evaluation Contribution
Midterm 25
Labs (Individual) 30
Course Project (including viva) 20
End Exam 25
Course Project: Identify a suitable data problem, obtain the dataset, create a
database and perform data visualization on the problem
Proposal due: One week after Midterm
Course Project due: Last but one week of class (one week for evaluation/viva)
Acknowledgement
• Couse developed in 2021 through British Council Going Global
Exploratory Grant in partnership with Oxford Brookes University, UK