http://www.mmds.
org
Introduction to Data Science
Yongfeng Zhang, Rutgers University
3
Not only massive data, but also massive models and massive computation.
e.g. Large Language Models.
Source: Scaling Language Model Training to a Trillion Parameters Using Megatron (NVIDIA)
4
Data contains value and knowledge
5
But to extract the knowledge
data needs to be
▪ Stored (Database Systems)
▪ Managed (Data Management for Data Science)
▪ And ANALYZED this class (Data Science and
Data Mining)
6
Given lots of data
Discover patterns and models that are:
▪ Valid: hold on new data with some certainty
▪ Useful: should help us to learn something from data
▪ Unexpected: non-obvious to us
▪ Understandable: humans should be able to
interpret the pattern
Data Mining ≈ Big Data (from 2012)
Big Data + Predictive Analytics ≈ Data Science
7
We can We can
Obverse it! Predict it!
Tycho Brahe (1546-1610) Johannes Kepler (1571-1630)
Demark astronomer German astronomer, student of Tycho Brahe.
Good at astro-observation Analyzed Tycho’s data, and discovered a rule
Observed and recorded a lot of hidden in the data.
data about how planets circle The “Kepler’s laws of planetary motion”:
around the sun. 𝜏2
=𝐾
𝑟3
𝜏: period of circling around the sun, r: radius
Time, Position
1, (a, b)
2, (c, d) Time, Position
3, (e, f) 1, (a, b) 𝜏2
=𝐾
… 2, (c, d) 𝑟3
3, (e, f)
…
[1] Z. Li, J. Ji, Y. Zhang. ”From Kepler to Newton: Explainable AI for Science Discovery.” arXiv:2111.12210.
We can We Understand it!
Predict it! We know Why!
Johannes Kepler (1571-1630) Isaac Newton (1643-1727)
German astronomer, student of Tycho Brahe. English mathematician, physicist, astronomer,
theologian, and author.
Analyzed Tycho’s data, and discovered the rules
hidden in the data. Proposed the Newton's law of universal gravitation
The “Kepler’s laws of planetary motion”: + differential calculus:
𝜏2
=𝐾 Naturally derives out the Kepler’s laws of
𝑟3
𝜏: period of circling around the sun, r: radius planetary motion!
Time Position
𝜏2 𝜏2
1 (a,b) =𝐾 =𝐾 is because
2 (c,d) 𝑟3 𝑟3
3 (e,f)
[1] Z. Li, J. Ji, Y. Zhang. ”From Kepler to Newton: Explainable AI for Science Discovery.” arXiv:2111.12210.
Tycho Brahe (1546-1610) Johannes Kepler (1571-1630) Isaac Newton (1643-1727)
Data Collection Data Analytics (what) Data Interpretation (why)
Time Position
1 (a,b) 𝜏2
2 (c,d) =𝐾
𝑟3
3 (e,f)
Almost automatic Many available methods, Still needs much exploration,
Main part of this course We will touch some topics
[1] Z. Li, J. Ji, Y. Zhang. ”From Kepler to Newton: Explainable AI for Science Discovery.” arXiv:2111.12210.
Massive data Use massive computing facilities
- search engine logs and develop advanced algorithms
- e-commerce transactions to analyze the data and make predictions.
- Social network data
Advanced algorithms:
- Impossible to manually analyze Machine learning, data mining, etc.
the data and derive conclusions
from data
11
Descriptive methods
▪ Find human-interpretable patterns that
describe the data
▪ Example: Clustering
Predictive methods
▪ Use some variables to predict unknown
or future values of other variables
▪ Example: Recommender systems, Computational
Advertising, etc.
12
This class overlaps with machine learning,
statistics, artificial intelligence, databases but
more stress on
▪ Scalability (big data
Statistics Machine
and big models)
Learning
▪ Algorithms
▪ Computing architectures Data Mining
▪ Automation for handling
large data Database
systems
13
Usage
Quality
Context
Streaming
Scalability
14
We will learn to analyze different types of data:
▪ Data is high dimensional (dim reduction)
▪ Data is a graph (social and web link graph analytics)
▪ Data is infinite/never-ending (streaming data process)
▪ Data is labeled (supervised) or unlabeled (unsupervised)
We will learn to use different models of
computation:
▪ Single machine in-memory
▪ Streams and online algorithms
▪ Map-Reduce, Hadoop, Spark, PyTorch, Huggingface
15
We will learn to solve real-world problems:
▪ Recommender systems (dimensionality reduction, latent
factor models, Collaborative Filtering, Collaborative
Reasoning, Large Language Models, Amazon)
▪ Market Basket Analysis (frequent item set mining, the beer
and diaper example, Walmart)
▪ Spam detection (web graph analysis, page rank, Google)
▪ Social network analysis (Facebook)
We will learn various “tools”:
▪ Linear algebra (SVD, LFM, Community analysis)
▪ Optimization (stochastic gradient descent)
▪ Dynamic programming (frequent itemsets)
▪ Hashing (LSH, Bloom filters)
16
High dim. Graph Infinite Machine
Apps
data data data learning
Locality
PageRank, Filtering SVM, SVD,
sensitive Recommen
TrustRank data SGD
hashing der systems
streams
Spam Decision
Clustering
Detection Trees, kNN
Web Search
advertising Perceptron, Engine
Dimension Community
Neural
reduction Detection
network
Graph Queries on Neural Social
Representat streams Reasoning,
Networks
Neural
ion Learning Large Language
Network Model
17
Instructor:
Yongfeng Zhang, [email protected]
TA:
Xi Zhu (Section 1), Email: [email protected]
Wujiang Xu (Section 2), Email: [email protected]
Devanshi Patel (Section 3), Email: [email protected]
Sai Samhith Thatikonda (Section 4), Email: [email protected]
Graders:
Satyam Saini (Section 1&2), Email: [email protected]
Sasank Chindirala (Section 3&4), Email: [email protected]
19
Data Structures (CS 112)
▪ Arrays, lists, sets, maps, queues, linked lists, etc.
Basic probability
▪ Moments, typical distributions, MLE, …
Programming
▪ Your choice, but C++/Java/Python will be very useful
Infrastructure (optional)
▪ Linux / Hadoop / Spark / PyTorch / Jupyter / Hugging Face
Fundamental Algorithms (CS 344, optional but preferred)
▪ Linear Algebra, basic data structures
We provide some background, but
the class will be fast paced
21
Course website:
▪ Canvas: Please join our Canvas homepage
▪ Lecture slides, homework, solutions, readings
Textbook: (LRU) Mining of Massive Datasets
by J. Leskovec, A. Rajaraman, J. D. Ullman
Free online: http://www.mmds.org
22
Slides, lecture notes, homework answers, etc. -> Files
23
Course Website
▪ Homework Assignments will also be released and submitted on
Canvas.
Be sure to enable your course notifications!
24
Canvas Chat:
▪ You may post questions and participate in discussions
on Canvas Chat room.
▪ Use Canvas for technical questions and public communication
with the course staff
For private questions please email to me or TA
We will post course announcements on Canvas
(make sure you check it regularly)
25
4 homework assignments: 40 points
▪ Theoretical and programming questions (10 points
each)
▪ Assignments take lots of time. Start early!
Time and Submission
▪ Homework assignments are posted on Fridays, and
you have 2 weeks to finish (due on next next Friday on
11:59pm).
▪ Homework can be submitted in Canvas
▪ Late policy will be shown on the homework
assignment: 90% discount factor for each 1 day late.
26
Midterm: 20 points
▪ Wednesday, Oct 30, 12-hour take-home exam (no
class on that day)
▪ Detailed instructions will be provided on Canvas.
27
Final Project: 40 points
▪ Complete as a team of at most 3 students, the team
can either choose from a set of provided project topics
(will provide on Canvas later this semester) or propose
their own project topic (based on approval from the
instructor), write the project report as a “mini-paper,”
make a slides for the project, and submit the paper,
slides, code and data. (40%)
Bonus: 10 points
▪ We will select 20 teams, each can do a 15-min
presentation to the class, each member of the
selected teams will get 10 bonus credits.
▪ Final two weeks is presentation day.
It’s going to be fun and hard work. ☺
28
3 To-do items for you:
▪ Register to Canvas
▪ Begin to think about your team up for class project
Additional details/instructions can be seen in
the course syllabus (posted on Canvas)
30