0% found this document useful (0 votes)

29 views28 pages

Introduction To Data Science 439

Introduction presentation to data science and understanding of concepts

Uploaded by

Surya Bhardwaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views28 pages

Introduction To Data Science 439

Introduction presentation to data science and understanding of concepts

Uploaded by

Surya Bhardwaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

http://www.mmds.

org

Introduction to Data Science

Yongfeng Zhang, Rutgers University
3
Not only massive data, but also massive models and massive computation.
e.g. Large Language Models.

Source: Scaling Language Model Training to a Trillion Parameters Using Megatron (NVIDIA)

4
Data contains value and knowledge
5
 But to extract the knowledge
data needs to be
▪ Stored (Database Systems)
▪ Managed (Data Management for Data Science)
▪ And ANALYZED  this class (Data Science and
Data Mining)

6
 Given lots of data
 Discover patterns and models that are:
▪ Valid: hold on new data with some certainty
▪ Useful: should help us to learn something from data
▪ Unexpected: non-obvious to us
▪ Understandable: humans should be able to
interpret the pattern

Data Mining ≈ Big Data (from 2012)

Big Data + Predictive Analytics ≈ Data Science
7
We can We can
Obverse it! Predict it!

Tycho Brahe (1546-1610) Johannes Kepler (1571-1630)

Demark astronomer German astronomer, student of Tycho Brahe.

Good at astro-observation Analyzed Tycho’s data, and discovered a rule

Observed and recorded a lot of hidden in the data.
data about how planets circle The “Kepler’s laws of planetary motion”:
around the sun. 𝜏2
=𝐾
𝑟3
𝜏: period of circling around the sun, r: radius
Time, Position
1, (a, b)
2, (c, d) Time, Position
3, (e, f) 1, (a, b) 𝜏2
=𝐾
… 2, (c, d) 𝑟3
3, (e, f)
…
[1] Z. Li, J. Ji, Y. Zhang. ”From Kepler to Newton: Explainable AI for Science Discovery.” arXiv:2111.12210.
We can We Understand it!
Predict it! We know Why!

Johannes Kepler (1571-1630) Isaac Newton (1643-1727)

German astronomer, student of Tycho Brahe. English mathematician, physicist, astronomer,
theologian, and author.
Analyzed Tycho’s data, and discovered the rules
hidden in the data. Proposed the Newton's law of universal gravitation
The “Kepler’s laws of planetary motion”: + differential calculus:
𝜏2
=𝐾 Naturally derives out the Kepler’s laws of
𝑟3
𝜏: period of circling around the sun, r: radius planetary motion!

Time Position
𝜏2 𝜏2
1 (a,b) =𝐾 =𝐾 is because
2 (c,d) 𝑟3 𝑟3
3 (e,f)

[1] Z. Li, J. Ji, Y. Zhang. ”From Kepler to Newton: Explainable AI for Science Discovery.” arXiv:2111.12210.
Tycho Brahe (1546-1610) Johannes Kepler (1571-1630) Isaac Newton (1643-1727)

Data Collection Data Analytics (what) Data Interpretation (why)

Time Position
1 (a,b) 𝜏2
2 (c,d) =𝐾
𝑟3
3 (e,f)

Almost automatic Many available methods, Still needs much exploration,

Main part of this course We will touch some topics

[1] Z. Li, J. Ji, Y. Zhang. ”From Kepler to Newton: Explainable AI for Science Discovery.” arXiv:2111.12210.
Massive data Use massive computing facilities
- search engine logs and develop advanced algorithms
- e-commerce transactions to analyze the data and make predictions.
- Social network data
Advanced algorithms:
- Impossible to manually analyze Machine learning, data mining, etc.
the data and derive conclusions
from data

11
 Descriptive methods
▪ Find human-interpretable patterns that
describe the data
▪ Example: Clustering

 Predictive methods
▪ Use some variables to predict unknown
or future values of other variables
▪ Example: Recommender systems, Computational
Advertising, etc.

12
 This class overlaps with machine learning,
statistics, artificial intelligence, databases but
more stress on
▪ Scalability (big data
Statistics Machine
and big models)
Learning
▪ Algorithms
▪ Computing architectures Data Mining
▪ Automation for handling
large data Database
systems

13
Usage

Quality

Context

Streaming
Scalability

14
 We will learn to analyze different types of data:
▪ Data is high dimensional (dim reduction)
▪ Data is a graph (social and web link graph analytics)
▪ Data is infinite/never-ending (streaming data process)
▪ Data is labeled (supervised) or unlabeled (unsupervised)
 We will learn to use different models of
computation:
▪ Single machine in-memory
▪ Streams and online algorithms
▪ Map-Reduce, Hadoop, Spark, PyTorch, Huggingface
15
 We will learn to solve real-world problems:
▪ Recommender systems (dimensionality reduction, latent
factor models, Collaborative Filtering, Collaborative
Reasoning, Large Language Models, Amazon)
▪ Market Basket Analysis (frequent item set mining, the beer
and diaper example, Walmart)
▪ Spam detection (web graph analysis, page rank, Google)
▪ Social network analysis (Facebook)
 We will learn various “tools”:
▪ Linear algebra (SVD, LFM, Community analysis)
▪ Optimization (stochastic gradient descent)
▪ Dynamic programming (frequent itemsets)
▪ Hashing (LSH, Bloom filters)

16
High dim. Graph Infinite Machine
Apps
data data data learning

Locality
PageRank, Filtering SVM, SVD,
sensitive Recommen
TrustRank data SGD
hashing der systems
streams
Spam Decision
Clustering
Detection Trees, kNN
Web Search
advertising Perceptron, Engine
Dimension Community
Neural
reduction Detection
network
Graph Queries on Neural Social
Representat streams Reasoning,
Networks
Neural
ion Learning Large Language
Network Model

17
 Instructor:
 Yongfeng Zhang, [email protected]

 TA:
 Xi Zhu (Section 1), Email: [email protected]
 Wujiang Xu (Section 2), Email: [email protected]
 Devanshi Patel (Section 3), Email: [email protected]
 Sai Samhith Thatikonda (Section 4), Email: [email protected]

 Graders:
 Satyam Saini (Section 1&2), Email: [email protected]
 Sasank Chindirala (Section 3&4), Email: [email protected]

19
 Data Structures (CS 112)
▪ Arrays, lists, sets, maps, queues, linked lists, etc.
 Basic probability
▪ Moments, typical distributions, MLE, …
 Programming
▪ Your choice, but C++/Java/Python will be very useful
 Infrastructure (optional)
▪ Linux / Hadoop / Spark / PyTorch / Jupyter / Hugging Face
 Fundamental Algorithms (CS 344, optional but preferred)
▪ Linear Algebra, basic data structures

 We provide some background, but

the class will be fast paced
21
 Course website:
▪ Canvas: Please join our Canvas homepage
▪ Lecture slides, homework, solutions, readings

 Textbook: (LRU) Mining of Massive Datasets

by J. Leskovec, A. Rajaraman, J. D. Ullman
Free online: http://www.mmds.org

22
 Slides, lecture notes, homework answers, etc. -> Files

23
 Course Website
▪ Homework Assignments will also be released and submitted on
Canvas.
 Be sure to enable your course notifications!

24
 Canvas Chat:
▪ You may post questions and participate in discussions
on Canvas Chat room.
▪ Use Canvas for technical questions and public communication
with the course staff

 For private questions please email to me or TA

 We will post course announcements on Canvas
(make sure you check it regularly)

25
 4 homework assignments: 40 points
▪ Theoretical and programming questions (10 points
each)
▪ Assignments take lots of time. Start early!

 Time and Submission

▪ Homework assignments are posted on Fridays, and
you have 2 weeks to finish (due on next next Friday on
11:59pm).
▪ Homework can be submitted in Canvas
▪ Late policy will be shown on the homework
assignment: 90% discount factor for each 1 day late.

26
 Midterm: 20 points
▪ Wednesday, Oct 30, 12-hour take-home exam (no
class on that day)
▪ Detailed instructions will be provided on Canvas.

27
 Final Project: 40 points
▪ Complete as a team of at most 3 students, the team
can either choose from a set of provided project topics
(will provide on Canvas later this semester) or propose
their own project topic (based on approval from the
instructor), write the project report as a “mini-paper,”
make a slides for the project, and submit the paper,
slides, code and data. (40%)
 Bonus: 10 points
▪ We will select 20 teams, each can do a 15-min
presentation to the class, each member of the
selected teams will get 10 bonus credits.
▪ Final two weeks is presentation day.

 It’s going to be fun and hard work. ☺

28
 3 To-do items for you:
▪ Register to Canvas
▪ Begin to think about your team up for class project

 Additional details/instructions can be seen in

the course syllabus (posted on Canvas)

Test Cases For Irctc 21
58% (26)
Test Cases For Irctc 21
14 pages
Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
A11 BW Manual
100% (1)
A11 BW Manual
220 pages
Syllabus - CIS 509 Data Mining II (Fall 2019)
No ratings yet
Syllabus - CIS 509 Data Mining II (Fall 2019)
7 pages
Course Title Course Number
No ratings yet
Course Title Course Number
15 pages
MR22 3-1 and 3-2
No ratings yet
MR22 3-1 and 3-2
68 pages
Data Science Course Syllabus 2015
No ratings yet
Data Science Course Syllabus 2015
5 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
4 pages
Course Plan - FDS Theory
No ratings yet
Course Plan - FDS Theory
8 pages
MCA-SEM-III-Syllabus Mobile Computing
No ratings yet
MCA-SEM-III-Syllabus Mobile Computing
12 pages
Fit 1043 Lecture Notes Full Final Number 1
No ratings yet
Fit 1043 Lecture Notes Full Final Number 1
64 pages
Mtechds 2021
No ratings yet
Mtechds 2021
17 pages
Unit 1 Fod
No ratings yet
Unit 1 Fod
43 pages
DataScience Minordegree 2023 Syllabus
No ratings yet
DataScience Minordegree 2023 Syllabus
12 pages
DC DSA DSM DSV POC Merged
No ratings yet
DC DSA DSM DSV POC Merged
5 pages
Data Science CS481 - Course Outline Spring 2020
No ratings yet
Data Science CS481 - Course Outline Spring 2020
3 pages
Data Science
No ratings yet
Data Science
244 pages
CS3352 FDS
No ratings yet
CS3352 FDS
23 pages
FoDS MIDSEM Syllabus
No ratings yet
FoDS MIDSEM Syllabus
3 pages
DS 4 Economics
No ratings yet
DS 4 Economics
5 pages
Data Science and Visualization Updated
No ratings yet
Data Science and Visualization Updated
3 pages
Data Science and Visualization Updated
No ratings yet
Data Science and Visualization Updated
3 pages
FDS (D55pe1c)
No ratings yet
FDS (D55pe1c)
2 pages
Cds3005 Foundations-Of-data-science LP 1.0 18 Cds3005 Foundation-Of-data-science LP 1.0 1 Foundations of Data Science
No ratings yet
Cds3005 Foundations-Of-data-science LP 1.0 18 Cds3005 Foundation-Of-data-science LP 1.0 1 Foundations of Data Science
2 pages
Lecture 1 Annotated
No ratings yet
Lecture 1 Annotated
76 pages
CS3352 - Foundations of Data Science
No ratings yet
CS3352 - Foundations of Data Science
142 pages
Types of Digital Data
No ratings yet
Types of Digital Data
22 pages
Edit Ds
No ratings yet
Edit Ds
37 pages
B.Tech - AIDS R 2021
No ratings yet
B.Tech - AIDS R 2021
31 pages
FODS Full Notes
No ratings yet
FODS Full Notes
217 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
COMP4332/RMBI4310: Big Data Mining and Management Advanced Data Mining For Risk Management and Business Intelligence
No ratings yet
COMP4332/RMBI4310: Big Data Mining and Management Advanced Data Mining For Risk Management and Business Intelligence
45 pages
Data Science & Programming Schedule
No ratings yet
Data Science & Programming Schedule
14 pages
Summer Term 2024 Course Handout: Date: 28.05.2024
No ratings yet
Summer Term 2024 Course Handout: Date: 28.05.2024
3 pages
Data Science Course for Students
No ratings yet
Data Science Course for Students
30 pages
Introduction to Data Science Course
No ratings yet
Introduction to Data Science Course
9 pages
FDS Course Plan - Update
No ratings yet
FDS Course Plan - Update
7 pages
SYLLABUS
No ratings yet
SYLLABUS
13 pages
BTCS9202 Data Sciences Lab Manual
No ratings yet
BTCS9202 Data Sciences Lab Manual
39 pages
Data Science Syllabus
No ratings yet
Data Science Syllabus
7 pages
Complete Technical Topics For AI
No ratings yet
Complete Technical Topics For AI
17 pages
Data Science Fundamentals
No ratings yet
Data Science Fundamentals
8 pages
Introduction To Data Science Course Outline
No ratings yet
Introduction To Data Science Course Outline
5 pages
PRINCIPLES OF DATA SCIENCE by - JOHN P DICKERSON
No ratings yet
PRINCIPLES OF DATA SCIENCE by - JOHN P DICKERSON
91 pages
BDA - CSE Syllabus
No ratings yet
BDA - CSE Syllabus
2 pages
DS 3000 Syllabus Spring 2025
No ratings yet
DS 3000 Syllabus Spring 2025
10 pages
Data Science Diploma for Aspiring Pros
No ratings yet
Data Science Diploma for Aspiring Pros
43 pages
Data Science
No ratings yet
Data Science
9 pages
(#0) Introduction
No ratings yet
(#0) Introduction
50 pages
PDS Merged New
No ratings yet
PDS Merged New
19 pages
Big Data Course Overview
No ratings yet
Big Data Course Overview
97 pages
Course Outline PDF
No ratings yet
Course Outline PDF
2 pages
Ocs353 DSF Syllabus
No ratings yet
Ocs353 DSF Syllabus
3 pages
Gujarat Technological University: Overview of Python and Data Structures
No ratings yet
Gujarat Technological University: Overview of Python and Data Structures
4 pages
ch01 Intro
No ratings yet
ch01 Intro
45 pages
Mba ZG536 Course Handout
No ratings yet
Mba ZG536 Course Handout
7 pages
Cpget2023 M.SC Datascience Eligibility Criteria
No ratings yet
Cpget2023 M.SC Datascience Eligibility Criteria
1 page
Ya5uE5 Syllabus Instructors
No ratings yet
Ya5uE5 Syllabus Instructors
2 pages
Data Science Student Schedule
No ratings yet
Data Science Student Schedule
7 pages
Lecture 1 - Introduction To Data Science
No ratings yet
Lecture 1 - Introduction To Data Science
12 pages
Web Quiz App Design Overview
No ratings yet
Web Quiz App Design Overview
12 pages
Vigenere Cipher: By: Mohsin Tahir Waqas Akram Numan-Ul-Haq Ali Asghar Rao Arslan
No ratings yet
Vigenere Cipher: By: Mohsin Tahir Waqas Akram Numan-Ul-Haq Ali Asghar Rao Arslan
15 pages
Binus University Code Reengineering Bad Code Smell - Object Orientation Abuser
No ratings yet
Binus University Code Reengineering Bad Code Smell - Object Orientation Abuser
22 pages
Artificial Intelligence Painting The Bigger Picture For Copyright Ownership
No ratings yet
Artificial Intelligence Painting The Bigger Picture For Copyright Ownership
29 pages
2011-A-Pl-Manish Naik-A Complete Guide To Oracle Asc S Inline Forecast Consumption Solution-Manuskript
100% (1)
2011-A-Pl-Manish Naik-A Complete Guide To Oracle Asc S Inline Forecast Consumption Solution-Manuskript
6 pages
Navneet Kaur PM 1
No ratings yet
Navneet Kaur PM 1
3 pages
Migration POC
No ratings yet
Migration POC
10 pages
Lab Assignment - 1 - Solution
No ratings yet
Lab Assignment - 1 - Solution
5 pages
Soft Computing
No ratings yet
Soft Computing
6 pages
Day 7 Task: Understanding Package Manager and Systemctl: Tasks
No ratings yet
Day 7 Task: Understanding Package Manager and Systemctl: Tasks
6 pages
Control Engineering Completion
No ratings yet
Control Engineering Completion
20 pages
CV - Andi Kurniawan - 2023
No ratings yet
CV - Andi Kurniawan - 2023
6 pages
Question No 1: Cryptanalytic Attacks On 3DES
No ratings yet
Question No 1: Cryptanalytic Attacks On 3DES
2 pages
LaTeX Homework Help Service
100% (1)
LaTeX Homework Help Service
6 pages
DE 3000 Brochure
No ratings yet
DE 3000 Brochure
4 pages
AI's Impact on Journalism and News
No ratings yet
AI's Impact on Journalism and News
46 pages
Oracle Applications - Query To Get Employee and Supervisor Hierarchy Details in Oracle Apps HRMS R12
No ratings yet
Oracle Applications - Query To Get Employee and Supervisor Hierarchy Details in Oracle Apps HRMS R12
3 pages
1725021614548
No ratings yet
1725021614548
293 pages
Sap Certification Orientation Sep9
No ratings yet
Sap Certification Orientation Sep9
23 pages
Management Information System: Bba LLB by The - Lawgical - World
No ratings yet
Management Information System: Bba LLB by The - Lawgical - World
18 pages
Linear Equations
No ratings yet
Linear Equations
4 pages
3a-105230 PBR 33 RH
No ratings yet
3a-105230 PBR 33 RH
1 page
Training Report On Machine Learning
No ratings yet
Training Report On Machine Learning
27 pages
Cisco CCIE Lab 4.2 Configuration Guide
No ratings yet
Cisco CCIE Lab 4.2 Configuration Guide
67 pages
ISO 17987-4 - 2013 - Draft
No ratings yet
ISO 17987-4 - 2013 - Draft
37 pages
Ar
No ratings yet
Ar
10 pages
Ensemble-Based Botnet Attack Detection and Classification Using Machine Learning Algorithms On NBaIoT Dataset
No ratings yet
Ensemble-Based Botnet Attack Detection and Classification Using Machine Learning Algorithms On NBaIoT Dataset
6 pages
Review of Phasor Estimation Algorithms For Phasor Measurement Units and Their Applications 667f8d3276d54
No ratings yet
Review of Phasor Estimation Algorithms For Phasor Measurement Units and Their Applications 667f8d3276d54
18 pages

Introduction To Data Science 439

Uploaded by

Introduction To Data Science 439

Uploaded by

http://www.mmds.

Introduction to Data Science

Data Mining ≈ Big Data (from 2012)

Tycho Brahe (1546-1610) Johannes Kepler (1571-1630)

Good at astro-observation Analyzed Tycho’s data, and discovered a rule

Johannes Kepler (1571-1630) Isaac Newton (1643-1727)

Data Collection Data Analytics (what) Data Interpretation (why)

Almost automatic Many available methods, Still needs much exploration,

 We provide some background, but

 Textbook: (LRU) Mining of Massive Datasets

 For private questions please email to me or TA

 Time and Submission

 It’s going to be fun and hard work. ☺

 Additional details/instructions can be seen in

You might also like