Chap1-Overview of Data Science
Chap1-Overview of Data Science
1
Why Data Science
DATA = $$$
2
Course material management
We have several places in which we will store the course materials.
1. The course material folder in K-drive, a CSE server.
All of the materials will be stored in this folder:
K:\Courses\2024-Fall\STA591-lee1c\22453055\_Materials_
5
Size of Big Data
One ZB is roughly equal to the digital information created by
every man, women and child on earth ‘tweeting’ continuously
for 100 year.
Byte One peanut 1960 -
1985 Early computing
Kilobyte Cup of peanuts machine
A self-driving
car generates
exabytes,
1 gigabyte
zettabytes,
per second.
yottabytes
8
Data Volume
• Data volumes are increasing due to use of the
following:
– social media (Facebook, Twitter (X), Instagram)
– machines talking to machines
– improvements in the manufacturing process (quality
control)
– automated tracking devices
– streaming data feeds
– Gaming,
– Web browsing, Messaging, Cloud
Data Velocity
13
Data Visualization
Visualization is critical in today’s world. Using
charts and graphs to visualize large amounts of
complex data is much more effective in
conveying meaning than spreadsheets and
reports chock-full of numbers and formulas.
14
Data Value
15
Data Science
16
Data Science: A multidisciplinary evolving science
17
“The New Breed of Professionals” –
Data Scientists
“Data Scientists use powerful computers and sophisticated models to
seek for meaningful patterns and insights in vast troves of data.
18
Data Scientist Skills
Scores and
Insights
Computer Mathematics
Science Machine and Statistics
Learning
Software Research
Communication
and Visualization
Domain
Knowledge
Articles and
Best Practices
Data Scientist Skills
Mathematics and Statistics Computer Science Domain Knowledge Communication and Visualization
(https://bigdata-madesimple.com/a-deep-dive-into-nosql-a-complete-list-of-nosql-databases/ )
29
Some Commonly used Data Mining Techniques
There are two major categories of data mining
techniques:
Unsupervised techniques: The set of techniques for exploring the
hidden structure, relations of input variables (features,
independent variables. It often also called: Exploratory analysis.
There is no target (or dependent variable) that will be modeled.
The story is: When people go to grocery store to buy baby diapers,
what would the most likely item they will also purchase?
The answer is beer.
This is just some facts inside the data, but, the causes or factors
behind requires other statistical methods such as experimental
designs
10/27/2024 to solve it. 31
Unsupervised Techniques:
Cluster Analysis
32
10/27/2024
Unsupervised Techniques:
Principal Component Analysis (PCA)
10/27/2024 33
Supervised techniques –
predictive modeling
Supervised techniques are typically called predictive modeling,
such as regression model and others. These techniques must
have at least one target (dependent, response) variable, which
will be modeled and predicted based on a set of input
(independent) variables.
10/27/2024 34
The Supervised techniques are often classified
into two general types, depending on the
character of Target:
• Classification: The target is a categorical variable, such as
Yes/No (Binary Target), Low/Medium/High (Ordinal Target),
Different brands of shoes (nominal target). The purpose is to
build a model to classify each case into one of the
categories.
• Regression: The target is interval scale, such as Median
Income, Profit, etc. The purpose is to build predict models
to make prediction of the interval target using a set of
inputs.
10/27/2024 35
Common Supervised Techniques
Each of these modeling techniques can be used for
classification or regression modeling, depending on the
type of Target:
• Decision Trees,
• Bagging, Boosting, Random Forest,
• Generalized Regression,
• Logistic Regression (only for classification),
• Neural Networks,
• Discriminant Analysis (only for classification),
• Support Vector Machine,
• Memory Based Reasoning,
• Enforcement Learning,
• Deep Learning techniques
10/27/2024 36
Other Useful modeling Techniques
• Other regression methods:
Ridge regression,
Absolute Shrinkage and
Regression Operator (LASSO),
Least Angle Regression
(LARS), Spline regression, etc.
38
Statistical Learning, Machine Learning, Deeping
Learning, Artificial Intelligence
• Statistical learning and machine learning are often used interchangeably.
They both involve using data-driven approach to build models, find
patterns using computer algorithms and statistical techniques. The slight
distinction is that statistical learning puts additional emphasis on
inferences, while machine learning focuses on using algorithms for
predictions.
• Deep learning is a specific subset of Machine learning that uses structured
complex multi-layers of neural networks to learn patterns/making
predictions from huge amount of data.
• AI is a general term for any computing techniques to mimic human
intelligence. It starts with applying historical data to make predictions and
finding patterns, then, by continuously acquiring new data and adjusting
rules to use the information to reach better and more efficient decisions.
• Machine learning is a subset of AI focusing on developing computing
algorithms that can learn and adapt to refine algorithms to draw
conclusions and decisions.
39
When raw data are not numbers
• Text Mining (if data is text)
• Image Mining (if data is image)
• Sentimental analysis (is a specific cluster
analysis for analyzing real-time social
media data, and so on.)
Both supervised and unsupervised techniques
can be applied to these on-traditional data,
depending on the project of interest.
10/27/2024 40
The sequence of topics studied in this class
We start with Supervised Techniques – Predictive modeling
(about ten weeks)
Discuss some unique features of modeling involving large # of
variables and large # of observation
Introduce SAS Enterprise Miner
Data cleaning and exploratory analysis
Data input, Data Visualizations
Variable selections and Variable transformation
Missing data imputation
Supervised modeling techniques
Supervised models when target variable is interval scale:
Multiple linear regression, Regression Tree, Neural network
Supervised models when target variable is categorical
logistic regression, , Decision Tree, neural network
Model selection: assessment, comparison, integration
Model Deployment: prediction, model reporting, model packaging
41
The sequence of topics studied in this class
42
Supervised Techniques: The Basic Steps of
Conducting
a Predictive Modeling Project
1. Define the problem
2. Build data mining database
3. Explore data
4. Prepare data for modeling
5. Build model
6. Evaluate model
7. Deploy model and results
8. Take action
9. Measure the results
c s Model Building
ly ti Comparison,
na
a A Deployment
t
Da
Variable selection,
Transformation
Data Exploration,
manipulation
i ng
e er
g in Data Extraction
E n
ata Data Partition:
D Combining,
Processing data Training : 50%,
Raw Data Validation: 30%
Test: 20%
44
Successful data science Project
Four keys to a successful project:
1) A precise formulation of a quantifiable problem you are
trying to solve. A focused problem usually results in the best
payoff.
2) Using the right data and proper analytics techniques.
Select proper internal source of data, and look for useful
external data. Data integration, data cleaning, data
manipulation, variable selection, transformation.
Often take 70% to 80% of your time and effort.
3) Apply proper analytics techniques. For a practitioner, one
does not need to know detailed theories or algorithms
behind. However, it is critical to know the assumptions, the
pros and cons of the techniques you are applying. The
process is iterative.
4) Taking the proper business action. Without proper action,
the results will not benefit to the business.
45
Data Science Project Management Model
Domain knowledge Expert
•Link to Business Strategy
•Data Knowledge
•Provide Corporate Data
•Prioritize Need
•Establish Requirements
•Interpret Results
•Monitor Results
•Identify Actionable
Events
PROJECT
Analytics Expertise TEAM Information Tech
•Identify data to •Tools & IT Skills
•Summarize and
extract •Project Management
Analyze •Create Models •External Data
•Quality Control •Integrate Data
•Discover and Explore •Security & Confidentiality
46
Data Science Process
47
SAS SEMMA Process
49
SPSS Modeling Process
Assess: Evaluate the previous
5 A model: deployment of a project, if any. Define
project, assess the scope of the project
and resources
50