Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
60 views50 pages

Chap1-Overview of Data Science

Uploaded by

sujithreddy765
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views50 pages

Chap1-Overview of Data Science

Uploaded by

sujithreddy765
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Chapter 1:

Overview of Data Science

1
Why Data Science
DATA = $$$

2
Course material management
We have several places in which we will store the course materials.
1. The course material folder in K-drive, a CSE server.
All of the materials will be stored in this folder:
K:\Courses\2024-Fall\STA591-lee1c\22453055\_Materials_

Browse through the folder, you see folders and files:


• Course Material
• Review ANOVA …
• S591
• A few files, including syllabus.
Inside the Course Material, you see the course materials I will
continue to update weekly.
Course Material Management
2. Inside the folder S591, you see two folders:
• S591_Data: all data sets from the textbook organized based
on the chapter # of the Textbook. Also, in this folder, you see
a folder call Class Data, containing some additional data that
may be used in class, but from textbook.
• S591_prj
After you have chosen the computer station, copy the entire
folder and paste it to your U-drive.
We will access the data and SAS/EM projects from the U-drive
folder. The S591_prj is still empty. It will be used to save all of
our SAS/EM projects later in the class.
You should also create a folder on your U-drive later to save your
SAS/EM projects for your homework problems. E.g., S591_Hw
inside S591 folder.
The growth of Global Data World wide
https://explodingtopics.com/blog/data-generated-per-day

Global data generated


in 2023 was about 120
ZT,
One ZET = 1012 GB expected to increased
to 181 ZT in 2025, and
660 ZT by 2030.

5
Size of Big Data
One ZB is roughly equal to the digital information created by
every man, women and child on earth ‘tweeting’ continuously
for 100 year.
Byte One peanut 1960 -
1985 Early computing
Kilobyte Cup of peanuts machine

Megabyte 8 bags of peanuts 1985 -


1992 Desktop
Gigabyte 3 Semi trucks of peanuts
Terabyte 2 container ships of peanuts 1992 -
Petabyte Peanuts blanket Manhattan
2005 Internet
Exabyte Peanuts blanket West coast 2005 -
states Present
Big Data
Zettabyte Peanuts fill the Pacific Ocean
Yottabyte An earth size of huge ball of Future 6
Big Data and Analytics
The first use of the “Big Data” in an academic paper :
Visually Exploring Gigabyte Datasets in Realtime (ACM)
The term was coined by Roger Mougalas in 2005.

• Facebook’s databases ingest 4000 terabytes of new data


per day in 2023.
• A self-driving car generates 1 gigabyte per second
• 90% of the data in the world today has been created in
the last two years alone
• 80% of data captured today is unstructured
7
Characteristics of Big Data

A self-driving
car generates
exabytes,
1 gigabyte
zettabytes,
per second.
yottabytes

Difference data What time is


sources; internal, it? How much
external, time it takes?
structured,
unstructured.

8
Data Volume
• Data volumes are increasing due to use of the
following:
– social media (Facebook, Twitter (X), Instagram)
– machines talking to machines
– improvements in the manufacturing process (quality
control)
– automated tracking devices
– streaming data feeds
– Gaming,
– Web browsing, Messaging, Cloud
Data Velocity

Velocity represents the speed at which data is


processed and becomes accessible.
– business processes that are more automated
– mergers and acquisitions
– more use of social media
– more use of self-service applications
– integration of business applications
Data Variety
Variety describes one of the biggest challenges of big data.
Organizing the data in a meaningful way is no simple task
when the data itself changes rapidly.
– structured data
– unstructured data
• business applications
• unstructured text documents
(articles, blogs, and so on)
• emails
• digital images
• video and audio clips
– streaming data
• stock ticker data
• RFID tag data
• sensor data
Data Variability
Variability is different from variety. A coffee shop may offer
six different blends of coffee, but if you get the same blend
every day and it tastes different every day, that is variability.
The same is true of data. If the meaning constantly changes,
it can significantly impact your data homogenization.
– The flow of data changes over time (seasonality, peak response,
social media trends, and so on).
– Data values change over time. How much history do you keep?
– Data values are different across data sources.
– Data is stored in different formats.
– Data standards change across time.
What was “valid” five years ago
might not be “valid” today.
Data Veracity

Veracity ensures the data is accurate, which


requires processes to keep the insufficient data
from accumulating in your systems.
- Missing data
- Different names of the same variable from
different data sources
- Different cultures may use different approaches
to present the same data

13
Data Visualization
Visualization is critical in today’s world. Using
charts and graphs to visualize large amounts of
complex data is much more effective in
conveying meaning than spreadsheets and
reports chock-full of numbers and formulas.

14
Data Value

Just collecting a huge amount of data is of no


use unless you can extract value from the data.
The value can be in many different definitions
depending on the project and the purpose of
each project. At end of the business world asks
for Return of Investment (ROI).

15
Data Science

“The ability to take data—to be able to understand it,


to process it, to extract value from it, to visualize it, to
communicate it—that’s going to be a hugely important
skill.” (Hal Varian, Google’s Chief Economist, NYT, 2009)

“Data Science refers to an emerging area of work concerned


with the collection, preparation, analysis, visualization,
management and preservation of large collections of information
using advanced computational technology, modern predictive
modeling and/or optimization techniques.”

16
Data Science: A multidisciplinary evolving science

17
“The New Breed of Professionals” –
Data Scientists
“Data Scientists use powerful computers and sophisticated models to
seek for meaningful patterns and insights in vast troves of data.

18
Data Scientist Skills

Scores and
Insights

Computer Mathematics
Science Machine and Statistics
Learning

Software Research
Communication
and Visualization

Reports Papers and


and Tasks Techniques

Domain
Knowledge
Articles and
Best Practices
Data Scientist Skills

Mathematics and Statistics Computer Science Domain Knowledge Communication and Visualization

• • • Business Knowledge • Engagement with Business and


Design of Experiments Programming Language
• • • Data Curiosity Management Levels
Descriptive Statistics Statistical Package
• Analytical Approach • Translation Insights into Business
• Statistical Inference • Scripting Language
• Problem Solver Decisions and Actions
• Supervised Modeling (Regression, • Mathematical Package
• Proactive • Visual Presentation Expertise
Decision Tree, Forest, Gradient • Machine Learning Package • Strategic • Data Visualization Tools Skills
Boosting, Neural Networks, Support • Deep Learning Package • Creative • Storytelling Capabilities
Vector Machine, Factorization • Data Cleansing
Machine, Ensemble Models, Two- • Innovative
• Data Preparation • Collaborative
Stage Models)
• Visualization Tools
• Unsupervised Modeling (K-Means,
• Databases (SQL, NoSQL, Graph)
Self-Organizing Maps, Variable
• Parallel Database and
Clustering, Principal Components,
Association Rules, Sequence, Parallel Query
Association, Path Analysis, Link • Distributed Computing
Analysis) • Hadoop and Hive
• Optimization • MapReduce
• Forecasting • Cloud Computing
• Econometrics • Graphical Processing
• Text Mining
Very Brief History of Data science
• John Tukey (1962): The Future of Data Analysis – argued that
statistics must take on the characteristics of science rather than
those of mathematics”, and that “data analysis is intrinsically an
empirical science”.
• Jeff Wu (1997) called for Statistics to be renamed Data Science.
• William Cleveland (2001) published a monograph on “Data Science:
An action Plan for expanding the technical areas of statistics”.
• 2002-The International Council for Science: Committee on Data for
Science and Technology started the Data Science Journal
• 2015: NSF Sponsored Workshop on Data Science Education
• 2017: The American Statistical Association: Guidelines for
Undergraduate Data Science Curriculum
• 2018: The National Academies of Sci., Eng., Med.: Data Science for
Undergraduates: Opportunities and Options 21
The evolvement of Knowledge Discovery
Artificial Intelligence and applications:
AI is a complete system relying on many complex subsystems of machine
The and statistical learning techniques with the goal of mimic human
intelligence. It has the capability of learning, reasoning and self-
Evolvement correction over time to improve the designated functions.

Analytics/Optimization/Machine Learning : What


of tools for if (scenarios):
Use simulation to simulate and optimize the ‘what if’ scenarios
knowledge under different possible future conditions.

discovery Analytics/Visualization/Machine Learning: What is (insight)


Use all possible data sources to look for patterns or
unexpected/unknown trends, signals, etc.

OLAP: Extracting information to check hypothesis of interest


Data Retrieval: Extracting interesting information from database
based on pre-determined interest.
22
Query/Reporting: Simple tables and summaries
A Few Words about Databases
Many data analysts and practitioners tend to think
datasets as big rectangular arrays:
Row = record, case, Column = variable, field
Databases can be organized much more efficiently
than a large flat file.
Different databases for different types of tasks:
Operational Vs Data warehouse
Flat file Vs Relational structure
Operational Database
• Operational databases consist of data from day-to-day
operation such as order entry, transactions, individual
accounts.
• The data used in data mining is usually collected for different
reasons from business operational point of view.
• Data records are usually short, atomic, isolated transactions.
• Contain only current data.
• Designed for fast throughput of transactions.
• Must be consistent and reliable.
• Different operational databases within the same organization
may not be able to talk properly.
Data Warehouse
• Data Warehouse is a system combining data from many
different sources into a platform. It often contains a large #
of variables and cases. It provides data for wide variety of
projects. Each project often needs a small # of variables
directly for the purpose of the project.
• Data Warehouse is a ‘subject-oriented, integrated, time-
varying, non-volatile collection of data that is used primary
in organizational decision making.
• Analytics techniques can be built on the Data Warehouse
platform to perform automated real-time data science
analytics projects, such as predictive modeling.
Data Warehouse
• Is designed for decision support.
• Often consolidates data from many resources
• Data covers longer time period than an operational database
• Much larger than a typical operational database.
• Allows complex and ad-hoc queries that access many records
across the warehouse.
• Data in a data warehouse is consistent.
• Data in a data warehouse can be separated and combined by
means of possible measures in the business.
• The quality of data warehouse is a driver of business
reengineering.
Data Warehouse Vs. Operational Database
• An operational database focuses on individual account transactions.
While data warehouse consolidates data from various operational
databases as well as some data sources from outside the
organization.
• Users of operational database almost always deals with one account
at a time for business operational purpose. While data warehouse
often consists of a long-term sequence of data records.
• Operational database is driven by consistency and reliability. While
data warehouse is driven by analytics that emphasizes automated
data updates and knowledge discovery for decision making.
SQL based Relational Database
• Within the database, fields are designed based on business
operation as a relational database, not for analytic processing.
• Individual operational databases may not be able to talk to
each other, especially the sampling units can be very different
from database to database.
• For small or moderate size of data, SQL database continues to
be a popular database used in many business and industries.
• It is not a good form for large data mining projects, since
searching across many tables and looking for relationships
among many variables will be very slow.
NoSQL Cloud Database
• Due to extremely large size of data and diverse types of data structure, SQL
database is no longer efficient for managing many unstructured databases.
• Many different NoSQL database systems have been created in the recent decade.
Some common NoSQL database systems are:
• Hadoop/Hbase
• Cassandra database
• Hypertable
• Accmulo
• Amazon SimpleDB
• Cloud Data
• HPCC (High-Performance Computing Cluster)
• Azure Table Storage
• Oracle NoSQL Database
• And many others.
Some of them are for more general purpose, and are designed to manage specific
types of unstructured data.

(https://bigdata-madesimple.com/a-deep-dive-into-nosql-a-complete-list-of-nosql-databases/ )
29
Some Commonly used Data Mining Techniques
There are two major categories of data mining
techniques:
Unsupervised techniques: The set of techniques for exploring the
hidden structure, relations of input variables (features,
independent variables. It often also called: Exploratory analysis.
There is no target (or dependent variable) that will be modeled.

Common Unsupervised Techniques:


Clustering,
Association, Sequence,
visualization techniques,
Principal Component Analysis,
path analysis, Kohonan Mapping,
Factor Analysis,
Multidimensional Scaling
10/27/2024 3030
Unsupervised Techniques:
Association and Link Analysis,
Market Basket Analysis
Association and Link Analysis, Market Basket Analysis: Look for
interesting relations between any variables. For example,
– When people buy the item A, what is the chance that people will also
buy the item B.
– When people buy the item A, what is the chance that people will also
buy the item B in the next shopping.

The story is: When people go to grocery store to buy baby diapers,
what would the most likely item they will also purchase?
The answer is beer.
This is just some facts inside the data, but, the causes or factors
behind requires other statistical methods such as experimental
designs
10/27/2024 to solve it. 31
Unsupervised Techniques:
Cluster Analysis

Clustering technique can be applied to


• Group n cases into k groups of cases (call clusters) using a
set of input variables.
• Group p input variables into m similar groups of input
variables: as a dimensional reduction technique:
Grouping p variables into m new ‘combined’ variables, where p >> m.
Similar variables are grouped as a cluster.

32
10/27/2024
Unsupervised Techniques:
Principal Component Analysis (PCA)

PCA is a dimension reduction technique to convert a set of m


correlated variables into k linearly uncorrelated new variables,
where m >> k). The ‘orthogonal transformation’ techniques is
applied to create the k ‘orthogonal’ principal components,
where the first principal component explains the most
variation of the original data, and so on.

10/27/2024 33
Supervised techniques –
predictive modeling
Supervised techniques are typically called predictive modeling,
such as regression model and others. These techniques must
have at least one target (dependent, response) variable, which
will be modeled and predicted based on a set of input
(independent) variables.

10/27/2024 34
The Supervised techniques are often classified
into two general types, depending on the
character of Target:
• Classification: The target is a categorical variable, such as
Yes/No (Binary Target), Low/Medium/High (Ordinal Target),
Different brands of shoes (nominal target). The purpose is to
build a model to classify each case into one of the
categories.
• Regression: The target is interval scale, such as Median
Income, Profit, etc. The purpose is to build predict models
to make prediction of the interval target using a set of
inputs.

10/27/2024 35
Common Supervised Techniques
Each of these modeling techniques can be used for
classification or regression modeling, depending on the
type of Target:
• Decision Trees,
• Bagging, Boosting, Random Forest,
• Generalized Regression,
• Logistic Regression (only for classification),
• Neural Networks,
• Discriminant Analysis (only for classification),
• Support Vector Machine,
• Memory Based Reasoning,
• Enforcement Learning,
• Deep Learning techniques
10/27/2024 36
Other Useful modeling Techniques
• Other regression methods:
Ridge regression,
Absolute Shrinkage and
Regression Operator (LASSO),
Least Angle Regression
(LARS), Spline regression, etc.

• Time Series Analysis: If observations are time dependent


• Spatial Analysis: if the observations are space dependent
• Survival Analysis: If the target is survival time
• Multivariate modeling techniques (when there are more than
one target to be modeled)
• Spatial modeling techniques: Methods incorporate spatial
dependence.
10/27/2024 37
Deep Learning Techniques
Deep learning is a technique using neural networks as
the basis to learn from a very large amount of data.
Some common deep learning techniques are:
– Convolutional Neural Networks (CNNs)
– Long Short-Term Memory (LSTM)
– Generative Adversarial Networks (GANs)
– Transformer Networks
– Autoencoders and different variations
– Deep Belief Networks (DBNs)
– Deep Q-Networks (DQNs)
– Graph Neural Networks (GNNs)
Some of these deep learning techniques will be discussed in STA 691.

38
Statistical Learning, Machine Learning, Deeping
Learning, Artificial Intelligence
• Statistical learning and machine learning are often used interchangeably.
They both involve using data-driven approach to build models, find
patterns using computer algorithms and statistical techniques. The slight
distinction is that statistical learning puts additional emphasis on
inferences, while machine learning focuses on using algorithms for
predictions.
• Deep learning is a specific subset of Machine learning that uses structured
complex multi-layers of neural networks to learn patterns/making
predictions from huge amount of data.
• AI is a general term for any computing techniques to mimic human
intelligence. It starts with applying historical data to make predictions and
finding patterns, then, by continuously acquiring new data and adjusting
rules to use the information to reach better and more efficient decisions.
• Machine learning is a subset of AI focusing on developing computing
algorithms that can learn and adapt to refine algorithms to draw
conclusions and decisions.
39
When raw data are not numbers
• Text Mining (if data is text)
• Image Mining (if data is image)
• Sentimental analysis (is a specific cluster
analysis for analyzing real-time social
media data, and so on.)
Both supervised and unsupervised techniques
can be applied to these on-traditional data,
depending on the project of interest.

10/27/2024 40
The sequence of topics studied in this class
We start with Supervised Techniques – Predictive modeling
(about ten weeks)
 Discuss some unique features of modeling involving large # of
variables and large # of observation
 Introduce SAS Enterprise Miner
 Data cleaning and exploratory analysis
 Data input, Data Visualizations
 Variable selections and Variable transformation
 Missing data imputation
 Supervised modeling techniques
 Supervised models when target variable is interval scale:
 Multiple linear regression, Regression Tree, Neural network
 Supervised models when target variable is categorical
 logistic regression, , Decision Tree, neural network
 Model selection: assessment, comparison, integration
 Model Deployment: prediction, model reporting, model packaging
41
The sequence of topics studied in this class

Week 10 to Week 15: Unsupervised Techniques

 Dimension Reduction Techniques:


 Principle Component Analysis
 Partial Least Square techniques
 Variable Cluster Analysis
 Cluster Analysis - cluster cases
 K-nearest neighbor (KNN) techniques
 Hierarchical clustering techniques

42
Supervised Techniques: The Basic Steps of
Conducting
a Predictive Modeling Project
1. Define the problem
2. Build data mining database
3. Explore data
4. Prepare data for modeling
5. Build model
6. Evaluate model
7. Deploy model and results
8. Take action
9. Measure the results

It is extremely important to keep in mind that the process is not a


sequential process. It is iterative.
10/27/2024 43
Modeling Methodology

c s Model Building
ly ti Comparison,
na
a A Deployment
t
Da
Variable selection,
Transformation

Data Exploration,
manipulation
i ng
e er
g in Data Extraction
E n
ata Data Partition:
D Combining,
Processing data Training : 50%,
Raw Data Validation: 30%
Test: 20%
44
Successful data science Project
Four keys to a successful project:
1) A precise formulation of a quantifiable problem you are
trying to solve. A focused problem usually results in the best
payoff.
2) Using the right data and proper analytics techniques.
Select proper internal source of data, and look for useful
external data. Data integration, data cleaning, data
manipulation, variable selection, transformation.
Often take 70% to 80% of your time and effort.
3) Apply proper analytics techniques. For a practitioner, one
does not need to know detailed theories or algorithms
behind. However, it is critical to know the assumptions, the
pros and cons of the techniques you are applying. The
process is iterative.
4) Taking the proper business action. Without proper action,
the results will not benefit to the business.
45
Data Science Project Management Model
Domain knowledge Expert
•Link to Business Strategy
•Data Knowledge
•Provide Corporate Data
•Prioritize Need

•Establish Requirements
•Interpret Results
•Monitor Results
•Identify Actionable
Events
PROJECT
Analytics Expertise TEAM Information Tech
•Identify data to •Tools & IT Skills
•Summarize and
extract •Project Management
Analyze •Create Models •External Data
•Quality Control •Integrate Data
•Discover and Explore •Security & Confidentiality

46
Data Science Process

47
SAS SEMMA Process

We will apply this methodology for conducting data


science project in this class
48
Cross-Industry Process for Data Mining
(CRISP-DM)

49
SPSS Modeling Process
Assess: Evaluate the previous
5 A model: deployment of a project, if any. Define
project, assess the scope of the project
and resources

Automate: Automate Access: Project plan


the use of the model in real
flow, acquire , clean,
time and interactive. Allows for
what if scenarios manipulate data

Act: Deploy model, take action to Analyze: Choose appropriate


apply the results to business scenarios.
methods to analyze data
Evaluate the Return of Investment

50

You might also like