9/22/2020
INTRODUCTION TO DATA MINING
UNIT # 2
FALL 2020 Sajjad Haider 1
ACKNOWLEDGEMENT
A few slides in this presentation are taken from material provided by
Han and Kimber (Data Mining Concepts and Techniques) and
Tan, Steinbach and Kumar (Introduction to Data Mining)
FALL 2020 Sajjad Haider 2
1
9/22/2020
RECENT TRENDS
FALL 2020 Sajjad Haider 3
DATA SCIENCE AND THE ART OF PERSUASION (HBR 2019)
Despite the success stories, many companies aren’t getting the value they
could from data science.
Four of the top seven “barriers faced at work”:
lack of management/financial support
lack of clear questions to answer
results not used by decision makers and
explaining data science to others
FALL 2020 Sajjad Haider 4
2
9/22/2020
DO YOUR DATA SCIENTISTS KNOW THE ‘WHY’ BEHIND
THEIR WORK? (HBR 2019)
Data science, broadly defined, has been around for a long time. But the failure
rates of big data projects in general and AI projects in particular remain
disturbingly high.
The following were found to be the two most important reasons:
Many data scientists are much more interested in pursuing their crafts — namely, finding
interesting nuggets buried in data — than they are in solving business problems.
From the company’s perspective, the talent is rare and protecting data scientists from the
chaos of everyday work just makes sense. But doing so increases the distance between
data scientists and the company’s most important problems and opportunities.
FALL 2020 Sajjad Haider 5
WHY DATA SCIENCE TEAMS NEED GENERALISTS, NOT
SPECIALISTS (HBR 2019)
The division of labor in Data Science projects similar to a pin factory
assembly line where “One [person] draws out the wire, another straights
it, a third cuts it, a fourth points it, a fifth grinds it,” doesn’t work well.
Algorithmic products and services like recommendations systems, style
preference classification, seasonal trend detection, and more can’t be
designed up-front.
With data science, you learn as you go, not before you go.
FALL 2020 Sajjad Haider 6
3
9/22/2020
WHAT’S THE BEST APPROACH TO DATA ANALYTICS? (HBR
2020)
Organizations’ approaches generally fall into one of five scenarios:
We’re here to help — do you have any problems to solve? (typically fail)
Boil the ocean. (typically fail)
Let a thousand flowers bloom. (work partially)
Three years and $10 million from now, it’s going to be great. (work partially)
Start with high-leverage business problems. (best approach)
FALL 2020 Sajjad Haider 7
POPULAR APPLICATIONS OF DATA MINING
FALL 2020 Sajjad Haider 8
4
9/22/2020
ANALYTICS
Two major types are: Descriptive and Predictive Analytics
Descriptive Analytics (Unsupervised Machine Learning)
what happened and why did it happen
Referred to as “unsupervised learning” in machine learning
Predictive Analytics (Supervised Machine Learning)
what will happen
Referred to as “supervised learning” in machine learning
FALL 2020 Sajjad Haider 9
POPULAR APPLICATIONS OF DATA MINING
Grouping items by similarity Clustering
Discovering relationships between items Association rules
Determining relationship between outcome Regression
and the input variables
Analyzing text data to find trending terms, Text analytics
sentiment analysis, document classification,
etc.
Assigning label/class to records Classification
FALL 2020 Sajjad Haider 10
5
9/22/2020
CLASSIFICATION EXAMPLE
Tid Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat
1 Yes Single 125K No No Single 75K ?
2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
Set
10
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10 No Single 90K Yes Model
FALL 2020
10
Set Classifier Sajjad Haider 11
CLASSIFICATION: DEFINITION
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other
attributes.
Goal: previously unseen records should be assigned a class as accurately
as possible.
A test set is used to determine the accuracy of the model. Usually, the given data
set is divided into training and test sets, with training set used to build the model
and test set used to validate it.
FALL 2020 Sajjad Haider 12
6
9/22/2020
CLASSIFICATION: APPLICATION 1
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new
cell-phone product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which decided otherwise. This {buy,
don’t buy} decision forms the class attribute.
Collect various demographic, lifestyle, and company-interaction related information
about all such customers.
Use this information as input attributes to learn a classifier model.
FALL 2020 Sajjad Haider 13
CLASSIFICATION: APPLICATION 2
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions.
Approach:
Use credit card transactions and the information on its account-holder as attributes.
When does a customer buy, what does he buy, how often he pays on time, etc
Label past transactions as fraud or fair transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card transactions on an account.
FALL 2020 Sajjad Haider 14
7
9/22/2020
CLASSIFICATION: APPLICATION 3
Customer Attrition/Churn:
Goal: To predict whether a customer is likely to be lost to a competitor.
Approach:
Use detailed record of transactions with each of the past and present customers, to find
attributes.
How often the customer calls, where he calls, what time-of-the day he calls most, his financial
status, marital status, etc.
Label the customers as loyal or disloyal.
Find a model for loyalty.
FALL 2020 Sajjad Haider 15
REGRESSION
Predict a value of a given continuous valued variable based on the values
of other variables, assuming a linear or nonlinear model of dependency.
Applications:
Predicting sales amounts of new product based on advertising expenditure.
Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
FALL 2020 Sajjad Haider 16
8
9/22/2020
CLUSTERING DEFINITION
Given a set of data points, each having a set of attributes, and a similarity
measure among them, find clusters such that
Data points in one cluster are more similar to one another.
Data points in separate clusters are less similar to one another.
Similarity Measures:
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.
FALL 2020 Sajjad Haider 17
ILLUSTRATING CLUSTERING
Intracluster distances Intercluster distances
are minimized are maximized
FALL 2020
Euclidean Distance Based Clustering in 3-D space.
Sajjad Haider 18
9
9/22/2020
CLUSTERING: APPLICATION 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers where any subset may
conceivably be selected as a market target to be reached with a distinct
marketing mix.
Approach:
Collect different attributes of customers based on their geographical and lifestyle related
information.
Find clusters of similar customers.
Measure the clustering quality by observing buying patterns of customers in same cluster vs.
those from different clusters.
FALL 2020 Sajjad Haider 19
CLUSTERING: APPLICATION 2
Document Clustering:
Goal: To find groups of documents that are similar to each other based on the
important terms appearing in them.
Approach: To identify frequently occurring terms in each document. Form a
similarity measure based on the frequencies of different terms. Use it to cluster.
FALL 2020 Sajjad Haider 20
10
9/22/2020
ASSOCIATION RULE DISCOVERY: DEFINITION
Given a set of records each of which contain TID Items
some number of items from a given collection; 1 Bread, Coke, Milk
2 Beer, Bread
Produce dependency rules which will predict 3 Beer, Coke, Diaper, Milk
occurrence of an item based on occurrences 4 Beer, Bread, Diaper, Milk
of other items. 5 Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke}
Most popular Application: Recommender {Diaper, Milk} --> {Beer}
Systems
FALL 2020 Sajjad Haider 21
ASSOCIATION RULE DISCOVERY: APPLICATION 1
Marketing and Sales Promotion:
Let the rule discovered be
{Bagels, … } --> {Potato Chips}
Potato Chips as consequent => Can be used to determine what should be done
to boost its sales.
Bagels in the antecedent => Can be used to see which products would be affected
if the store discontinues selling bagels.
Bagels in antecedent and Potato chips in consequent => Can be used to see what
products should be sold with Bagels to promote sale of Potato chips!
FALL 2020 Sajjad Haider 22
11
9/22/2020
ASSOCIATION RULE DISCOVERY: APPLICATION 2
Supermarket shelf management.
Goal: To identify items that are bought together by sufficiently many customers.
Approach: Process the point-of-sale data collected with barcode scanners to find
dependencies among items.
A classic rule --
If a customer buys diaper and milk, then he is very likely to buy beer.
So, don’t be surprised if you find six-packs stacked next to diapers!
FALL 2020 Sajjad Haider 23
ASSOCIATION RULE DISCOVERY: APPLICATION 3
Inventory Management:
Goal: A consumer appliance repair company wants to anticipate the nature
of repairs on its consumer products and keep the service vehicles equipped
with right parts to reduce on number of visits to consumer households.
Approach: Process the data on tools and parts required in previous repairs
at different consumer locations and discover the co-occurrence patterns.
FALL 2020 Sajjad Haider 24
12
9/22/2020
TEXT ANALYTICS
Text analytics is the process of analyzing unstructured text, extracting
relevant information, and transforming it into useful insight.
Applications:
Sentiment analysis
Tag cloud
Topic modeling
Machine translation
FALL 2020 Sajjad Haider 25
DATA MINING VS ALLIED FIELDS
FALL 2020 Sajjad Haider 26
13
9/22/2020
STATISTICS VS. MACHINE LEARNING
Data mining has its origins in various disciplines, of which the two most
important are statistics and machine learning.
Statistics has its roots in mathematics, and therefore, there has been an
emphasis on mathematical rigor, a desire to establish that something is sensible
on theoretical grounds before testing it in practice.
In contrast, the machine learning community has its origin very much in
computer practice. This has led to a practical orientation, a willingness to test
something out to see how well it performs, without waiting for a formal proof
of effectiveness.
FALL 2020 Sajjad Haider 27
STATISTICS VS. MACHINE LEARNING (CONT’D)
Modern statistics is entirely driven by the notion of a model. This is a
postulated structure, or an approximation to a structure, which could
have led to the data.
In place of the statistical emphasis on models, machine learning tends to
emphasize algorithms.
FALL 2020 Sajjad Haider 28
14
9/22/2020
DATA MINING AND MACHINE LEARNING
Data mining as a process includes data understanding, data preparation
and data modeling; while machine learning takes the processed data as
input and performs predictions by applying algorithms.
Thus, data mining requires involvement of human beings to clean and
prepare the data and to understand the patterns.
While in machine learning human effort is involved only to define an
algorithm, after which the algorithm takes over operations.
FALL 2020 Sajjad Haider 29
DM AND ML (CONT’D)
Nevertheless, it is worth pointing out some of the differences to give
perspective.
Speaking generally, because Machine Learning is concerned with many
types of performance improvement, it includes subfields such as robotics
and computer vision that are not part of Data Mining.
Machine Learning also is concerned with issues of agency and cognition—
how will an intelligent agent use learned knowledge to reason and act in
its environment—which are not concerns of Data Mining.
FALL 2020 Sajjad Haider 30
15
9/22/2020
BI VS. DATA ANALYTICS
Past
Business Intelligence (BI) focuses on using a
consistent set of metrics to measure past
performance and inform business planning.
Data Analytics refers to a combination of analytical
and machine learning techniques used for drawing
inferences and insight out of data
Future
FALL 2020 Sajjad Haider 31
BI ANSWERS FOR FRAUD DETECTION
How many cases were investigated last month?
What was the success rate in collecting debts?
How much revenue was recovered through collections?
What was the close rate of cases in the past month? Past quarter? Past
year?
For debts that were closed out, how many days it take on average to
close out debts?
FALL 2020 Sajjad Haider 32
16
9/22/2020
PREDICTIVE ANALYTICS FOR FRAUD DETECTION
What is the likelihood that the transaction is fraudulent?
What is the likelihood the invoice is fraudulent or warrants further
investigation?
Which characteristics of the transaction are most related to or most
predictive of fraud?
What is the expected amount of fraud?
Historically, which demographic and historic purchase patterns were most
related to fraud?
FALL 2020 Sajjad Haider 33
BI ANSWERS FOR CUSTOMER ANALYTICS
Which regions/states/ZIPs had the highest response rates?
Which products had the highest/lowest click-through rates?
How many repeat purchasers were there last month?
How many new subscriptions to the loyalty program were there?
How many visits to the store/website did a person have?
FALL 2020 Sajjad Haider 34
17
9/22/2020
PREDICTIVE ANALYTICS FOR CUSTOMER ANALYTICS
What is the likelihood an e-mail will be opened?
What is the likelihood a customer will click-through a link in an e-mail?
Which product is a customer more likely to purchase if given the choice?
How many e-mails should the customer receive to maximize the
likelihood of a purchase?
What is the likelihood of a product will sell out if it is put on sale?
FALL 2020 Sajjad Haider 35
18