Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
72 views18 pages

Introduction To Data Mining Unit 2

This document provides an introduction to data mining and summarizes several key articles on data science trends. It discusses common challenges with data science projects, including lack of business focus and disconnects between data scientists and decision-makers. Additionally, it argues that data science teams are more effective when composed of generalists rather than specialists. The document also outlines popular applications of data mining like classification, clustering, association rule mining, and analytics. Classification is explained in more detail with examples in direct marketing, fraud detection, and customer churn prediction.

Uploaded by

vinay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views18 pages

Introduction To Data Mining Unit 2

This document provides an introduction to data mining and summarizes several key articles on data science trends. It discusses common challenges with data science projects, including lack of business focus and disconnects between data scientists and decision-makers. Additionally, it argues that data science teams are more effective when composed of generalists rather than specialists. The document also outlines popular applications of data mining like classification, clustering, association rule mining, and analytics. Classification is explained in more detail with examples in direct marketing, fraud detection, and customer churn prediction.

Uploaded by

vinay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

9/22/2020

INTRODUCTION TO DATA MINING


UNIT # 2

FALL 2020 Sajjad Haider 1

ACKNOWLEDGEMENT

 A few slides in this presentation are taken from material provided by


 Han and Kimber (Data Mining Concepts and Techniques) and
 Tan, Steinbach and Kumar (Introduction to Data Mining)

FALL 2020 Sajjad Haider 2

1
9/22/2020

RECENT TRENDS

FALL 2020 Sajjad Haider 3

DATA SCIENCE AND THE ART OF PERSUASION (HBR 2019)

 Despite the success stories, many companies aren’t getting the value they
could from data science.
 Four of the top seven “barriers faced at work”:
 lack of management/financial support
 lack of clear questions to answer
 results not used by decision makers and
 explaining data science to others

FALL 2020 Sajjad Haider 4

2
9/22/2020

DO YOUR DATA SCIENTISTS KNOW THE ‘WHY’ BEHIND


THEIR WORK? (HBR 2019)

 Data science, broadly defined, has been around for a long time. But the failure
rates of big data projects in general and AI projects in particular remain
disturbingly high.
 The following were found to be the two most important reasons:
 Many data scientists are much more interested in pursuing their crafts — namely, finding
interesting nuggets buried in data — than they are in solving business problems.
 From the company’s perspective, the talent is rare and protecting data scientists from the
chaos of everyday work just makes sense. But doing so increases the distance between
data scientists and the company’s most important problems and opportunities.

FALL 2020 Sajjad Haider 5

WHY DATA SCIENCE TEAMS NEED GENERALISTS, NOT


SPECIALISTS (HBR 2019)

 The division of labor in Data Science projects similar to a pin factory


assembly line where “One [person] draws out the wire, another straights
it, a third cuts it, a fourth points it, a fifth grinds it,” doesn’t work well.
 Algorithmic products and services like recommendations systems, style
preference classification, seasonal trend detection, and more can’t be
designed up-front.
 With data science, you learn as you go, not before you go.

FALL 2020 Sajjad Haider 6

3
9/22/2020

WHAT’S THE BEST APPROACH TO DATA ANALYTICS? (HBR


2020)

 Organizations’ approaches generally fall into one of five scenarios:


 We’re here to help — do you have any problems to solve? (typically fail)
 Boil the ocean. (typically fail)
 Let a thousand flowers bloom. (work partially)
 Three years and $10 million from now, it’s going to be great. (work partially)
 Start with high-leverage business problems. (best approach)

FALL 2020 Sajjad Haider 7

POPULAR APPLICATIONS OF DATA MINING

FALL 2020 Sajjad Haider 8

4
9/22/2020

ANALYTICS

 Two major types are: Descriptive and Predictive Analytics


 Descriptive Analytics (Unsupervised Machine Learning)
 what happened and why did it happen
 Referred to as “unsupervised learning” in machine learning

 Predictive Analytics (Supervised Machine Learning)


 what will happen
 Referred to as “supervised learning” in machine learning

FALL 2020 Sajjad Haider 9

POPULAR APPLICATIONS OF DATA MINING

Grouping items by similarity Clustering


Discovering relationships between items Association rules
Determining relationship between outcome Regression
and the input variables
Analyzing text data to find trending terms, Text analytics
sentiment analysis, document classification,
etc.
Assigning label/class to records Classification
FALL 2020 Sajjad Haider 10

5
9/22/2020

CLASSIFICATION EXAMPLE

Tid Refund Marital Taxable Refund Marital Taxable


Status Income Cheat Status Income Cheat

1 Yes Single 125K No No Single 75K ?


2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
Set
10

7 Yes Divorced 220K No


8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10 No Single 90K Yes Model
FALL 2020
10

Set Classifier Sajjad Haider 11

CLASSIFICATION: DEFINITION

 Given a collection of records (training set )


 Each record contains a set of attributes, one of the attributes is the class.
 Find a model for class attribute as a function of the values of other
attributes.
 Goal: previously unseen records should be assigned a class as accurately
as possible.
 A test set is used to determine the accuracy of the model. Usually, the given data
set is divided into training and test sets, with training set used to build the model
and test set used to validate it.
FALL 2020 Sajjad Haider 12

6
9/22/2020

CLASSIFICATION: APPLICATION 1
 Direct Marketing
 Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new
cell-phone product.
 Approach:
 Use the data for a similar product introduced before.
 We know which customers decided to buy and which decided otherwise. This {buy,
don’t buy} decision forms the class attribute.
 Collect various demographic, lifestyle, and company-interaction related information
about all such customers.
 Use this information as input attributes to learn a classifier model.
FALL 2020 Sajjad Haider 13

CLASSIFICATION: APPLICATION 2

 Fraud Detection
 Goal: Predict fraudulent cases in credit card transactions.
 Approach:
 Use credit card transactions and the information on its account-holder as attributes.
 When does a customer buy, what does he buy, how often he pays on time, etc
 Label past transactions as fraud or fair transactions. This forms the class attribute.
 Learn a model for the class of the transactions.
 Use this model to detect fraud by observing credit card transactions on an account.

FALL 2020 Sajjad Haider 14

7
9/22/2020

CLASSIFICATION: APPLICATION 3

 Customer Attrition/Churn:
 Goal: To predict whether a customer is likely to be lost to a competitor.
 Approach:
 Use detailed record of transactions with each of the past and present customers, to find
attributes.
 How often the customer calls, where he calls, what time-of-the day he calls most, his financial
status, marital status, etc.
 Label the customers as loyal or disloyal.
 Find a model for loyalty.

FALL 2020 Sajjad Haider 15

REGRESSION

 Predict a value of a given continuous valued variable based on the values


of other variables, assuming a linear or nonlinear model of dependency.
 Applications:
 Predicting sales amounts of new product based on advertising expenditure.
 Predicting wind velocities as a function of temperature, humidity, air pressure, etc.

FALL 2020 Sajjad Haider 16

8
9/22/2020

CLUSTERING DEFINITION

 Given a set of data points, each having a set of attributes, and a similarity
measure among them, find clusters such that
 Data points in one cluster are more similar to one another.
 Data points in separate clusters are less similar to one another.
 Similarity Measures:
 Euclidean Distance if attributes are continuous.
 Other Problem-specific Measures.

FALL 2020 Sajjad Haider 17

ILLUSTRATING CLUSTERING
Intracluster distances Intercluster distances
are minimized are maximized

FALL 2020
Euclidean Distance Based Clustering in 3-D space.
Sajjad Haider 18

9
9/22/2020

CLUSTERING: APPLICATION 1

 Market Segmentation:
 Goal: subdivide a market into distinct subsets of customers where any subset may
conceivably be selected as a market target to be reached with a distinct
marketing mix.
 Approach:
 Collect different attributes of customers based on their geographical and lifestyle related
information.
 Find clusters of similar customers.
 Measure the clustering quality by observing buying patterns of customers in same cluster vs.
those from different clusters.

FALL 2020 Sajjad Haider 19

CLUSTERING: APPLICATION 2

 Document Clustering:
 Goal: To find groups of documents that are similar to each other based on the
important terms appearing in them.
 Approach: To identify frequently occurring terms in each document. Form a
similarity measure based on the frequencies of different terms. Use it to cluster.

FALL 2020 Sajjad Haider 20

10
9/22/2020

ASSOCIATION RULE DISCOVERY: DEFINITION

 Given a set of records each of which contain TID Items


some number of items from a given collection; 1 Bread, Coke, Milk
2 Beer, Bread
 Produce dependency rules which will predict 3 Beer, Coke, Diaper, Milk
occurrence of an item based on occurrences 4 Beer, Bread, Diaper, Milk
of other items. 5 Coke, Diaper, Milk

Rules Discovered:
{Milk} --> {Coke}
 Most popular Application: Recommender {Diaper, Milk} --> {Beer}
Systems

FALL 2020 Sajjad Haider 21

ASSOCIATION RULE DISCOVERY: APPLICATION 1

 Marketing and Sales Promotion:


 Let the rule discovered be
{Bagels, … } --> {Potato Chips}
 Potato Chips as consequent => Can be used to determine what should be done
to boost its sales.
 Bagels in the antecedent => Can be used to see which products would be affected
if the store discontinues selling bagels.
 Bagels in antecedent and Potato chips in consequent => Can be used to see what
products should be sold with Bagels to promote sale of Potato chips!

FALL 2020 Sajjad Haider 22

11
9/22/2020

ASSOCIATION RULE DISCOVERY: APPLICATION 2

 Supermarket shelf management.


 Goal: To identify items that are bought together by sufficiently many customers.
 Approach: Process the point-of-sale data collected with barcode scanners to find
dependencies among items.
 A classic rule --
 If a customer buys diaper and milk, then he is very likely to buy beer.
 So, don’t be surprised if you find six-packs stacked next to diapers!

FALL 2020 Sajjad Haider 23

ASSOCIATION RULE DISCOVERY: APPLICATION 3

 Inventory Management:
 Goal: A consumer appliance repair company wants to anticipate the nature
of repairs on its consumer products and keep the service vehicles equipped
with right parts to reduce on number of visits to consumer households.
 Approach: Process the data on tools and parts required in previous repairs
at different consumer locations and discover the co-occurrence patterns.

FALL 2020 Sajjad Haider 24

12
9/22/2020

TEXT ANALYTICS

 Text analytics is the process of analyzing unstructured text, extracting


relevant information, and transforming it into useful insight.
 Applications:
 Sentiment analysis
 Tag cloud
 Topic modeling
 Machine translation

FALL 2020 Sajjad Haider 25

DATA MINING VS ALLIED FIELDS

FALL 2020 Sajjad Haider 26

13
9/22/2020

STATISTICS VS. MACHINE LEARNING

 Data mining has its origins in various disciplines, of which the two most
important are statistics and machine learning.
 Statistics has its roots in mathematics, and therefore, there has been an
emphasis on mathematical rigor, a desire to establish that something is sensible
on theoretical grounds before testing it in practice.
 In contrast, the machine learning community has its origin very much in
computer practice. This has led to a practical orientation, a willingness to test
something out to see how well it performs, without waiting for a formal proof
of effectiveness.

FALL 2020 Sajjad Haider 27

STATISTICS VS. MACHINE LEARNING (CONT’D)

 Modern statistics is entirely driven by the notion of a model. This is a


postulated structure, or an approximation to a structure, which could
have led to the data.
 In place of the statistical emphasis on models, machine learning tends to
emphasize algorithms.

FALL 2020 Sajjad Haider 28

14
9/22/2020

DATA MINING AND MACHINE LEARNING

 Data mining as a process includes data understanding, data preparation


and data modeling; while machine learning takes the processed data as
input and performs predictions by applying algorithms.
 Thus, data mining requires involvement of human beings to clean and
prepare the data and to understand the patterns.
 While in machine learning human effort is involved only to define an
algorithm, after which the algorithm takes over operations.

FALL 2020 Sajjad Haider 29

DM AND ML (CONT’D)

 Nevertheless, it is worth pointing out some of the differences to give


perspective.
 Speaking generally, because Machine Learning is concerned with many
types of performance improvement, it includes subfields such as robotics
and computer vision that are not part of Data Mining.
 Machine Learning also is concerned with issues of agency and cognition—
how will an intelligent agent use learned knowledge to reason and act in
its environment—which are not concerns of Data Mining.

FALL 2020 Sajjad Haider 30

15
9/22/2020

BI VS. DATA ANALYTICS

Past
 Business Intelligence (BI) focuses on using a
consistent set of metrics to measure past
performance and inform business planning.
 Data Analytics refers to a combination of analytical
and machine learning techniques used for drawing
inferences and insight out of data

Future

FALL 2020 Sajjad Haider 31

BI ANSWERS FOR FRAUD DETECTION

 How many cases were investigated last month?


 What was the success rate in collecting debts?
 How much revenue was recovered through collections?
 What was the close rate of cases in the past month? Past quarter? Past
year?
 For debts that were closed out, how many days it take on average to
close out debts?

FALL 2020 Sajjad Haider 32

16
9/22/2020

PREDICTIVE ANALYTICS FOR FRAUD DETECTION

 What is the likelihood that the transaction is fraudulent?


 What is the likelihood the invoice is fraudulent or warrants further
investigation?
 Which characteristics of the transaction are most related to or most
predictive of fraud?
 What is the expected amount of fraud?
 Historically, which demographic and historic purchase patterns were most
related to fraud?

FALL 2020 Sajjad Haider 33

BI ANSWERS FOR CUSTOMER ANALYTICS

 Which regions/states/ZIPs had the highest response rates?


 Which products had the highest/lowest click-through rates?
 How many repeat purchasers were there last month?
 How many new subscriptions to the loyalty program were there?
 How many visits to the store/website did a person have?

FALL 2020 Sajjad Haider 34

17
9/22/2020

PREDICTIVE ANALYTICS FOR CUSTOMER ANALYTICS

 What is the likelihood an e-mail will be opened?


 What is the likelihood a customer will click-through a link in an e-mail?
 Which product is a customer more likely to purchase if given the choice?
 How many e-mails should the customer receive to maximize the
likelihood of a purchase?
 What is the likelihood of a product will sell out if it is put on sale?

FALL 2020 Sajjad Haider 35

18

You might also like