Topic 1c:
Task and
Techniques
of Data
Mining
Ts. Dr. Tuan Norhafizah Tuan Zakaria
Objectives
To introduce about To discuss the history, To discuss Data Mining
Data Mining and its evolution and techniques, tasks,
relationship with data motivation of Data applications and some
and knowledge Mining major issues
Knowledge Discovery
in Databases
DM: Tasks and Techniques Data mining
Tasks
Techniques
Tasks Techniques
• Classification • Decision Trees
• Clustering • Association Rule
• Association Rules • k-means
• Prediction • Neural Networks
• Sequential Analysis • Naïve Bayes
• Deviation analysis • k-nearest neighbor
• Similarity analysis • Statistical Method
• Trend analysis
Given a collection of records (training set )
• Each record contains a set of attributes, one of the
attributes is the class.
Classificati Find a model for class attribute as a
function of the values of other attributes.
on:
Definition Goal: previously unseen records should be
assigned a class as accurately as possible.
• A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
Classification Example
l l us
ir ca ir ca uo
go go ti n
te te n s
ca ca co lc as
Tid Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat
1 Yes Single 125K No No Single 75K ?
2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
Set
10
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No Training Learn
Set Classifier Model
10 No Single 90K Yes
10
Classification: Direct Marketing
Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
Approach:
We know Collect various demographic, lifestyle,
Identify which customers decided to buy and
and company-interaction related information, Use this information as input attributes to learn a
which decided otherwise. This {buy, don’t buy}
type of business, where they stay, how much classifier model.
decision forms the class attribute.
they earn, etc.
Classification: Customer Attrition/Churn
Goal: To predict whether a customer is likely to be lost to a
competitor.
Approach:
How often the customer calls,
Use detailed record of transactions where he calls, what time-of-the day Label the customers as loyal or
Find a model for loyalty.
(past and present customers he calls most, his financial status, disloyal.
marital status, etc.
Given a set of data points, each having a
set of attributes, and a similarity measure
among them, find clusters such that
• Data points in one cluster are more similar to one
another.
Clusterin • Data points in separate clusters are less similar to one
another.
g
Similarity Measures:
• Euclidean Distance if attributes are continuous.
• Other Problem-specific Measures.
Clustering: Euclidean Distance
Based Clustering in 3-D space.
Intracluster Intercluster
distances distances
are minimized are maximized
Clustering: Market Segmentation
Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market target
to be reached with a distinct marketing mix.
2. Approach:
Collect different attributes of customers Measure the clustering quality by observing
based on their geographical and lifestyle Find clusters of similar customers. buying patterns of customers in same cluster
related information. vs. those from different clusters.
Clustering: Market Segmentation
Segment 1: high duration
Segment 2: moderate
but low number of
duration of generated calls
generated calls and
and moderate to high data
moderate number of sent
usage.
and received SMS.
Segment 3: high duration of
off-net calls, high number Segment 4: very low call
of generated calls, and duration, high sent and
moderate to low of both received SMS, and high
duration of generated calls data usage.
and data usage.
Segment 5: very low data
usage, low duration of
Segment 6: relatively high
generated calls, and high
duration of international
number of received calls
calls.
with respect to the number
of generated calls.
Clustering: Document Clustering
Goal: To find groups of documents that are similar to each
other based on the important terms appearing in them.
2. Approach:
To identify frequently occurring terms in each document. Gain: Information Retrieval can utilize the clusters to
Form a similarity measure based on the frequencies of relate a new document or search term to clustered
different terms. Use it to cluster. documents.
Association
Rule TID Items
Discovery 1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
• Given a set of records each of
which contain some number of 4 Beer, Bread, Diaper, Milk
items from a given collection; 5 Coke, Diaper, Milk
• Produce dependency rules
which will predict occurrence
of an item based on Rules
RulesDiscovered:
Discovered:
occurrences of other items. {Milk}
{Milk}-->
-->{Coke}
{Coke}
{Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
Association Rule
Discovery:
Marketing & Sales
Promotion
• Let the rule discovered be
{Bagels, … } --> {Potato Chips}
• Potato Chips as consequent can be used to
determine what should be done to boost its sales.
• Bagels in the antecedent Can be used to see which
products would be affected if the store discontinues
selling bagels.
• Bagels in antecedent and Potato chips in consequent
can be used to see what products should be sold
with Bagels to promote sale of Potato chips!
Goal: To identify items that are bought
Association together by sufficiently many customers.
Rule Approach:
Discovery: • Process the point-of-sale data collected with barcode
Supermark scanners to find dependencies among items.
et Shelf A classic rule
Manageme • If a customer buys diaper and milk, then he is very
nt likely to buy rootbeer.
• So, don’t be surprised if you find six-packs of rootbeer
stacked next to diapers!
Retail
Analytics
https://www.digitalnewsasia.com/download/tapwaycasestudy.pdf
Regression
Predict a value of a given
continuous valued variable
Greatly studied in statistics,
based on the values of
and machine learning Examples:
other variables, assuming a
fields.
linear or nonlinear model
of dependency.
Predicting sales amounts of Predicting wind velocities as
Time series prediction of
new product based on a function of temperature,
stock market indices.
advertising expenditure. humidity, air pressure, etc.
Deviation Analysis
Discovering most significant changes in data from previously measured or normative
values
• Usually, categorical separately from other data mining tasks
Deviations are often infrequent
Modifications of classification, clustering, time series analysis can be used as a means
to achieve the goal
Outlier detection in statistics
Detect significant deviations from
Deviation normal behavior.
Analysis:
Anomaly Applications:
Detection • Credit card fraud detection
• Network intrusion detection
Typical network traffic at University level may reach over 100 million connections per day
Deviation Analysis: Fraud Detection
Compare employee home
Identify employee accounts at addresses, social security numbers,
financial institutions that have telephone numbers and bank
excess numbers of credit memos. routing and account numbers to
Excess credit memos can indicate those of vendors from vendor
diversion of funds into employee master file. This test can reveal
accounts. bogus or improperly selected vendor
accounts.
Deviation Analysis: Fraud Detection
https://www.insurancebusinessmag.com/asia/news/breaking-news/malaysias-antifraud-system-operational-by-october-74933.aspx
Profiteering Cases
https://www.freemalaysiatoday.com/category/nation/2018/08/25/yes-keep-receipts-to-fight-profit
eering-say-retailers/
Yes, keep receipts to fight profiteering, say retailers
Robin Augustin -August 25, 2018 8:00 AM
http://english.astroawani.com/malaysia-news/gst-1-256-profiteering-
cases-detected-1-115-notices-issued-till-june-5-61853
References
1. Tan, Steinbach, Karpatne, Kumar, Lecture Notes, Chapter 1, Introduction to Data
Mining, 2nd Edition, 2018
2. Pang-Ning Tan, Michael Steinbach & Vipin Kumar, Introduction to Data Mining,
Addison Wesley, 2019.
3. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 3rd
Edition, Morgan Kaufmann, 2012.
4. Coenen, F. Data mining: past, present and future. Knowledge Engineering Review,
26(1), 25-29, 2011
5. Gregory Piatetsky-Shapiro, Data Science: Past, Present, and Future KDnuggets 1©
Kdnuggets, 2016