Unit-4
Introduction to Data Mining
Data Mining is an information extraction activity
whose goal is to discover hidden facts
contained in large
databases.
2
Data Mining Models and
Tasks
BASIC TASKS
Classification : Classification is a data mining technique
used for systematic placement of group membership
for data.
For example, you may wish to use classification to
predict whether the weather on a particular day will be
“sunny”, “rainy” or “cloudy”. Popular classification
techniques include decision trees and neural networks.
4
Classification
Given old data about customers and payments, predict
new applicant’s loan eligibility.
Previous
customers Classifier Decision rules
Salary > 5 L
Age
Salary Good/
Profession
Prof. = Exec
bad
Location
Customer
type New applicant’s
data
DATA MINING TASKS…………cntd
Regression : Used to predict for individuals on the basis of
information gained from a previous sample of similar
individuals.
Example:
A person wants to do some savings for future, and then it wil be
based on his current values and several past values. He uses a
linear regression formula to predict his future savings.
6
DATA MINING TASKS…………cntd
Clustering : Clustering is a data mining technique used to place
data elements into related groups without advance knowledge
of the group definitions.
Example : A department store chain creates special catalogues
targeted to various types of customer groups based on
attributes such as income, location, etc.
7
DATA MINING TASKS…………cntd
Pattern mining is a data mining method that involves
finding existing patterns in data. In this context patterns
often means association rules. The original motivation for
searching association rules came from the desire to analyze
supermarket transaction data, that is, to examine customer
behavior in terms of the purchased products.
For example, an association rule “cold drink ⇒ potato chips
(80%)" states that four out of five customers that bought
cold drink also bought potato chips.
8
DATA MINING TASKS…………cntd
Summarization maps data into subsets with associated
simple descriptions (Characterization or Generalization)
Ex- GATE score
Link Analysis uncovers relationships among data.
Association Rules
Sequential Analysis determines sequential patterns.
9
Data Mining Application: Marketing
Sales Analysis
• associations between product sales:
bread and butter
Toothpaste and toothbrush
Customer Profiling
• data mining can tell you what types of customers
buy what products
Identifying Customer Requirements
• identify the best products for different customers
• use prediction to find what factors will attract
new
customers
10
Data Mining Application:
Fraud Detection
• Association Rule Mining can detect a group of people who
stage accidents to collect on insurance
• a data-mining application can be used to detect suspicious
money transactions
• data mining can be used to help commercial lending
decisions and to prevent fraud
11
Data Preprocessing
12
Why Data
Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
e.g., occupation=“ ”
noisy: containing errors or outliers
e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or
names
e.g., Age=“42” Birthday=“03/07/1997”
e.g.,Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
13 Data Mining: Concepts and Techniques
Why Is Data Dirty?
Incomplete data may come from
“Not applicable” data value when collected
Different considerations between the time when the data was collected and when it
is analyzed.
Human/hardware/software problems
Noisy data (incorrect values) may come from
Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from
Different data sources
Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
14 Data Mining: Concepts and Techniques August 10, 2015
Why Is Data Preprocessing
Important?
No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or even misleading statistics.
Data warehouse needs consistent integration of quality data
Data extraction, cleaning, and transformation comprises the majority
of the work of building a data warehouse
15 Data Mining: Concepts and Techniques
Multi-Dimensional Measure of Data
Quality
Properties of a well-accepted multidimensional
view:
Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
16 Data Mining: Concepts and Techniques August 10, 2015
Major Tasks in Data
Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve
inconsistencies
Data integration
Integration of
multiple databases,
data cubes, or files
Data
transformation
Normalization and
aggregation
Data reduction
Obtains reduced representation in volume but produces the same or
17 Data Mining: Concepts and Techniques August 10, 2015
similar analytical results
Forms of Data
Preprocessing
18 Data Mining: Concepts and Techniques August 10, 2015
KDD Process
19
The KDD
process
"KDD is the nontrivial process of identifying valid, novel,
potentially useful, and ultimately understandablepatterns in
data".
20
Steps
: The process operates on the following basic steps:
(i) identifying the goal from the user's point of view ( based on
the relevant knowledge about the domain),
(ii) creating a target data,
(iii) data preprocessing,
(iv) data reduction and projection,
(v) matching the goals of the KDD process,
(vi) exploratory analysis,
(vii) data mining,
(viii) interpreting mined patterns,
(ix) acting on the discovered knowledge.
21
These steps can be divided into three tasks:
the preprocessing of data(steps i - vi),
the mining of data (steps vii) and
the postprocessing of data (steps viii - ix).
The domain knowledge helps the process to focus on the
research content.
22
Fig. : The KDD Process
23
KDD Process Ex: Web
Log
Selection:
Select log data (dates and locations) to use
Preprocessing:
Remove identifying URLs
Remove error logs
Transformation:
Sessionize (sort and group)
Data Mining:
Identify and count patterns
Construct data structure
Interpretation/Evaluation:
Identify and display frequently accessed sequences.
Potential User Applications:
Cache prediction
Personalization
24
KDD
Issues
Human Interaction
Outliers
Interpretation
Visualization
Large Datasets
High Dimensionality
25
KDD Issues…………
cntd
Multimedia Data
Missing Data
Irrelevant Data
Noisy Data
Changing Data
Integration
Application
26