Introduction to Data Mining
Table of Contents
Data-Mining Application
A Strategy for Data Mining: CRISP-DM
Stages and tasks in CRISP-DM
Life Cycle of a Data Mining Project
Skills Needed for Data Mining
Objectives
At the end of this module you should be able to,
List two applications of data mining
Explain the stages of the CRISP-DM process model
Describe Successful data-mining projects and the reason
why the project fails
Describe the skills needed for data mining
Data-Mining Applications (1 of 2)
Reduce churn (reduce the number of customers who
cancel their policies, subscriptions, or accounts)
Reduce costs by better targeting customers in direct
mail campaigns
Reduce costs in a manufacturing process by
preventing machine failures
Reduce the incidence of a heart attack among those
with a cardiac disease
Data-Mining Applications (2 of 2)
Better target customers by classifying customers into
groups with distinct usage or need patterns
Reduce costs by preventing fraudulent credit-card
activity, or detecting fraud in an earlier stage
Increase revenues by increasing the number of
products sold by cross-selling
Increase revenues by showing a visitor the best-next-
page on a website
A Strategy for Data Mining: CRISP-DM
A data-mining project can become complicated
quickly
A model is needed that guides you through the
critical issues
Recommendation: use the Cross-Industry Standard
Process for Data Mining (CRISP-DM)
Stages in CRISP-DM
1 Business Understanding
2 Data Understanding
3 Data Preparation
4 Modeling
5 Evaluation
6 Deployment
Stage 1: Business Understanding
Task Sub task 1 Sub task 2 Sub task 3
Determine Background Business Business
business objectives success
objectives criteria
Assess Inventory of Risks and Terminology
situation resources contingencies
Determine Data-mining
data-mining success criteria
objectives
Produce project Write a project Initial
plan plan assessment of
tools and
techniques
Stage 2: Data Understanding
Task Sub task 1
Collect initial data Initial data-collection report
Describe data Data-description report
Explore data Data-exploration report
Verify data quality Data-quality report
Stage 3: Data Preparation
Task Sub task 1 Sub task 2
Select data Rational for inclusion
and exclusion
Clean data Data-cleaning report
Construct data Derived attributes
Format data and Set the unit of analysis Integrate data
combine datasets
Stage 4: Modeling
Task Sub task 1 Sub task 2
Select modeling Modeling
techniques assumptions
Generate test design Test design
Build model Set model Model descriptions
parameters
Assess model Model assessment Revise model
parameters
Stage 5: Evaluation
Task Sub task 1 Sub task 2
Evaluate results Assessment of data- Approve models
mining results with
respect to business
success criteria
Review process Review of process
Determine next steps List of possible actions Decision
Stage 6: Deployment
Task Sub task 1 Sub task 2
Plan deployment Deployment plan
Maintenance Maintenance plan
Produce final report Final report Final presentation
Review project Documentation
The Life Cycle of a Data-Mining Project
The stages influence each other in a non-linear way
Data mining is an ongoing endeavor
Data-Mining Success (1 of 4)
Measures of success:
the initial assessment will be directly tied to the
predictive accuracy
in the long run the success of a data-mining effort is
measured by concrete factors
Data-Mining Success (2 of 4)
Monitoring:
after deployment, collect data to assess the model’s
success
Data-Mining Success (3 of 4)
Cost of errors:
there will always be errors, sometimes with high cost
if no cost estimates are possible beforehand, then try to
gather this information afterwards, for future use
Data-Mining Success (4 of 4)
Other measures of project successes:
seek other measures to determine success from a
business perspective
bring successes to the attention of colleagues and
management early on in the project, so that tracking
systems or reports can be developed
Data-Mining Failure (1 of 4)
Bad data:
no data mining algorithm will be able to compensate for
large amounts of error in the data
never scrimp on the time spent on data preparation and
cleaning
Data-Mining Failure (2 of 4)
Organizational resistance:
difficulties implementing a solution are still part of the
whole data-mining effort
to address resistance, educate and convince others about
the potential benefits of the solution
consider implementation in only a portion of the
organization
Data-Mining Failure (3 of 4)
Results that cannot be deployed:
factors can be out of the control, or cannot legally be used
in marketing or in making decisions
Data-Mining Failure (4 of 4)
Cause and effect:
you must be certain that inputs/predictors in a model
occur before the output
Skills Needed for Data Mining (1 of 4)
Understanding the business:
asking the right data-mining question requires
knowledge of the specific business area and organization
evaluating a data-mining solution needs a business
perspective
Skills Needed for Data Mining (2 of 4)
Database knowledge:
the database administrator plays an important role:
Which data tables or files are available?
How are they linked?
How are the fields coded?
What are reasonable data values?
Skills Needed for Data Mining (3 of 4)
Knowledge of data-mining techniques:
best tools for situation
fine-tuning techniques
assess effects of data on outcome
identify anomalies
Skills Needed for Data Mining (4 of 4)
Team work combining multiple competencies,
such as:
business domain knowledge
database knowledge
data-mining algorithms
project management