Data mining
We will cover
• Data mining concepts and applications
• Process
• Methods
• data mining myths and blunders
Meaning
• The process of digging through data to discover
hidden connections and predict future trends
• Data mining is the process of finding patterns
and correlations within large data sets to
predict outcomes. Using a broad range of
techniques, business can use this information
to increase revenues, cut costs, improve
customer relationships, reduce risks and more.
Example of Data Mining
• Grocery stores are well-known users of data
mining techniques. Many supermarkets offer
free loyalty cards to customers that give them
access to reduced prices not available to non-
members. The cards make it easy for stores to
track who is buying what, when they are buying it
and at what price. After analyzing the data, stores
can then use this data to offer customers coupons
targeted to their buying habits and decide when to
put items on sale or when to sell them at full price.
Applications
Data Mining Applications
• Customer Relationship Management
– Maximize return on marketing campaigns
– Improve customer retention (churn analysis)
– Maximize customer value (cross-, up-selling)
– Identify and treat most valued customers
• Banking and Other Financial
– Automate the loan application process
– Detecting fraudulent transactions
– Maximize customer value (cross-, up-selling)
– Optimizing cash reserves with forecasting
Data Mining Applications (cont.)
• Retailing and Logistics
– Optimize inventory levels at different locations
– Improve the store layout and sales promotions
– Optimize logistics by predicting seasonal effects
– Minimize losses due to limited shelf life
• Manufacturing and Maintenance
– Predict/prevent machinery failures
– Identify anomalies in production systems to optimize the
use manufacturing capacity
– Discover novel patterns to improve product quality
Data Mining Applications
• Brokerage and Securities Trading
– Predict changes on certain bond prices
– Forecast the direction of stock fluctuations
– Assess the effect of events on market movements
– Identify and prevent fraudulent activities in trading
• Insurance
– Forecast claim costs for better business planning
– Determine optimal rate plans
– Optimize marketing to specific customers
– Identify and prevent fraudulent claim activities
Data Mining Applications (cont.)
• Computer hardware and software
• Science and engineering
• Government and defense
• Homeland security and law enforcement
• Travel industry
• Healthcare Highly popular
• Medicine application areas for
• data mining
Entertainment industry
• Sports
• Etc.
Process
Data Mining Process
• A manifestation of best practices
• A systematic way to conduct DM projects
• Different groups has different versions
• Most common standard processes:
– CRISP-DM (Cross-Industry Standard Process for
Data Mining)
– SEMMA (Sample, Explore, Modify, Model, and
Assess)
– KDD (Knowledge Discovery in Databases)
Data Mining Process: CRISP-DM
1 2
Business Data
Understanding Understanding
3
Data
Preparation
Data Sources
6
4
Deployment
Model
Building
5
Testing and
Evaluation
Data Mining Process: CRISP-DM
Step 1: Business Understanding Accounts for
~85% of
Step 2: Data Understanding total project
Step 3: Data Preparation (!) time
Step 4: Model Building
Step 5: Testing and Evaluation
Step 6: Deployment
• The process is highly repetitive and
experimental (DM: art versus science?)
Data Preparation – A Critical DM Task
Real-world
Data
· Collect data
Data Consolidation · Select data
· Integrate data
· Impute missing values
Data Cleaning · Reduce noise in data
· Eliminate inconsistencies
· Normalize data
Data Transformation · Discretize/aggregate data
· Construct new attributes
· Reduce number of variables
Data Reduction · Reduce number of cases
· Balance skewed data
Well-formed
Data
Data Mining Process: SEMMA
Sample
(Generate a representative
sample of the data)
Assess Explore
(Evaluate the accuracy and (Visualization and basic
usefulness of the models) description of the data)
SEMMA
Model Modify
(Use variety of statistical and (Select variables, transform
machine learning models ) variable representations)
Sample: Generate a representative sample of the data
Explore: Visualization and basic description of the data
Modify: Select variables, transform variable representations
Model: Use variety of statistical and machine learning
models
Assess: Evaluate the accuracy and usefulness of the models
Data Mining Myths and Blunders
Data Mining Myths
• Data mining …
– provides instant solutions/predictions
– is not yet viable for business applications
– requires a separate, dedicated database
– can only be done by those with advanced degrees
– is only for large firms that have lots of customer
data
– is another name for the good-old statistics
Common Data Mining Mistakes
1. Selecting the wrong problem for data mining
2. Ignoring what your sponsor thinks data mining is
and what it really can/cannot do
3. Not leaving insufficient time for data acquisition,
selection and preparation
4. Looking only at aggregated results and not at
individual records/predictions
5. Being sloppy about keeping track of the data
mining procedure and results
Common Data Mining Mistakes
6. Ignoring suspicious (good or bad) findings and
quickly moving on
7. Running mining algorithms repeatedly and blindly,
without thinking about the next stage
8. Naively believing everything you are told about
the data
9. Naively believing everything you are told about
your own data mining analysis
10. Measuring your results differently from the way
your sponsor measures them