Data Mining for Business
Intelligence
Data Mining Concepts and
Definitions
Why Data Mining?
More intense competition at the global
scale
Recognition of the value in data sources
Availability of quality data on customers,
vendors, transactions, Web, etc.
Consolidation and integration of data
repositories into data warehouses
The exponential increase in data
processing and storage capabilities; and
decrease in cost
Movement toward conversion of
Definition of Data Mining
The nontrivial process of identifying
valid, novel, potentially useful, and
ultimately understandable patterns in
data stored in structured databases
- Fayyad et al., (1996)
Keywords in this definition: Process,
nontrivial, valid, novel, potentially useful,
understandable
Data mining: a misnomer?
Other names: knowledge extraction,
pattern analysis, knowledge discovery,
information harvesting, pattern
Data Mining at the Intersection
of Many Disciplines
Pattern
Recognition
DATA Machine
MINING Learning
Mathematical
Modeling Databases
Management Science &
Information Systems
Data Mining
Characteristics/Objectives
Source of data for DM is often a
consolidated data warehouse (not
always!).
DM environment is usually a client-server
or a Web-based information systems
architecture.
Data is the most critical ingredient for
DM which may include soft/unstructured
data.
The miner is often an end user.
Striking it rich requires creative thinking.
Data in Data Mining
Data: a collection of facts usually obtained as
the result of experiences, observations, or
experiments
Data may consist of numbers, words, and
images
Data: lowest level of abstraction (from which
Data
- DM with
information and knowledge are derived)
different data
Categorical Numerical
types?
- Other data
types?
Nominal Ordinal Interval Ratio
What Does DM Do? How Does it
Work?
DM extracts patterns from data
Pattern? A mathematical (numeric and/or
symbolic) relationship among data items
Types of patterns
Association: (Beer & diapers in a markets basket
analysis)
Prediction: Predicts future occurrences based on the
past (Super Bowl winner, temperature on a specific day)
Cluster: (segmentation based on demographics or past
purchase behavior)
Sequential (or time series) relationships: existing
bank customer with checking account will open savings
account within a year
A Taxonomy for Data Mining
Tasks
Data Mining Learning Method Popular Algorithms
Classification and Regression Trees,
Prediction Supervised
ANN, SVM, Genetic Algorithms
Decision trees, ANN/MLP, SVM, Rough
Classification Supervised
sets, Genetic Algorithms
Linear/Nonlinear Regression, Regression
Regression Supervised
trees, ANN/MLP, SVM
Association Unsupervised Apriory, OneR, ZeroR, Eclat
Link analysis Unsupervised Expectation Maximization, Apriory
Algorithm, Graph-based Matching
Sequence analysis Unsupervised Apriory Algorithm, FP-Growth technique
Clustering Unsupervised K-means, ANN/SOM
Outlier analysis Unsupervised K-means, Expectation Maximization (EM)
Other Data Mining Tasks
These are in addition to the primary
DM tasks (prediction, association,
clustering)
Time-series forecasting
Part of sequence or link analysis?
Visualization
Another data mining task?
Types of DM
Hypothesis-driven data mining
Discovery-driven data mining
Data Mining Applications
Customer Relationship Management
Maximize return on marketing campaigns
Improve customer retention (churn analysis)
Maximize customer value (cross- or up-
selling)
Identify and treat most valued customers
Banking & Other Financial
Automate the loan application process
Detecting fraudulent transactions
Maximize customer value (cross- and up-
selling)
Optimizing cash reserves with forecasting
Data Mining Applications (cont.)
Retailing and Logistics
Optimize inventory levels at different
locations
Improve the store layout and sales
promotions
Optimize logistics by predicting seasonal
effects
Minimize losses due to limited shelf life
Manufacturing and Maintenance
Predict/prevent machinery failures
Identify anomalies in production systems to
optimize manufacturing capacity
Data Mining Applications (cont.)
Brokerage and Securities Trading
Predict changes on certain bond prices
Forecast the direction of stock fluctuations
Assess the effect of events on market
movements
Identify and prevent fraudulent activities in
trading
Insurance
Forecast claim costs for better business
planning
Determine optimal rate plans
Optimize marketing to specific customers
Data Mining Applications (cont.)
Computer hardware and software
Science and engineering
Government and defense
Homeland security and law enforcement
Travel industry
Healthcare Highly popular
Medicine application areas for
data mining
Entertainment industry
Sports
Etc.
Data Mining Methods:
Classification
Most frequently used DM method
Part of the machine-learning family
Employ supervised learning
Learn from past data, classify new
data
The output variable is categorical
(nominal or ordinal) in nature
Classification versus regression?
Classification versus clustering?
Classification Techniques
Decision tree analysis
Statistical analysis
Neural networks
Support vector machines
Case-based reasoning
Bayesian classifiers
Genetic algorithms
Rough sets
Decision Trees
Employs the divide and conquer method
Recursively divides a training set until
each division consists of examples from
A one class
general 1. Create a root node and assign all of the
algorith training data to it.
m for 2. Select the best splitting attribute.
decision 3. Add a branch to the root node for each
tree value of the split. Split the data into
building mutually exclusive subsets along the lines
of the specific split.
4. Repeat the steps 2 and 3 for each and
every leaf node until the stopping criteria
Data Mining SPSS PASW Modeler (formerly Clementine)
RapidMiner
SAS / SAS Enterprise Miner
Software Microsoft Excel
Your own code
Weka (now Pentaho)
Commercial KXEN
MATLAB
IBM SPSS Modeler Other commercial tools
(formerly Clementine)
KNIME
Microsoft SQL Server
SAS – Enterprise Miner Other free tools
Zementis
IBM – Intelligent Miner Oracle DM
Statsoft Statistica
StatSoft – Statistica Salford CART, Mars, other
Data Miner Orange
Angoss
… many more C4.5, C5.0, See5
Free and/or Open
Bayesia
Insightful Miner/S-Plus (now TIBCO)
Source Megaputer
Viscovery
RapidMiner Clario Analytics
Total (w/ others) Alone
Miner3D
Weka Thinkanalytics
… many more Source: KDNuggets.com, May 2009
0 20 40 60 80 100 120
Data Mining Myths
Data mining …
provides instant solutions/predictions.
is not yet viable for business
applications.
requires a separate, dedicated
database.
can only be done by those with
advanced degrees.
is only for large firms that have lots of
customer data.
is another name for good-old
Common Data Mining Blunders
1. Selecting the wrong problem for data
mining
2. Ignoring what your sponsor thinks data
mining is and what it really can/cannot
do
3. Not leaving sufficient time for data
acquisition, selection and preparation
4. Looking only at aggregated results and
not at individual records/predictions
5. Being sloppy about keeping track of the
data mining procedure and results
Common Data Mining Mistakes
6. Ignoring suspicious (good or bad)
findings and quickly moving on
7. Running mining algorithms repeatedly
and blindly, without thinking about the
next stage
8. Naively believing everything you are
told about the data
9. Naively believing everything you are
told about your own data mining
analysis
10. Measuring your results differently from