DSS chapter 5 , Data mining
Why data mining?
- More intense competition at the global scale (differentiation)
- Recognition of the value in data sources that are continuously growing.
- Availability of quality data on customers, vendors, web, transactions
- Exponential increase in data processing and storage capabilities, decrease in cost
- to gain a better understanding of customers and own operations, and to solve complex org
problems
- Generally, data mining is a way to develop intelligence from data than an org collects,
organizes, stores and analyze.
Data Mining: process of discovering patterns, relationships and insights within large datasets,
typically through computational algorithms and statistical techniques (BIDA)
- Involves extracting valuable knowledgeable and actionable info from vast amounts of
data, which may be structured, semi-structured or unstructured
- Data mining is primarily concerned with knowledge discovery.
- Aims to find patterns, trends, associations and anomalies within data.
Data mining importance: by uncovering patterns, trends and relationships within data, data
mining enables orgs to make informed decisions, optimize processes and gain a competitive
advantage.
Data mining (Fayyad et al 1996) : the nontrivial process of identifying valid, novel (new),
potentially useful and ultimately understandable patterns in data stored in structured databases.
•Data mining is positioned at the intersection of many disciplines.
• Data mining and AI are closely related fields, they often intersect and complement each other.
Data: a collection of facts usually obtained as the result of experiences, observations or
experiments.
- May consist of numbers, words, images, voice recordings etc.
- Data is the lowest level of abstraction (from which info and knowledge are derived)
Structured data is what data mining algorithms use and can be classified as.
1. Categorial, ex: race, gender race, age, group, and educational group can be subdivided
into:
a. Nominal Data: simple codes assigned to objects as labels (tags) which are not
measurements (single, married, divorced, yes/no)
b. Ordinal data: labels that also represent the rank order among them (high, medium,
low)
2. Numeric, ex: age, number of children, total household income, temperature, miles. Can
be subdivided into interval or ratio.
• Data mining extracts patterns from data:
- A data mining pattern is a recurring and meaningful observation or structure within a
dataset that is not immediately apparent but can be identified through statistical,
mathematical, or computational techniques.
- These patterns are derived from the data through various techniques and algorithms and
can provide valuable insights into the underlying information
Types of Patterns
1. Association: find the commonly co-occurring groupings of things (ex: toothpaste and
toothbrush)
2. Cluster (segmentation): identify natural groupings of things based on their known
characteristics (ex: demographics, similar things together)
3. Prediction: tell the nature of future occurrences of certain events based on what has
happened in the past ( 2 methods : classifications and Regression)
4. Sequential relationships: discover logical sequence of actions or events (ex : symptoms,
receives diagnosis, starts medication, follows up with doctor)
Regression analysis: is a powerful statistical method that allows you to examine the relationship
between two or more variables of interest. It is basically classification where we forecast a
number instead of a category.
Data mining techniques:
1. Time-series: forecasting (part of sequence (trend) or link analysis) to extrapolate.
2. Visualization: to gain a clearer understanding of underlying relationships, easier and
faster.
Types of data mining
1. Hypothesis-driven: data mining using the right-sized data (through surveys)
2. Discovery-driven: data mining (without preconceived hypotheses) using as big data as
possible.
Data mining applications
• Customer relationship mgt (CRM):
- maximize return on marketing compaigns (customer profiling)
- improve customer retention (churn analysis/customer attrition)
- maximize customer value (cross-selling, up-selling)
• Banking and other financial:
- Automate the loan application ; predicting defaulters
- Detecting fraudulent transactions
- Maximize customer value
- Optimize cash reserves
• Retailing and logistics
- Optimize inventory levels at diff locations based on its sales volumes predictions
- Improve the store layout and sales promotions (with market-basket analysis)
Market-basket analysis: association btwn pairs of products purchased together identify
patterns of co-occurrence.
- Optimize logistics by predicting seasonal effects
- Minimize losses due to limited shel life ( analyzing sensory and RFID data)
• Manufacturing and Maintenance
- Predict/prevent machinery failure (condition-based maintenance)
- Identify anomalies (irregularities)
- Discover novel patterns to improve product quality
• Brokerage and securities trading
- Predict changes on certain bond prices
- Forecast the direction of stock fluctuations
- Assess the effect of events on market movements
- Identify and prevent fraudulent activities in trading
• Insurance
- Forecast claim costs for better business planning
- Determine optimal rate plans
- Optimize marketing
- Identify and prevent fraudulent claim activities
Data mining processes
(a systematic way to conduct data mining projects)
- Based on best practices, several processes are proposed to maximize the chances of
success in conducting data mining projects
Processes: can be workflows or simple step-by-step approaches
Most common standard processes (methodology)
- CRISP-DM: Cross-industry standard process for data mining
- SEMMA: sample, explore, modify, model and assess
- KDD: knowledge discovery in databases
- The data mining process is iterative, and adjustments may be made at various stages
based on the insights gained and the performance of the models.
Dirty data: incomplete, missing, duplicate, inaccurate, inconsistent
Data cleaning: a crucial step in data analysis involving refining, correcting and preparing raw
data for meaningful insights and accurate decision making.
• Main purpose of data transformation: is to improve the quality and structure of data, making
it easier for data mining algorithms to uncover patterns, insights and relationships.
-Proper transformation can lead to better model performance, accuracy, and more actionable
insights.
Data Mining Process: CRISP-DM
- Standardized process (methodology). Most popular.
- Proposed in the mid 1990s by a European consortium of companies
- Nonproprietary (free) standard methodology.
Process (steps):
Step 1. Business understanding
Step 2 data understanding accounts for 85% of total project time
Step 3 data preparation
Step 4 model building
Step 5 testing and evaluation
Step 6 deployment
- The process is highly repetitive and experimental
(data preparation ) Normalization: usually involves adjusting values to a common scale so
that different variables can be compared on a similar scale essentially for comparing data
accurately and effectively in analysis
Data Mining Process: SEMMA
- Begin with a statistically representative sample of the data
- Applies exploratory statistical and visualization techniques
- Select & transform the most significant predictive variables.
- Model the variables to predict outcomes
- Confirm a models accuracy
• CRISP-DM and SEMMA are driven by a highly iterative experimentation cycle
Data mining = explaining the past (by data exploration) and predicting the future by means of
data analysis.
Data Mining Process: Classification (most frequently used data mining method)
- Classification is a data mining predictive method/ technique that assigns items in a data
set to target classes to group records into a class based on their characteristics
- The goal of classification is to accurately predict the target class for each case in the data.
for example, a classification model could be used to identify loan applications as low,
medium or high credit risks and whether as sunny rainy or cloudy
- It’s part of the machine-learning family; learns patterns from past data.
Used for: spam filtering, language detection, search of similar documents, recognition and fraud
detection.
- If being predicted as a class label (sunny, rainy or cloudy) then prediction problem is
called a classification
- If it is a numeric value (temperature 20c), the prediction problem is called a regression
Classification terminology :
Row = example or instance
Column = attribute
Output attribute = the one we want to
determine/predict
Input attribute= everything else
Nominal attribute = values that
are “names” of categories
Numeric attributes = have
values that are numbers
• Estimation methodologies for Classification
Simple split (or holdout or test sample estimation): split the data into 2 mutually exclusive sets.
- Main criticism: it assumes that the data in the 2 subsets are of the same kind( same
properties/ characteristics)
Classification techniques/algorithms
- Decision tree analysis (machine-learning technique), most popular classification technique in
the data mining arena.
- statistical analysis -neural networks -support vector machines -rough sets
-case-based reasoning (CBR) -Bayesian classifiers -Genetic Algorithms
Classification technique – ANN:
ANN: a type of artificial intelligence that imitates some functions of the person mind. It has a
normal tendency to store experiential knowledge.
- Can learn by example
- The use of ANN in the solution of a task initially involves a learning phase, which is
when the network extracts the patterns, thereby creating a specific representation of the
problem.
- Described as one of the best techniques to model the stock market (does not contain
standard formulas and may be easily adapted to market changes)
ANN applications:
1. Speech to text transcription 2. Handwriting recognition 3. Weather prediction
4. facial recognition 5.chatbots 6.stock market prediction
7. Delivery route planning and optimization
Classification techniques – Decision Trees
- employs the divide and conquer method
- repeatedly divides are training set until each division consists of examples from one class
A general algorithm for decision tree building:
1. Create a root node and assign all the training data to it.
2. Select the best splitting attribute.
3. Add a branch to the root node for each value of the split. Split the data into mutually exclusive
subsets along the lines of the specific split.
4. Repeat steps 2 and 3 for every leaf node until the stopping criteria is reached.
Cluster Analysis (aka segmentation) :
-Finding groups of objects such that the objects in our group would be similar or related to one
another and different from or unrelated to the objects and other groups.
-Used for automatic identification of natural groupings of things, based on their known
characteristics, e.g., demographics, shapes, colors, photos, etc.
▪ Part of the machine-learning family.
▪ Employs unsupervised learning.
▪ Learns the clusters of things from past data, then assigns new instances
(data).
▪ There is NO output variable.
▪ Also known as Segmentation
Data mining software: process of identifying patterns analyzing data and transforming
unstructured data into structured and valuable information that can be used to make informed
business decisions.
- Data mining software allows the organization to analyze data from a wide range of
databases and detect patterns
Data mining tools: main aim is to find data extract data refine data distribute the information
and monetize it.
-that's a mining is important because it extracts insights from data whether it's structured or
unstructured.
-structured data refers to data that has been organized into columns and rows for efficient
modification.
Data mining = an interdisciplinary sciences that combines computer science and
mathematical algorithms depicted by a machine.
Data mining: a powerful analytical tool that enables business executives to advance from
describing the nature of the past to predicting the future. → increase revenue, reduce expenses,
and identify fraud, and locate business opportunities, offering a whole new realm of competitive
advantage.
Data Mining Myths