Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
19 views9 pages

DSS Chapter 5

Uploaded by

kkindamughrabi04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views9 pages

DSS Chapter 5

Uploaded by

kkindamughrabi04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

DSS chapter 5 , Data mining

Why data mining?

- More intense competition at the global scale (differentiation)


- Recognition of the value in data sources that are continuously growing.
- Availability of quality data on customers, vendors, web, transactions
- Exponential increase in data processing and storage capabilities, decrease in cost
- to gain a better understanding of customers and own operations, and to solve complex org
problems
- Generally, data mining is a way to develop intelligence from data than an org collects,
organizes, stores and analyze.

Data Mining: process of discovering patterns, relationships and insights within large datasets,
typically through computational algorithms and statistical techniques (BIDA)

- Involves extracting valuable knowledgeable and actionable info from vast amounts of
data, which may be structured, semi-structured or unstructured
- Data mining is primarily concerned with knowledge discovery.
- Aims to find patterns, trends, associations and anomalies within data.

Data mining importance: by uncovering patterns, trends and relationships within data, data
mining enables orgs to make informed decisions, optimize processes and gain a competitive
advantage.

Data mining (Fayyad et al 1996) : the nontrivial process of identifying valid, novel (new),
potentially useful and ultimately understandable patterns in data stored in structured databases.

•Data mining is positioned at the intersection of many disciplines.

• Data mining and AI are closely related fields, they often intersect and complement each other.

Data: a collection of facts usually obtained as the result of experiences, observations or


experiments.

- May consist of numbers, words, images, voice recordings etc.


- Data is the lowest level of abstraction (from which info and knowledge are derived)
Structured data is what data mining algorithms use and can be classified as.

1. Categorial, ex: race, gender race, age, group, and educational group can be subdivided
into:
a. Nominal Data: simple codes assigned to objects as labels (tags) which are not
measurements (single, married, divorced, yes/no)
b. Ordinal data: labels that also represent the rank order among them (high, medium,
low)

2. Numeric, ex: age, number of children, total household income, temperature, miles. Can
be subdivided into interval or ratio.

• Data mining extracts patterns from data:

- A data mining pattern is a recurring and meaningful observation or structure within a


dataset that is not immediately apparent but can be identified through statistical,
mathematical, or computational techniques.
- These patterns are derived from the data through various techniques and algorithms and
can provide valuable insights into the underlying information

Types of Patterns

1. Association: find the commonly co-occurring groupings of things (ex: toothpaste and
toothbrush)
2. Cluster (segmentation): identify natural groupings of things based on their known
characteristics (ex: demographics, similar things together)
3. Prediction: tell the nature of future occurrences of certain events based on what has
happened in the past ( 2 methods : classifications and Regression)
4. Sequential relationships: discover logical sequence of actions or events (ex : symptoms,
receives diagnosis, starts medication, follows up with doctor)

Regression analysis: is a powerful statistical method that allows you to examine the relationship
between two or more variables of interest. It is basically classification where we forecast a
number instead of a category.
Data mining techniques:

1. Time-series: forecasting (part of sequence (trend) or link analysis) to extrapolate.


2. Visualization: to gain a clearer understanding of underlying relationships, easier and
faster.

Types of data mining

1. Hypothesis-driven: data mining using the right-sized data (through surveys)


2. Discovery-driven: data mining (without preconceived hypotheses) using as big data as
possible.

Data mining applications

• Customer relationship mgt (CRM):

- maximize return on marketing compaigns (customer profiling)

- improve customer retention (churn analysis/customer attrition)

- maximize customer value (cross-selling, up-selling)

• Banking and other financial:

- Automate the loan application ; predicting defaulters


- Detecting fraudulent transactions
- Maximize customer value
- Optimize cash reserves

• Retailing and logistics

- Optimize inventory levels at diff locations based on its sales volumes predictions
- Improve the store layout and sales promotions (with market-basket analysis)
Market-basket analysis: association btwn pairs of products purchased together identify
patterns of co-occurrence.
- Optimize logistics by predicting seasonal effects
- Minimize losses due to limited shel life ( analyzing sensory and RFID data)

• Manufacturing and Maintenance

- Predict/prevent machinery failure (condition-based maintenance)


- Identify anomalies (irregularities)
- Discover novel patterns to improve product quality
• Brokerage and securities trading

- Predict changes on certain bond prices


- Forecast the direction of stock fluctuations
- Assess the effect of events on market movements
- Identify and prevent fraudulent activities in trading

• Insurance

- Forecast claim costs for better business planning


- Determine optimal rate plans
- Optimize marketing
- Identify and prevent fraudulent claim activities

Data mining processes

(a systematic way to conduct data mining projects)

- Based on best practices, several processes are proposed to maximize the chances of
success in conducting data mining projects

Processes: can be workflows or simple step-by-step approaches

Most common standard processes (methodology)

- CRISP-DM: Cross-industry standard process for data mining


- SEMMA: sample, explore, modify, model and assess
- KDD: knowledge discovery in databases

- The data mining process is iterative, and adjustments may be made at various stages
based on the insights gained and the performance of the models.

Dirty data: incomplete, missing, duplicate, inaccurate, inconsistent

Data cleaning: a crucial step in data analysis involving refining, correcting and preparing raw
data for meaningful insights and accurate decision making.

• Main purpose of data transformation: is to improve the quality and structure of data, making
it easier for data mining algorithms to uncover patterns, insights and relationships.

-Proper transformation can lead to better model performance, accuracy, and more actionable
insights.
Data Mining Process: CRISP-DM

- Standardized process (methodology). Most popular.


- Proposed in the mid 1990s by a European consortium of companies
- Nonproprietary (free) standard methodology.

Process (steps):

Step 1. Business understanding

Step 2 data understanding accounts for 85% of total project time

Step 3 data preparation


Step 4 model building

Step 5 testing and evaluation

Step 6 deployment

- The process is highly repetitive and experimental

(data preparation ) Normalization: usually involves adjusting values to a common scale so


that different variables can be compared on a similar scale essentially for comparing data
accurately and effectively in analysis

Data Mining Process: SEMMA

- Begin with a statistically representative sample of the data


- Applies exploratory statistical and visualization techniques
- Select & transform the most significant predictive variables.
- Model the variables to predict outcomes
- Confirm a models accuracy

• CRISP-DM and SEMMA are driven by a highly iterative experimentation cycle

Data mining = explaining the past (by data exploration) and predicting the future by means of
data analysis.
Data Mining Process: Classification (most frequently used data mining method)

- Classification is a data mining predictive method/ technique that assigns items in a data
set to target classes to group records into a class based on their characteristics
- The goal of classification is to accurately predict the target class for each case in the data.
for example, a classification model could be used to identify loan applications as low,
medium or high credit risks and whether as sunny rainy or cloudy
- It’s part of the machine-learning family; learns patterns from past data.

Used for: spam filtering, language detection, search of similar documents, recognition and fraud
detection.

- If being predicted as a class label (sunny, rainy or cloudy) then prediction problem is
called a classification
- If it is a numeric value (temperature 20c), the prediction problem is called a regression

Classification terminology :

Row = example or instance

Column = attribute

Output attribute = the one we want to


determine/predict

Input attribute= everything else

Nominal attribute = values that


are “names” of categories

Numeric attributes = have


values that are numbers
• Estimation methodologies for Classification

Simple split (or holdout or test sample estimation): split the data into 2 mutually exclusive sets.

- Main criticism: it assumes that the data in the 2 subsets are of the same kind( same
properties/ characteristics)

Classification techniques/algorithms

- Decision tree analysis (machine-learning technique), most popular classification technique in


the data mining arena.
- statistical analysis -neural networks -support vector machines -rough sets

-case-based reasoning (CBR) -Bayesian classifiers -Genetic Algorithms

Classification technique – ANN:

ANN: a type of artificial intelligence that imitates some functions of the person mind. It has a
normal tendency to store experiential knowledge.

- Can learn by example

- The use of ANN in the solution of a task initially involves a learning phase, which is
when the network extracts the patterns, thereby creating a specific representation of the
problem.
- Described as one of the best techniques to model the stock market (does not contain
standard formulas and may be easily adapted to market changes)

ANN applications:
1. Speech to text transcription 2. Handwriting recognition 3. Weather prediction
4. facial recognition 5.chatbots 6.stock market prediction

7. Delivery route planning and optimization


Classification techniques – Decision Trees

- employs the divide and conquer method

- repeatedly divides are training set until each division consists of examples from one class

A general algorithm for decision tree building:

1. Create a root node and assign all the training data to it.

2. Select the best splitting attribute.

3. Add a branch to the root node for each value of the split. Split the data into mutually exclusive
subsets along the lines of the specific split.

4. Repeat steps 2 and 3 for every leaf node until the stopping criteria is reached.

Cluster Analysis (aka segmentation) :

-Finding groups of objects such that the objects in our group would be similar or related to one
another and different from or unrelated to the objects and other groups.
-Used for automatic identification of natural groupings of things, based on their known
characteristics, e.g., demographics, shapes, colors, photos, etc.
▪ Part of the machine-learning family.
▪ Employs unsupervised learning.
▪ Learns the clusters of things from past data, then assigns new instances
(data).
▪ There is NO output variable.
▪ Also known as Segmentation
Data mining software: process of identifying patterns analyzing data and transforming
unstructured data into structured and valuable information that can be used to make informed
business decisions.

- Data mining software allows the organization to analyze data from a wide range of
databases and detect patterns

Data mining tools: main aim is to find data extract data refine data distribute the information
and monetize it.

-that's a mining is important because it extracts insights from data whether it's structured or
unstructured.

-structured data refers to data that has been organized into columns and rows for efficient
modification.

Data mining = an interdisciplinary sciences that combines computer science and


mathematical algorithms depicted by a machine.

Data mining: a powerful analytical tool that enables business executives to advance from
describing the nature of the past to predicting the future. → increase revenue, reduce expenses,
and identify fraud, and locate business opportunities, offering a whole new realm of competitive
advantage.

Data Mining Myths

You might also like