Overview
Machine Learning for
Business Analytics Using
RapidMiner
Shmueli, Bruce,
© Galit Shmueli, Peter Bruce and AmitDeokar,
Deokar 2023 & Patel
Core Ideas in Machine
Learning`
●Classification
●Prediction
●Association Rules & Recommenders
●Data & Dimension Reduction
●Data Exploration
●Visualization
Paradigms for Machine Learning
(variations)
●SEMMA (from SAS)
• Sample
• Explore
• Modify
• Model
• Assess
●CRISP-DM (SPSS/IBM)
• Business Understanding
• Data Understanding
• Data Preparation
• Modeling
• Evaluation
• Deployment
Supervised Learning
●Goal: Predict a single “target” or “outcome”
variable
●Training data, where target value is known
●Score to data where value is not known
●Methods: Classification and Prediction
Unsupervised Learning
●Goal: Segment data into meaningful
segments; detect patterns
●There is no target (outcome) variable to
predict or classify
●Methods: Association rules, collaborative
filters, data reduction & exploration,
visualization
Supervised: Classification
●Goal: Predict categorical target (outcome)
variable
●Examples: Purchase/no purchase, fraud/no
fraud, creditworthy/not creditworthy…
●Each row is a case (customer, tax return,
applicant)
●Each column is a variable
●Target variable is often binary (yes/no)
Supervised: Prediction
(Estimation)
●Goal: Predict numerical target (outcome)
variable
●Examples: sales, revenue, performance
●As in classification:
●Each row is a case (customer, tax return,
applicant)
●Each column is a variable
●Taken together, classification and
prediction constitute “predictive analytics”
Unsupervised: Association
Rules
●Goal: Produce rules that define “what goes
with what” in transactions
●Example: “If X was purchased, Y was also
purchased”
●Rows are transactions
●Used in recommender systems – “Our
records show you bought X, you may also
like Y”
●Also called “affinity analysis”
Unsupervised: Data
Reduction
●Distillation of complex/large data into
simpler/smaller data
●Reducing the number of variables/columns
(e.g., principal components)
●Reducing the number of records/rows (e.g.,
clustering)
The Process of Machine
Learning
Steps in Machine Learning
1. Define/understand purpose
2. Obtain data (may involve random
sampling)
3. Explore, clean, pre-process data
4. Reduce the data; if supervised learning,
partition it
5. Specify task (classification, clustering,
etc.)
6. Choose the techniques (regression, CART,
neural networks, etc.)
7. Iterative implementation and “tuning”
8. Assess results – compare models
Obtaining Data: Sampling
●Machine learning typically deals with huge
databases
●For piloting/prototyping, algorithms and
models are typically applied to a sample
from a larger dataset (easier to handle)
●Once you develop and select a final model,
you use it to “score” (predict values or
classes for) the records in the larger
database
●Also called “inference”
Rare Event Oversampling
●Often the event of interest is rare
●Examples: response to mailing, fraud in
taxes, …
●Sampling may yield too few “interesting”
cases to effectively train a model
●A popular solution: oversample the rare
cases (equivalent to undersampling the
dominant cases) to obtain a more balanced
training set
●Later, need to adjust results for the
oversampling
Types of Variables (Features)
●Determine the types of pre-processing
needed, and algorithms used
●Main distinction: Categorical vs. numeric
●Numeric
●Continuous
●Integer
●Categorical
●Ordered (low, medium, high)
●Unordered (male, female)
Variable handling
●Numeric
●Most algorithms can handle numeric data
●May occasionally need to “bin” into
categories
●Categorical
●Naïve Bayes can use as-is
●In most other algorithms, must create n or n-
1 binary dummies
Data Pre-processing in RM - West Roxbury
data
Detecting Outliers
⚫An outlier is an observation that is
“extreme”, being distant from the rest of
the data (definition of “distant” is
deliberately vague)
⚫Outliers can have disproportionate
influence on models (a problem if it is
spurious)
⚫An important step in data pre-processing is
detecting outliers
⚫Once detected, domain knowledge is
required to determine if it is an error, or
truly extreme.
Detecting Outliers
⚫In some contexts, finding outliers is the
purpose of the ML exercise (e.g. airport
security screening). This is called “anomaly
detection”.
Handling Missing Data
⚫Most algorithms will not process records
with missing values. Default is to drop
those records.
⚫Solution 1: Omission
⚫ If a small number of records have missing values,
can omit them
⚫ If many records are missing values on a small set of
variables, can drop those variables (or use proxies)
⚫ If many records/variables have missing values,
omission is not practical
⚫Solution 2: Imputation
⚫ Replace missing values with reasonable substitutes
⚫ Lets you keep the record and use the rest of its (non-
missing) information
Normalizing (Standardizing)
Data
⚫Used in some techniques when variables with
the largest scales would dominate and skew
results
⚫Puts all variables on same scale
⚫Normalizing function: Subtract mean and
divide by standard deviation
⚫Alternative function: scale to 0-1 by
subtracting minimum and dividing by the
range
⚫ Useful when the data contain dummies and
numeric
The Problem of Overfitting
⚫Statistical models can produce highly
complex explanations of relationships
between variables
⚫The “fit” may be excellent
⚫When used with new data, models of great
complexity do not do so well.
100% fit – not useful for new
data
Overfitting (cont.)
Causes:
⚫ Too many predictors
⚫ A model with too many parameters
⚫ Trying many different models
Consequence: Deployed model will not work
as well as expected with completely new
data.
Partitioning the Data
Problem: How well will our model
perform with new data?
Solution: Separate data into two parts
⚫Training partition to develop the
model
⚫Validation partition (sometimes called
test) to implement the model and
evaluate its performance on “new”
data
Addresses the issue of overfitting
Holdout Partition
⚫ When a model is developed on training
data, it can overfit the training data
(hence need to assess on validation)
⚫ Assessing multiple models on same
validation data can overfit validation
data
⚫ Some methods use the validation data to
choose a parameter. This too can lead to
overfitting the validation data
⚫ Solution: final selected model is applied
to a holdout partition (sometimes
called test) to give unbiased estimate of
its performance on new data
Cross Validation
● Repeated partitioning = cross-validation (“cv”)
● k-fold cross validation, e.g. k=5
○ For each fold, set aside ⅕ of data as
validation
○ Use full remainder as training
○ The validation folds are non-overlapping
Partitioning the Data
(60/40 split)
Example – Linear Regression
West Roxbury Housing Data
Predictive Modeling Process in RM
Training data: a few predictions, and
summary metrics
Validation data: a few predictions, and
summary metrics
Error metrics
Error = actual – predicted
ME = Mean error
RMSE = Root-mean-squared error = Square
root of average squared error
MAE = Mean absolute error
MPE = Mean percentage error
MAPE = Mean absolute percentage error
AI Engineering, ML-Ops
⚫ In this book, focus is on developing/testing
the models that predict, classify, cluster,
recommend, forecast.
⚫ Developing/testing (prototyping) is the
initial stage; after selecting a model we
usually need to deploy it in a pipeline that
feeds data to it and generates actions.
⚫ Deployment phase = AI Engineering, or
more specifically ML Ops
AI Engineering, ML-Operations (ML-
Ops)
Dozens of Tools
The “infrastructure” layer provides base computing capability,
memory, and networking
AI Engineering, ML-Ops
Security = admin, permissions access rules
Monitoring = ingests logs, issues alerts
Automation = bring up, configure, tear down tools &
infrastructure
Resource Management = oversight, checking for resource
exhaustion
AI Engineering, ML-Ops
Tools for testing and debugging
AI Engineering, ML-Ops
Data store (data warehouse or data lake); also Analytic Base Table
(ABT) with derivatives more suited to analysis
AI Engineering, ML-Ops
Tools to create the ABT derivatives that are in the Data Collection layer.
AI Engineering, ML-Ops
Models: main focus of this book
Model Monitoring: ties into “Monitoring” component below
AI Engineering, ML-Ops
Delivery is how user views the system (text file, spreadsheet, interface
with Tableau or Power BI, …)
Summary
⚫ Machine Learning consists of supervised methods
(Classification & Prediction) and unsupervised methods
(Association Rules, Data Reduction, Data Exploration &
Visualization)
⚫ Before algorithms can be applied, data must be explored
and pre-processed
⚫ To evaluate performance and to avoid overfitting, data
partitioning is used
⚫ Models are fit to the training partition and assessed on
the validation and holdout partitions
⚫ Machine Learning methods are usually applied to a
sample from a large database, and then the best model
is used to score the entire database
⚫ Once a model is developed, AI Engineering (ML-Ops)
skills and tools are required to deploy it