0% found this document useful (0 votes)

22 views43 pages

Chapter 02 Overview - 4

Uploaded by

Mery

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views43 pages

Chapter 02 Overview - 4

Uploaded by

Mery

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 43

Overview

Machine Learning for

Business Analytics Using
RapidMiner

Shmueli, Bruce,
© Galit Shmueli, Peter Bruce and AmitDeokar,
Deokar 2023 & Patel
Core Ideas in Machine
Learning`

●Classification
●Prediction
●Association Rules & Recommenders
●Data & Dimension Reduction
●Data Exploration
●Visualization
Paradigms for Machine Learning
(variations)
●SEMMA (from SAS)
• Sample
• Explore
• Modify
• Model
• Assess
●CRISP-DM (SPSS/IBM)
• Business Understanding
• Data Understanding
• Data Preparation
• Modeling
• Evaluation
• Deployment
Supervised Learning
●Goal: Predict a single “target” or “outcome”
variable

●Training data, where target value is known

●Score to data where value is not known

●Methods: Classification and Prediction

Unsupervised Learning

●Goal: Segment data into meaningful

segments; detect patterns

●There is no target (outcome) variable to

predict or classify

●Methods: Association rules, collaborative

filters, data reduction & exploration,
visualization
Supervised: Classification

●Goal: Predict categorical target (outcome)

variable
●Examples: Purchase/no purchase, fraud/no
fraud, creditworthy/not creditworthy…
●Each row is a case (customer, tax return,
applicant)
●Each column is a variable
●Target variable is often binary (yes/no)
Supervised: Prediction
(Estimation)
●Goal: Predict numerical target (outcome)
variable
●Examples: sales, revenue, performance
●As in classification:
●Each row is a case (customer, tax return,
applicant)
●Each column is a variable
●Taken together, classification and
prediction constitute “predictive analytics”
Unsupervised: Association
Rules
●Goal: Produce rules that define “what goes
with what” in transactions
●Example: “If X was purchased, Y was also
purchased”
●Rows are transactions
●Used in recommender systems – “Our
records show you bought X, you may also
like Y”
●Also called “affinity analysis”
Unsupervised: Data
Reduction
●Distillation of complex/large data into
simpler/smaller data
●Reducing the number of variables/columns
(e.g., principal components)
●Reducing the number of records/rows (e.g.,
clustering)
The Process of Machine
Learning
Steps in Machine Learning
1. Define/understand purpose
2. Obtain data (may involve random
sampling)
3. Explore, clean, pre-process data
4. Reduce the data; if supervised learning,
partition it
5. Specify task (classification, clustering,
etc.)
6. Choose the techniques (regression, CART,
neural networks, etc.)
7. Iterative implementation and “tuning”
8. Assess results – compare models
Obtaining Data: Sampling

●Machine learning typically deals with huge

databases
●For piloting/prototyping, algorithms and
models are typically applied to a sample
from a larger dataset (easier to handle)
●Once you develop and select a final model,
you use it to “score” (predict values or
classes for) the records in the larger
database
●Also called “inference”
Rare Event Oversampling

●Often the event of interest is rare

●Examples: response to mailing, fraud in
taxes, …
●Sampling may yield too few “interesting”
cases to effectively train a model
●A popular solution: oversample the rare
cases (equivalent to undersampling the
dominant cases) to obtain a more balanced
training set
●Later, need to adjust results for the
oversampling
Types of Variables (Features)
●Determine the types of pre-processing
needed, and algorithms used
●Main distinction: Categorical vs. numeric
●Numeric
●Continuous
●Integer
●Categorical
●Ordered (low, medium, high)
●Unordered (male, female)
Variable handling
●Numeric
●Most algorithms can handle numeric data
●May occasionally need to “bin” into
categories

●Categorical
●Naïve Bayes can use as-is
●In most other algorithms, must create n or n-
1 binary dummies
Data Pre-processing in RM - West Roxbury
data
Detecting Outliers

⚫An outlier is an observation that is

“extreme”, being distant from the rest of
the data (definition of “distant” is
deliberately vague)
⚫Outliers can have disproportionate
influence on models (a problem if it is
spurious)
⚫An important step in data pre-processing is
detecting outliers
⚫Once detected, domain knowledge is
required to determine if it is an error, or
truly extreme.
Detecting Outliers
⚫In some contexts, finding outliers is the
purpose of the ML exercise (e.g. airport
security screening). This is called “anomaly
detection”.
Handling Missing Data
⚫Most algorithms will not process records
with missing values. Default is to drop
those records.
⚫Solution 1: Omission
⚫ If a small number of records have missing values,
can omit them
⚫ If many records are missing values on a small set of
variables, can drop those variables (or use proxies)
⚫ If many records/variables have missing values,
omission is not practical
⚫Solution 2: Imputation
⚫ Replace missing values with reasonable substitutes
⚫ Lets you keep the record and use the rest of its (non-
missing) information
Normalizing (Standardizing)
Data
⚫Used in some techniques when variables with
the largest scales would dominate and skew
results
⚫Puts all variables on same scale
⚫Normalizing function: Subtract mean and
divide by standard deviation
⚫Alternative function: scale to 0-1 by
subtracting minimum and dividing by the
range
⚫ Useful when the data contain dummies and
numeric
The Problem of Overfitting

⚫Statistical models can produce highly

complex explanations of relationships
between variables
⚫The “fit” may be excellent
⚫When used with new data, models of great
complexity do not do so well.
100% fit – not useful for new
data
Overfitting (cont.)
Causes:
⚫ Too many predictors
⚫ A model with too many parameters
⚫ Trying many different models

Consequence: Deployed model will not work

as well as expected with completely new
data.
Partitioning the Data
Problem: How well will our model
perform with new data?

Solution: Separate data into two parts

⚫Training partition to develop the
model
⚫Validation partition (sometimes called
test) to implement the model and
evaluate its performance on “new”
data

Addresses the issue of overfitting

Holdout Partition
⚫ When a model is developed on training
data, it can overfit the training data
(hence need to assess on validation)
⚫ Assessing multiple models on same
validation data can overfit validation
data
⚫ Some methods use the validation data to
choose a parameter. This too can lead to
overfitting the validation data
⚫ Solution: final selected model is applied
to a holdout partition (sometimes
called test) to give unbiased estimate of
its performance on new data
Cross Validation

● Repeated partitioning = cross-validation (“cv”)

● k-fold cross validation, e.g. k=5
○ For each fold, set aside ⅕ of data as
validation
○ Use full remainder as training
○ The validation folds are non-overlapping
Partitioning the Data
(60/40 split)
Example – Linear Regression
West Roxbury Housing Data
Predictive Modeling Process in RM
Training data: a few predictions, and
summary metrics
Validation data: a few predictions, and
summary metrics
Error metrics
Error = actual – predicted
ME = Mean error
RMSE = Root-mean-squared error = Square
root of average squared error
MAE = Mean absolute error
MPE = Mean percentage error
MAPE = Mean absolute percentage error
AI Engineering, ML-Ops
⚫ In this book, focus is on developing/testing
the models that predict, classify, cluster,
recommend, forecast.
⚫ Developing/testing (prototyping) is the
initial stage; after selecting a model we
usually need to deploy it in a pipeline that
feeds data to it and generates actions.
⚫ Deployment phase = AI Engineering, or
more specifically ML Ops
AI Engineering, ML-Operations (ML-
Ops)
Dozens of Tools

The “infrastructure” layer provides base computing capability,

memory, and networking
AI Engineering, ML-Ops

Security = admin, permissions access rules

Monitoring = ingests logs, issues alerts
Automation = bring up, configure, tear down tools &
infrastructure
Resource Management = oversight, checking for resource
exhaustion
AI Engineering, ML-Ops

Tools for testing and debugging

AI Engineering, ML-Ops

Data store (data warehouse or data lake); also Analytic Base Table
(ABT) with derivatives more suited to analysis
AI Engineering, ML-Ops

Tools to create the ABT derivatives that are in the Data Collection layer.
AI Engineering, ML-Ops

Models: main focus of this book

Model Monitoring: ties into “Monitoring” component below
AI Engineering, ML-Ops

Delivery is how user views the system (text file, spreadsheet, interface
with Tableau or Power BI, …)
Summary
⚫ Machine Learning consists of supervised methods
(Classification & Prediction) and unsupervised methods
(Association Rules, Data Reduction, Data Exploration &
Visualization)
⚫ Before algorithms can be applied, data must be explored
and pre-processed
⚫ To evaluate performance and to avoid overfitting, data
partitioning is used
⚫ Models are fit to the training partition and assessed on
the validation and holdout partitions
⚫ Machine Learning methods are usually applied to a
sample from a large database, and then the best model
is used to score the entire database
⚫ Once a model is developed, AI Engineering (ML-Ops)
skills and tools are required to deploy it

Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
316 pages
Statistics For Data Science
100% (1)
Statistics For Data Science
39 pages
Unit4 PPT
No ratings yet
Unit4 PPT
126 pages
Zarantech - Intro To ML
No ratings yet
Zarantech - Intro To ML
105 pages
Data Mining
No ratings yet
Data Mining
18 pages
CSC413 Lecture Note
No ratings yet
CSC413 Lecture Note
32 pages
Unit 4 - Question Bank and Answers
No ratings yet
Unit 4 - Question Bank and Answers
23 pages
ML Challenges and Metrics
No ratings yet
ML Challenges and Metrics
19 pages
Social Media Analytics Techniques
No ratings yet
Social Media Analytics Techniques
77 pages
Introduction To ML
No ratings yet
Introduction To ML
55 pages
Machine Learning
No ratings yet
Machine Learning
25 pages
Machine Learning Models: by Mayuri Bhandari
No ratings yet
Machine Learning Models: by Mayuri Bhandari
48 pages
TM 4 - Data Mining and Machine Learning
No ratings yet
TM 4 - Data Mining and Machine Learning
60 pages
Introduction To Predictive Analytics: UNIT-1
No ratings yet
Introduction To Predictive Analytics: UNIT-1
14 pages
Machine Learning Engineer Interview Preparation Guide
No ratings yet
Machine Learning Engineer Interview Preparation Guide
14 pages
MCS224 Dec 2024 Solved
No ratings yet
MCS224 Dec 2024 Solved
22 pages
Machine Learning (Unit I)
No ratings yet
Machine Learning (Unit I)
12 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
Machine Learning
No ratings yet
Machine Learning
37 pages
FML - KNN
No ratings yet
FML - KNN
64 pages
Data Science Checklist
No ratings yet
Data Science Checklist
22 pages
ML Chap 2
No ratings yet
ML Chap 2
60 pages
Week 12 Intro To DS and ML
No ratings yet
Week 12 Intro To DS and ML
67 pages
Machine Learning for Nigerian Languages
No ratings yet
Machine Learning for Nigerian Languages
67 pages
Module3 DS PPT
No ratings yet
Module3 DS PPT
68 pages
Ass Bigd
No ratings yet
Ass Bigd
9 pages
Machine Learning
No ratings yet
Machine Learning
57 pages
Overfitting & Feature Engineering
No ratings yet
Overfitting & Feature Engineering
37 pages
DSF - UNIT III Notes
No ratings yet
DSF - UNIT III Notes
17 pages
Introduction Class
No ratings yet
Introduction Class
134 pages
Air Quality Prediction Using Machine Learning
No ratings yet
Air Quality Prediction Using Machine Learning
29 pages
Chapter 01 Introduction To Machine Learning
No ratings yet
Chapter 01 Introduction To Machine Learning
59 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
Aiya Session 4
No ratings yet
Aiya Session 4
42 pages
Machine Learning Basics & kNN Guide
No ratings yet
Machine Learning Basics & kNN Guide
94 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
Machine Learning for Beginners
No ratings yet
Machine Learning for Beginners
18 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Intro ML 1 Day
No ratings yet
Intro ML 1 Day
43 pages
Machine Learning Essentials Guide
No ratings yet
Machine Learning Essentials Guide
33 pages
NEP Syllabus Questions
No ratings yet
NEP Syllabus Questions
3 pages
Week5 Modified
No ratings yet
Week5 Modified
25 pages
A Practical and Technical Introduction To Machine Learning
No ratings yet
A Practical and Technical Introduction To Machine Learning
23 pages
Lecture Notes 1 2 Intro Python
No ratings yet
Lecture Notes 1 2 Intro Python
13 pages
Capstone Project
No ratings yet
Capstone Project
6 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Fam Question Bank CT
No ratings yet
Fam Question Bank CT
14 pages
Study Notes - Lesson 1 - 7 PDF
No ratings yet
Study Notes - Lesson 1 - 7 PDF
25 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Intro to Machine Learning & kNN
No ratings yet
Intro to Machine Learning & kNN
90 pages
Heart Disease Prediction Final
67% (3)
Heart Disease Prediction Final
45 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
32 pages
Analyzing Activation Functions With Transfer Learning-Based Layer Customization For Improved Brain Tumor Classification
No ratings yet
Analyzing Activation Functions With Transfer Learning-Based Layer Customization For Improved Brain Tumor Classification
21 pages
Major Project Report BIG MART Final Reedited
No ratings yet
Major Project Report BIG MART Final Reedited
91 pages
Data Mining Unit-1 Complete
No ratings yet
Data Mining Unit-1 Complete
45 pages
Module-2 - Logistic Regression in Machine Learning
No ratings yet
Module-2 - Logistic Regression in Machine Learning
28 pages
Bike Sharing Prediction Project Structure
No ratings yet
Bike Sharing Prediction Project Structure
37 pages
A Report ON Intrusion Detection System For Android Automotive Network Traffic BY
No ratings yet
A Report ON Intrusion Detection System For Android Automotive Network Traffic BY
58 pages
Credit Scoring Model Implementation in A Microfinance Context
No ratings yet
Credit Scoring Model Implementation in A Microfinance Context
6 pages
Understanding and
No ratings yet
Understanding and
57 pages
A Novel Framework For Real-Time Fire Detection in CCTV Videos Using A Hybrid App
No ratings yet
A Novel Framework For Real-Time Fire Detection in CCTV Videos Using A Hybrid App
6 pages
Distributed Machine Learning With PySpark Migrating Effortlessly From Pandas and Scikit-Learn (Abdelaziz Testas) (Z-Library)
No ratings yet
Distributed Machine Learning With PySpark Migrating Effortlessly From Pandas and Scikit-Learn (Abdelaziz Testas) (Z-Library)
381 pages
Machine Learning for Real Estate
No ratings yet
Machine Learning for Real Estate
9 pages
Deep Learning Methods in Mining Ver Ver Ver
No ratings yet
Deep Learning Methods in Mining Ver Ver Ver
16 pages
Flowerformer: Empowering Neural Architecture Encoding Using A Flow-Aware Graph Transformer
No ratings yet
Flowerformer: Empowering Neural Architecture Encoding Using A Flow-Aware Graph Transformer
12 pages
Mineral Exploration and Machine Learning 1741103755
No ratings yet
Mineral Exploration and Machine Learning 1741103755
2 pages
MITSMR Leaders Guide To AI 2023
No ratings yet
MITSMR Leaders Guide To AI 2023
30 pages
Deep Enzyme
No ratings yet
Deep Enzyme
41 pages
Employee Performance Prediction: An Integrated Approach of Business Analytics and Machine Learning
No ratings yet
Employee Performance Prediction: An Integrated Approach of Business Analytics and Machine Learning
6 pages
Physics Guided Self Supervised Learning For Low Frequency Data Prediction in Fwi
No ratings yet
Physics Guided Self Supervised Learning For Low Frequency Data Prediction in Fwi
5 pages
Ding Et Al. - 2025 - Risk of Glacier Collapse in The Southeast Tibetan Basin
No ratings yet
Ding Et Al. - 2025 - Risk of Glacier Collapse in The Southeast Tibetan Basin
10 pages
Ma Et Al. - 2022 - Exploring High Thermal Conductivity Amorphous Poly
No ratings yet
Ma Et Al. - 2022 - Exploring High Thermal Conductivity Amorphous Poly
12 pages
Integrating Deep Learning For Visual Question Answering in Agricultural Disease Diagnostics: Case Study of Wheat Rust
No ratings yet
Integrating Deep Learning For Visual Question Answering in Agricultural Disease Diagnostics: Case Study of Wheat Rust
18 pages
Col774 A5
No ratings yet
Col774 A5
6 pages
ASD Classification For Children Using Deep Nueral Network
No ratings yet
ASD Classification For Children Using Deep Nueral Network
11 pages
1 s2.0 S2352012423018416 Main
No ratings yet
1 s2.0 S2352012423018416 Main
14 pages
Evaluation Metrics and Evaluation
No ratings yet
Evaluation Metrics and Evaluation
9 pages
Cable Robot Deep Learning With Transfer Learning Springer
No ratings yet
Cable Robot Deep Learning With Transfer Learning Springer
20 pages
Bioengineering 10 00732
No ratings yet
Bioengineering 10 00732
15 pages
Fine-Tuning - OpenAI API
No ratings yet
Fine-Tuning - OpenAI API
19 pages
An Automatic Dermatology Detection System Based On Deep Learning and Computer Vision
No ratings yet
An Automatic Dermatology Detection System Based On Deep Learning and Computer Vision
10 pages

Chapter 02 Overview - 4

Uploaded by

Chapter 02 Overview - 4

Uploaded by

Overview

Machine Learning for

●Training data, where target value is known

●Score to data where value is not known

●Methods: Classification and Prediction

●Goal: Segment data into meaningful

●There is no target (outcome) variable to

●Methods: Association rules, collaborative

●Goal: Predict categorical target (outcome)

●Machine learning typically deals with huge

●Often the event of interest is rare

⚫An outlier is an observation that is

⚫Statistical models can produce highly

Consequence: Deployed model will not work

Solution: Separate data into two parts

Addresses the issue of overfitting

● Repeated partitioning = cross-validation (“cv”)

The “infrastructure” layer provides base computing capability,

Security = admin, permissions access rules

Tools for testing and debugging

Models: main focus of this book

You might also like