Thanks to visit codestin.com
Credit goes to www.scribd.com

Open navigation menu

Scribd

0% found this document useful (0 votes)

11 views29 pages

Recap of Machine Learning

The document provides a comprehensive overview of machine learning concepts, essential libraries, and tools such as Python, Scikit-Learn, and Jupyter Notebook. It covers various types of machine learning systems, data collection, preprocessing, feature engineering, and model training, along with practical applications like customer churn analysis. Additionally, it discusses the importance of exploratory data analysis, handling missing values, and feature scaling techniques to improve model performance.

Uploaded by

estherndunge278

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views29 pages

Recap of Machine Learning

The document provides a comprehensive overview of machine learning concepts, essential libraries, and tools such as Python, Scikit-Learn, and Jupyter Notebook. It covers various types of machine learning systems, data collection, preprocessing, feature engineering, and model training, along with practical applications like customer churn analysis. Additionally, it discusses the importance of exploratory data analysis, handling missing values, and feature scaling techniques to improve model performance.

Uploaded by

estherndunge278

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Recap of Machine Learning

Concepts
• Essential Libraries and Tool:

• Python : the lingua franca for data science

• Scikit-Learn: open-source project

http://scikit-learn.org

•JupyterNotebook: browser-based
interactive development environment for
running code (IDE)

• NumPy: for scientific computing

• Pandas: for data wrangling and analysis

• Matplotlib and Seaborn: for plotting

•What is ML?
Introduction •Types of ML Systems
•Applications and challenges

•Data Collection
Data •Data Pre-processing
•Feature Engineering

•Supervised Models
Algorithms •Unsupervised Models
•Model Boosting, Stacking, Ensembling

Training ML •Problem framing

Models for •Training best practices
Production
•Model validation

Deployment
& monitoring
ML is good for:

◼ Solutions requiring long list of rules

◼ Solutions requiring extensive fine-tuning
◼ Complex problems unsolvable by traditional methods e.g. perceptive problems
such as image recognition
◼ Fluctuating environments e.g. data changes, problem changes
◼ Dealing with large, complex data
◼ Observable but unstudied phenomenon e.g. computing network logs
Recap: Types of ML Systems

Machine
Learning

Supervised Unsupervised Semi- Reinforcement

Supervised Learning

Classification Regression Clustering Association

Binary Multi-Class Multi-Label Divisive Agglomerative

Business Goal Data Collection
Problem Definition
Data
Model Preprocessing
Maintenance & Feature
Engineering

Model
Monitoring Model Training

The Machine Model Serving

Model

Learning
Evaluation

LifeCycle
Model
Deployment
Business Goal Data Collection
Problem Definition
Data
Model Preprocessing
Business Analyst Maintenance & Feature
Engineering
Product Manager

Model
Monitoring Model Training

Model
Model Serving Evaluation

Model
Deployment
Business Goal Data Collection
& Preparation
Problem Definition
Data
Model
Maintenance
Data Engineer Preprocessing
& Feature
Engineering

Model
Monitoring Model Training

Model
Model Serving Evaluation

Model
Deployment
Business Goal Data Collection
& Preparation
Problem Definition
Data
Model Preprocessing
Maintenance & Feature
Engineering

Data Scientist
Model
Monitoring Model Training

Model
Model Serving Evaluation

Model
Deployment
Business Goal Data Collection
& Preparation
Problem Definition
Data
Model Preprocessing
Maintenance & Feature
Engineering

Model
Monitoring Model Training

Software Engineer
MLOps
Model
Model Serving Evaluation

Model
Deployment
Business Goal Data Collection
& Preparation
Problem Definition
Data
Model Preprocessing
Maintenance & Feature
Engineering

Data Scientist

Model
Monitoring Model Training

Model
Model Serving Evaluation

Model
Deployment
Customer Churn Example: Problem Formulation
Problem: Declining sales

Why? A) Customers buying less items or B) Fewer customers

buying? If B, are there patterns? loyal customers, some
regions? some months? Etc.
What is churn?

Hypothesis: We can identify customers likely to churn before

they leave. This is a classification problem (ANN or DT?)
Understand potential value.. E.g., if we can convince 5% not to
leave, the financial impact is X
Other Considerations:
-What tools and data do I have?
-if model will be used as a batch system; the scores will be
generated monthly (weekly? Daily? Real time?) and sent to Y
-What is the activation incentive (offer?), how will success be
measured in production? Feedback loop?
EDA –Know your
data

In statistics,exploratory data analysisis

an approach to analyzing data sets to
summarize their main characteristics,
often with visual methods. A statistical
model can be used or not, but primarily
EDA is for seeing what the data can tell us
beyond the formal modeling or hypothesis
testing task.
◼ Data set shape and feature types
 df.shape, df.dtypes
◼ Eye-balling
 of data
 Explore oddities by looking at column names etc.
df.head()
◼ Univariate
 analysis
Understand distributions, outliers, missing
EDA 
values, variance, unique values etc.
df.describe(), box plots, cdf/pdf plots, violin plots
Tools/Techniques ◼ 
Bivariate analysis
 Understand relationship between 2 variables
e.g., age vs target
Box plots, pair plots
◼ 
Multivariate analysis
Understand
 interactions between multiple
variables
Correlation matrix, pair plots, 3D plots etc.
Clean your data
Missing data
Important: Understand why data is missing

Missing completely at random (MCAR)

Missing data are randomly distributed across the variable and unrelated to other variables.
(no patterns observed, same probability of missing)
Missing at random (MAR)
There might be systematic differences between missing and observed recordsbut these are completely accounted for by
otherobserved variables. (e.g., more data is missing for males vs. females but probability of missing is the same within
each group). The term ‘random’ is a bit of a misnomer
Missing not at random (MNAR)
Missing data systematically differ from the observed values. Related to the variable itself e.g., not stating my preference
for brand X because I don’t like it.
Handling missing values
◼ Remove records with missing data
◼ Leave as-is
◼ Impute
 Substitution (Fixed-value impute (mean/mode/median/’unknown’)

◼
Fast, easy but tend to be inaccurate without accounting for other
features/correlations or overall data structure; Only suitable for MCAR;
May be sensitive to noise e.g., outliers
Feature Engineering
Sidebar:
What is Cardinality?

The number of unique elements in a set:

X={4,6,7}
1 Cardinality = 3
X2={9,2,7,3,1} Cardinality =5

Counties dataset:
Cust
County
1002
2010
3030 Cardinality = 47
4002
5006
6047
Converting Categorical Features to Numeric
One-Hot Encoding (dummy coding)
Marital Single Married Divorced Unknown -Very simple
1 Single 1 1 0 0 0 -but can create an
2 Married 2 0 1 0 0 explosion of features if
3 Divorced 3 0 0 1 0
cardinality is high.
-Is not target-led
4 Unknown 4 0 0 0 1
(pro-tip: can use aggregation approach to reduce cardinality e.g., ‘Nairobi’, ‘Kiambu’, ‘Nakuru’, ‘Other’)

Label Encoding
Mon 1 -Alsosimple
Tue 2 -works better for ordered categories -
Wed 3
but may mislead algorithm on scale
Thur 4
and distance
Fri 5
Sat 6
–Is not target-led.
Sun 7
Clustering Approach to
reduce cardinality

 For high-cardinality features, build

similarity clusters and then perform
one-hot encoding or proportion
representation.

Example, clustering US state

codes into fewer categories
Feature Enrichment

◼ Feature splitting Management, tertiary Management tertiary

◼ Date extraction 1 May 2022 1 May 2022 Sunday

1 May 2022; 01:00hrs (Hour of the week from 1 to 168)

◼ Combining Features Household

income
Number of
kids
Income
/Kid
(domain led) Can apply simple additions, subtractions, polynomials etc. to various
features to extract a different dimension
Feature Scaling
Bring your features to the same or similar range of values or distributions

◼ Normalization (min-max scaling), constrain data into a range (typically [0,1]

Caution: if min and max are outliers, the feature will be

squeezed into a very small range. In this case consider
robust scaling ( x = (x —median) / inter-quartile range

from sklearn.preprocessing
import MinMaxScaler
Feature Scaling
Bring your features to the same or similar range of values or distributions

◼ Standardization , rescale data to achieve properties of a standard normal distribution

(µ=0, σ=1)

from sklearn.preprocessing
import StandardScaler
Power Transformation for changing distribution
Techniques used for converting a skewed distribution to a normal distribution/less-skewed distribution.
Many machine learning algorithms prefer or perform better when numerical variables have a Gaussian probability
distribution

◼ Log Transform

Changes distribution and takes

◼ Box-Cox Transform
care of extreme values

λ when λ
Y' = [Y−1] / λ
≠0
from sklearn.preprocessing
when λ import PowerTransformer
Y' = ln Y
=0 PowerTransformer(method
=
Quantile Transform
a non-parametric data transformation technique to
transform your numerical data distribution to following a
certain data distribution (e.g. theGaussian Distribution or
Uniform Distribution)

Related technique: Feature Discretization

(binning) to replace numerical values with
bin numbers e.g., No of times defaulted on a loan 0,1,2,3,4 can be
binnedin to 0 (never) or 1 (have defaulted one + times
Linear
Models
Linear Separability

Linearly Separable Not Linearly Separable

Linear separability implies that if there are twoclassesthen there will be a point, line, plane, or hyperplane that splits the
input features in such a way that all points of one class are in one-half space and the second class is in the other half-
space.
Linear Models For Classification

Admitted Suppose we are looking college admissions data:

0 (not admitted)
Y
1 (admitted)

Fit a linear regression model:

Not Admitted

Ypred= β0+ β1 x1+…..+ βp xp

Exam Score

Linear regression is not suitable for classification. Reasons: •It

underfits the data
•Predictions are not constrained between 0 and 1
• Outliers can have a large negative impact on the model
Logistic Regression
Ytakes on the value 1 with success probability p
And takes on the value 0 with failure probability
(1-p)
Admitted
We can use an appropriate link function to
produce a linearized model:
Cutoff

= β0+ β1 x1 Not admitted

Exam Score
Logistic regression fits a sigmoidalcurve and
constrains the probability to between 0 and 1
Types of Logistic Regression:

1 Binary Logistic Regression (2 outcomes)

2.Multinomial Logistic Regression (one can also binarize the target and do ‘one vs all’ approach)
. Ordinal Logistic Regression (aka Ordered Logit e.g., Olympic medals)

You might also like

Vedic Numerology Course Guide
90% (63)
Vedic Numerology Course Guide
114 pages
Daily Math Review Sheets Grade 5 PDF
100% (2)
Daily Math Review Sheets Grade 5 PDF
77 pages
Astm C273-C273M - 19
No ratings yet
Astm C273-C273M - 19
9 pages
Model Analysis
100% (3)
Model Analysis
7 pages
Slides Concepts
No ratings yet
Slides Concepts
55 pages
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
No ratings yet
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
35 pages
Control Account Reconciliation Statement
No ratings yet
Control Account Reconciliation Statement
8 pages
MTL782 A1
No ratings yet
MTL782 A1
19 pages
Live Classroom 2
No ratings yet
Live Classroom 2
40 pages
ML ch-1
No ratings yet
ML ch-1
32 pages
Study Material For Machine Learning - 1 - 1754721598318
No ratings yet
Study Material For Machine Learning - 1 - 1754721598318
18 pages
Etman MachineL 3
No ratings yet
Etman MachineL 3
47 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
Python & Data Science Career Course
No ratings yet
Python & Data Science Career Course
16 pages
Machine: Learning
No ratings yet
Machine: Learning
24 pages
Machine Learning
No ratings yet
Machine Learning
25 pages
PDS Chapter 5
No ratings yet
PDS Chapter 5
13 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Chapter 6 Barriers To International Trade
No ratings yet
Chapter 6 Barriers To International Trade
13 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
Data Handling: Probability Statistics II
No ratings yet
Data Handling: Probability Statistics II
98 pages
VFD Application Checklist
No ratings yet
VFD Application Checklist
3 pages
DIY Guide To Building Your Own Pulk
No ratings yet
DIY Guide To Building Your Own Pulk
41 pages
2 - Machine Learning - 130824
No ratings yet
2 - Machine Learning - 130824
81 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Unit 2
No ratings yet
Unit 2
19 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Python 06 MachineLearning
No ratings yet
Python 06 MachineLearning
45 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Data Preprocessing and Feature Engineering
No ratings yet
Data Preprocessing and Feature Engineering
32 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Machine Learning for Nigerian Languages
No ratings yet
Machine Learning for Nigerian Languages
67 pages
DATA 2024 - Dist
No ratings yet
DATA 2024 - Dist
72 pages
Data Science - 2 Sets
No ratings yet
Data Science - 2 Sets
10 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
Machine Learning Essentials
No ratings yet
Machine Learning Essentials
86 pages
Final ML
No ratings yet
Final ML
2 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
Machine Learning
No ratings yet
Machine Learning
28 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Week 12 Intro To DS and ML
No ratings yet
Week 12 Intro To DS and ML
67 pages
Mobiltech Presentation
100% (1)
Mobiltech Presentation
27 pages
Machine Learning: Technical Requirements & Data Processing Guide
No ratings yet
Machine Learning: Technical Requirements & Data Processing Guide
30 pages
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
No ratings yet
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
91 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
L3 Overview of ML Model Development Lifecycle-1
No ratings yet
L3 Overview of ML Model Development Lifecycle-1
30 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Mathematical and Physical Formulas
No ratings yet
Mathematical and Physical Formulas
10 pages
Unit 2
No ratings yet
Unit 2
48 pages
20 Questions On Feature Engineering and Eda
No ratings yet
20 Questions On Feature Engineering and Eda
9 pages
Unit 2
No ratings yet
Unit 2
18 pages
Data Science Cheat Sheet
No ratings yet
Data Science Cheat Sheet
10 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Machine Learning Lecture1 - 26-27 Aug
No ratings yet
Machine Learning Lecture1 - 26-27 Aug
30 pages
The Importance of Corporate Communications During Financial Crisis
No ratings yet
The Importance of Corporate Communications During Financial Crisis
12 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
69 pages
BE Mech 5.5 Year
No ratings yet
BE Mech 5.5 Year
3 pages
MILIT PPT Modifies
No ratings yet
MILIT PPT Modifies
43 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Design and Manufacturing of Pneumatic Burr Removing Machine: Kakde D V, Lokawar V L
No ratings yet
Design and Manufacturing of Pneumatic Burr Removing Machine: Kakde D V, Lokawar V L
3 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Matrices
No ratings yet
Matrices
12 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
11 Ergonomics in Osh
No ratings yet
11 Ergonomics in Osh
9 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
What Is Data Science? Probability Overview Descriptive Statistics
No ratings yet
What Is Data Science? Probability Overview Descriptive Statistics
10 pages
Multi2sim Quickstart
No ratings yet
Multi2sim Quickstart
10 pages
FUN Transmissions: by Bill Brayton
No ratings yet
FUN Transmissions: by Bill Brayton
4 pages
Master Thesis - BUILDING A RISK MODEL FOR OIL & GAS - Submitted by Himanshu Singh
No ratings yet
Master Thesis - BUILDING A RISK MODEL FOR OIL & GAS - Submitted by Himanshu Singh
56 pages
MGMT5410 STRATEGIC MANAGEMENT - Outline 25-1-24
No ratings yet
MGMT5410 STRATEGIC MANAGEMENT - Outline 25-1-24
16 pages
Executive Leadership Profile
No ratings yet
Executive Leadership Profile
2 pages
Mechatronics Project: Linear Displacement Indicator
No ratings yet
Mechatronics Project: Linear Displacement Indicator
6 pages
Anticipation Guide-Phonics and Word Recognition
No ratings yet
Anticipation Guide-Phonics and Word Recognition
5 pages
CSS 12 Module 5
No ratings yet
CSS 12 Module 5
4 pages
Hypertension Cheat Sheet
No ratings yet
Hypertension Cheat Sheet
4 pages
Math 8 Q1 Week 2.2
No ratings yet
Math 8 Q1 Week 2.2
6 pages
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
No ratings yet
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
11 pages
Relational DB Design Lab Guide
No ratings yet
Relational DB Design Lab Guide
2 pages
Raisen PDF
No ratings yet
Raisen PDF
99 pages
Bonsai Cendrawasih How To Grow
No ratings yet
Bonsai Cendrawasih How To Grow
2 pages
Blocked Credit Under GST
No ratings yet
Blocked Credit Under GST
15 pages
Technical and Grammar Quiz
No ratings yet
Technical and Grammar Quiz
3 pages