Recap of Machine Learning
Concepts
• Essential Libraries and Tool:
• Python : the lingua franca for data science
• Scikit-Learn: open-source project
http://scikit-learn.org
•JupyterNotebook: browser-based
interactive development environment for
running code (IDE)
• NumPy: for scientific computing
• Pandas: for data wrangling and analysis
• Matplotlib and Seaborn: for plotting
•What is ML?
Introduction •Types of ML Systems
•Applications and challenges
•Data Collection
Data •Data Pre-processing
•Feature Engineering
•Supervised Models
Algorithms •Unsupervised Models
•Model Boosting, Stacking, Ensembling
Training ML •Problem framing
Models for •Training best practices
Production
•Model validation
Deployment
& monitoring
ML is good for:
◼ Solutions requiring long list of rules
◼ Solutions requiring extensive fine-tuning
◼ Complex problems unsolvable by traditional methods e.g. perceptive problems
such as image recognition
◼ Fluctuating environments e.g. data changes, problem changes
◼ Dealing with large, complex data
◼ Observable but unstudied phenomenon e.g. computing network logs
Recap: Types of ML Systems
Machine
Learning
Supervised Unsupervised Semi- Reinforcement
Supervised Learning
Classification Regression Clustering Association
Binary Multi-Class Multi-Label Divisive Agglomerative
Business Goal Data Collection
Problem Definition
Data
Model Preprocessing
Maintenance & Feature
Engineering
Model
Monitoring Model Training
The Machine Model Serving
Model
Learning
Evaluation
LifeCycle
Model
Deployment
Business Goal Data Collection
Problem Definition
Data
Model Preprocessing
Business Analyst Maintenance & Feature
Engineering
Product Manager
Model
Monitoring Model Training
Model
Model Serving Evaluation
Model
Deployment
Business Goal Data Collection
& Preparation
Problem Definition
Data
Model
Maintenance
Data Engineer Preprocessing
& Feature
Engineering
Model
Monitoring Model Training
Model
Model Serving Evaluation
Model
Deployment
Business Goal Data Collection
& Preparation
Problem Definition
Data
Model Preprocessing
Maintenance & Feature
Engineering
Data Scientist
Model
Monitoring Model Training
Model
Model Serving Evaluation
Model
Deployment
Business Goal Data Collection
& Preparation
Problem Definition
Data
Model Preprocessing
Maintenance & Feature
Engineering
Model
Monitoring Model Training
Software Engineer
MLOps
Model
Model Serving Evaluation
Model
Deployment
Business Goal Data Collection
& Preparation
Problem Definition
Data
Model Preprocessing
Maintenance & Feature
Engineering
Data Scientist
Model
Monitoring Model Training
Model
Model Serving Evaluation
Model
Deployment
Customer Churn Example: Problem Formulation
Problem: Declining sales
Why? A) Customers buying less items or B) Fewer customers
buying? If B, are there patterns? loyal customers, some
regions? some months? Etc.
What is churn?
Hypothesis: We can identify customers likely to churn before
they leave. This is a classification problem (ANN or DT?)
Understand potential value.. E.g., if we can convince 5% not to
leave, the financial impact is X
Other Considerations:
-What tools and data do I have?
-if model will be used as a batch system; the scores will be
generated monthly (weekly? Daily? Real time?) and sent to Y
-What is the activation incentive (offer?), how will success be
measured in production? Feedback loop?
EDA –Know your
data
In statistics,exploratory data analysisis
an approach to analyzing data sets to
summarize their main characteristics,
often with visual methods. A statistical
model can be used or not, but primarily
EDA is for seeing what the data can tell us
beyond the formal modeling or hypothesis
testing task.
◼ Data set shape and feature types
df.shape, df.dtypes
◼ Eye-balling
of data
Explore oddities by looking at column names etc.
df.head()
◼ Univariate
analysis
Understand distributions, outliers, missing
EDA
values, variance, unique values etc.
df.describe(), box plots, cdf/pdf plots, violin plots
Tools/Techniques ◼
Bivariate analysis
Understand relationship between 2 variables
e.g., age vs target
Box plots, pair plots
◼
Multivariate analysis
Understand
interactions between multiple
variables
Correlation matrix, pair plots, 3D plots etc.
Clean your data
Missing data
Important: Understand why data is missing
Missing completely at random (MCAR)
Missing data are randomly distributed across the variable and unrelated to other variables.
(no patterns observed, same probability of missing)
Missing at random (MAR)
There might be systematic differences between missing and observed recordsbut these are completely accounted for by
otherobserved variables. (e.g., more data is missing for males vs. females but probability of missing is the same within
each group). The term ‘random’ is a bit of a misnomer
Missing not at random (MNAR)
Missing data systematically differ from the observed values. Related to the variable itself e.g., not stating my preference
for brand X because I don’t like it.
Handling missing values
◼ Remove records with missing data
◼ Leave as-is
◼ Impute
Substitution (Fixed-value impute (mean/mode/median/’unknown’)
◼
Fast, easy but tend to be inaccurate without accounting for other
features/correlations or overall data structure; Only suitable for MCAR;
May be sensitive to noise e.g., outliers
Feature Engineering
Sidebar:
What is Cardinality?
The number of unique elements in a set:
X={4,6,7}
1 Cardinality = 3
X2={9,2,7,3,1} Cardinality =5
Counties dataset:
Cust
County
1002
2010
3030 Cardinality = 47
4002
5006
6047
Converting Categorical Features to Numeric
One-Hot Encoding (dummy coding)
Marital Single Married Divorced Unknown -Very simple
1 Single 1 1 0 0 0 -but can create an
2 Married 2 0 1 0 0 explosion of features if
3 Divorced 3 0 0 1 0
cardinality is high.
-Is not target-led
4 Unknown 4 0 0 0 1
(pro-tip: can use aggregation approach to reduce cardinality e.g., ‘Nairobi’, ‘Kiambu’, ‘Nakuru’, ‘Other’)
Label Encoding
Mon 1 -Alsosimple
Tue 2 -works better for ordered categories -
Wed 3
but may mislead algorithm on scale
Thur 4
and distance
Fri 5
Sat 6
–Is not target-led.
Sun 7
Clustering Approach to
reduce cardinality
For high-cardinality features, build
similarity clusters and then perform
one-hot encoding or proportion
representation.
Example, clustering US state
codes into fewer categories
Feature Enrichment
◼ Feature splitting Management, tertiary Management tertiary
◼ Date extraction 1 May 2022 1 May 2022 Sunday
1 May 2022; 01:00hrs (Hour of the week from 1 to 168)
◼ Combining Features Household
income
Number of
kids
Income
/Kid
(domain led) Can apply simple additions, subtractions, polynomials etc. to various
features to extract a different dimension
Feature Scaling
Bring your features to the same or similar range of values or distributions
◼ Normalization (min-max scaling), constrain data into a range (typically [0,1]
Caution: if min and max are outliers, the feature will be
squeezed into a very small range. In this case consider
robust scaling ( x = (x —median) / inter-quartile range
from sklearn.preprocessing
import MinMaxScaler
Feature Scaling
Bring your features to the same or similar range of values or distributions
◼ Standardization , rescale data to achieve properties of a standard normal distribution
(µ=0, σ=1)
from sklearn.preprocessing
import StandardScaler
Power Transformation for changing distribution
Techniques used for converting a skewed distribution to a normal distribution/less-skewed distribution.
Many machine learning algorithms prefer or perform better when numerical variables have a Gaussian probability
distribution
◼ Log Transform
Changes distribution and takes
◼ Box-Cox Transform
care of extreme values
λ when λ
Y' = [Y−1] / λ
≠0
from sklearn.preprocessing
when λ import PowerTransformer
Y' = ln Y
=0 PowerTransformer(method
=
Quantile Transform
a non-parametric data transformation technique to
transform your numerical data distribution to following a
certain data distribution (e.g. theGaussian Distribution or
Uniform Distribution)
Related technique: Feature Discretization
(binning) to replace numerical values with
bin numbers e.g., No of times defaulted on a loan 0,1,2,3,4 can be
binnedin to 0 (never) or 1 (have defaulted one + times
Linear
Models
Linear Separability
Linearly Separable Not Linearly Separable
Linear separability implies that if there are twoclassesthen there will be a point, line, plane, or hyperplane that splits the
input features in such a way that all points of one class are in one-half space and the second class is in the other half-
space.
Linear Models For Classification
Admitted Suppose we are looking college admissions data:
0 (not admitted)
Y
1 (admitted)
Fit a linear regression model:
Not Admitted
Ypred= β0+ β1 x1+…..+ βp xp
Exam Score
Linear regression is not suitable for classification. Reasons: •It
underfits the data
•Predictions are not constrained between 0 and 1
• Outliers can have a large negative impact on the model
Logistic Regression
Ytakes on the value 1 with success probability p
And takes on the value 0 with failure probability
(1-p)
Admitted
We can use an appropriate link function to
produce a linearized model:
Cutoff
= β0+ β1 x1 Not admitted
Exam Score
Logistic regression fits a sigmoidalcurve and
constrains the probability to between 0 and 1
Types of Logistic Regression:
1 Binary Logistic Regression (2 outcomes)
2.Multinomial Logistic Regression (one can also binarize the target and do ‘one vs all’ approach)
. Ordinal Logistic Regression (aka Ordered Logit e.g., Olympic medals)