Lecture 2: ML Pipeline
CSC 484 / 584, DA 515
Fall 2024
REF: Chapter 2: End to End ML
Ch2. An Example: end to end
1. Look at the big picture.
2. Get the data.
3. Discover and visualize the data to gain
insights.
4. Prepare the data for Machine Learning
algorithms.
5. Select models and train them.
6. Fine-tune your models.
7. Present your solution.
2
8. Launch, monitor, and maintain your
ML pipeline example
# Create the pipeline (not totally matched the pipeline above)
pipeline = make_pipeline(StandardScaler(), PCA(n_components=8),
RandomForestClassifier(criterion='gini', n_estimators=50,
max_depth=2, random_state=1))
# Fit the pipeline
pipeline.fit(X_train, y_train)
# Evaluate the model
print('Model Accuracy: %.3f' % pipeline.score(X_test, y_test))
3
Questions you might ask:
What does the dataset include?
District median price along with population, income…
What are the business objectives?
(summary, viz, prediction…)
What do we have currently? Complex rules (low accuracy)
What kind of problem is it?
Supervised/Non-Supervised
Classification/Regression
Which algorithm(linear, poly, neural, kernel, tree, k-nn)
4
Example: California Housing Price (1990)
Given California housing prices(by district)
Train a model to predict a district’s median
housing price
5
Load in the dataset
Demo Code:
Create an isolated Environment: DA515
Install sklearn (code: pip install scikit-learn)
Download the dataset (check the book code)
For us, the data is saved in disk in the same folder as your
code is in.
housing = pd.read_csv(“CA_housing.csv”)
For my case: data is saved in subfolder “datasets”
housing = pd.read_csv(“./datasets/CA_housing.csv”) 6
Take a Quick Look at the Data Structure
The first 5 rows:
7
Explore the dataset
Check data numbers of rows and cols
Check data types(int, float, categorical…)
Check missing values
(Column total bedroom: 207 rows missing)
Check duplicated data
Check statistics(max, min, mean, …)
8
Missing values
In this example, we do not have a lot of missing values:
We need to fix the missing values later.
9
Categorical data
# check the categorical data
ocean_proximity 20640 non-null object
housing["ocean_proximity"].value_counts()
<1H OCEAN 9034
INLAND 6496
NEAR OCEAN 2628
NEAR BAY 2270
ISLAND 5
There are 5 categories, counts are dif.
10
Plotting
# install matplotlib
! pip install matplotlib
# import matplotlib to memory
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()
11
.hist()
12
Observations
Data distribution varies. Need to have outlier removed.
These attributes have very different scales.
Finally, many histograms are tail-heavy: extend much
farther to the right
13
For Machine Learning
Data pre-processing
Missing values
Categorical data
Feature selection/engineering
Data scaling
Data Sampling:
using stratify parameter:
This stratify parameter makes a split so that the proportion of values in
the sample produced will be the same as the proportion of values
provided to parameter stratify
14
For missing values 1/2
You can:
1. Get rid of the corresponding districts.
2. Get rid of the whole attribute.
3. Set the values to some value (zero, the mean, the median, etc.).
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median, inplace=True)
4. Find the closest neighbors and use the average
15
Use SimpleImputer to fill in 2/2
You can use Scikit-Learn: SimpleImputer:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
imputer =
imputer.fit(housing[["total_bedrooms"]])
housing['total_bedrooms'] =
imputer.transform(housing[['total_bedrooms
']])
16
Text and Categorical Attributes
Check the first 10 categorical “ocean_proximity” data
samples
housing_cat = housing[["ocean_proximity"]]
housing_cat.head(10)
Computer cannot deal with text data directly
17
Use value_count or distinct
# count all distinct values.
housing_cat.ocean_proximity.value_counts()
<1H OCEAN 7276
INLAND 5263
NEAR OCEAN 2124
NEAR BAY 1847
ISLAND 2
There are 5 categories, counts are dif.
18
Convert:
categories text => ordinal numbers
Scikit-Learn provides
OrdinalEncoder to change 5
categorical data using number
list: [0 1 2 3 4] to represents:
[‘<1H OCEAN', 'INLAND',
'ISLAND', 'NEAR BAY', 'NEAR
OCEAN’]
Don’t Use it:
Reason: Not
Ordinal data
19
DATA TYPES
Variables Types:
Continuous
Discrete:
Ordinal: can order such as A, B, C, D, or Mon, Tue,
Nominal: blue, red, .. Or banana, apple, orange, …
Text -> encode to numerical
Image -> use pixel
Voice -> text
….
20
Correct Encoding: One-Hot encoding
There are total 5 categories:
use list to encode: [x1 , x2 , x3, x4, x5]
xi is 1 for yes, 0 for no
1-hot example:
21
Problem: TOO MANY VARIABLES
USE one-hot encoder
# Use pd.get_dummies( )
housing_cat_1hot =
pd.get_dummies(housing_cat).astype(int)
# then merge it with numerical attribute
housing = housing.join(housing_cat_1hot)
Now we have 14 features
Feature selection/engineering
Feature engineering:
create some new features which make more
sense.
For example: rooms_per_household is better than
total_rooms
Feature selection:
keep only the important relevant features
There are several diff methods
Recursive elimination, Important Ranking, PCA, etc.
We talk about Correlation 23
Experiments of attribute combinations
What you really want is the number of rooms per
household
housing["rooms_per_household"]
=housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] =
housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/
housing["households"]
24
Matrix of Correlations
25
Standard correlation coefficient of various
datasets
(source: Wikipedia; public domain image)
26
Feature Selection
Correlations: median_house_value
#Looking for Correlations:
# for ML, keep only the important features
(Assume:linear relation)
27
Separate X and y: features and
target
Separate the dependent Y/independent variables X:
# training X
X = housing.drop("median_house_value", axis=1)
# label Y
y = housing["median_house_value"]
28
Splitting and Scaling
Which one needs to be done first?
Book Code did splitting first
Splitting with random sampling methods. This is generally
fine if your dataset is large enough (especially relative to
the number of attributes)
I prefer do it later.
29
Feature Scaling
Now, all data are numerical
----------------------------------------------------------------------------
Scaling is very important:
For distance computation
For optimization
Two ways:
Normalization: (x-x_min)/(x_max – x_min) => [0,1]
Standardization: (x- mu)/sigma => N(0, 1)
30
Why do data scaling:
Lesson of the widow's mite
This poor widow put in more than all the other contributors
to the treasury. For they have all contributed from their
surplus wealth, but she, from her poverty, has contributed
all she had, her whole livelihood (Wikipedia)
31
Source of figure:
Feature Scaling http://cs231n.github.io/neural-
networks-2/
𝑦 =𝑏+ 𝑤1 𝑥 1+ 𝑤2 𝑥 2
𝑥2 𝑥2
𝑥1 𝑥1
Make different features have the same scaling
Feature Selection
Add combined features
Remove less relevant features:
33
Feature Scaling 𝑦 =𝑏+ 𝑤1 𝑥 1+ 𝑤2 𝑥 2
w1 w1
1, 2 …… x1 y 1, 2 …… x1 y
w2 w2
100, 200 …… x2 b 1, 2 …… x2 b
w2 Loss L w2 Loss L
w1 w1
Data Splitting: 80 vs 20
Scikit-Learn provides a few functions to split datasets into
multiple subsets in various ways. The simplest function is
train_test_split(),
# random sampling
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.2,
random_state= 100)
35
Now visualization for training data
36
ML models
Regression: for continuous data (Y)
Linear, or Poly
Tree
Random Forest
K-NN(HOMEWORK)
SVM
ANN
Kernel
….
37
Regression Evaluation Metrics (1/3)
1. Mean Square Error(MSE) or Root Mean Square
Error(RMSE):
38
Regression Evaluation Metrics (2/3)
2. Mean Absolute Error(MAE):
39
Regression Evaluation Metrics (3/3)
3. Square/Adjusted R Squared:
Simple linear Regression ( r2 instead of R2 )
The R2 quantifies the degree of any linear correlation between Yobs and Ypred,
or assess the goodness-of-fit
https://en.wikipedia.org/wiki/Coefficient_of_determination
40
Limitation of using R squared
R-squared Is Not Valid for Nonlinear Regression
https://statisticsbyjim.com/regression/r-squared-invalid-nonlinear-regression/#:~:text=Nonlinear%20regression%20is%20an
%20extremely,just%20don't%20go%20together
.
If you use R-squared for nonlinear models, their study
indicates you will experience the following problems:
R-squared is consistently high for both excellent and appalling models.
R-squared will not rise for better models all of the time.
If you use R-squared to pick the best model, it leads to the proper
model only 28-43% of the time.
More info: https://en.wikipedia.org/wiki/Coefficient_of_determination
41
3 Steps of ML
Import a model
Train: fit (X, y)
Evaluate:
MSE(Mean Squared Error)
RMSE(Root Mean Squared Error (RMSE).)
Cross Validation
42
Grid Search: looking for best user-defined
parameters
43
Example:
Forest: Total 12 combinations + 6 combinations
44
Short Summary
After hyperparameter tuning,
Compare the RMSEs
We also need to avoid
overfitting
Feature Selection can be
done differently
Feature Importance from
Random Forest:
45
Final Pipeline
Example Bayesian Algorithm
46
Finally
Evaluate Your System on the Test
Set
Launch, Monitor, and Maintain
Your System
47
Homework: K-NN for Appraisal
Data: California Housing (chapter 2 )
Cannot use the sk-learn library
K_NN:K nearest neighbors:
• Lazy algorithm ,
• No training,
• No distribution assumption ,
• Based on feature similarity ,
• Used in classification by majority vote
• For regression, find the average price (scalar)
48
FYI: Real data sources
UCI Data Repository
http://archive.ics.uci.edu/ml/index.php
Kaggle
https://www.kaggle.com/datasets
Google datasets
https://cloud.google.com/public-datasets/
Government (Agriculture/Commerce/Education/FDA…. )
https://catalog.data.gov/dataset
49
END
• Read book Chapter 2
• Practice the code
• Do your homework 1