0% found this document useful (0 votes)

19 views50 pages

Lecture02. ML Pipeline (Chapter 2)

Uploaded by

emad qedies

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views50 pages

Lecture02. ML Pipeline (Chapter 2)

Uploaded by

emad qedies

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 50

Lecture 2: ML Pipeline

CSC 484 / 584, DA 515

Fall 2024

REF: Chapter 2: End to End ML

Ch2. An Example: end to end

1. Look at the big picture.

2. Get the data.
3. Discover and visualize the data to gain
insights.

4. Prepare the data for Machine Learning

algorithms.
5. Select models and train them.
6. Fine-tune your models.
7. Present your solution.
2

8. Launch, monitor, and maintain your

ML pipeline example

# Create the pipeline (not totally matched the pipeline above)

pipeline = make_pipeline(StandardScaler(), PCA(n_components=8),
RandomForestClassifier(criterion='gini', n_estimators=50,
max_depth=2, random_state=1))
# Fit the pipeline
pipeline.fit(X_train, y_train)
# Evaluate the model
print('Model Accuracy: %.3f' % pipeline.score(X_test, y_test))
3
Questions you might ask:
 What does the dataset include?
 District median price along with population, income…

 What are the business objectives?

 (summary, viz, prediction…)

 What do we have currently? Complex rules (low accuracy)

 What kind of problem is it?

 Supervised/Non-Supervised

 Classification/Regression
 Which algorithm(linear, poly, neural, kernel, tree, k-nn)

4
Example: California Housing Price (1990)
 Given California housing prices(by district)
 Train a model to predict a district’s median
housing price

5
Load in the dataset
 Demo Code:
 Create an isolated Environment: DA515
 Install sklearn (code: pip install scikit-learn)

 Download the dataset (check the book code)

 For us, the data is saved in disk in the same folder as your
code is in.
housing = pd.read_csv(“CA_housing.csv”)

 For my case: data is saved in subfolder “datasets”

housing = pd.read_csv(“./datasets/CA_housing.csv”) 6
Take a Quick Look at the Data Structure
The first 5 rows:

7
Explore the dataset
 Check data numbers of rows and cols
 Check data types(int, float, categorical…)

 Check missing values

 (Column total bedroom: 207 rows missing)

 Check duplicated data

 Check statistics(max, min, mean, …)

8
Missing values
 In this example, we do not have a lot of missing values:

 We need to fix the missing values later.

9
Categorical data
# check the categorical data
ocean_proximity 20640 non-null object

housing["ocean_proximity"].value_counts()
<1H OCEAN 9034
INLAND 6496
NEAR OCEAN 2628
NEAR BAY 2270
ISLAND 5

There are 5 categories, counts are dif.

10
Plotting

# install matplotlib
! pip install matplotlib

# import matplotlib to memory

import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()

11
.hist()

12
Observations
 Data distribution varies. Need to have outlier removed.

 These attributes have very different scales.

 Finally, many histograms are tail-heavy: extend much

farther to the right

13
For Machine Learning
 Data pre-processing
 Missing values

 Categorical data

 Feature selection/engineering

 Data scaling
 Data Sampling:
 using stratify parameter:

This stratify parameter makes a split so that the proportion of values in

the sample produced will be the same as the proportion of values
provided to parameter stratify

14
For missing values 1/2
 You can:
1. Get rid of the corresponding districts.
2. Get rid of the whole attribute.
3. Set the values to some value (zero, the mean, the median, etc.).

median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median, inplace=True)

4. Find the closest neighbors and use the average

15
Use SimpleImputer to fill in 2/2

You can use Scikit-Learn: SimpleImputer:

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
imputer =
imputer.fit(housing[["total_bedrooms"]])
housing['total_bedrooms'] =
imputer.transform(housing[['total_bedrooms
']])

16
Text and Categorical Attributes
 Check the first 10 categorical “ocean_proximity” data
samples
housing_cat = housing[["ocean_proximity"]]
housing_cat.head(10)

 Computer cannot deal with text data directly

17
Use value_count or distinct
# count all distinct values.
housing_cat.ocean_proximity.value_counts()

<1H OCEAN 7276

INLAND 5263
NEAR OCEAN 2124
NEAR BAY 1847
ISLAND 2

There are 5 categories, counts are dif.

18
Convert:
categories text => ordinal numbers

Scikit-Learn provides
OrdinalEncoder to change 5
categorical data using number
list: [0 1 2 3 4] to represents:

[‘<1H OCEAN', 'INLAND',

'ISLAND', 'NEAR BAY', 'NEAR
OCEAN’]

Don’t Use it:

Reason: Not
Ordinal data
19
DATA TYPES
 Variables Types:
 Continuous
 Discrete:
 Ordinal: can order such as A, B, C, D, or Mon, Tue,
 Nominal: blue, red, .. Or banana, apple, orange, …
 Text -> encode to numerical
 Image -> use pixel
 Voice -> text
 ….

20
Correct Encoding: One-Hot encoding

There are total 5 categories:

use list to encode: [x1 , x2 , x3, x4, x5]
xi is 1 for yes, 0 for no

1-hot example:

21
Problem: TOO MANY VARIABLES
USE one-hot encoder
# Use pd.get_dummies( )
housing_cat_1hot =
pd.get_dummies(housing_cat).astype(int)

# then merge it with numerical attribute

housing = housing.join(housing_cat_1hot)

Now we have 14 features

Feature selection/engineering
 Feature engineering:
create some new features which make more
sense.

 For example: rooms_per_household is better than

total_rooms

 Feature selection:
keep only the important relevant features
 There are several diff methods
 Recursive elimination, Important Ranking, PCA, etc.

 We talk about Correlation 23

Experiments of attribute combinations

What you really want is the number of rooms per

household

housing["rooms_per_household"]
=housing["total_rooms"]/housing["households"]

housing["bedrooms_per_room"] =
housing["total_bedrooms"]/housing["total_rooms"]

housing["population_per_household"]=housing["population"]/
housing["households"]

24
Matrix of Correlations

25
Standard correlation coefficient of various
datasets
(source: Wikipedia; public domain image)

26
Feature Selection
Correlations: median_house_value
#Looking for Correlations:
# for ML, keep only the important features
(Assume:linear relation)

27
Separate X and y: features and
target
 Separate the dependent Y/independent variables X:
# training X
X = housing.drop("median_house_value", axis=1)

# label Y
y = housing["median_house_value"]

28
Splitting and Scaling
 Which one needs to be done first?

 Book Code did splitting first

Splitting with random sampling methods. This is generally
fine if your dataset is large enough (especially relative to
the number of attributes)

 I prefer do it later.

29
Feature Scaling
 Now, all data are numerical
----------------------------------------------------------------------------
 Scaling is very important:
 For distance computation
 For optimization

 Two ways:
 Normalization: (x-x_min)/(x_max – x_min) => [0,1]
 Standardization: (x- mu)/sigma => N(0, 1)

30
Why do data scaling:
Lesson of the widow's mite
 This poor widow put in more than all the other contributors
to the treasury. For they have all contributed from their
surplus wealth, but she, from her poverty, has contributed
all she had, her whole livelihood (Wikipedia)

31
Source of figure:

Feature Scaling http://cs231n.github.io/neural-

networks-2/

𝑦 =𝑏+ 𝑤1 𝑥 1+ 𝑤2 𝑥 2

𝑥2 𝑥2

𝑥1 𝑥1

Make different features have the same scaling

Feature Selection
 Add combined features
 Remove less relevant features:

33
Feature Scaling 𝑦 =𝑏+ 𝑤1 𝑥 1+ 𝑤2 𝑥 2

w1 w1
1, 2 …… x1  y 1, 2 …… x1  y
w2 w2

100, 200 …… x2 b 1, 2 …… x2 b

w2 Loss L w2 Loss L

w1 w1
Data Splitting: 80 vs 20
Scikit-Learn provides a few functions to split datasets into
multiple subsets in various ways. The simplest function is
train_test_split(),

# random sampling
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.2,
random_state= 100)

35
Now visualization for training data

36
ML models
 Regression: for continuous data (Y)
 Linear, or Poly
 Tree
 Random Forest

 K-NN(HOMEWORK)

 SVM
 ANN
 Kernel
 ….
37
Regression Evaluation Metrics (1/3)
1. Mean Square Error(MSE) or Root Mean Square
Error(RMSE):

38
Regression Evaluation Metrics (2/3)
2. Mean Absolute Error(MAE):

39
Regression Evaluation Metrics (3/3)
3. Square/Adjusted R Squared:
 Simple linear Regression ( r2 instead of R2 )
 The R2 quantifies the degree of any linear correlation between Yobs and Ypred,
or assess the goodness-of-fit

https://en.wikipedia.org/wiki/Coefficient_of_determination

40
Limitation of using R squared
 R-squared Is Not Valid for Nonlinear Regression
https://statisticsbyjim.com/regression/r-squared-invalid-nonlinear-regression/#:~:text=Nonlinear%20regression%20is%20an
%20extremely,just%20don't%20go%20together
.

 If you use R-squared for nonlinear models, their study

indicates you will experience the following problems:

 R-squared is consistently high for both excellent and appalling models.

 R-squared will not rise for better models all of the time.
 If you use R-squared to pick the best model, it leads to the proper
model only 28-43% of the time.

 More info: https://en.wikipedia.org/wiki/Coefficient_of_determination

41
3 Steps of ML
 Import a model
 Train: fit (X, y)
 Evaluate:
 MSE(Mean Squared Error)
 RMSE(Root Mean Squared Error (RMSE).)
 Cross Validation

42
Grid Search: looking for best user-defined
parameters

43
Example:
 Forest: Total 12 combinations + 6 combinations

44
Short Summary

 After hyperparameter tuning,

Compare the RMSEs

 We also need to avoid

overfitting

 Feature Selection can be

done differently

 Feature Importance from

Random Forest:
45
Final Pipeline
 Example Bayesian Algorithm

46
Finally

 Evaluate Your System on the Test

Set
 Launch, Monitor, and Maintain

Your System

47
Homework: K-NN for Appraisal

Data: California Housing (chapter 2 )

Cannot use the sk-learn library

K_NN:K nearest neighbors:

• Lazy algorithm ，
• No training,
• No distribution assumption ，
• Based on feature similarity ，
• Used in classification by majority vote
• For regression, find the average price (scalar)

48
FYI: Real data sources
 UCI Data Repository
http://archive.ics.uci.edu/ml/index.php

 Kaggle
https://www.kaggle.com/datasets

 Google datasets
https://cloud.google.com/public-datasets/

 Government (Agriculture/Commerce/Education/FDA…. )
https://catalog.data.gov/dataset
49
END
• Read book Chapter 2
• Practice the code
• Do your homework 1

Pringles Marketing Analysis
100% (2)
Pringles Marketing Analysis
24 pages
Top Datasets for Data Science
100% (1)
Top Datasets for Data Science
9 pages
Specialty Chemical Production Analysis
No ratings yet
Specialty Chemical Production Analysis
8 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
CS 2 3 4 Aml
No ratings yet
CS 2 3 4 Aml
70 pages
Heatless Regenerative Dessicant Dryers
No ratings yet
Heatless Regenerative Dessicant Dryers
20 pages
LAB MANUAL For Machine Learning
No ratings yet
LAB MANUAL For Machine Learning
15 pages
Regression Pipeline in AI Techniques
No ratings yet
Regression Pipeline in AI Techniques
94 pages
The Data Science Process
100% (1)
The Data Science Process
53 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
CFA LEVEL 1 - CFA Exam Core Video Series
No ratings yet
CFA LEVEL 1 - CFA Exam Core Video Series
2 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
0 PDF
No ratings yet
0 PDF
9 pages
Eng Pcdmis 2022.1 Vision Manual
No ratings yet
Eng Pcdmis 2022.1 Vision Manual
274 pages
House Report
No ratings yet
House Report
26 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material I 09-Aug-2021 Data2 1
No ratings yet
FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material I 09-Aug-2021 Data2 1
9 pages
Dawit House
No ratings yet
Dawit House
49 pages
PLC, Scada Training
100% (1)
PLC, Scada Training
47 pages
CX9000 HWen
No ratings yet
CX9000 HWen
61 pages
Neural Network Housing Price Prediction
No ratings yet
Neural Network Housing Price Prediction
30 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
Data Science for Engineers Course
No ratings yet
Data Science for Engineers Course
8 pages
Data Analysis for Beginners
No ratings yet
Data Analysis for Beginners
8 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
Encoded Data Document
No ratings yet
Encoded Data Document
6 pages
Machine Learning Data Prep Guide
No ratings yet
Machine Learning Data Prep Guide
17 pages
Aastha Mahajan Python File
No ratings yet
Aastha Mahajan Python File
17 pages
QB 1
No ratings yet
QB 1
11 pages
The GMP Regulations Report 2020
No ratings yet
The GMP Regulations Report 2020
5 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
California Housing Project
No ratings yet
California Housing Project
5 pages
Presented By: Nosheen Mehfooz M.Awais Anum Aziz M.Shayan S. Hammad S. Rameez Khalid
No ratings yet
Presented By: Nosheen Mehfooz M.Awais Anum Aziz M.Shayan S. Hammad S. Rameez Khalid
19 pages
Regulatory Environment For Food and Beverage in Brazil
No ratings yet
Regulatory Environment For Food and Beverage in Brazil
12 pages
Dividend Policy Veddanta
No ratings yet
Dividend Policy Veddanta
14 pages
Cp4252-Machine Learning Lab Manual 23-24
No ratings yet
Cp4252-Machine Learning Lab Manual 23-24
28 pages
17 Cloud Computing
No ratings yet
17 Cloud Computing
35 pages
Wireless Communication Basics
No ratings yet
Wireless Communication Basics
17 pages
Stack by Linked List (By C++) : #Include
No ratings yet
Stack by Linked List (By C++) : #Include
4 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
ML Lab Report for ECE Students
No ratings yet
ML Lab Report for ECE Students
38 pages
End To End Machine Learning Project-2
No ratings yet
End To End Machine Learning Project-2
10 pages
Real Estate ML Project Guide
No ratings yet
Real Estate ML Project Guide
20 pages
Project Report
No ratings yet
Project Report
37 pages
BE Module 5
No ratings yet
BE Module 5
16 pages
Module 2
No ratings yet
Module 2
35 pages
Lec 1 Cost Est
No ratings yet
Lec 1 Cost Est
42 pages
Report
No ratings yet
Report
40 pages
Module 5
No ratings yet
Module 5
46 pages
ISMLA Module5
No ratings yet
ISMLA Module5
25 pages
AIMLlatestmodule 2notes Removed
No ratings yet
AIMLlatestmodule 2notes Removed
33 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
1 - Lab Manual (ML)
No ratings yet
1 - Lab Manual (ML)
42 pages
Wipro Disability Inclusion Policy
100% (1)
Wipro Disability Inclusion Policy
7 pages
Scalability PDF
No ratings yet
Scalability PDF
21 pages
Machinelearning
No ratings yet
Machinelearning
26 pages
Brand Ambassador Playbook Roster
No ratings yet
Brand Ambassador Playbook Roster
27 pages
DM Lab Cycle 2 1
No ratings yet
DM Lab Cycle 2 1
10 pages
Board of Education Meeting Summary
No ratings yet
Board of Education Meeting Summary
13 pages
LESSON PLAN FORMAT HAND TOOLS Arbelle
No ratings yet
LESSON PLAN FORMAT HAND TOOLS Arbelle
2 pages
HY Syllabus Class 12 - 2024-25
No ratings yet
HY Syllabus Class 12 - 2024-25
4 pages
Geogia Hotel Ghana LTD Vrs Silver Star Auto LTD (J4 34 of 2012) 2012 GHASC 54 (4 December 2012)
No ratings yet
Geogia Hotel Ghana LTD Vrs Silver Star Auto LTD (J4 34 of 2012) 2012 GHASC 54 (4 December 2012)
26 pages
Preface To IGAS and IGFRS
No ratings yet
Preface To IGAS and IGFRS
5 pages
Data - Science - Manaul (Te)
No ratings yet
Data - Science - Manaul (Te)
78 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
ML Lab - BCSL606
No ratings yet
ML Lab - BCSL606
67 pages
Week 1 Get Familier With Jupyter Notebook
No ratings yet
Week 1 Get Familier With Jupyter Notebook
4 pages
Cp4252 Machine Learning Lab Manual
No ratings yet
Cp4252 Machine Learning Lab Manual
27 pages
ML Manual
No ratings yet
ML Manual
24 pages
ML Lap
No ratings yet
ML Lap
23 pages
ML Lab Manual
No ratings yet
ML Lab Manual
60 pages
Unit 2
No ratings yet
Unit 2
78 pages
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
No ratings yet
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
127 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
20 pages
Practical (Data Science)
No ratings yet
Practical (Data Science)
13 pages
Dr. Naheed Zamani Clinic Lahore - Top Doctors, Fees, Contact Number
No ratings yet
Dr. Naheed Zamani Clinic Lahore - Top Doctors, Fees, Contact Number
1 page
ML Lab Manual
No ratings yet
ML Lab Manual
14 pages
Improving Population Health Using Electronic Health Records: Methods For Data Management and Epidemiological Analysis 1st Edition Goldstein
No ratings yet
Improving Population Health Using Electronic Health Records: Methods For Data Management and Epidemiological Analysis 1st Edition Goldstein
60 pages
Management MCQ - Merged (1) - 1
No ratings yet
Management MCQ - Merged (1) - 1
1 page
DDO26B1101
No ratings yet
DDO26B1101
6 pages
Machine Learning Problem-Solving Steps: 1. Look at The Big Picture
No ratings yet
Machine Learning Problem-Solving Steps: 1. Look at The Big Picture
41 pages
Advanced Feature Engineering and Data Preprocessing in Machine Learning
No ratings yet
Advanced Feature Engineering and Data Preprocessing in Machine Learning
7 pages
Bstm20oe201 2ND Sem Sy2024 2025
No ratings yet
Bstm20oe201 2ND Sem Sy2024 2025
1 page
Hands On Machine Learning, End-to-End Machine Learning Project Notes
No ratings yet
Hands On Machine Learning, End-to-End Machine Learning Project Notes
10 pages
ML Manual
No ratings yet
ML Manual
29 pages
CSE 445 - Lecture 2 - Data Exploration - Regression
No ratings yet
CSE 445 - Lecture 2 - Data Exploration - Regression
31 pages
Hands On ML Workshop End To End ML
No ratings yet
Hands On ML Workshop End To End ML
20 pages
CWH Sklearn Merged
No ratings yet
CWH Sklearn Merged
74 pages

Lecture02. ML Pipeline (Chapter 2)

Uploaded by

Lecture02. ML Pipeline (Chapter 2)

Uploaded by

Lecture 2: ML Pipeline

CSC 484 / 584, DA 515

REF: Chapter 2: End to End ML

1. Look at the big picture.

4. Prepare the data for Machine Learning

8. Launch, monitor, and maintain your

# Create the pipeline (not totally matched the pipeline above)

 What are the business objectives?

 What do we have currently? Complex rules (low accuracy)

 What kind of problem is it?

 Download the dataset (check the book code)

 For my case: data is saved in subfolder “datasets”

 Check missing values

 (Column total bedroom: 207 rows missing)

 Check duplicated data

 Check statistics(max, min, mean, …)

 We need to fix the missing values later.

There are 5 categories, counts are dif.

# import matplotlib to memory

 These attributes have very different scales.

 Finally, many histograms are tail-heavy: extend much

This stratify parameter makes a split so that the proportion of values in

4. Find the closest neighbors and use the average

You can use Scikit-Learn: SimpleImputer:

 Computer cannot deal with text data directly

<1H OCEAN 7276

There are 5 categories, counts are dif.

[‘<1H OCEAN', 'INLAND',

Don’t Use it:

There are total 5 categories:

# then merge it with numerical attribute

Now we have 14 features

 For example: rooms_per_household is better than

 We talk about Correlation 23

What you really want is the number of rooms per

 Book Code did splitting first

Feature Scaling http://cs231n.github.io/neural-

Make different features have the same scaling

 If you use R-squared for nonlinear models, their study

 R-squared is consistently high for both excellent and appalling models.

 More info: https://en.wikipedia.org/wiki/Coefficient_of_determination

 After hyperparameter tuning,

 We also need to avoid

 Feature Selection can be

 Feature Importance from

 Evaluate Your System on the Test

Data: California Housing (chapter 2 )

Cannot use the sk-learn library

K_NN:K nearest neighbors:

You might also like