Introduction to Machine Learning
Theory and Practice
David R. Pugh
Instructional Assistant Professor, KAUST
Director, SDAIA-KAUST AI
• 5+ years teaching applied machine learning and deep learning at KAUST.
• 2+ years as the director of SDAIA-KAUST AI where I work to match applied AI
problems of interest to SDAIA with AI solutions developed at KAUST.
• 15+ years experience with the core data science Python stack: NumPy, SciPy,
Pandas, Matplotlib, NetworkX, Jupyter, Scikit-Learn, PyTorch, etc.
KAUST Academy 2
Agenda
Introduction to Machine Learning: Theory and Practice
09:00 - 09:05 Welcome and Opening Remarks Prof. David Pugh
09:05 - 10:30 The Machine Learning Landscape Prof. David Pugh
10:30 - 10:45 Break
10:45 - 12:00 Classification and Regression Prof. David Pugh
12:00 - 13:00 Lunch
13:00 - 14:30 Linear Regression with NumPy Prof. David Pugh + TAs
14:30 - 14:45 Break
14:45 – 16:00 Introduction to Scikit-Learn Prof. David Pugh + TAs
KAUST Academy 3
References
• Slides closely follow Hands-on Machine Learning with Scikit-Learn,
Keras and Tensorflow by Aurelien Geron.
• Another great reference is Machine Learning with PyTorch and Scikit-
Learn by Sebastian Raschka.
• Official documentation for Scikit-Learn is also fantastic.
KAUST Academy Prof. Da vi d R. Pugh 4
The ML Landscape
Prof. Da vi d R. Pugh
What is difference between AI and ML?
KAUST Academy Prof. Da vi d R. Pugh 6
What is ML?
• ML is the science (and art) of programming computers so they can learn from
data (Geron, 2019).
• [ML is the] field of study that gives computers the ability to learn without
being explicitly programmed (Samuel, 1959).
• A computer program is said to learn from experience E with respect to some
task T and some performance measure P, if its performance on T, as
measured by P, improves with experience E (Mitchell, 1997).
KAUST Academy Prof. Da vi d R. Pugh 7
Why is ML so popular right now?
Stanford’s Coursera machine learning course had more than 100,000 expressing interest
in the first year.
1. The field has matured both in terms of identity and in terms of methods and tools.
2. There is an abundance of data available
3. There is an abundance of computation to run methods
4. There have been impressive results, increasing acceptance, respect, and competition
Resources + Ingredients + Tools + Desire = Popularity
KAUST Academy Based on: http://machinelearningmastery.com/machine-learning-is-popular/?__s=yq1qzcnf67sfiuzmnvjf 8
Traditional approach is model/rules based...
KAUST Academy Prof. Da vi d R. Pugh 9
...ML approach is data-driven!
KAUST Academy Prof. Da vi d R. Pugh 10
ML adapts to change!
KAUST Academy Prof. Da vi d R. Pugh 11
ML can help humans learn!
KAUST Academy Prof. Da vi d R. Pugh 12
Types of ML systems
• Supervised vs unsupervised
• Semi-supervised vs self-supervised
• Batch (offline) vs incremental (online)
• Instance-based vs model-based
KAUST Academy Prof. Da vi d R. Pugh 13
Supervised learning
Classification Regression
KAUST Academy Prof. Da vi d R. Pugh 14
Other forms of supervised learning
Semi-supervised learning Self-supervised learning
KAUST Academy Prof. Da vi d R. Pugh 15
Unsupervised learning
Clustering Data visualization
KAUST Academy Prof. Da vi d R. Pugh 16
Reinforcement Learning
KAUST Academy Prof. Da vi d R. Pugh 17
Batch (offline) vs incremental (online) learning
Batch (offline) Learning Incremental (online) learning
KAUST Academy Prof. Da vi d R. Pugh 18
Out-of-core learning
KAUST Academy Prof. Da vi d R. Pugh 19
Instance-based vs model-based learning
Instance-based learning Model-based learning
KAUST Academy Prof. Da vi d R. Pugh 20
Main Challenges of Applying ML
KAUST Academy 21
Main Challenges of Applying ML
• Insufficient quantity of training data
• Non-representative training data
• Poor quality data
• Irrelevant features
• Overfitting the training data
• Underfitting the training data
KAUST Academy Prof. Da vi d R. Pugh 22
Insufficient quantity of training data
• The more data for training the
better!
• It can take a lot of data for most
ML algorithms to work.
• "Simple" problems often require
O(10k) samples.
• "Complex" problems often
require O(1m) samples.
KAUST Academy Prof. Da vi d R. Pugh 23
Non-representative training data
• Need training data to be
representative of new data for
generalization.
• Sampling noise: not enough
data => training data not
representative by chance.
• Sampling bias: poor sampling
technique => training data not
representative (biased).
KAUST Academy Prof. Da vi d R. Pugh 24
Poor quality training data
• Data can be full of errors, • Data types? Do you have
outliers, and noise (e.g., due to numeric features? Ordinal
poor-quality measurements). features? Categorical features?
• Dirty data => hard for • Look for outliers in your data:
any algorithm to detect Remove? Fix manually?
patterns. • Look for missing data:
• Significant amount of your Remove? Impute values?
time will be spent cleaning
data.
KAUST Academy Prof. Da vi d R. Pugh 25
Irrelevant features
Garbage in => garbage out! Feature engineering is often
critical to success.
• Learning requires sufficient
relevant features (and not too • Feature selection:
many irrelevant ones!). selecting the "best" subset
• Developing a good set of of features for training.
features for training is critical
part of ML project. • Feature extraction:
• Significant amount of combining existing features to
your time will be spent doing produce new ones.
feature engineering. • Creating new features
from new data.
KAUST Academy Prof. Da vi d R. Pugh 26
Overfitting the training data
What is overfitting?
• Overfitting is when model
performs well on training data
but poorly on new data.
• If model is complex or training
data is limited, model will detect
spurious patterns.
• Constraining a complex
model to make it simpler is
called regularization.
KAUST Academy Prof. Da vi d R. Pugh 27
Underfitting the training data
What is underfitting? How to reduce underfitting?
• Underfitting is when a model is • Select more complex (more
too simple to learn the parameters) model.
underlying structure of the data.
• Linear models will often • Feed better features to the
underfit (but often a good place model (feature engineering).
to start). • Reduce the constraints on
model (reduce regularization).
KAUST Academy Prof. Da vi d R. Pugh 28
Validation and Testing
KAUST Academy 29
Why measure generalization error?
• Only way to know if your model Some train-test split heuristics:
is good is to measure
performance new data! • For datasets smaller than
• Split your data into train and O(100k) samples, take 80%
test sets: error on the test set is for train and holdout 20%
estimate of generalization error. for test.
• Low training error, high • For larger datasets, O(1m)
generalization error => samples, holdout 1-10% of the
overfitting! dataset for test.
KAUST Academy Prof. Da vi d R. Pugh 30
Model Selection
• Often need to tune • Validation set too small =>
hyperparameters to find a good might select "bad" model by
model within a particular class mistake.
of models. • Validation set too large
• How? Split training data into => training set too small!
training set and validation set. • Cross validation: create lots
• Always compare tuned models of small validation sets,
using the test set! evaluate model on each
validation set, measure
average performance across
validation sets.
KAUST Academy Prof. Da vi d R. Pugh 31
Model selection process
KAUST Academy Prof. Da vi d R. Pugh 32
Thanks!
KAUST Academy Prof. Da vi d R. Pugh 36