Introduction to data science
Overview of Machine Learning
Machine Learning Approaches
Classical
learning
Reinforcement MACHINE Ensemble
LEARNING
learning learning
Neural nets
and deep
learning
Machine Learning Approaches
Classical
learning
Supervised Unsupervised Semi-supervised
learning learning learning
Machine Learning Approaches
Ensemble
learning
Boosting Bagging Stacking
Machine Learning Approaches
Reinforcement
learning
Genetic
Algorithm Q-Learning …
(GA)
Machine Learning Approaches
Neural nets
(NN) and
deep learning
Back
Feed forward Convolutional
Propagation Recurrent NN ….
NN NN
NN
Supervised vs. Unsupervised Learning
◼ Supervised learning (classification)
◼ Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
◼ New data is classified based on the training set
◼ Unsupervised learning (clustering)
◼ The class labels of training data is unknown
◼ Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Supervised vs. Unsupervised Learning
Supervised Learning: Classification vs. Prediction
◼ Classification
◼ predicts categorical class labels (discrete or nominal)
◼ classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
◼ Prediction (Regression)
◼ models continuous-valued functions, i.e., predicts
unknown or missing values
◼ Typical applications
◼ Credit approval
◼ Target marketing
◼ Medical diagnosis
◼ Fraud detection
Supervised Learning: Drawbacks
◼ Supervised learning requires human expertise: Expert
annotators play an invaluable role in guiding your model’s
training, but they can be difficult to recruit.
◼ Supervised learning is labor-intensive: You’ll need to
have a big enough team with relevant expertise to accurately
label large datasets.
◼ Supervised learning is time-intensive: In addition to top
talent, you’ll need the bandwidth to accurately annotate the
dataset so that your model is capable of producing
predictable outcomes.
Classification: A Two-Step Process
◼ Model construction: describing a set of predetermined classes
◼ Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
◼ The set of tuples used for model construction is training set
◼ The model is represented as classification rules, decision trees,
or mathematical formulae
◼ Model usage: for classifying future or unknown objects
◼ Estimate accuracy of the model
◼ The known label of test sample is compared with the
classified result from the model
◼ Accuracy rate is the percentage of test set samples that are
correctly classified by the model
◼ Test set is independent of training set, otherwise over-
fitting will occur
◼ If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
Process (1): Model Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier
M ik e A ssistan t P ro f 3 no (Model)
M ary A ssistan t P ro f 7 yes
B ill P ro fesso r 2 yes
J im A sso c iate P ro f 7 yes
IF rank = ‘professor’
D ave A ssistan t P ro f 6 no
OR years > 6
Anne A sso c iate P ro f 3 no
THEN tenured = ‘yes’
Process (2): Using the Model in Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom A ssistan t P ro f 2 no Tenured?
M erlisa A sso c iate P ro f 7 no
G eo rg e P ro fesso r 5 yes
J o sep h A ssistan t P ro f 7 yes
Machine learning in data mining
Issues regarding to classification and prediction
Issues: Data Preparation
◼ Data cleaning
◼ Preprocess data in order to reduce noise and handle
missing values
◼ Relevance analysis (feature selection)
◼ Remove the irrelevant or redundant attributes
◼ Data transformation
◼ Generalize and/or normalize data
Issues: Evaluating Classification Methods
◼ Accuracy
◼ classifier accuracy: predicting class label
◼ predictor accuracy: guessing value of predicted attributes
◼ Speed
◼ time to construct the model (training time)
◼ time to use the model (classification/prediction time)
◼ Robustness: handling noise and missing values
◼ Scalability: efficiency in disk-resident databases
◼ Interpretability
◼ understanding and insight provided by the model
◼ Other measures, e.g., goodness of rules, such as decision
tree size or compactness of classification rules
Issues: Evaluating Classification Methods
Actual class
+ –
False Positive - NP
Predicted + True Positive - TP
Type I error
False Negative- FN
class – Type II error
True Negative - TN
Issues: Evaluating Classification Methods
Miss Detection Rate
False Alarm Rate
Issues: Evaluating Classification Methods
Issues: Evaluating Classification Methods
Example: Given a confusion matrix
Calculate Accuracy, Precision,
Recall and F1-Score.
Accuracy =
Precision =
Recall =
F1-Score =
Issues: Evaluating Regression Methods
Issues: Evaluating Regression Methods
Mean Squared Error (MSE)
Mean Absolute Error (MAE):
Root Mean Square Error (RMSE):
where: yi is the actual values, and 𝑦ො𝑖 is the predicted values
Issues: Evaluating Regression Methods
Mean Absolute Percentage Error (MAPE)
R2 (R-squared):
where: yi is the actual values, and 𝑦ො𝑖 is the predicted values
SSR is the sum of squared residuals, and SST is the total sum of squares
Issues: Evaluating Regression Methods
Calculate MSE, MAE, RMSE, R2
Issues: Evaluating Regression Methods
Issues: Overfitting and underfitting
▪ Underfitting happens when a model is not good enough to understand all the
details in the data
→ Poor performance on both the training and test sets
▪ Overfitting occurs when a model is too complex and memorizes the training
data too well
→ good performance on the training set but poor performance on the test set
Other machine learning models
▪ Ensemble learning:
Other machine learning models
▪ Ensemble learning: