0% found this document useful (0 votes)

27 views57 pages

5.2 Feature Engineering

The document discusses feature engineering, emphasizing its importance in transforming raw data into suitable features for predictive models to enhance accuracy. It covers various techniques for handling numerical and categorical variables, including binarization, scaling, encoding, and text vectorization. Additionally, it highlights feature selection methods to improve model performance and reduce complexity.

Uploaded by

Tuân Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views57 pages

5.2 Feature Engineering

Uploaded by

Tuân Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

1

Lecture 5 – Part 2
Feature Engineering

2
Feature engineering
• "Feature engineering is the process of transforming
raw data into features that better represent the
underlying problem to the predictive models, resulting
in improved model accuracy on unseen data." – Jason
Brownlee

3
Feature engineering
• “Coming up with features is difficult, time-consuming,
requires expert knowledge. 'Applied machine learning'
is basically feature engineering.” – Andrew Ng

4
The dream ...

Raw Datas Mod Tas

data et el k

5
… The Reality

?
Features ML Ready
?
Model Task
dataset

Raw data
Feature engineering toolbox
• Just kidding :)
Variable data types

8
Number variables

9
Binarization
• Counts can quickly accumulate without bound
• convert them into binary values (0, 1) to indicate
presence

10
Quantization or Binning
• Group the counts into bins
• Maps a continuous number to a discrete one
• Bin size
• Fixed-width binning
• Eg.
• 0–12 years old
• 12–17 years old
• 18–24 years old
• 25–34 years old
• Adaptive-width binning

11
Equal Width Binning
• divides the continuous variable into several categories
having bins or range of the same width

• Pros
• easy to compute
• Cons
• large gaps in the counts
• many empty bins with no data

12
Adaptive-width binning
• Equal frequency binning
• Quantiles: values that divide the data into equal portions
(continuous intervals with equal probabilities)
• Some q-quantiles have special names
• The only 2-quantile is called the median
• The 4-quantiles are called quartiles → Q
• The 6-quantiles are called sextiles → S
• The 8-quantiles are called octiles
• The 10-quantiles are called deciles → D

13
Example: quartiles

14
Log Transformation
• Original number = x
• Transformed number
x'=log10(x)
• Backtransformed number =
10x'

15
Box-Cox transformation

16
Feature Scaling (Normalization)
• Models that are smooth functions of the input, such as
linear regression, logistic regression are affected by the
scale of the input
• Feature scaling or normalization changes the scale of
the features

17
Min-max scaling
● Squeezes (or stretches) all values within the range of [0, 1] to add robustness to
very small standard deviations and preserving zeros for sparse data.

• >>> from sklearn import preprocessing

• >>> X_train = np.array([[ 1., -1., 2.],
• ... [ 2., 0., 0.],
• ... [ 0., 1., -1.]])
• ...
• >>> min_max_scaler =
preprocessing.MinMaxScaler()
• >>> X_train_minmax =
min_max_scaler.fit_transform(X_train)
array([[ 0.5 , 0. , 1. ],
[ 1. , 0.5 , 0.33333333],
[ 0. , 1. , 0. ]])
Standard (Z) Scaling
After Standardization, a feature has mean of 0 and variance of 1 (assumption of
many learning algorithms)

>>> from sklearn import preprocessing

>>> import numpy as np
>>> X = np.array([[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]])
>>> X_scaled = preprocessing.scale(X)
>>> X_scaled
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
>> X_scaled.mean(axis=0)
array([ 0., 0., 0.])
>>> X_scaled.std(axis=0)
array([ 1., 1., 1.])
Standardization with scikit-learn
l2 Normalization
• also known as the Euclidean norm
• measures the length of the vector in coordinate space
• scale the values so that if they were all squared and
summed, the value would be 1

from pandas import read_csv

from numpy import set_printop3ons
from sklearn.preprocessing import Normalizer
path = r’./pima-indians-diabetes.csv’
names = ['preg', 'test', 'mass', 'pedi', 'age', 'class’]
dataframe = read_csv (path, names=names)
array = dataframe.values
Data_normalizer = Normalizer(norm='l2').ﬁt(array)
Data_normalized = Data_normalizer.transform(array)
20
Categorical Variables

21
Categorical Features
• Nearly always need some treatment to be suitable for
models
• High cardinality can create very sparse data
• Difficult to impute missing
• Examples
• Platform: [“desktop”, “tablet”, “mobile”]
• Document_ID or User_ID: [121545, 64845, 121545]
Label Encoding
• transform categorical variables into numerical variables
by assigning a numerical value to each of the
categories

23
LabelCount encoding
• Rank categorical variables by count in train set
• Useful for both linear and non-linear algorithms (eg:
decision trees)
• Not sensitive to outliers
• Won’t give same encoding to different variables
Ordinal encoding
• transform an original categorical variable to a
numerical variable by ensuring the ordinal nature of
the variables is sustained

25
Frequency encoding
• transform an original categorical variable to a
numerical variable by considering the frequency
distribution of the data

26
One hot encoding
• creates k different columns each for a category and
replaces one column with 1 rest of the columns is 0

28
Target Mean encoding
• one of the best
techniques
• replace the categorical
variable with the mean
of its corresponding
target variable
• Steps for mean
encoding
• For each category
• Calculate aggregated
sum (= a)
• Calculate aggregated
total count (= b)
• Numerical value for that
category = a/b

29
Feature Hashing
• Dealing with Large Categorical Variables

Some large categorical features from Outbrain Click Prediction competition

30
Feature hashing [2]
• Hashes categorical values into vectors with fixed-length.
• Lower sparsity and higher compression compared to one hot
encoding
• Deals with new and rare categorical values (eg: new user-agents)
• May introduce collisions

100 hashed columns

Bin-counting
• Instead of using the value of the categorical variable as
the feature, we compute the association statistics
between that value and the target that we wish to
predict
• Useful for both linear and non-linear algorithms
• May give collisions (same encoding for different
categories)
• Be careful about leakage
• Strategies
• Count o o
• Average CTR r r

Counts Click-Through Rate

P(click | ad) =
ad_clicks / ad_views
Text Data

33
Natural Language Processing
• Cleaning • Removing
• Lowercasing • Stopwords
• Convert accented characters • Rare words
• Removing non-alphanumeric • Common words
• Repairing • Roots
• Tokenizing • Spelling correction
• Encode punctuation marks • Chop
• Tokenize • Stem
• N-Grams • Lemmatize
• Skip-grams • Enrich
• Char-grams • Entity Insertion / Extraction
• Affixes • Parse Trees
• Reading Level
Text vectorization
• Represent each document as a feature vector in the
vector space, where each position represents a word
(token) and the contained value is its relevance in the
document.
• BoW (Bag of words)
• TF-IDF (Term Frequency - Inverse Document Frequency)
• Embeddings (eg. Word2Vec, Glove)
• Topic models (e.g LDA)

Document Term Matrix - Bag of Words

Bag-of-Words
• Input
• “Customer reviews build something known as social proof, a
phenomenon that states people are influenced by those
around them. This might include friends and family, industry
experts and influencers, or even internet strangers.”
• Output
• a text document is converted into a “flat” vector of counts
• doesn’t contain any of the original textual structures
• John is quicker than Mary and Mary is quicker than John have
the same vectors

36
Bag-of-n-Grams
• a natural extension of bag-of-words (a word is
essentially an unigram)
• bag-of-n-grams representation can be more
informative
• n-grams retain more of the original sequence structure
• Cons
• bag-of-n-grams is a much bigger and sparser feature space

37
From Words to n-Grams to Phrases
• Tokenization is the process of tokenizing or splitting a
string, text into a list of tokens.
• Chunking a sentences refers to breaking/dividing
a sentence into parts of words such as word groups
and verb groups.

38
Document frequency
• Rare terms are more informative than frequent terms
• Recall stop words
• Consider a term in the query that is rare in the
collection (e.g., arachnocentric)
• A document containing this term is very likely to be
relevant to the query arachnocentric
• → We want a high weight for rare terms like arachnocentric.

39
Tf-idf
• The term frequency tft,d of term t in document d is
defined as the number of times that t occurs in d.
• dft is the document frequency of t: the number of
documents that contain t
• dft is an inverse measure of the informativeness of t
• dft £ N
• We define the idf (inverse document frequency) of t by
idft = log10 ( N/dft )
• We use log (N/dft) instead of N/dft to “dampen” the effect of idf.

40
Tf-idf
• The tf-idf weight of a term is the product of its tf weight
and its idf weight.

w t ,d = log(1 + tft ,d ) ´ log10 ( N / df t )

• Best known weighting scheme in information retrieval

• Note: the “-” in tf-idf is a hyphen, not a minus sign!
• Alternative names: tf.idf, tf x idf
• Increases with the number of occurrences within a
document
• Increases with the rarity of the term in the collection

41
Filtering for Cleaner Features
• Stopwords
• weeding out common words that make for vacuous features
• Frequency-Based Filtering
• filtering out corpus-specific common words as well as general-
purpose stopwords
• Rare words
• Depending on the task, one might also need to filter out rare
words.
• These might be truly obscure words, or misspellings of
common words.
• Stemming
• An NLP task that tries to chop each word down to its basic
linguistic word stem form

42
Word representation: embedding the context

• Attempt to encode similarity inside the word vectors

• Built ontop of the following great idea
• “You shall know a word by the company it keeps” (J. R. Firth
1957)

During his presidency, Trump ordered a travel ban on citizens

controversial or false. Trump was elected president in a surprise victory over
1971, renamed it to The Trump Organization, and expanded it into Manhattan.
coordination between the Trump campaign and the Russian government in its election
interference.

These words describe the meaning of Trump

11/27/24 43
Word embedding
• Each word is encoded in a dense vector (Low
dimension)
• Able to capture the semantics
• Similar words ~ Similar vectors
0.13
0.67
-
0.34
University = 0.76
-
0.21
-0.11
-
0.45
0.87
0.44

11/27/24 44
How to learn word embeddings
• The famous approach: Word2vec (Mikolov et. al. 2013)
• Unsupervised learning
• Large-scale dataset
• Lower computation cost
• High quality word vectors

11/27/24 45
BERT sentence embedding
• Feng, Fangxiaoyu, et al. "Language-agnostic BERT
Sentence Embedding." arXiv preprint
arXiv:2007.01852 (2020).

46
Feature selection

47
Interaction Features
• A simple pairwise interaction feature is the product of
two features
• A simple linear model
• y=w1x1 +w2x2 +...+wnxn
• An easy way to extend the linear model is to include
combinations of pairs of input features
• y = w1x1 + w2x2 + ... + wnxn + w1,1x1x1 + w1,2x1x2 + w1,3x1x3 + ...

48
Polynomial Features

>>> import numpy as np

>>> from sklearn.preprocessing import PolynomialFeatures
>>> X = np.arange(6).reshape(3, 2)
>>> X
array([[0, 1],
[2, 3],
[4, 5]])
>>> poly = poly = PolynomialFeatures(degree=2, interaction_only=False,
include_bias=True)
>>> poly.fit_transform(X)
array([[ 1., 0., 1., 0., 0., 1.],
[ 1., 2., 3., 4., 6., 9.],
[ 1., 4., 5., 16., 20., 25.]])
Polynomial features with scikit-learn
Feature Selection
• Objective
• prune away nonuseful features in order to reduce the
complexity of the resulting model
• Advantages
• Training a machine learning algorithm faster.
• Reducing the complexity of a model and making it easier
to interpret.
• Building a sensible model with better prediction power.
• Reducing overfitting by selecting the right set of features.

50
Wrapper methods
• The feature selection process is based on a specific
machine learning algorithm
• Exhaustive search follows a greedy search
approach by evaluating all the possible combinations
of features against the evaluation criterion
• Random search methods randomly generate a subset
of features
• Computationally intensive since for each subset a new
model needs to be trained

51
Embedded Methods
• Perform feature selection during the model training
• Decision tree
• select a feature in each recursive step of the tree growth
process and divide the sample set into smaller subsets
• The more child nodes in a subset are in the same class, the
more informative the features are

52
Method comparision

53
“More data beats clever algorithms,
but better data beats more data.”
– Peter Norvig
Diverse set of features and models leads to
different results!

Outbrain Click Prediction

Towards Automated Feature
Engineering
Deep Learning....
Thank you
for your
attention!!!

57
References
• Scikit-learn -
Preprocessing data
• Spark ML - Feature
extraction

ST2334 Midterm Test 2022-2023 Sem 1 Solution
No ratings yet
ST2334 Midterm Test 2022-2023 Sem 1 Solution
7 pages
Deep Learning (Book)
No ratings yet
Deep Learning (Book)
130 pages
Complete Unit-1 Merged
No ratings yet
Complete Unit-1 Merged
74 pages
Feature Engineering Guide
100% (2)
Feature Engineering Guide
44 pages
AI Fundamentals Level 1 Quiz - Attempt Review
No ratings yet
AI Fundamentals Level 1 Quiz - Attempt Review
9 pages
Feature Engineering PDF
100% (1)
Feature Engineering PDF
75 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Introduction to IR Models and Techniques
100% (1)
Introduction to IR Models and Techniques
32 pages
Machine Learning/Data Science Interview Cheat Sheets: Aqeel Anwar
No ratings yet
Machine Learning/Data Science Interview Cheat Sheets: Aqeel Anwar
17 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Study Material For Machine Learning - 1 - 1754721598318
No ratings yet
Study Material For Machine Learning - 1 - 1754721598318
18 pages
Graphical Models For The Internet
No ratings yet
Graphical Models For The Internet
306 pages
Text Processing & Term Weighting
100% (2)
Text Processing & Term Weighting
38 pages
Unit 2
No ratings yet
Unit 2
91 pages
Live Classroom 3
No ratings yet
Live Classroom 3
36 pages
Unit II
No ratings yet
Unit II
119 pages
Unit 3 - Test 2 Portions-1
No ratings yet
Unit 3 - Test 2 Portions-1
24 pages
UT-1-Machine Learning Lecture Notes-2
No ratings yet
UT-1-Machine Learning Lecture Notes-2
11 pages
Introduction To Automatic Indexing
No ratings yet
Introduction To Automatic Indexing
28 pages
Time Series Analysis
No ratings yet
Time Series Analysis
49 pages
Feature Engineering Guide
No ratings yet
Feature Engineering Guide
51 pages
Introduction To Maximum Entropy Models: Adwait Ratnaparkhi Yahoo! Labs
No ratings yet
Introduction To Maximum Entropy Models: Adwait Ratnaparkhi Yahoo! Labs
46 pages
SAT SAT Systems of Linear Equations
No ratings yet
SAT SAT Systems of Linear Equations
4 pages
Data Stream Sampling Techniques
No ratings yet
Data Stream Sampling Techniques
3 pages
Machine: Learning
No ratings yet
Machine: Learning
24 pages
ML Week 8
No ratings yet
ML Week 8
12 pages
Predictive Methods For Text Mining
No ratings yet
Predictive Methods For Text Mining
75 pages
Unit-4 Part 3 Feature Engineering
No ratings yet
Unit-4 Part 3 Feature Engineering
29 pages
05 - Feature Engineering (Text)
No ratings yet
05 - Feature Engineering (Text)
28 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
45 pages
Module III
No ratings yet
Module III
42 pages
Lect 06 Feature Engineering and Selection
No ratings yet
Lect 06 Feature Engineering and Selection
41 pages
ML in Simple Words: in Python, The Function Is Used To Display Output On The Screen or Other Standard Output Device
No ratings yet
ML in Simple Words: in Python, The Function Is Used To Display Output On The Screen or Other Standard Output Device
30 pages
CS L04 MachineLearning Basics 02
No ratings yet
CS L04 MachineLearning Basics 02
69 pages
Unit No: 4 Basics of Feature Engineering (31707 24)
No ratings yet
Unit No: 4 Basics of Feature Engineering (31707 24)
98 pages
Lect 05 Preprocessing Text
No ratings yet
Lect 05 Preprocessing Text
25 pages
04 MLModelingBasics
No ratings yet
04 MLModelingBasics
61 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
27 pages
Machine Learning Techniques For Heart Disease Prediction: A. Lakshmanarao, Y.Swathi, P.Sri Sai Sundareswar
No ratings yet
Machine Learning Techniques For Heart Disease Prediction: A. Lakshmanarao, Y.Swathi, P.Sri Sai Sundareswar
4 pages
Data Mining Numericals
No ratings yet
Data Mining Numericals
38 pages
L2 Cse256 Fa24 TC
No ratings yet
L2 Cse256 Fa24 TC
65 pages
ML-Unit 3
No ratings yet
ML-Unit 3
58 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
5.1 Exploratory Analysis en
No ratings yet
5.1 Exploratory Analysis en
79 pages
7-8 Feature Engineering 101-Normalization
No ratings yet
7-8 Feature Engineering 101-Normalization
8 pages
L03
No ratings yet
L03
16 pages
Data Analysis and Machine Learning Essentials
No ratings yet
Data Analysis and Machine Learning Essentials
14 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
Lecture 4 Classification P1
No ratings yet
Lecture 4 Classification P1
50 pages
Feature and Feature Extractionlect2
No ratings yet
Feature and Feature Extractionlect2
28 pages
Pattern Recognition 14
No ratings yet
Pattern Recognition 14
46 pages
Naïve Bayes for CS Students
No ratings yet
Naïve Bayes for CS Students
55 pages
Feature Engineering in ML & NLP
No ratings yet
Feature Engineering in ML & NLP
85 pages
Amazon Food Reviews Analysis
No ratings yet
Amazon Food Reviews Analysis
37 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Lecture 4 - Spark Introduction
No ratings yet
Lecture 4 - Spark Introduction
45 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
33 pages
CV Lecture 4
No ratings yet
CV Lecture 4
52 pages
Lecture-4 Strip Method - StructuralDesign (Compatibility Mode) PDF
No ratings yet
Lecture-4 Strip Method - StructuralDesign (Compatibility Mode) PDF
7 pages
Machine Learning Feature Engineering
No ratings yet
Machine Learning Feature Engineering
94 pages
Data Preprocessing
No ratings yet
Data Preprocessing
38 pages
Linear Search and Binary Search
No ratings yet
Linear Search and Binary Search
11 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
DS Cheat Sheets
No ratings yet
DS Cheat Sheets
18 pages
Time Value of Money Solutions
No ratings yet
Time Value of Money Solutions
4 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
124 pages
1-Zeroth-Hour, Course Syllabus & Discussion-04-01-2024
No ratings yet
1-Zeroth-Hour, Course Syllabus & Discussion-04-01-2024
3 pages
Extreme Multi-Label Text Classification
No ratings yet
Extreme Multi-Label Text Classification
11 pages
Word Embedding & Language Modelling
No ratings yet
Word Embedding & Language Modelling
111 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Homework 03
No ratings yet
Homework 03
2 pages
Ijet V2i3p7
No ratings yet
Ijet V2i3p7
6 pages
Lecture 3 - 1-ML and Data Systems Fundamentals
No ratings yet
Lecture 3 - 1-ML and Data Systems Fundamentals
48 pages
Lecture 7 - 1-Spark - Streaming
No ratings yet
Lecture 7 - 1-Spark - Streaming
25 pages
A Two Stage Estimation Method Based On Conceptors Aided Unsup 2023 Expert Sy
No ratings yet
A Two Stage Estimation Method Based On Conceptors Aided Unsup 2023 Expert Sy
17 pages
2024 End Term Statistical Modeling Question Paper
No ratings yet
2024 End Term Statistical Modeling Question Paper
2 pages
Inverse Manipulator Kinematics: Osman Parlaktuna Osmangazi University Eskisehir, Turkey WWW - Ogu.edu - TR/ Oparlak
No ratings yet
Inverse Manipulator Kinematics: Osman Parlaktuna Osmangazi University Eskisehir, Turkey WWW - Ogu.edu - TR/ Oparlak
41 pages
03 - Classification PDF
No ratings yet
03 - Classification PDF
92 pages
Hybrid Systems Stability Guide
No ratings yet
Hybrid Systems Stability Guide
51 pages
Final Project Report - Kunal - Sir
No ratings yet
Final Project Report - Kunal - Sir
32 pages
Retail Shoe Dataset: Adidas vs. Nike: by - Rochita Sundar 15 April 2022
No ratings yet
Retail Shoe Dataset: Adidas vs. Nike: by - Rochita Sundar 15 April 2022
20 pages
Factoring: Math 8 Teacher Jervy Josiah D. Bayang
No ratings yet
Factoring: Math 8 Teacher Jervy Josiah D. Bayang
23 pages
Ca 1
No ratings yet
Ca 1
24 pages
Exponential Functions Guide
No ratings yet
Exponential Functions Guide
22 pages
Designing Optimal Routes in A Liner Shipping Problem
No ratings yet
Designing Optimal Routes in A Liner Shipping Problem
10 pages
What Is Spanning Tree? What Are The Properties and Applications of Spanning Tree?
No ratings yet
What Is Spanning Tree? What Are The Properties and Applications of Spanning Tree?
5 pages
Company Profile and OR JD
No ratings yet
Company Profile and OR JD
2 pages
ISOM2500 Regression Practice Solutions
No ratings yet
ISOM2500 Regression Practice Solutions
3 pages
Unit 3:group B Test 9-Klasse
No ratings yet
Unit 3:group B Test 9-Klasse
3 pages
MSE Revision Set A-1
No ratings yet
MSE Revision Set A-1
2 pages

5.2 Feature Engineering

Uploaded by

5.2 Feature Engineering

Uploaded by

1

Raw Datas Mod Tas

• >>> from sklearn import preprocessing

>>> from sklearn import preprocessing

from pandas import read_csv

Some large categorical features from Outbrain Click Prediction competition

100 hashed columns

Counts Click-Through Rate

Document Term Matrix - Bag of Words

w t ,d = log(1 + tft ,d ) ´ log10 ( N / df t )

• Best known weighting scheme in information retrieval

• Attempt to encode similarity inside the word vectors

During his presidency, Trump ordered a travel ban on citizens

These words describe the meaning of Trump

© Mikolov et. Al. 2013

>>> import numpy as np

Outbrain Click Prediction

You might also like