1
Lecture 5 – Part 2
Feature Engineering
2
Feature engineering
• "Feature engineering is the process of transforming
raw data into features that better represent the
underlying problem to the predictive models, resulting
in improved model accuracy on unseen data." – Jason
Brownlee
3
Feature engineering
• “Coming up with features is difficult, time-consuming,
requires expert knowledge. 'Applied machine learning'
is basically feature engineering.” – Andrew Ng
4
The dream ...
Raw Datas Mod Tas
data et el k
5
… The Reality
?
Features ML Ready
?
Model Task
dataset
Raw data
Feature engineering toolbox
• Just kidding :)
Variable data types
8
Number variables
9
Binarization
• Counts can quickly accumulate without bound
• convert them into binary values (0, 1) to indicate
presence
10
Quantization or Binning
• Group the counts into bins
• Maps a continuous number to a discrete one
• Bin size
• Fixed-width binning
• Eg.
• 0–12 years old
• 12–17 years old
• 18–24 years old
• 25–34 years old
• Adaptive-width binning
11
Equal Width Binning
• divides the continuous variable into several categories
having bins or range of the same width
• Pros
• easy to compute
• Cons
• large gaps in the counts
• many empty bins with no data
12
Adaptive-width binning
• Equal frequency binning
• Quantiles: values that divide the data into equal portions
(continuous intervals with equal probabilities)
• Some q-quantiles have special names
• The only 2-quantile is called the median
• The 4-quantiles are called quartiles → Q
• The 6-quantiles are called sextiles → S
• The 8-quantiles are called octiles
• The 10-quantiles are called deciles → D
13
Example: quartiles
14
Log Transformation
• Original number = x
• Transformed number
x'=log10(x)
• Backtransformed number =
10x'
15
Box-Cox transformation
16
Feature Scaling (Normalization)
• Models that are smooth functions of the input, such as
linear regression, logistic regression are affected by the
scale of the input
• Feature scaling or normalization changes the scale of
the features
17
Min-max scaling
● Squeezes (or stretches) all values within the range of [0, 1] to add robustness to
very small standard deviations and preserving zeros for sparse data.
• >>> from sklearn import preprocessing
• >>> X_train = np.array([[ 1., -1., 2.],
• ... [ 2., 0., 0.],
• ... [ 0., 1., -1.]])
• ...
• >>> min_max_scaler =
preprocessing.MinMaxScaler()
• >>> X_train_minmax =
min_max_scaler.fit_transform(X_train)
array([[ 0.5 , 0. , 1. ],
[ 1. , 0.5 , 0.33333333],
[ 0. , 1. , 0. ]])
Standard (Z) Scaling
After Standardization, a feature has mean of 0 and variance of 1 (assumption of
many learning algorithms)
>>> from sklearn import preprocessing
>>> import numpy as np
>>> X = np.array([[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]])
>>> X_scaled = preprocessing.scale(X)
>>> X_scaled
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
>> X_scaled.mean(axis=0)
array([ 0., 0., 0.])
>>> X_scaled.std(axis=0)
array([ 1., 1., 1.])
Standardization with scikit-learn
l2 Normalization
• also known as the Euclidean norm
• measures the length of the vector in coordinate space
• scale the values so that if they were all squared and
summed, the value would be 1
from pandas import read_csv
from numpy import set_printop3ons
from sklearn.preprocessing import Normalizer
path = r’./pima-indians-diabetes.csv’
names = ['preg', 'test', 'mass', 'pedi', 'age', 'class’]
dataframe = read_csv (path, names=names)
array = dataframe.values
Data_normalizer = Normalizer(norm='l2').fit(array)
Data_normalized = Data_normalizer.transform(array)
20
Categorical Variables
21
Categorical Features
• Nearly always need some treatment to be suitable for
models
• High cardinality can create very sparse data
• Difficult to impute missing
• Examples
• Platform: [“desktop”, “tablet”, “mobile”]
• Document_ID or User_ID: [121545, 64845, 121545]
Label Encoding
• transform categorical variables into numerical variables
by assigning a numerical value to each of the
categories
23
LabelCount encoding
• Rank categorical variables by count in train set
• Useful for both linear and non-linear algorithms (eg:
decision trees)
• Not sensitive to outliers
• Won’t give same encoding to different variables
Ordinal encoding
• transform an original categorical variable to a
numerical variable by ensuring the ordinal nature of
the variables is sustained
25
Frequency encoding
• transform an original categorical variable to a
numerical variable by considering the frequency
distribution of the data
26
One hot encoding
• creates k different columns each for a category and
replaces one column with 1 rest of the columns is 0
28
Target Mean encoding
• one of the best
techniques
• replace the categorical
variable with the mean
of its corresponding
target variable
• Steps for mean
encoding
• For each category
• Calculate aggregated
sum (= a)
• Calculate aggregated
total count (= b)
• Numerical value for that
category = a/b
29
Feature Hashing
• Dealing with Large Categorical Variables
Some large categorical features from Outbrain Click Prediction competition
30
Feature hashing [2]
• Hashes categorical values into vectors with fixed-length.
• Lower sparsity and higher compression compared to one hot
encoding
• Deals with new and rare categorical values (eg: new user-agents)
• May introduce collisions
100 hashed columns
Bin-counting
• Instead of using the value of the categorical variable as
the feature, we compute the association statistics
between that value and the target that we wish to
predict
• Useful for both linear and non-linear algorithms
• May give collisions (same encoding for different
categories)
• Be careful about leakage
• Strategies
• Count o o
• Average CTR r r
Counts Click-Through Rate
P(click | ad) =
ad_clicks / ad_views
Text Data
33
Natural Language Processing
• Cleaning • Removing
• Lowercasing • Stopwords
• Convert accented characters • Rare words
• Removing non-alphanumeric • Common words
• Repairing • Roots
• Tokenizing • Spelling correction
• Encode punctuation marks • Chop
• Tokenize • Stem
• N-Grams • Lemmatize
• Skip-grams • Enrich
• Char-grams • Entity Insertion / Extraction
• Affixes • Parse Trees
• Reading Level
Text vectorization
• Represent each document as a feature vector in the
vector space, where each position represents a word
(token) and the contained value is its relevance in the
document.
• BoW (Bag of words)
• TF-IDF (Term Frequency - Inverse Document Frequency)
• Embeddings (eg. Word2Vec, Glove)
• Topic models (e.g LDA)
Document Term Matrix - Bag of Words
Bag-of-Words
• Input
• “Customer reviews build something known as social proof, a
phenomenon that states people are influenced by those
around them. This might include friends and family, industry
experts and influencers, or even internet strangers.”
• Output
• a text document is converted into a “flat” vector of counts
• doesn’t contain any of the original textual structures
• John is quicker than Mary and Mary is quicker than John have
the same vectors
36
Bag-of-n-Grams
• a natural extension of bag-of-words (a word is
essentially an unigram)
• bag-of-n-grams representation can be more
informative
• n-grams retain more of the original sequence structure
• Cons
• bag-of-n-grams is a much bigger and sparser feature space
37
From Words to n-Grams to Phrases
• Tokenization is the process of tokenizing or splitting a
string, text into a list of tokens.
• Chunking a sentences refers to breaking/dividing
a sentence into parts of words such as word groups
and verb groups.
38
Document frequency
• Rare terms are more informative than frequent terms
• Recall stop words
• Consider a term in the query that is rare in the
collection (e.g., arachnocentric)
• A document containing this term is very likely to be
relevant to the query arachnocentric
• → We want a high weight for rare terms like arachnocentric.
39
Tf-idf
• The term frequency tft,d of term t in document d is
defined as the number of times that t occurs in d.
• dft is the document frequency of t: the number of
documents that contain t
• dft is an inverse measure of the informativeness of t
• dft £ N
• We define the idf (inverse document frequency) of t by
idft = log10 ( N/dft )
• We use log (N/dft) instead of N/dft to “dampen” the effect of idf.
40
Tf-idf
• The tf-idf weight of a term is the product of its tf weight
and its idf weight.
w t ,d = log(1 + tft ,d ) ´ log10 ( N / df t )
• Best known weighting scheme in information retrieval
• Note: the “-” in tf-idf is a hyphen, not a minus sign!
• Alternative names: tf.idf, tf x idf
• Increases with the number of occurrences within a
document
• Increases with the rarity of the term in the collection
41
Filtering for Cleaner Features
• Stopwords
• weeding out common words that make for vacuous features
• Frequency-Based Filtering
• filtering out corpus-specific common words as well as general-
purpose stopwords
• Rare words
• Depending on the task, one might also need to filter out rare
words.
• These might be truly obscure words, or misspellings of
common words.
• Stemming
• An NLP task that tries to chop each word down to its basic
linguistic word stem form
42
Word representation: embedding the context
• Attempt to encode similarity inside the word vectors
• Built ontop of the following great idea
• “You shall know a word by the company it keeps” (J. R. Firth
1957)
During his presidency, Trump ordered a travel ban on citizens
controversial or false. Trump was elected president in a surprise victory over
1971, renamed it to The Trump Organization, and expanded it into Manhattan.
coordination between the Trump campaign and the Russian government in its election
interference.
These words describe the meaning of Trump
11/27/24 43
Word embedding
• Each word is encoded in a dense vector (Low
dimension)
• Able to capture the semantics
• Similar words ~ Similar vectors
0.13
0.67
-
0.34
University = 0.76
-
0.21
-0.11
-
0.45
0.87
0.44
11/27/24 44
How to learn word embeddings
• The famous approach: Word2vec (Mikolov et. al. 2013)
• Unsupervised learning
• Large-scale dataset
• Lower computation cost
• High quality word vectors
© Mikolov et. Al. 2013
11/27/24 45
BERT sentence embedding
• Feng, Fangxiaoyu, et al. "Language-agnostic BERT
Sentence Embedding." arXiv preprint
arXiv:2007.01852 (2020).
46
Feature selection
47
Interaction Features
• A simple pairwise interaction feature is the product of
two features
• A simple linear model
• y=w1x1 +w2x2 +...+wnxn
• An easy way to extend the linear model is to include
combinations of pairs of input features
• y = w1x1 + w2x2 + ... + wnxn + w1,1x1x1 + w1,2x1x2 + w1,3x1x3 + ...
48
Polynomial Features
>>> import numpy as np
>>> from sklearn.preprocessing import PolynomialFeatures
>>> X = np.arange(6).reshape(3, 2)
>>> X
array([[0, 1],
[2, 3],
[4, 5]])
>>> poly = poly = PolynomialFeatures(degree=2, interaction_only=False,
include_bias=True)
>>> poly.fit_transform(X)
array([[ 1., 0., 1., 0., 0., 1.],
[ 1., 2., 3., 4., 6., 9.],
[ 1., 4., 5., 16., 20., 25.]])
Polynomial features with scikit-learn
Feature Selection
• Objective
• prune away nonuseful features in order to reduce the
complexity of the resulting model
• Advantages
• Training a machine learning algorithm faster.
• Reducing the complexity of a model and making it easier
to interpret.
• Building a sensible model with better prediction power.
• Reducing overfitting by selecting the right set of features.
50
Wrapper methods
• The feature selection process is based on a specific
machine learning algorithm
• Exhaustive search follows a greedy search
approach by evaluating all the possible combinations
of features against the evaluation criterion
• Random search methods randomly generate a subset
of features
• Computationally intensive since for each subset a new
model needs to be trained
51
Embedded Methods
• Perform feature selection during the model training
• Decision tree
• select a feature in each recursive step of the tree growth
process and divide the sample set into smaller subsets
• The more child nodes in a subset are in the same class, the
more informative the features are
52
Method comparision
53
“More data beats clever algorithms,
but better data beats more data.”
– Peter Norvig
Diverse set of features and models leads to
different results!
Outbrain Click Prediction
Towards Automated Feature
Engineering
Deep Learning....
Thank you
for your
attention!!!
57
References
• Scikit-learn -
Preprocessing data
• Spark ML - Feature
extraction