Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
17 views50 pages

Feature Generation & Selection Guide

Module 3 focuses on feature generation and selection, emphasizing their importance in building predictive models, particularly for user retention in applications. It covers various algorithms and methods for feature selection, including filters, wrappers, and decision trees, as well as techniques for dimensionality reduction like PCA and SVD. The module also discusses the process of creating recommendation systems and the significance of logging user behavior to inform feature extraction.

Uploaded by

kanti chandrakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views50 pages

Feature Generation & Selection Guide

Module 3 focuses on feature generation and selection, emphasizing their importance in building predictive models, particularly for user retention in applications. It covers various algorithms and methods for feature selection, including filters, wrappers, and decision trees, as well as techniques for dimensionality reduction like PCA and SVD. The module also discusses the process of creating recommendation systems and the significance of logging user behavior to inform feature extraction.

Uploaded by

kanti chandrakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

MODULE 3

Feature Generation and


Feature Selection
Topics Covered
• Extracting Meaning from Data
• Motivating application: user (customer) retention.
• Feature Generation
• Feature Selection algorithms.
• Filters; Wrappers; Decision Trees; Random Forests.
• Recommendation Systems:
• Building a User-Facing Data Product
• Algorithmic ingredients of a Recommendation Engine
• Dimensionality Reduction
• Singular Value Decomposition
• Principal Component Analysis
• Exercise: build your own recommendation system.
Feature Selection
•The idea of feature selection is identifying the subset of data
or transformed data that you want to put into your model.
• Feature selection is not only useful for winning
competitions—it’s an important part of building statistical
models and algorithms in general.
• Just because you have data doesn’t mean it all has to go into
the model
•For example, it’s possible you have many redundancies or
correlated variables in your raw data, and so you don’t want
to include all those variables in your model.
•Similarly you might want to construct new variables by
transforming the variables with a logarithm, say, or turning a
continuous variable into a binary variable, before feeding
them into the model
Features, Explanatory Variables, Predictor
•Different branches of academia use different terms to describe
the same thing.
•Statisticians say “explanatory variables” or “dependent
variables” or “predictors” when they’re describing the subset
of data that is the input to a model.
•Computer scientists say “features.”
• We are getting bigger and bigger datasets, but that’s not always helpful.
• If the number of features is larger than the number of observations, or if
we have a sparsity problem, then large isn’t necessarily good.
• And if the huge data just makes it hard to manipulate because of
computational reasons (e.g., it can’t all fit on one computer, so the data
needs to be sharded across multiple machines) without improving our
signal, then that’s a net negative.
• To improve the performance of your predictive models, you want to
improve your feature selection process
Example: User Retention
• Suppose you have an app that you designed Chasing Dragons and
users pay a monthly subscription fee to use it.
• The more users you have, the more money you make.
• Suppose you realize that only 10% of new users ever come back after
the first month.
• So you have two options to increase your revenue:
• find a way to increase the retention rate of existing users, or acquire
new users.
• Generally it costs less to keep an existing customer around than to
market and advertise to new users.
•let’s choose to focus on your user retention situation by building a
model that predicts whether or not a new user will come back next
month based on their behavior this month.
•You could build such a model in order to understand your
retention situation,
•Focus instead on building an algorithm that is highly accurate at
predicting.
• You might want to use this model to give a free month to users
who you predict need the extra incentive to stick around
• A good, crude, simple model you could start out with would be logistic
regression
• This would give you the probability the user returns their second month
conditional on their activities in the first month.
• Record each user’s behavior for the first 30 days after sign-up.
• You could log every action the user took with timestamps: user clicked the
button that said “level 6” at 5:22 a.m., user slew a dragon at 5:23 a.m.,
user got 22 points at 5:24 a.m., user was shown an ad for deodorant at 5:25
a.m.
• This would be the data collection phase.
• Any action the user could take gets recorded
• Notice that some users might have thousands of such actions, and other
users might have only a few.
• These would all be stored in time‐ stamped event logs.
• You’d then need to process these logs down to a dataset with rows and
columns, where each row was a user and each column is a feature.
• At this point, you shouldn’t be selective; you’re in the feature generation
phase.
• So your data science team (game designers, software engineers,
statisticians, and marketing folks) might sit down and brainstorm features.
Here are some examples
• Number of days the user visited in the first month
• Amount of time until second visit
• Number of points on day j for j=1, . . .,30 (this would be 30 separate
features)
• Total number of points in first month (sum of the other features)
• Did user fill out Chasing Dragons profile (binary 1 or 0)
• Age and gender of user
• Screen size of device
• Use your imagination and come up with as many features as possible.
• Notice there are redundancies and correlations between these features.
Feature Generation or Feature Extraction
• This process we just went through of brainstorming a list of features for
Chasing Dragons is the process of feature generation or feature
extraction.
• This process is as much of an art as a science.
• It’s good to have a domain expert around for this process, but it’s also
good to use your imagination.
• We can generate tons of features through logging.
• Contrast this with other contexts like surveys,
• you’re lucky if you can get a survey respondent to answer 20 questions.
• when you can capture a lot of data, not all of it might be actually useful
information.
You can think of information as falling into
the following buckets
Relevant and useful, but it’s impossible to capture it.
There’s a lot of information that you’re not capturing about users—
how much free time do they actually have?
What other apps have they downloaded?
Are they unemployed?
Do they suffer from insomnia?
Do they have an addictive personality?
Do they have nightmares about dragons?
Relevant and useful, possible to log it, and you did
• It’s great that you chose to log it, but just because you chose to log it
doesn’t mean you know that it’s relevant or useful,
• so that’s what you’d like your feature selection process to discover.
Relevant and useful, possible to log it, but you didn’t
• It could be that you didn’t think to record whether users uploaded a
photo of themselves to their profile, and this action is highly predictive
of their likelihood to return.
• You’re human, so sometimes you’ll end up leaving out really
important stuff
Not relevant or useful, but you don’t know that and log it.
• This is what feature selection is all about—you’ve logged it, but you
don’t actually need it and you’d like to be able to know that
• Use Logistic regression for your game retention prediction.
• Let ci =1 if user i returns to use Chasing Dragons any time in the
subsequent month.
• Again this is crude—you could choose the subsequent week or
subsequent two months.
• It doesn’t matter. You just first want to get a working model, and then
you can refine it.
• Ultimately you want your logistic regression to be of the form:
logit (P( ci =1 |xi ))=α+βτ ·xi
Filters
•Filters order possible features with respect to a ranking based
on a metric or statistic, such as correlation with the outcome
variable.
• This is sometimes good on a first pass over the space of
features, because they then take account of the predictive
power of individual features.
•However, the problem with filters is that you get correlated
features.
• In other words, the filter doesn’t care about redundancy.
• And by treating the features as independent, you’re not
taking into account possible interactions
• On the one hand, two redundant features can be more powerful when
they are both used;
• On the other hand, something that appears useless alone could actually
help when combined with another possibly useless-looking feature that
an interaction would capture.
• for each feature, run a linear regression with only that feature as a
predictor.
• Each time, note either the p-value or R-squared, and rank order
according to the lowest p-value or highest R-squared
Wrappers
• Wrapper feature selection tries to find subsets of features, of some fixed size,
that will do the trick.
• anyone who has studied combinations and permutations knows, the number of
possible size k subsets of n things, called (n)
k
• grows exponentially.
• So there’s a opportunity for overfitting by doing this.
• There are two aspects to wrappers that you need to consider:
1) selecting an algorithm to use to select features and
2) deciding on a selection criterion or filter to decide that your set of features is
“good.”
Selecting an algorithm
•Set of algorithms that fall under the category of stepwise
regression, a method for feature selection that involves
selecting features according to some selection criterion by
either adding or subtracting features to a regression model in
a systematic way.
There are three primary methods of stepwise regression:
•forward selection,
• backward elimination
•combined approach (forward and backward).
Forward selection
• In forward selection you start with a regression model with no
features, and gradually add one feature at a time according to which
feature improves the model the most based on a selection criterion.
• Build all possible regression models with a single predictor.
• Pick the best.
• Now try all possible models that include that best predictor and a
second predictor. Pick the best of those.
• You keep adding one feature at a time, and you stop when your
selection criterion no longer improves, but instead gets worse.
Backward elimination
•In backward elimination you start with a regression model
that includes all the features, and you gradually remove one
feature at a time according to the feature whose removal
makes the biggest improvement in the selection criterion.
• You stop removing features when removing the feature
makes the selection criterion get worse
Combined approach
• Most subset methods are capturing some flavor of minimum
redundancy-maximum-relevance.
• for example, you could have a greedy algorithm that starts with the
best feature, takes a few more highly ranked, removes the worst, and
so on.
• This a hybrid approach with a filter method.
Selection criterion
• Different selection criterion might produce wildly different models, and it’s
part of your job to decide what to optimize for and why
R-squared
R2 =1− ∑i (yi − ^yi )2
_________
∑i( yi − y )2 ,
it can be interpreted as the proportion of variance explained by your model.
• is a statistical measure in a regression model that determines the proportion
of variance in the dependent variable that can be explained by
the independent variable.
• In other words, r-squared shows how well the data fit the regression model
(the goodness of fit).
p-values
• In the context of regression where you’re trying to estimate
coefficients (the βs), to think in terms of p-values,
• Make an assumption of there being a null hypothesis that the βs are zero.
• For any given β, the p-value captures the probability of observing the data
that you observed, and obtaining the test-statistic (in this case the
estimated β) that you got under the null hypothesis.
• Specifically, if you have a low p-value, it is highly unlikely that you
would observe such a test-statistic if the null hypothesis actually held.
• This translates to meaning that (with some confidence) the coefficient is
highly likely to be non-zero.
AIC (Akaike Information Criterion)
• 2k−2ln(L)
• where k is the number of parameters in the model and ln(L) is the
“maximized value of the log likelihood.” The goal is to minimize AIC.
BIC (Bayesian Information Criterion)
k*ln(n) −2ln(L)
where k is the number of parameters in the model, n is the number of
observations (data points, or users), and ln(L) is the maximized value of the
log likelihood. The goal is to minimize BIC.
Entropy
Embedded Methods: Decision Trees
• Decision trees have an intuitive appeal because outside the context of
data science in our every day lives, we can think of breaking big
decisions down into a series of questions
This decision is actually dependent on a bunch of factors:
• whether or not there are any parties or deadlines,
• how lazy the student is feeling,
• and what they care about most (parties).
• The interpretability of decision trees is one of the best features about
them.
• In the context of a data problem, a decision tree is a classification
algorithm.
• For the Chasing Dragons example, you want to classify users as “Yes, going
to come back next month” or “No, not going to come back next month.”
• This isn’t really a decision in the colloquial sense, so don’t let that throw
you.
• You know that the class of any given user is dependent on many factors
(number of dragons the user slew, their age, how many hours they already
played the game.
• But how do you construct decision trees from data and what mathematical
properties can you expect them to have?
• If you want this tree to be based on data and not just what you feel like.
• Choosing a feature to pick at each step is like playing the game 20
Questions really well.
• You take whatever the most informative thing is first.
• Let’s formalize that—we need a notion of “informative.”
• Assume we break compound questions into multiple yes-or-no questions,
and we denote the answers by “0” or “1.”
• Given a random variable X, we denote by p(X =1) and p(X=0) the
probability that X is true or false, respectively
Entropy
• To quantify what is the most “informative” feature, we define entropy
effectively a measure for how mixed up something is—for X as follows:
• H(X) = − p (X =1) log2 p(X =1)) − p(X =0) log2( p(X =0))
• Note when p(X =1 )=0 or p( X =0) =0, the entropy vanishes, consistent
with the fact that: lim t ->0 t ·log( t )=0.
• If either option has probability zero, the entropy is 0.
• Moreover, because p(X =1) =1− p( X =0) , the entropy is symmetric about
0.5 and maximized at 0.5, which we can easily confirm using a bit of
calculus.
• entropy is a measurement of how mixed up something is.
• So, for example, if X denotes the event of a baby being born a boy,
we’d expect it to be true or false with probability close to 1/2, which
corresponds to high entropy,
• The bag of babies from which we are selecting a baby is highly mixed.
• But if X denotes the event of a rainfall in a desert, then it’s low
entropy. In other words, the bag of day-long weather events is not
highly mixed in deserts.
• Using this concept of entropy, we will be thinking of X as the target of
our model.
• So, X could be the event that someone buys something on our site.
• We’d like to know which attribute of the user will tell us the most
information about this event X.
• We will define the information gain, denoted IG( X,a) , for a given
attribute a, as the entropy we lose if we know the value of that
attribute: IG( X,a) =H( X) −H(X| a)
• To compute this we need to define H(X| a) .
• For any actual value a0 of the attribute a we can compute the specific
conditional entropy H (X|a=a0 )as
• H( X| a=a0) = − p( X =1| a=a0) log2( p( X =1 |a=a0)) −
p (X =0| a=a0) log2( p(X =0|a=a0))
• and then we can put it all together, for all possible values of a, to get
the conditional entropy
• H(X| a) = ∑ai p( a=ai) ·H( X|a=ai)
Decision Tree Algorithm
• Algorithm to decide which attribute to spilt the node
• Select the attribute in order to maximize information gain
• Prune the tree to avoid overfitting

Handling continuous variable in Decision tree


• Threshold is used to convert continuous variable to binary predictor
• Partition into less than 10 or at least 10
Random Forests
• It is supervised algorithms that generalize decision trees with bagging,
otherwise known as bootstrap aggregating.
• A bootstrap sample is a sample with replacement, which means we
might sample the same data point more than once.
• We usually take to the sample size to be 80% of the size of the entire
(training) dataset, but of course this parameter can be adjusted
depending on circumstances.
• This is technically a third hyper parameter of our random forest
algorithm.
To construct a random forest, you construct N decision trees as
follows:
• For each tree, take a bootstrap sample of your data, and for each node
you randomly select F features, say 5 out of the 100 total features.
• Then you use your entropy-information-gain engine as described in
the previous section to decide which among those features you will
split your tree on at each stage
Dimensionality Reduction
• Many real problems in machine learning involve datasets with
thousands or even millions of features.
• Training such datasets can be computationally demanding, and
interpreting the resulting solutions can be even more challenging.
• As the number of features increases, the data points become more
sparse, and distance metrics become less informative, since the
distances between points are less pronounced making it difficult to
distinguish what are close and distant points.
• That is known as the curse of dimensionality.
• The more sparse data makes models harder to train and more prone to
overfitting capturing noise rather than the underlying patterns.
• This leads to poor generalization to new, unseen data.
• Dimensionality reduction is used in data science and machine learning to
reduce the number of variables or features in a dataset while retaining as
much of the original information as possible.
• This technique is useful for simplifying complex datasets, improving
computational efficiency, and helping with data visualization.
Singular Value Decomposition (SVD)
• Given an m×n matrix X of rank k, it is a theorem from linear algebra
that we can always compose it into the product of three matrices as
follows:
• X =USVτ
• where U is m×k, S is k×k, and V is k×n,
• The columns of U and V are pairwise orthogonal, and S is diagonal.
• Note the standard statement of SVD is slightly more involved and has
U and V both square unitary matrices, and has the middle “diagonal”
matrix a rectangular.
• Let’s apply the preceding matrix decomposition to our situation.
• Consider original dataset X , which has users’ ratings of items.
• We have m users, n items, and k would be the rank of X, and
consequently would also be an upper bound on the number d of latent
variables we decide to care about—
• we choose d where as m,n, and k are defined through our training dataset.
• d is the tuning parameter.
• Each row of U corresponds to a user, whereas V has a row for each item.
• The values along the diagonal of the square matrix S are called the
“singular values.”
• They measure the importance of each latent variable—the most important
latent variable has the biggest singular value.
Properties of SVD
• Columns of u and V are orthogonal to each other, so orders of columns can
be changed based on singular values.
• Dimensions are ordered based on importance from highest to lowest.
• Cut off part of S, U & V which are least important vectors ‘d’ is chosen
carefully, it is very smaller than k. we no longer have original X.
• Fill empty cells of X with average rating for that item and compute SVD
• Not with 0 since 0 is reserved for some meaning
X =USVτ
• Predict a rating by looking at X of user rating pair
• SVD is computationally expensive.
• Missing data problem in KNN still present
Principal Component Analysis

• The PCA reduces the number of features in a dataset while keeping


most of the useful information by finding the axes that account for the
largest variance in the dataset.
• Those axes are called the principal components.
• Since PCA aims to find a low-dimensional representation of a dataset
while keeping a great share of the variance instead of performing
predictions, It is considered an unsupervised learning algorithm.
• Find U and V such that X=U ·Vτ
• Your optimization problem is that you want to minimize the discrep‐ ency
between the actual X and your approximation to X via U and V measured
via the squared error
• argmin∑i,j (xi,j −ui· v j )2
• ui the row of U corresponding to user i, and similarly you denote by v j the
row of V corresponding to item j.
• Items can include metadata information (so the age vectors of all the users
will be a row in V).
• Then the dot product ui ·v j is the predicted preference of user i for item j,
and you want that to be as close as possible to the actual preference xi, j.
• find the best choices of U and V that minimize the squared differences
between prediction and observation on every‐ thing you actually know.
• choose a parameter d defined as how may latent features you want to
use.
• U will have a row for each user and a column for each latent feature,
and the matrix V will have a row for each item and a column for each
latent feature.
• d is typically about 100, because it’s more than 20 then it increases
complexity.
Resulting latent features will be uncorrelated.
Proof:
1. Find u and V with UV=X such that squared error is minimum.
2. Find U and V such that their entries are small & minimum error.
3. Modify U with any invertible dXd matrix G and modify V with
inverse of G
U .V=(U.G).(G-1. V)=X
Algorithm
• Pick a random V
• Optimize U while V is fixed.
• Optimize V while U is fixed.
• Keep doing the preceding two steps until you’re not changing very
much at all.
• To be precise, you choose an ϵ and if your coefficients are each
changing by less than ϵ, then you declare your algorithm “converged.”
Fix V and Update U
• For user i, you want to find: argminui∑ jePi (pi, j −ui *v j)2
• where v j is fixed. In other words, you just care about this user for no.
• Set ui = (V*,i τ V*,i )−1 V*,i τ P*i
• where V*,i is the subset of V for which you have preferences coming
from user i. Taking the inverse is easy because it’s d×d, which is small.
• When you fix U and optimize V, it’s analogous—you only ever have
to consider the users that rated that movie, which may be pretty large
for popular movies but on average isn’t; but even so, you’re only ever
inverting a d×d matrix.

You might also like