Compressing Feature Space For Classification Using PCA

12 minute read

In this project, I use Principal Component Analysis (PCA) to compress 100 unlabeled, sparse features into a more manageable number for classifying buyers of Ed Sheeran’s latest album.

Project Overview

Context

The client is looking to promote Ed Sheeran’s new album and be as efficient as possible with their marketing budget by targeting customers based on their music listening habits.

As a proof-of-concept they would like to have a classification model for customers who purchased Ed’s last album based upon a small sample of listening data they have acquired for some of their customers at that time.

If this can be done successfully, they will look to purchase up-to-date listening data, apply the model, and use the predicted probabilities to promote to customers who are most likely to purchase.

The sample data is short but wide. It contains only 356 customers, but for each, columns that represent the percentage of historical listening time allocated to each of 100 artists. On top of these, the 100 columns do not contain the artist in question, instead being labeled artist1, artist2 etc.

This data will need to be compressed into something more manageable for classification!

Actions

Firstly, I needed to bring in the required data, both the historical listening sample and the flag showing which customers purchased Ed Sheeran’s last album. I ensured to split my data into a training set and a test set for classification purposes. For PCA, I made sure to scale the data so that all features exist on the same scale.

Then, I applied PCA without specifying the number of components, allowing me to examine and plot the percentage of explained variance for every number of components. Based on this, I decided to limit my dataset to the number of components that make up 75% of the variance of the initial feature set, rather than a specific number of components. I applied this rule to both my training set (using fit_transform) and my test set (using transform only).

With this new, compressed dataset, I applied a Random Forest Classifier to predict the sales of the album, and I assessed the predictive performance.

Results

Based upon an analysis of variance vs. components - I made a decision to keep 75% of the variance of the initial feature set, which meant the number of features were dropped from 100 down to 24.

Using these 24 components, I trained a Random Forest Classifier which able to predict customers that would purchase Ed Sheeran’s last album with a Classification Accuracy of 93%

Growth/Next Steps

I only tested one type of classifier here (Random Forest) - it would be worthwhile testing others. I also only used the default classifier hyperparameters - I would want to optimize these.

Here, I selected 24 components based upon the fact this accounted for 75% of the variance of the initial feature set. I would instead look to search for the optimal number of components to use based on classification accuracy.

Github Repo

View the Github Repo here: Album Sales Prediction Github Repo

Data Overview

The dataset contains only 356 customers, but 102 columns.

In the code below, we:

Import the required python packages & libraries
Import the data from the database
Drop the ID column for each customer
Shuffle the dataset
Analyze the class balance between album buyers, and non album buyers

# import required packages
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.decomposition import PCA



# load the data for modeling
df_albums = pd.read_csv('../data/Album_Sales_Data/sample_data_pca.csv')
    
# drop the user_id column
df_albums = df_albums.drop('user_id', axis=1)

# shuffle the data
df_albums = shuffle(df_albums, random_state=42)

# drop missing values since only a few are present
df_albums = df_albums.dropna(how='any')

# analyze the class imbalance
df_albums['purchased_album'].value_counts(normalize=True)

From the last step in the above code, it shows that 53% of customers in the sample did purchase Ed’s last album, and 47% did not. Since this is evenly balanced, Classification Accuracy can be used for assessing the performance of the classification model later on.

After these steps, the dataset looks like the below sample (not all columns shown):

purchased_album	artist1	artist2	artist3	artist4	artist5	artist6	artist7	…
1	0.0278	0	0	0	0	0.0036	0.0002	…
1	0	0	0.0367	0.0053	0	0	0.0367	…
1	0.0184	0	0	0	0	0	0	…
0	0.0017	0.0226	0	0	0	0	0	…
1	0.0002	0	0	0	0	0	0	…
1	0	0	0	0	0	0	0	…
1	0.0042	0	0	0	0	0	0	…
0	0	0	0.0002	0	0	0	0	…
1	0	0	0	0	0.1759	0	0	…
1	0.0001	0	0.0001	0	0	0	0	…
1	0	0	0	0.0555	0	0.0003	0	…
0	0	0	0	0	0	0	0	…
0	0	0	0	0	0	0	0	…

The data is at the customer level. There is a binary column showing whether the customer purchased the prior album or not, and following that 100 columns containing the percentage of historical listening time allocated to each artist. The names of the artists are not known.

From the above sample, I can also see the sparsity of the data, customers do not listen to all artists and therefore many of the values are 0.

PCA Overview

Principal Component Analysis (PCA) is often used as a Dimensionality Reduction technique that can reduce a large set of variables down to a smaller set, that still contains most of the original information.

In other words, PCA takes a high number of dimensions, or variables and boils them down into a much smaller number of new variables - each of which is called a principal component. These new components are somewhat abstract - they are a blend of some of the original features where the PCA algorithm found they were correlated. By blending the original variables rather than just removing them, the hope is to keep much of the key information that was held in the original feature set.

Dimensionality reduction techniques like PCA are mainly used to simplify the space in which to analyze. Attempting to apply the k-means clustering algorithm (for example) across hundreds or thousands of features can be computationally expensive, PCA reduces this vastly while maintaining much of the key information contained in the data. But PCA doesn’t have applications just within the realms of unsupervised learning, it could just as easily be applied to a set of input variables in a supervised learning approach. And that is how I will apply it to solve this problem!

In supervised learning, Feature Selection is typically used to remove variables that are not deemed to be important in predicting my output. PCA is often used in a similar way, although in this case I will not explicitly removing variables - I will simply create a smaller number of new ones that contain much of the information contained in the original set.

Business consideration of PCA: It is much more difficult to interpret the outputs of a predictive model that is based upon component values rather than the original variables.

Data Preparation

Split Out Data For Modeling

In the next code block I do two things, first, I split my data into an X object which contains only the predictor variables, and a y object that contains only the dependent variable.

Once I have done this, I split the data into training and test sets to ensure I can fairly validate the accuracy of the predictions on data that was not used in training. In this case, I have allocated 80% of the data for training, and the remaining 20% for validation. I make sure to add in the stratify parameter to ensure that both my training and test sets have the same proportion of customers who did, and did not, sign up for the delivery club. Thus I can be more confident in my assessment of predictive performance.

# split data into X and y objects for modeling
X = data_for_model.drop(["purchased_album"], axis = 1)
y = data_for_model["purchased_album"]

# split out training & test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

Feature Scaling

Feature Scaling is extremely important when applying PCA - it means that the algorithm can successfully “judge” the correlations between the variables and effectively create the principal components for me. The general consensus is to apply Standardization rather than Normalization.

The below code uses the in-built StandardScaler functionality from scikit-learn to apply Standardization to all of my variables.

In the code, I use fit_transform for the training set, but only transform to the test set. This means the standardization logic will learn and apply the “rules” from the training data, but only apply them to the test data. This is important in order to avoid data leakage where the test set learns information about the training data, and means I can’t fully trust model performance metrics!

# standardize the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Fitting PCA

I firstly apply PCA to my training set without limiting the algorithm to any particular number of components, in other words, I’m not explicitly reducing the feature space at this point.

Allowing all components to be created here allows me to examine & plot the percentage of explained variance for each, and assess which solution might work best for my task.

In the code below the PCA object is instantiated, and then fit to the training set.

# apply PCA
pca = PCA(n_components=None, random_state=42)
pca.fit(X_train)

Analysis Of Explained Variance

There is no right or wrong number of components to use - this is something that I need to decide based upon the scenario I’m working in. I know I want to reduce the number of features, but I need to trade this off with the amount of information I lose.

In the following code, I extract this information from the prior step where I fit the PCA object to the training data. I extract the variance for each component, and then do the same for the cumulative variance. These are plotted in the next step.

# view the explained variance ratio
explained_variance = pca.explained_variance_ratio_

# create the cumulative sum of the explained variance
cum_sum_explained_variance = explained_variance.cumsum()

In the following code, I create two plots - one for the variance of each principal component, and one for the cumulative variance.

# view the cumulative sum of the explained variance
num_vars_list = list(range(1, 101))

plt.figure(figsize=(15, 10))
plt.subplot(2, 1, 1)
plt.bar(num_vars_list, explained_variance)
plt.title('Variance across Principal Components')
plt.xlabel('Number of Components')
plt.ylabel('% Variance')
plt.tight_layout()

plt.subplot(2, 1, 2)
plt.plot(num_vars_list, cum_sum_explained_variance)
plt.title('Cumulative Variance across Principal Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative % Variance')
plt.tight_layout()
plt.show()

As I can see in the top plot, PCA works in a way where the first component holds the most variance, and each subsequent component holds less and less.

The second plot shows this as a cumulative measure - and I can see how many components I would need to remain in order to keep any amount of variance from the original feature set.

alt text

Based upon the cumulative plot above, I can keep 75% of the variance from the original feature set with only around 25 components, in other words with only a quarter of the number of features around three-quarters of the information will be kept.

Applying our PCA solution

After running the analysis of variance by component, the PCA solution can be applied.

In the code below, the PCA object is re-instantiated, this time specifying the number of components that will keep 75% of the initial variance.

I then apply this solution to both the training set (using fit_transform) and the test set (using transform only).

Finally, based on this 75% threshold, confirm the number of components left.

# apply PCA with 75% of the variance
pca = PCA(n_components=0.75, random_state=42)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

# view the number of components
print(pca.n_components_)

It turns out the analysis from the chart is almost correct. The 24 principal components will retain 75% of the information from my initial feature set.

The X_train and X_test objects now contain 24 columns, each representing one of the principal components, which results in the X_train below:

0	1	2	3	4	5	6	…
-0.402194	-0.756999	0.219247	-0.0995449	0.0527621	0.0968236	-0.0500932	…
-0.360072	-1.13108	0.403249	-0.573797	-0.18079	-0.305604	-1.33653	…
10.6929	-0.866574	0.711987	0.168807	-0.333284	0.558677	0.861932	…
-0.47788	-0.688505	0.0876652	-0.0656084	-0.0842425	1.06402	0.309337	…
-0.258285	-0.738503	0.158456	-0.0864722	-0.0696632	1.79555	0.583046	…
-0.440366	-0.564226	0.0734247	-0.0372701	-0.0331369	0.204862	0.188869	…
-0.56328	-1.22408	1.05047	-0.931397	-0.353803	-0.565929	-2.4482	…
-0.282545	-0.379863	0.302378	-0.0382711	0.133327	0.135512	0.131	…
-0.460647	-0.610939	0.085221	-0.0560837	0.00254932	0.534791	0.251593	…
…	…	…	…	…	…	…	…

Here, column “0” represents the first component, column “1” represents the second component, and so on. These are the input variables that will be fed into the classification model to predict which customers purchased Ed Sheeran’s last album!

Classification Model

Training The Classifier

To start with, simply apply a Random Forest Classifier to see if it is possible to predict based upon the set of 24 components.

In the code below the Random Forest is instantiated using the default parameters, and then fit this to the data.

# train Random Forest Classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

Classification Performance

In the code below, the trained classifier predicts on the test set and I run a simple analysis for the classification accuracy for the predictions vs. actuals.

# run accuracy score because balanced classes
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.3f}')

The result of this is a 93% classification accuracy, in other words, using a classifier trained on 24 principal components the model is able to accurately predict which test set customers purchased Ed Sheeran’s last album, with an accuracy of 93%.

Application

Based upon this proof-of-concept, I can go back to the client and recommend that they purchase some up to date listening data. With the listener data, they can tailor their marketing by applying PCA, creating the components, and predicting which customers are likely to buy Ed’s next album.

Growth & Next Steps

Only one classifier (Random Forest) is tested here, it would be worthwhile testing others. Only the default classifier hyperparameters were used, and they can be optimized to improve performance.

Here, 24 components were selected based upon the fact they accounted for 75% of the variance of the initial feature set. Instead look to find the optimal number of components to use based upon classification accuracy.

Dagart Allison

Data Science Portfolio