Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
27 views5 pages

PI Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views5 pages

PI Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Popularity Index Analysis

Marc Schmieder

2/28/2022

1. Exploration

1.1 Dataset attributes

This document summarizes a first draft from the Spotify Popularity Index (PI) analysis. In the current
state, 331 data points from the community were provided, whereas each data point equals one song at a
given time. The data points were anonymized by creating an artist/song ID. Overall, 31 variables (columns)
were provided or engineered through existing data. The column names are

## [1] "ArtistSongId"
## [2] "PopularityIndex"
## [3] "Timestamp"
## [4] "DaysSinceRelease"
## [5] "ReleaseDate"
## [6] "StreamsLast28Days"
## [7] "ListenersLast28Days"
## [8] "SavesLast28Days"
## [9] "StreamsAllTime"
## [10] "ListenersAllTime"
## [11] "NumberOfBlogsThatCoveredTheSong"
## [12] "NumberOfPlaylistsAllTime"
## [13] "EmailAddress"
## [14] "CurrentSpotifyFollowers"
## [15] "PopularityIndexSource"
## [16] "StreamsLast7Days"
## [17] "ListenersLast7Days"
## [18] "SavesLast7Days"
## [19] "DiscoverWeeklyStreamsLast28Days"
## [20] "DiscoverWeeklyStreamsLast7Days"
## [21] "ReleaseRadarStreamsLast28Days"
## [22] "ReleaseRadarStreamsLast7Days"
## [23] "NumberOfBlogsThatCoveredTheSong_num"
## [24] "StreamsLast28Days_PerListener"
## [25] "SavesLast28Days_PerListener"
## [26] "StreamsAllTime_PerListener"
## [27] "NumberOfPlaylistsAllTime_PerListener"
## [28] "StreamsLast7Days_PerListener"
## [29] "SavesLast7Days_PerListener"
## [30] "PopularityIndexSource_Bin"
## [31] "ReleasePassed21Days"

1
As to be seen later in the analysis, the variables

c("StreamsLast28Days", "ListenersLast28Days",
"StreamsLast7Days", "ListenersLast7Days")

## [1] "StreamsLast28Days" "ListenersLast28Days" "StreamsLast7Days"


## [4] "ListenersLast7Days"

are the most important ones for the later conducted model, predicting the Popularity Index

1.2 Mean values for different PI

As certain PI numbers are criterias for getting into algorithmic playlists, the following table shows the mean
values in the data for PI indices of 20-22, 30-32 and 40-42. With that information, an artist can estimate
what streams/listeners/saves they need to achieve those PI’s.

## Lower Upper StreamsLast28Days ListenersLast28Days SavesLast28Days


## 1: 20 22 2503.057 993.8529 375.5143
## 2: 30 32 9217.862 4097.1724 447.2759
## 3: 40 42 32185.889 13334.5556 1020.4444
## StreamsLast7Days ListenersLast7Days SavesLast7Days
## 1: 776.500 384.750 67.15000
## 2: 2348.966 1334.759 85.89655
## 3: 10506.778 5801.778 343.00000

1.3 Variables plottet vs PI

## Warning: Removed 126 rows containing missing values (geom_point).

2
Scaled variable vs PI

Variable StreamsLast28Days ListenersLast28Days StreamsLast7Days ListenersLast7Days

59

52
between 0 and 1 scaled values

46

39

33

26

20

13

0.00 0.11 0.22 0.33 0.44 0.56 0.67 0.78 0.89 1.00

between 0 and 1 scaled values


The grafik shows the between 0 and 1 scaled values of the Streams last 28/7 days and listeners last 28/7
days. It can be seen that the Popularity index is dependent in a (not perfect) quadratic function from those
variables.

2. Model
This is a regression problem where any model can predict continuous numbers, but the true values can only
be of integer type. As an algorithm, the XGboost regression tree was chosen, a (still) state of the art machine
learning algorithm that builds on tree boosting.
The model was trained on 263 obversations (75 percent) and fitted on 68 (25 percent) observations (never
seen by the model). The test set of 68 data points was sampled at random, but taking into account an equal
distribution between new songs (<21 days) and older songs.

2.1 Model performance

## [1] 1.524063

The mean absolute deviation (mad) of 1.52 states that the predictions of the model deviate in mean 1.52
from the true PI. For example if the true PI for one song is 32, we can expect, that the model will predict a
value that is around 33.5 or 30.5. So the model is working well but not perfect. There can be several reasons
for that the prediction of the model is not perfect.

1. Quality of data: false numbers entered or false PI entered

3
2. Time-delay of PI (updates every few days, also in Spotify for artist it is not updated in real time)
3. Rounding of PI: Internally the formula from Spotify probably results in a continues number (e.g. 24.32)
but is then rounded for the public display. This is sort of imperfect information that can bias the model.
4. There are factors/variables that contribute to the formula that are not publicly accessible/not in the
community driven dataset.
5. The algorithm itself lacks potential (model tuning could increase performance or choosing another
model like random forest or a neural net)

I personally suspect that it is a combination of 1-3 and think that 4 is rather unlikely.

2.3 Variable importance

Variable Importance in XGBoost model


StreamsLast28Days
ListenersLast28Days
ListenersAllTime
StreamsAllTime
SavesLast28Days_PerListener
StreamsLast7Days
DaysSinceRelease
StreamsAllTime_PerListener
CurrentSpotifyFollowers
ListenersLast7Days
Variable

StreamsLast7Days_PerListener
NumberOfPlaylistsAllTime_PerListener
SavesLast28Days
DiscoverWeeklyStreamsLast28Days
ReleaseRadarStreamsLast28Days
NumberOfPlaylistsAllTime
SavesLast7Days
StreamsLast28Days_PerListener
SavesLast7Days_PerListener
NumberOfBlogsThatCoveredTheSong_num
DiscoverWeeklyStreamsLast7Days
ReleaseRadarStreamsLast7Days
PopularityIndexSource_Bin
0.0 0.2 0.4 0.6 0.8
Gain in XGBoost model
The grafik shows the variable importance of the computed model. That is, how much each variable
contributes to the model. It is seen that streams last 28 days and listeners last 28 days are by far
the most important variables. It should be noted that other variables that are highly correlate with
StreamsLast28Days and ListenersLast28Days are important too, but the XGBoost model only needs one
representor for those sets of variables. It can be observed, that the number of saves do appear very late in
the variable importance ranking.

2.3 Graph of predictions

Lets take a look at the predictions and true values of the test data set.

4
Evaluation Plot: Comparison of predictions and true values
Model: XGboost

Type of point: actuals pred_xg

45

40

35
Popularity Index

30

25

20

15

10

0
0 5 10 15 20 25 30 35 40 45 50 55 60 65

Data point in unseen data

The grafik shows the prediction of the XGBoost model vs the actual values. Most observations were predicted
with decent accuracy. Only the predictions of some points show a serious deviation from the actual values.
It could be of merit to look into those data points to see which reasons 1-5 from 2.2 do apply.

You might also like