SpotHitPy: A Study For ML-Based Song Hit
Prediction Using Spotify
Ioannis Dimolitsas∗ , Spyridon Kantarelis† and Afroditi Fouka‡
School of Electrical and Computer Engineering,
National Technical University of Athens, Athens, Greece
Email: ∗
[email protected], †
[email protected], ‡
[email protected] Abstract—In this study, we approached the Hit Song Predic- TABLE I: Dataset’s Features Specifications
arXiv:2301.07978v1 [cs.SD] 19 Jan 2023
tion problem, which aims to predict which songs will become Feature Specification
Billboard hits. We gathered a dataset of nearly 18500 hit and id the song’s unique Spotify track ID
non-hit songs and extracted their audio features using the Spotify artist the song’s artist name
Web API. We test four machine-learning models on our dataset. a value between 0 and 100, with 100 being the
We were able to predict the Billboard success of a song with popularity
most popular
approximately 86% accuracy. The most succesful algorithms a boolean value indicated whether a track has
were Random Forest and Support Vector Machine. explicit
explicit lyrics
Index Terms—Machine Learning, Music Information Re- album type
the type of the album: one of “album”, “single”,
trieval, Hit Song Science, Binary Classification, Data Mining, or “compilation”
Data Collection, Feature Selection a value from 0.0 to 1.0 describing how suitable a
danceability
track is for dancing
a value from 0.0 to 1.0 that represents a perceptual
I. I NTRODUCTION energy
measure of intensity and activity
key the music key the track is in
Hit Song Science (HSS) is an active research topic in Music loudness the overall loudness of a track in decibels (dB)
Information Retrieval (MIR). The main focus of this topic is to mode
indicates the modality (major=1 or minor=0) of a
predict whether a song will become a hit or not. Hit prediction track
a value from 0.0 to 1.0 describing the amount of
is therefore useful to musicians, labels, and music vendors speechiness
spoken words present in the track
considering that popular songs address to a big audience. Hit a value from 0.0 to 1.0 predicting whether the track
acousticness
songs help labels and music vendors to increase their profits, is acoustic
a value from 0.0 to 1.0 predicting whether the track
and artists to share their message with a broad audience. Our instrumentalness
is instrumental or contains vocals
approach to this topic relies on the assumption that hit songs a value from 0.0 to 1.0 that describes the presence
liveness
are similar with respect to their audio features. We gathered of an audience in the track
a value from 0.0 to 1.0 describing the musical
a dataset of hit and non-hit songs and their features using the valence
positiveness conveyed by a track
Spotify Web API1 . We then use machine learning methods and the overall estimated tempo of a track in beats per
tempo
algorithms to predict if a song is a hit or not. minute (BPM)
The rest of the paper is structured as follows; Section duration ms the duration of the track in milliseconds
time signature an estimated overall time signature of a track
II describes the related work on the HSS topic. Section III
describes our approach on gathering our dataset. Section IV
describes the machine learning methods and algorithms we
used. In Section V we perform an extensive evaluation of the particularities. Elena Georgieva et al. [4] used five machine-
proposed framework and in Section VI we conclude the paper. learning algorithms and managed to predict the Billboard suc-
cess of a song with approximately 75% accuracy. Middlebrook
II. R ELATED W ORK and Sheik [5] tested four machine-learning models to achieve
88% accuracy on predicting the Billboard success.
As we mentioned, Hit Song Prediction is an active topic
in MIR. Raza and Nanath [1] concluded there is no magic
formula yet that could predict a song being hit before it III. DATASET AND F EATURES
is released. Various approaches have been introduced. Li-
Chia Yang et al. [2] introduced state-of-the-art deep learning In this section, insights related to the data collection and
techniques to the audio-based hit song prediction problem. processing are discussed. In particular, descriptions of the
Zangerle et al. [3] proposed a combination of low- and high- tools and methods that are utilized for data collection are
level audio features of songs in a deep neural network that included, while, at the same time, the process of selecting and
distinguishes low- and high-level features to account for their extracting the features of the sample is presented. Furthermore,
techniques for dataset preparation, such as normalization and
1 https://developer.spotify.com/documentation/web-api/ augmentation, are utilized and described in detail below.
A. Acquiring the Data
Initially, we acquired data that is mandatory for the dataset
preparation and consists of the Billboard top 100 hits for every
year, starting from 2011 until 2021. This data is retrieved
through the Billboard API using the Python’s library bill-
board.py2 . In particular, we construct a collection composed
of 1000 hit songs (that were in some year’s Billboard top
100), where each one of them is determined by its title and
the corresponding artist, or artists in several cases. Afterwards,
depending on the latter collection, we perform HTTP requests
to the Spotify API, using the spotipy3 Python library, in order
to retrieve information related to the 1000 Billboard Hits. In
response for each request, Spotify API provided a set of 10
Fig. 1: Unbalanced and Balanced Dataset
related objects, in JSON format. The ”track id”, the ”artist id”
and the song’s ”popularity” were included, among others, in
every object as basic song’s features. So, the object with the
most naive strategy is to generate new samples by randomly
highest popularity was retrieved to be under consideration for
sampling with replacement the current available samples. The
the construction of the dataset. Afterwards, we populate the
RandomOverSampler offers such scheme and we utilized this
dataset with random songs from Spotify. To achieve this, we
technique from the sklearn python library4 .
generate queries of random characters as track titles. These
Another issue that needs to be solved is that the audio
queries was then posted as HTTP requests to the Spotify API.
features Spotify provides are not in the same scale. As Table I
Respectively, 50 songs were included in the response message,
shows some features have boolean values, some have decibel
while we choose randomly 10 of those. In total, 2000 of
values, some have a value between 0 and 100 and some
these requests were posted, leading to 20,000 collected tracks
between 0 and 1.0. Using scaling methods from the sklearn
from the Spotify. As mentioned for the hit songs, the same
python library we managed to scale all numerical and boolean
applies for the non-hits case about the basic features that came
values between 0 and 1.0.
included in the response message as a JSON object.
Considering the aforementioned data, we proceeded to an TABLE II: PCA and Features Co-efficiency
enrichment in terms of the corresponding features for the
Principal Component Most Variance Feature Co-Efficiency
respective tracks. Towards this direction, the audio features
PC1 explicit 0.728131
functionality of the spotipy has been utilized. For each ac- PC2 mode -0.729190
quired track object and based on the id feature, which is PC3 key -0.900311
unique for every object, we perform the corresponding API PC4 acousticness 0.651091
PC5 valence -0.810128
Call to retrieve further features. Table I summarizes the PC6 danceability -0.546361
extracted features for each track from Spotify, containing the PC7 popularity -0.713593
corresponding specification. The collected data determines the PC8 tempo -0.834668
basis for the creation of the dataset that will be used by the PC9 instrumentalness 0.711965
PC10 liveness 0.604598
classifiers. However, further processing is required to create a
proper dataset for this purpose. PC11 duration ms 0.044111
PC12 energy -0.454230
PC13 loudness -0.139564
B. Dataset Preparation PC14 speechiness 0.239800
To begin with, we proceeded with a data cleanup, in order to PC15 time signature -0.050520
take into account only unique tracks from Spotify. Also, track
objects with no data for specific features were discarded. After After standardization, we utilized Principal Component
this initial process, a sample of approximately 18,000 Spotify Analysis, in order to identify the features that reflect to
songs occurred, where 861 of them were BillBoard Top 100 more information regarding to our classification problem,
sometime between 2011 and 2021. Obviously, this sample while reducing dataset’s dimensions. It is critical to perform
instance is unbalanced, as there is no balanced ratio between standardization prior to PCA, because the latter is quite
the distribution of the classes ”hit” and ”non-hit”. Data imbal- sensitive regarding the variances of the initial variables. That
ances can affect classification predictions, when they are not is, if there are large differences between the ranges of initial
managed properly. Thus, balancing the Dataset is mandatory variables, those variables with larger ranges will dominate over
before proceeding. One way to fight this issue is to generate those with small ranges, leading to biased results. Principal
new samples in the class which is under-represented. The components are new variables that are constructed as linear
combinations of the initial variables. In our 15-dimensional
2 https://github.com/guoguo12/billboard-charts
3 https://spotipy.readthedocs.io/en/2.19.0/ 4 https://scikit-learn.org/stable/
dataset, 15 principal components are occurred, as these shown generalizes the Random Forest one and its purpose is to find
in Tab. II, which also includes the most variance feature of the most optimal hyper-plane that divide the data into two
each component. recognizable and well-defined classes. When the data which
are used in training are sparse, then the SVM can effectively
improve its quality to fit but simultaneously increases the
computational demands. The Gaussian Radial Basis Function
(RBF) is commonly used as the algorithm’s kernel:
e−gk(x−y) k ,
2
(2)
where g is a kernel parameter.
C. Logistic Regression
Logistic Regression (LR) [8] is a commonly used clas-
sification algorithm when the target variable is categorical.
This method aims to create a correlation between features
and the desirable output. Generally, exist two types of logistic
regression problems, the binary and the multi-class ones.
Logistic regression model uses a logistic function to determine
Fig. 2: PCA Cumulative Explained Variance a binary dependent variable in its main form. This function
re-frames the log-odds to meaningful probabilities and finally
Features combinations are determined in such a way that the the labeled values are ”0” and ”1”. The logistic function is
new ones (principal components) are uncorrelated and most of described by the equation
the information within the initial features is compressed into
the first components. Figure 2 illustrates the variance against 1
P (x) = −(x−m)
(3)
principal components. As it shown the first 10 components 1+e s
contribute to significant variance, approximately 98%, so the
where m is a location parameter and s is a scale one
rest of them (11-15) can be ignored.
respectively.
IV. M ETHODS D. k-Nearest Neighbors
In this section different machine learning methods which The k-Nearest Neighbors algorithm [9] is an non-parametric
have been used in this paper will be presented as follows: (i) classification algorithm and refers to supervised problems.
Random Forest (RF), (ii) Support Vector Machine (SVM), (iii) Subsequently, it is applied to labeled data, which are then
Logistic Regression (LR), (iv) k-Nearest Neighbors (kNN) categorized into many classes. This algorithm is commonly
A. Random Forest used as a classifier and is widely known for its simplicity and
effectiveness. Nevertheless, it can also be used in regression
Random Forest (RF) models [6] are one of the most popular problems. In the first case the algorithm uses a metric, such
methods in classification and aims to overcome the problem as Hamming distance, to determine the class in which the
of decision trees, the over-fitting. In more details, they involve object belongs while in the latter it uses Euclidean distance
a set of generalized classification trees which are trained with usually. Finally, the value of the parameter k is important for
different aspects of the dataset and randomly selected data. Its its performance since it affects the boundaries between the
purpose is to decrease the overall variance of the remaining classes. Generally, there is not a strict rule for it and it depends
training data. In other words, this method tries to produce a on the data. In binary classification problem the output can be
homogeneous subset of the primary dataset by binary splitting calculated as the class with the highest frequency from the
it sequentially. RF models’ advantages are the high accuracy, K-most similar instances. For example the probability of class
the simplicity in training and the robustness against outliers. In 0 is estimated as follows.
their drawbacks belongs the fact that the function delivered is
often discrete rather than smooth. The predictions for unseen
count(class = 0)
samples y can be defined after training the model by averaging P (class = 0) = (4)
the predictions from all the individual classification trees gb count(class = 0) + count(class = 1)
on y, as it seems below. V. E XPERIMENTS /R ESULTS /D ISCUSSION
1 In this section we present the training process alongside an
f= sumB
b=1 gb (y) (1) extensive evaluation of the machine learning models as those
B
have been described on section IV.
B. Support Vector Machine Firstly, we created the train, validation, and test sets, de-
The support-vector machine (SVM) [7] was initially cre- pending on the final version of the dataset, which essentially
ated to solve logistic or classification problems. This method refers to scaled data with balanced sample point for each class.
(a) Random Forest Classifier (b) Optimized kNN Classifier (c) Optimized SVM Classifier
Fig. 3: Confusion Matrices For Validation Set
(a) Random Forest Classifier (b) Optimized kNN Classifier (c) Optimized SVM Classifier
Fig. 4: Confusion Matrices For Test Set
TABLE III: Model Metrics on Validation Set
T ruepositive + T ruenegative
Metrics Accuracy =
Classifiers
Accuracy Precision Recall
T otal number of predictions
Support Vector Machine 0.73 0.70 0.83 The precision is calculated as the ratio between the number
Logistic Regression 0.66 0.64 0.76
Random Forest 0.86 0.82 0.94 of positive samples correctly classified to the total number of
k-Nearest Neighbors (n=25) 0.76 0.71 0.90 samples classified as positive (either correctly or incorrectly):
Support Vector Machine - opt 0.83 0.79 0.92
k-Nearest Neighbors (n=25) - opt 0.80 0.71 0.90 T ruepositive
P recision =
T ruepositive + F alsepositive
The recall is calculated as the ratio between the number of
The train and validation sets were produced by splitting our positive samples correctly classified as positive to the total
balanced dataset in the ratio 7:3. So, approximately 25,000 number of positive samples:
music tracks were used, in order to train the discussed models.
T ruepositive
The test set was created by combining 861 unique hit and 861, Recall =
randomly chosen, unique non-hit songs. T ruepositive + F alsenegative
For the evaluation, we use three different metrics; accuracy, The results on our validation test are shown on Table III.
precision and recall5 . We observe that the best performing classifiers are the Random
Accuracy is a metric that generally describes how the model Forest, the optimized SVM and the optimzed kNN. In an effort
performs across all classes. It is calculated as the ratio between to achieve high performance on the kNN algorithm, we used
the number of correct predictions to the total number of cross-validation and found that the optimal value of neighbors
predictions: is 25. Regarding the SVM model, the regularization parameter
is set equal to 10, with an RBF kernel. Confusion matrices for
5 https://blog.paperspace.com/deep-learning-metrics-precision-recall- the performance of these classifiers on our validation test are
accuracy/ shown on Figure 3. In order to achieve better results, we used
pipeline methods on the SVM and kNN classifiers. As Table VI. C ONCLUSION & F UTURE W ORK
III shows, using pipelines increased all metrics of these two Our study showed the highest performing ML-algorithm for
algorithms by an average of 9%. Hit Song Prediction was Random Forest having achieved 86%
Aiming to evaluate the adaptability of our methods, we accuracy. Random Forest achieved high precision on both valid
tested the three best performing classifiers on our test set. and test set, making it suitable for the Hit Song Prediction
Result metrics are shown on Figure 5, while confusion ma- problem. SVM and kNN algorithms illustrated higher accuracy
trices are shown on Figure 4. We see that Random Forest on our test set, showing that they may be more effective in
outperforms SVM and kNN on precision, while the other two more extended experiments and therefore should be further
used models display high recall values and better accuracy. explored and optimized.
Considering that music trends change constantly throughout Furthermore, exploring and gathering more metadata about
the years [10], the Hit Song Prediction problem is of high songs would be useful in order to select better features. Having
complexity. On our study we used hit songs from the last access to information about the songs’ melody and harmony
decade in order to focus on contemporary music trends. should give a better understanding about a song’s structure,
High recall values on our models mean that we have a low leading to better results. Additionally, features like mood or
number of false negatives. This is due to the fact that audio emotion could also be considered during training.
features collected from Spotify offer an interpretation on how Driven by the questions discussed on section V, exploring
a song sounds, without regard to the music trends of the year parameters and defining features that express music trends is
it was released. Thus, training models using these features another task that will be useful on approaching the Hit Song
lead to more accurate predictions about a song being non-hit Prediction problem.
than a song becoming a hit. Improving our models’ precision, ACKNOWLEDGMENT
meaning that our models will predict better if a song will
This study is carried out in the context of the machine
become a hit, requires implementing features that take under
learning course of the Data Science and Machine Learning
consideration the current music trends.
master degree program of School of Electrical and Computer
Therefore, the results’ footprint can be summarized in Engineering of National Technical University of Athens.
these questions: We thank Giannis Delimpaltadakis (Postdoctoral Researcher
at Eindhoven University of Technology) for useful discussions
(i) Would the #1 hit song of 2011 be a hit if it was released and wish him the best for the next stage of his academic career.
in 2021?
R EFERENCES
[1] A. H. Raza and K. Nanath, “Predicting a hit song with machine learning:
(ii) The prediction of the #1 hit song of 2011 being a hit Is there an apriori secret formula?” in 2020 International Conference on
in 2021 implies over-fitting? Data Science, Artificial Intelligence, and Business Analytics (DATABIA),
2020, pp. 111–116.
[2] L.-C. Yang, S.-Y. Chou, J.-Y. Liu, Y.-H. Yang, and Y.-A. Chen, “Revis-
(iii) Could a non-hit song of 2011 be a hit in 2021 if its iting the problem of audio-based hit song prediction using convolutional
features match the dominant features of hit songs of 2021? neural networks,” in 2017 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), 2017, pp. 621–625.
[3] E. Zangerle, M. Vötter, R. Huber, and Y.-H. Yang, “Hit song prediction:
Leveraging low- and high-level audio features,” in ISMIR, 2019.
[4] E. Georgieva, M. Şuta, and N. S. Burton, “Hitpredict : Predicting hit
songs using spotify data stanford computer science 229 : Machine
learning,” 2018.
[5] K. Middlebrook and K. Sheik, “Song hit prediction: Predicting billboard
hits using spotify data,” ArXiv, vol. abs/1908.08609, 2019.
[6] T. K. Ho, “Random decision forests,” in Proceedings of 3rd international
conference on document analysis and recognition, vol. 1. IEEE, 1995,
pp. 278–282.
[7] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning,
vol. 20, no. 3, pp. 273–297, 1995.
[8] D. R. Cox, “The regression analysis of binary sequences,” Journal of
the Royal Statistical Society: Series B (Methodological), vol. 20, no. 2,
pp. 215–232, 1958.
[9] K. Taunk, S. De, S. Verma, and A. Swetapadma, “A brief review of
nearest neighbor algorithm for learning and classification,” in 2019
International Conference on Intelligent Computing and Control Systems
(ICCS), 2019, pp. 1255–1260.
[10] M. Interiano, K. Kazemi, L. Wang, J. Yang, Z. Yu, and N. Komarova,
“Musical trends and predictability of success in contemporary songs in
and out of the top charts,” Royal Society Open Science, vol. 5, p. 171274,
05 2018.
Fig. 5: Model Metrics for Test Set