Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
265 views66 pages

Generating Music Using AI: Ebba Rickard

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
265 views66 pages

Generating Music Using AI: Ebba Rickard

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

MASTER’S THESIS 2022

Generating Music using AI


Ebba Rickard

ISSN 1650-2884
LU-CS-EX: 2022-42

DEPARTMENT OF COMPUTER SCIENCE


LTH | LUND UNIVERSITY
EXAMENSARBETE
Datavetenskap

LU-CS-EX: 2022-42

Generating Music using AI

Generera Musik med AI

Ebba Rickard
Generating Music using AI

Ebba Rickard
[email protected]

June 28, 2022

Master’s thesis work carried out at Axis Communications.

Supervisors: Emma Söderberg, [email protected]


Johan Davidsson, [email protected]
Danny Smith, [email protected]
Examiner: Elin Anna Topp, [email protected]
Abstract

Network speakers are used in public spaces for public announcements and
sometimes to play background music. To play music, a special commercial music
licence is required, specifically made for use in public environments. However,
customers have to pay licence fees and educate themselves on the topic of copy-
right licensing in order to do this according to current regulations. In this thesis
we explored the possibilities of generating alternative, licence-free background
music using machine learning methods. We surveyed the field for existing mod-
els and data sets, and carried out interviews with musicians to identify music
quality characteristics that could be used as evaluation metrics.
We chose to tune and compare the transformer model GPT-3 and the Long
Short-Term Memory (LSTM) model Performance_RNN. The music was evalu-
ated using the COSIATEC algorithm to find recurrent patterns as well as using a
custom metric based on Tymoszcko’s theories of tonality. Experiments were car-
ried out investigating the impact of learning rate, training data characteristics
and generation parameters. Both GPT-3 and Performance RNN performed well
at generating long term structure in music, but the training time and accuracy
differed depending on the chosen data set. To add to the findings in this thesis
it would be interesting to investigate the correlation between human perception
of the music and the scores obtained in this report. It would also be of interest to
further investigate the impact of the training data characteristics, such as genre
and melodic content.

Keywords: MSc, Generative Model, Music generation, Transformer, Encoder, Decoder,


VAE, Evaluation metric, Music quality
2
Acknowledgements

I want to take the opportunity to thank my supervisor Emma Söderberg, for the commitment,
engagement and encouragement put into guiding me in my work. I could not have had a
better supervisor.
A big thank you also to my supervisors at Axis, Johan Davidsson and Danny Smith, for
the warm welcome at Axis as well as the important help throughout the entire thesis process.

3
4
Contents

1 Introduction 7
1.1 Project context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Scope and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Report outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Background 11
2.1 Machine learning basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Artificial Neural networks - ANN . . . . . . . . . . . . . . . . . . 12
2.1.2 Optimization aspects . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Generative models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Autoencoder - AE . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Long Short-Term Memory - LSTM . . . . . . . . . . . . . . . . . . 17
2.2.4 Examples of generative models . . . . . . . . . . . . . . . . . . . . 17
2.3 Music Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Digital music representation . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.1 MIDI (Musical Instrument Digital Interface) . . . . . . . . . . . . . 22
2.4.2 Raw audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.3 Text representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Method 25
3.1 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Music generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5
CONTENTS

3.5 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28


3.5.1 ABC format parsing in GPT-3 . . . . . . . . . . . . . . . . . . . . . 29

4 Results 31
4.1 Interview results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Dataset evaluation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Pretrained model results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Magenta: Performance RNN . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.1 Learning rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.2 Seeding sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 GPT-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.5.1 Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.5.2 Learning rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5.3 Epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5.4 Transposed training data . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5.5 Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Discussion 43
5.1 Model performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1.1 Data set homogenity . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1.2 Learning rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.3 Epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.4 Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.5 Data format parsing . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.1 Note density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.2 Empty beat ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.3 Consonance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.4 Centricity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.5 TECs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.6 Macroharmony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.7 Groove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Legal aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6 Conclusion 49

References 51

A 55
A.1 Interview documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A.1.1 Consent to be interviewed for research . . . . . . . . . . . . . . . . 57
A.1.2 Interview Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6
Chapter 1
Introduction

Artificial Intelligence (AI) generating music has had a significant acceleration of progress
during the last decades [11] [31]. In generative modelling an AI model is trained to generate
new data such as images, text or audio. This thesis deals with the generation of score-based
music.

1.1 Project context


This master thesis project is carried out at Axis Communications, a company providing net-
work connected security solutions. They are most known for their security cameras, but Axis
also provides network connected speakers for commercial use. The speakers are used in pub-
lic spaces such as malls, train stations, airports, etc, to give public announcements or in some
cases to play background music. In this thesis AI generated music is researched because it can
be licence free. This is further discussed in Section 1.4. Music generated using AI can there-
fore act as a an alternative to a commercial licence when playing music in a public space. Axis
wants to explore the possibility of providing licence-free music that their customers could
use freely.

1.2 Aim
The aim of this thesis is to explore ways of generating music using machine learning which
is suitable for use as background music. There has been great success in the field of generat-
ing music using AI in recent years, with producing convincing piano melodies [23][31] and
even music with realistic vocals [11]. In the context of this thesis, the music in question will
be played in public environments. The focus will therefore be on generating music that is
suitable as background music in public spaces. More specifically, this thesis will explore the
possibilities in music generation by investigating existing methods and models, testing out

7
1. Introduction

well-supported choices of models, finding objective ways of evaluating and comparing the
methods, and ultimately to find a method that works for the specific context at Axis. The
work in this thesis has been driven by the following research questions:

RQ1 Which already developed models in the state-of-the-art can be utilised to achieve music
generation?

RQ2 What are the demands on the quality of background music in public spaces?

RQ3 How can the quality of generated music be measured?

RQ4 To what extent can the existing generative models of background music generate back-
ground music suitable for public spaces?

1.3 Methodology
The work in this thesis was carried out using a mixed methodology approach [25] consisting
of reviewing the state-of-the-arts, conducting semi-structured interviews and carrying out
experiments. The results of the experiments were evaluated from a quantitative perspective.

1.4 Scope and limitations


The end-results of this project are envisioned to be used in a commercial context. This re-
quires a concern for legal issues in choosing a training data set and using pretrained models.
Many AI models are publicly available under a creative commons licence but not always al-
lowing commercial use of the software. Using copyrighted music for training could also be
problematic for legal reasons. However, today there are plenty of music that has fallen into
the public domain that can be used freely. Therefore copyright-free music will be used for
training as much as possible.
The result of this thesis is also limited by the available computing power in combination
with time constraint, as is with many machine learning project. The most successful genera-
tive models have been trained on millions of data samples over a long period of time on strong
computers, wherefore this particular setup will not be able to reach the same results. Focus
will instead be put on the comparison of different models, both pretrained and retrained.
Because of the breadth of the field of generative AI modelling, the models chosen for
further investigation were limited to only transformer or Long Short-Term Memory (LSTM)
models.
There are other potential applications for AI generated music that will not be covered in
this thesis, for example as a tool for musicians in the composing process or an unbiased way
of finishing unfinished work by late composers [12][8].

1.5 Previous work


This work is partially based on the work by Gonsalves [14] in which GPT-3 was retrained
using the OpenEWLD data set in ABC style music notation.

8
1.6 Report outline

1.6 Report outline


The content of the report is structured as follows:

Chapter 1 - Introduction: Introduces the subject, context and aim of the thesis. The scope
and its limitations are introduced together with the research questions.

Chapter 2 - Background: Contains necessary theory explained concerning machine learn-


ing, music, digital processing of music and music evaluation metrics. The background
is intended to give all the necessary information to be able to fully comprehend the
method and results.

Chapter 3 - Method: This chapter includes a detailed description of how the work was con-
ducted in researching the research questions.

Chapter 4 - Results: The results of the carried out experiments. This section also contains a
brief evaluation of the results.

Chapter 5 - Discussion: A detailed analysis of the results and the experiment setup.

Chapter 6 - Conclusion: This section contains a conclusion answering the posed research
questions.

9
1. Introduction

10
Chapter 2
Background

This chapter describes the background theory for this thesis. In Section 2.1 fundamental
machine learning theory is introduced, Section 2.2 handles some generative modelling ar-
chitectures, Section 2.3 gives an introduction to music theory and finally some details about
processing music digitally are presented in Section 2.4.

2.1 Machine learning basics


Machine learning is a subfield of Artificial Intelligence. Machine learning algorithms mimic
the human learning process by observing large amounts of data and learning the patterns in
them. In general terms, machine learning aims to implement algorithms that are able to learn
from experience. This means that the system does not need to be explicitly programmed,
which enables complex tasks to be solved with less complex code. There are many different
types of machine learning models and many ways of categorizing them. In this project it is
especially important to distinguish between discriminative and generative models.

Discriminative models A discriminative model is capable of classifying data. More


specifically this means finding the probability that data with a certain set of characteristics
belong to a specific category, a class. As an example, a good discriminative model could be
able to classify the contents of an image or the genre of a song. Discriminative models are
used in practice in face recognition, self-driving cars, medicine and many other fields.

Generative models A generative machine learning model can classify data, just like
a discriminative model, but with the additional capability of being able to generate new
data. If a discriminatory model predicts how likely data with a certain set of characteristics
belong to a certain label, the generative model can predict the probability of a certain set of
characteristics, given the label. This means that when a generative machine learning model is
given the instruction to generate a picture of a dog, it generates it based on the characteristics

11
2. Background

it learned through classifying images of animals. In practice it may for example find the
probability of a dog having fur and four legs, and then generates the image based on these
findings.

2.1.1 Artificial Neural networks - ANN


An artificial neural network is a machine learning algorithm that models the neurons and
synapses of the brain. The neural network architecture can be seen in Figure 2.1. A neural
network consists of multiple node layers, where each node receives input from the nodes
in the previous layer and passes on its output to the next layer. Based on this, the network
can be modelled as a directed graph which is why the neurons are sometimes referred to
as nodes. In each node the input is evaluated by the nodes activation function to calculate
the node output, as shown in an example in Figure 2.2. The standard activation function
adds weights to the inputs, sums the weighted values and adds a bias. The weights of the
activation function are updated each time the network has been propagated through, the
network "learns" the weights. The training process can either be supervised or unsupervised.
In supervised learning, the training data is labelled. The label, i.e the desired output, is
compared to the output of the network. Depending on how close the network is to making
the correct prediction the weights are adjusted accordingly.
In unsupervised learning in neural networks, the training data is unlabelled. The network
does not know anything about the data, and instead learns patterns in it. For example the
network can group cats and dogs into two different groups based on their characteristics, but
it does not know what it is categorizing. In unsupervised generative models it is possible to
generate random pictures that could end up being a picture of a dog, but one could not ask
the model to explicitly generate a picture of a dog since it does not know what it modelled.

Feed-forward Neural Network


The feed-forward neural network is the simplest type of ANN. It does not form any cycles,
which means that the information is sent in only one direction in the graph.

Recurrent Neural Network - RNN


In a recurrent neural network the outputs from the nodes are sent back as input to the node
or into previous node layers. An advantage of RNNs is that they can predict sequential data,
which the standard feed-forward neural network cannot.

2.1.2 Optimization aspects


Learning rate
The learning rate controls how fast the neural network learns. It does this by determining
how much the weights should be updated in each training iteration. If the learning rate is too
small the training process gets very long, but if the learning rate is too big the model could
converge too quickly and not give good enough results.

12
2.1 Machine learning basics

N hidden layers

Input Output

Figure 2.1: An example of the Artificial Neural Network (ANN) ar-


chitecture. The ANN consists of an input layer, N hidden layers and
an output layer of different sizes. The hidden layers can have differ-
ent sizes which affects the performance of the model. The input
layer size corresponds to the size of the input data sample. The out-
put layer size corresponds to the size of the desired output sample
size or the number of possible classes in the case of a classification
application.

x1

x2
Σ i=1
wixi y

x3

Figure 2.2: The standard activation function and how it evaluates


the input to determine the output. A bias term can also be added to
the output.

13
2. Background

Loss
An ANN learns by minimizing the loss function. The loss function is the difference between
the true desired output and the network output. A common loss function to use for se-
quential data is the cross entropy loss function [13]. Cross entropy is defined in Formula 2.1.
p(xi ) denotes the original probability distribution and q(x) is the predicted distribution, the
output of the model.

X
H(p, q) = − p(xi )logq(xi ) (2.1)
xi

Batch size
The batch size defines the number of data samples used in one training iteration before up-
dating the weights. A large batch size in the size of the data set can lead to the model being
bad at generalizing data it has not seen before. The optimal batch size to use is different for
every model and data set.

Temperature
The temperature is a parameter used when generating data samples. It determines the confi-
dence of the model, where a low temperature benefits the most probable results and therefore
generates data with less diversity. Instead using a high temperature includes less probable
sample generations, resulting in more variation in the data but also more mistakes.

2.1.3 Transfer learning


To fine-tune an already trained model to fit your specific problem is called Transfer learn-
ing. Training a new model from scratch is time- and resource consuming and generally only
available to big companies and organizations with enough funds [24]. In the case of music
generation, there are many models that have been trained on classical music. Even though
the model is genre-specific, it will have learned basic, low-level musical patterns that can
be useful when generating music in other genres. Transfer learning could for example mean
that the majority of the learned weights of the trained models are kept, while some of the
last layers are retrained on new data. This enables high accuracy without having to train the
entire model.
There are extensively trained models publicly available whose results are almost impossi-
ble to replicate with one GPU even if large time resources were available. The transparency
and mutability of the models vary, where the models are sometimes available without insight
into the source code and therefore have limited room for customizing. In some models all of
the trained weights are available, and can therefore be fine-tuned through transfer learning
to tailor the model to a specific problem.

14
2.2 Generative models

2.2 Generative models


This section describes some common architectures used for data generation. Focus is put on
models that perform well with sequential data such as audio.

2.2.1 Autoencoder - AE
An autoencoder is a type of neural network that learns how to represent data in a meaningful
way using less information than the original representation. The autoencoder architecture
consists of an encoder network and a decoder network. The data is encoded and then repre-
sented in the latent space, z, which is why these types of architectures are sometimes called
deep latent variable models. After the data have been encoded, the aim is to reproduce the
original data by decoding the encoded representation. Figure 2.3 shows the autoencoder ar-
chitecture in more detail. The figure visualizes how the input is encoded in the latent space
by using an example from the MNIST data set [10]. In the latent space the image is clearly
represented using less information, in this case fewer image pixels. It can then be decoded
to reproduce the original input. Because of the shape of the network architecture, the latent
space is often called a bottleneck.
The encoder and decoder of the autoencoder architecture can be used together or sepa-
rately as different parts of a generative model. By encoding data using an encoder, one can
benefit from the data being represented with less information by using this as training data.
This can make training processes faster since a big part of what makes training take time is
the information heavy data. After training, the output can be decoded by a decoder in order
to get a sample with the sought after detail [4].

Variational Autoencoder - VAE The variational autoencoder (VAE) is similar


to the autoencoder, but represents the input data as a normal distribution in the latent space.
The practical difference from the regular autoencoder is an additional layer where the com-
pressed data in the latent space are normalized. The distribution is called the true distribution
of the data.
The additional normalizing layer means that the representation in the latent space is
continuous, which makes sampling from the latent space more accurate than with the regular
autoencoder. A drawback of VAEs is that they tend to have difficulties with sequential data
and modelling long-term structure [23]. Furthermore, the information loss that in many ways
is the big advantage of the model is also a compromise for quality as the generated data is
often noisier than the original data [4].

2.2.2 Transformers
The transformer architecture was introduced by Vaswani et. al in 2017 [29]. The transformer
architecture has an encoder-decoder structure and an inherent attention mechanism that
enables it to perform very well with sequential data with long-term structure. Because of
this it does not rely on recurrence which makes it faster to train than other encoder-decoder
models.

15
2. Background

Encoder Decoder

Latent space z

Original input Reconstructed


Representation in lower dimension
input

Figure 2.3: The autoencoder architecture consisting of an encoder,


a decoder and the latent space representation, or bottleneck. In this
example the autoencoder encodes an image from the MNIST dataset
[10] and decodes it into a noisier reproduction.

16
2.2 Generative models

The attention function is presented in Equation 2.2 and takes queries (Q), keys (K) and
values (V) as input and maps them to an output value. dk is the dimension of the K and Q
vectors and this factor scales the dot-product to avoid vanishing gradient problems, which
is a common issue with neural networks where the gradient of the loss function becomes too
small resulting in an undertrained network. Q, K and V are all derived from the same input
when referring to self-attention.

QKT
attention(Q, K, V) = softmax( √ )V (2.2)
dk

Attention Attention is the ability to identify context. The correct understanding of a


word is important in a sequence completion task, both for language and music. The following
two sentences illustrate this in a language example by showing two very different meanings
of the word cross:

The animal didn’t cross the street because it was too tired.
You don’t want to cross them.

The meaning of a word in a sentence is deduced by its surrounding context, and can
even depend on tokens from further away in the sequence than in the direct proximity of the
word. Figure 2.4 shows how the word it relates to the other words in the sentence. It depends
the most on the words the animal. If there were instead multiple animals not crossing the
street, it would have been replaced by they. The word it depends directly on some words, and
contextually on others. The same theory applies to music, which also has structural semantics
similar to language.
The transformer architecture can use the whole previously generated sequence to predict
the next. In comparison, an RNN forgets tokens from further back in the sequence and only
takes recent words, or as in the context of this thesis, music notes, into account.

2.2.3 Long Short-Term Memory - LSTM


The LSTM model architecture is a type of Recurrent Neural Network that was developed in
response to the common vanishing gradient problem with regular RNNs [17]. An LSTM has
a feedback loop from previous layers in the network which is why it is said to "remember"
previous values. Compared to the regular RNN network the LSTM remembers values further
back in the network, which is why it performs better with sequential data, and also avoids
the vanishing gradient problem.

2.2.4 Examples of generative models


The following models are examples of the architectures described in Section 2.2.1, Section
2.2.2 and Section 2.2.3 and are used to investigate music generation in this thesis.

MusicVAE is a variational autoencoder model developed by the Magenta Tensorflow team in


2018 [23]. It uses a two-dimensional LSTM encoder and a hierarchical LSTM decoder.

17
2. Background

The The

animal animal

didn didn

‘ ‘

t t

cross cross

the the

street street

because because

it it

was was

too too

tire tire

d d

Figure 2.4: Illustration of the attention mechanism. Attention


deducts how the words in a sentence inherently relate to each other
on the basis of context.

18
2.3 Music Theory

LSTM networks are introduced in Section 2.2.3. Input data of dimension 2048 was en-
coded into 512 latent dimensions. This model has freely available trained weights and
was trained on a MIDI dataset created by the Magenta team themselves by collecting
1.5 million songs from the internet. Instead of directly decoding from the latent space,
decoding is done bar-wise by using a conductor for each bar and decoding from these
separately, with the aim of improving performance with sequential data.

GPT-3. GPT stands for Generative Pre-trained Transformer and GPT-3 is the third genera-
tion GPT model. GPT-3 was at the time of its release 10 times bigger than any other
sparse NLP model with its 175 billion parameters [7]. GPT-3 was trained on 45 TB of
text data collected from the internet, and is designed for use in NLP tasks, in which it
achieves high accuracy. This model can be applied to music generation by representing
music in a text format. By fine-tuning the pre-trained GPT-3 model it can be used to
generate music while still taking advantage of the sense of context GPT-3 already has.

MuseMorphose is a transformer-based variational autoencoder model developed by Taiwan


AI Labs in 2021 [31]. The developers had the aim to create a model that allowed the
user to have an impact on the rhythmic intensity and polyphony of music. The model
both generates music from scratch, but can also take a song as an input and recreate
it based on the conditioning parameters. MuseMorphose was trained on 10 626 songs
from the Remi pop data set.

Performance RNN is an open source LSTM model for music generation from Magenta Ten-
sorflow [26]. The model focuses on generating music with expressive timing and dy-
namics, and successfully generates cohesive music with local long-term structure. It is
trained on the Yamaha e-Piano competition dataset containing 1400 midi files.

2.3 Music Theory


Music theory [6] defines a set of principles that music we like to listen to tends to follow1 . A
big part of music theory is rooted in the physics of sound as well as in complex mathematics,
but an equally big part is the more intangible that comes from human intuition and creativity.
A general convention is that popular music follows music theory rules, but in practice a lot
of great music also breaks them. Deviating from the convention requires knowledge of how
much and in what way to do this. This is a much more difficult modelling problem than just
sticking to the clear rules of music theory, which is why the objective in this thesis is that the
generated music should follow conventional music theory.

2.3.1 Fundamentals
In this section some fundamental music theory is presented. It is later used in the evaluation
of music and is an important aspect in finding a quantitative metric. The theory is explained
based on the piano clavier, seen in Figure 2.5.
1
Note that when referring to music theory in this report, we generally mean western music theory.

19
2. Background

Pitch The frequency of a sound wave corresponds to its pitch. The higher the pitch, the
higher the frequency.

Tones Western music is based on 12 different tones: C, C#/Db, D, D#/Eb, E, F, F#/Gb, G,


G#/Ab, A, A#/Bb and B. The notes with # or b in front of them are called sharps and
flats respectively and correspond to the black keys on a piano. The notes without #
or b correspond to the white keys. The distance between each of the twelve notes is a
determined distance called a semitone.

Octave The next double frequency of a tone marks an octave. For example, 220 Hz, 440 Hz
and 880 Hz all correspond to the note A but in different octaves. We say that they
are harmonics of the same tone. This means that the same set of tones are repeated in
octaves spanning over the entire piano in Figure 2.5.

Harmony Harmony is one or more notes played at the same time. Three or more notes
form a chord. Playing two or more tones simultaneously can either sound dissonant or
consonant depending on the note distance between the notes. Dissonant sounds feel
unresolved and can be stressful or uncomfortable to listen to. Consonant sounds are the
opposite and sound pleasant. Consonance and dissonance can be used in alternation
to create tension and emotion in music.

Keys Western music usually follows a key. A key defines a set of eight tones which are dis-
tanced with a decided distance from each other so that no dissonances appear. The
major and minor keys are most commonly used in western music but there are also
other modes. The eight white keys make up the C major key, but the major key can be
transposed to have any note as the base. Transposing a key does not change the melodic
content, only the pitches.
Every major scale has a parallel key in minor. This key contains the exact same tones
but has another tone as the base. The base tones of the two parallel keys are spaced
three semitones from each other, e.g the parallel key of C major is A minor.
Some notes in the key have a greater attraction to the base tone than others. These are
the fourth and fifth tones. In many popular songs the chords corresponding to these
specific notes are used more than the others in the key.

Repetition An important part of music is the rhythm and consistent beat as well as rep-
etition to some extent. Music with too much and too little repetition can both be
uncomfortable to listen to.

Tonality Tonality is the principle of arranging music around a central note, a tone. More
specifically it defines a set of rules for the specific relationship between chords, notes and keys
that regulate much of modern and older music [1]. We can call music tonal or atonal. In the
following sections the characteristics of tonal music are presented.

Criteria for tonal music In this project tonality is of interest because essentially
all music we listen to is considered to be tonal, and it is therefore desirable that the music we
generate also is tonal. Tonality is also a widely discussed concept that has been theoretisized

20
2.4 Digital music representation

F # G# A # C# D# F # G# A # C# D# F#

G♭ A B♭ D♭ E

G♭ A

B♭ D♭ E

G♭

F G A B C D E F G A B C D E FG

Figure 2.5: An octave on a piano. The octave is repeated multi-


ple times and the 12 tones in the octave can be played in different
pitches.

in such a way that it could act as support for a potential evaluation metric.2 Dmitri Tymoczko
proposes a set of criteria [28] that define the characteristics of tonal music. The five proposed
criteria that a tonal musical piece contains are as follows:

• Conjunct melodic motion. Melodies tend to move by short distances from note to
note. To illustrate, the distance between C and D is shorter than the distance between
C and G in terms of note distances.

• Acoustic consonance. Consonant harmonies are preferred to dissonant harmonies and


tend to be used at points of musical stability.

• Harmonic consistency. The harmonies in a passage of music tend to be structurally


similar to one another over time.

• Limited macroharmony. Tymoczko uses the term “macroharmony” to refer to the total
collection of notes heard over moderate spans of musical time. Tonal music tends to
use relatively small macroharmonies, often involving five to eight notes.

• Centricity. Over moderate spans of musical time, one note is heard as being more
prominent than the others, appearing more frequently and serving as a goal of musical
motion.

2.4 Digital music representation


Music can be represented digitally in different formats. The different file formats can con-
tain large amounts of information, or be more stripped-down and potentially lack detailed
information. This impacts the final sound and modifiability of the sound file.
2
Here’s an attempt: https://github.com/sebasgverde/music-geometry-eval

21
2. Background

2.4.1 MIDI (Musical Instrument Digital Interface)


MIDI is a common file format to use because of its portability and widespread use. It is
a score-based representation, meaning that a MIDI file contains instructions for how the
music should sound. It then has to be parsed and converted into a raw audio file format to be
listenable. An example of a MIDI file visualization can be seen in Figure 2.6. MIDI contains
information about the velocity, pitch, vibrato, instrument and length of each note. It also
contains information about the entire track such as key, tempo and instrument.
The MIDI file format is generally of small size and is flexible since it is easy to change
instruments and the general composition of the piece. In the machine learning context this
means that music generated in a MIDI format can be post-processed using a music processing
software. A drawback of MIDI is that each instrument has to be generated separately. An-
other drawback is that the MIDI format has a determined set of possible note onsets, which
is why there is no room for onsets outside of these. This can be compared to a continuous set
of note onsets. The limited onset resolution can result in information being lost when con-
verting between file formats because notes are placed in the closest onset space. The onset
resolution is determined by the number of pulses per quarter note (PPQN) which is included
in every MIDI file.

Figure 2.6: An example of a MIDI file. Pitch is shown on the x-


axis and time is shown on the y axis. The velocity of each tone is
represented by the colour of each bar in this particular visualization.
The note length corresponds to the bar lengths of each note.

2.4.2 Raw audio


Raw audio formats such as wav and aiff contain the actual sound waves of the music. This
means that they are many times more information heavy than MIDI. This is a limitation in
that it demands a lot more computational power, but it also has more possibilities because
of the added information. In comparison with language, MIDI would be a text string while
raw audio is a sound file of the text being read out loud. With raw audio even singing can be
learnt, and the model can learn to replicate the personal singing voice of a famous artist by
completely copying the frequency spectra of their voice [11].

22
2.4 Digital music representation

2.4.3 Text representation


Music can also be represented using text. This means that natural langage processing models
can be used for training, since these were trained on text.

MusicXML is an XML-based text representation of music. It contains information about


the key and song title and presents each note, its duration, pitch and position in the
song. This notation does not contain a lot of information compared to for example
raw audio but has a lot of extra XML syntax elements why it gets very long even for
short music pieces.

ABC notation ABC notation contains a header and a part containing the musical contents
of the song. The header contains information about the key, time, default note length
and song name. ABC notation uses letter notation, a-g and z to denote the note values
and rests. Other elements are used to mark note lengths, chords and sharps and flats.
The notes are described in their designated order, why the position of each note does
not have to be explicitly written out.

Figure 2.7 shows the difference in length between MusicXML and ABC notation for the same
piece. MusicXML is much wordier without giving any extra information about the piece. See
Section 3.5.1 for more details about the formats used in this report.

23
2. Background

Figure 2.7: Comparison between MusicXML and ABC notation of


a simple song only containing a whole middle C note.

24
Chapter 3
Method

The purpose of this chapter is to describe the detailed approach used to address the following
research questions (also listed in Section 1.2):

RQ1 Which already developed models in the state-of-the-art can be utilised to achieve music
generation?
RQ2 What are the demands on the quality of background music in public spaces?
RQ3 How can the quality of generated music be measured?
RQ4 To what extent can the existing generative models of background music generate back-
ground music suitable for public spaces?

To address RQ1 a review of the state-of-the-art was done through a semi-structured literature
review. Investigating music quality in RQ3 was done through interviews with professional
musicians. The results from the interviews in combination with a semi-structured literature
review addressed RQ2. Finally, experiments were carried out to address RQ4 where models
found in RQ1 were implemented using metrics from RQ3.

3.1 Literature review


To find relevant studies in order to answer RQ1, a semi-structured literature search was con-
ducted [19]. We chose the database Google scholar to conduct the search. Google scholar con-
tains papers from many different databases and publishers which is why it is a good choice in
order to avoid bias in favour of any specific publisher. To find relevant papers, a set of search
strings were specified.
The articles were chosen based on a set of criteria that specify the relevance for the spe-
cific topic.

25
3. Method

Search string RQ1: music generation OR generative model OR transformer music

Table 3.1: Inclusion and exclusion criteria for RQ1.

Inclusion criteria Articles chosen


1. The paper presents a generative model architecture that is foundational 4
research in the field
2. The paper presents a novel generative model architecture 10
3. Secondary and tertiary studies that review or map the current state-of- 1
the-art in generative AI modelling

Exclusion criteria Articles excluded


1. The model implementation is not publicly available
2. The generative model type is not transformer- or LSTM-based 2
3. The paper deals with generation of raw audio 2

Papers found: 11

In addition to the semi-structured literature review, articles were also found through
citation analysis, or snowballing [19], which is a reason for its semi-structured nature. This
was done by back-propagating the sources in some of the articles found through the literature
review. Snowballing was used when searching for papers about music evaluation criteria for
RQ3 by investigating what metrics had been used in evaluating the models in the articles
from RQ1.

3.2 Interviews
Four semi-structured interviews [25] were carried out with professional musicians. They had
all composed music of their own and had rigorous musical education. The aim of the inter-
views was to attempt to define the more abstract aspects of what makes music sound good,
and later use this knowledge either to create or choose a mathematical evaluation method.
The analysis was carried out by finding patterns and common themes in the interviews. The
interview protocol and consent form is included in Appendix A.1.

3.3 Metrics
The interviews together with the literature review resulted in a set of important, measurable
characteristics of music. Most of the metrics do not have a specified reference value, but
instead need to be compared to a reference piece or distribution. To find reference points for
the generated music, metrics were computed for the training data sets, which were used as a
representation of human-made music. This shows how the scores vary and can give a hint of
what could be a lower limit score or an interval to stay within. The properties of the music
were evaluated using the following metrics:

26
3.3 Metrics

Consonance. This metric is calculated as the key consistency of a song, which is measured
by registering to what extent the notes that are played in a sequence belong to the
same key. The consonance is measured between 0-1 for each key in major and minor,
meaning that a song in C major should get a consonance score of 1 for C major and 0
for every other key. A high consonance score is therefore considered as an indicator of
good quality.

Centricity The harmonies in a musical piece should be structurally similar, meaning that
some notes and chords are more common than others. This can be measured by col-
lecting the occurrence of each note in a histogram, whereafter the statistical entropy
of the histogram is measured. Entropy is high if all notes are common, and low if a
couple of notes dominate. It is therefore assumed that a song with low entropy follows
tonality theory better [30].

Macroharmony The total number of used notes in the song is defined as the macroharmony.
The maximum macroharmony is 12, i.e notes in different octaves count as the same
note. Tymozsckos theory of tonal music (presented in Section 2.3.1) say that tonal
music should have a macroharmony of 5-8 notes [28].

Groove Grooving pattern similarity, or consistency of rhythm. We limit the generated music
by saying that the rhythm in bars close to each other in time should not differ too much
but be somewhat consistent. For music in general this is not always true, but to branch
out when it comes to rhythm requires more detailed specification of how this should
be done [30].

Empty beat ratio The music should not have completely silent parts in the middle of the
sequence, which is why a low empty beat ratio is aimed for.

Note density The note density measures the ratio of note onsets, i.e the number of note
onsets divided by the total length of the sequence. Previous research [5] [16] shows
that a low song tempo can have a calming effect on the listener. Lowering the tempo
makes the note density lower for a sequence length fixed in time, which then results
in note density correlating with tempo. With this reasoning the aimed at note density
should not be too high.

Compression ratio and TECs The structure induction algorithm (SIA) by Meredith [21] can
be used to calculate repetition and patterns in long-term structure. SIA identifies
translational equivalence classes (TECs) by representing each note as a point and ap-
plying a pattern recognizing algorithm to it. From this a visual representation of the
musical patterns can be obtained. SIA finds all patterns in a musical passage, including
ones that are not of interest from a musicological perspective. COSIATEC by Mered-
ith [21] is an improved implementation of SIA and finds actual musical patterns more
accurately by filtering out the most relevant patterns based on coverage, compactness
and compression ratio. The compression ratio measures the total number of points
that make up a pattern’s total numer of occurrences.

27
3. Method

Table 3.2: Details about type, training data and licence for the mod-
els chosen for evaluation.

Model Type Pretrained on Licence


GPT-3 Transformer 45 TB of text data OpenAI licence (Off-
(Common Crawl the-shelf licence, if
dataset) purchased commercial
use is permitted)
Performance RNN LSTM-based RNN Yamaha e-Piano Apache 2.0 (Commer-
Competition cial use permitted)
dataset (∼1400
songs)
MusicVAE VAE 1.5 million midi Apache 2.0 (Commer-
files from the web cial use permitted)
[23]
MuseMorphose Transformer-VAE Remi Pop-dataset MIT (Commercial use
(10 626 songs) permitted)

3.4 Music generation


Four models were chosen for evaluation, presented in Table 3.2. The models chosen for eval-
uation are GPT3, Magenta Music_vae, Magenta Performance_rnn, and MuseMorphose. Pre-
trained weights for music generation were available for all models except GPT-3 since it is
an NLP model. The models were first run using these pretrained weights, and then retrained
using new datasets. In all iterations the training dataset as well as the generated music was
evaluated for comparison. When evaluating the training data the full dataset was evaluated,
whereas for the generated music 100 samples from each model iteration were used for evalu-
ation.
The models were chosen based on model type, documented performance, as well as cus-
tomizeability. All the models perform well on sequential data with long-term structure,
which gives better results when generating music. No raw audio models were tested be-
cause of the extensive training time and resources needed. All the chosen models were either
trained on MIDI or text representations, which facilitated training the models on the same
data set.
After evaluating the generated music based on the pretrained weights, GPT-3 and Per-
formance RNN were chosen for retraining and optimization. This was based on the results
presented in Chapter 4 as well as the fact that we wanted to explore the possibilities of mu-
sic generation with GPT-3. The two models were also of two different types, LSTM and
Transformer.

3.5 Training data


Four data sets were found that could potentially be used for training. These were all evalu-
ated using the same metrics used for the generated music. The four potential data sets are
presented in Table 3.3.

28
3.5 Training data

Table 3.3: The datasets used for training.

Dataset Size File format Public Comment


domain
EMOPIA [18] 1087 MIDI and No Piano music collected from youtube.
REMI Licenced under creative commons
attribution licence.
MAESTRO 1276 MIDI and Yes Classical music in the public domain
wav
OpenEWLD 502 MusicXML Yes Derived from EWLD to only contain
[27] music in the public domain
Trumpet 177 Multitrack Yes Small dataset but contains multiple
Kings of the Midi instruments
Swing Era

3.5.1 ABC format parsing in GPT-3


Since GPT3 is a Natural Language Processing model, music needs to be represented in a
text format to be able to act as input in the model. We chose to use ABC format, because it
describes music well using few characters, and there are good libraries available for converting
ABC notation to MIDI and vice versa. The conversion is not lossless though, as can be seen
in Figure 3.1. Figure 3.1a shows an original midi file. In Figure 3.1b the original MIDI file
has been converted to ABC notation and back to MIDI using EasyABC [20] and music21 [9]
APIs. In Figure 3.1c the original midi file has been converted to XML format and then ABC
using music21 and xml2abc [3]. This was done without a direct function for translating MIDI
to ABC, which is why the file first had to be translated into XML and then to ABC before it
was recreated as MIDI.
The two recreated MIDI files both look different from the original, but the version in
Figure 3.1b lies closer to the original. It contains the same notes but different note lengths,
and it sounds reasonably similar when listening. The version in Figure 3.1c on the other hand,
has errors in the translation of the notes which gives a dissonant result and makes it sound
very different from the original.
For the purpose of training, the parsing method in Figure 3.1b was chosen based on the
fact that it contains the same melodic content as the original.

29
3. Method

(a) The original midi file.

(b) Recreated using EasyABC and Music21.

(c) Recreated using xml2abc and Music21.

Figure 3.1: A song from the EMOPIA dataset


(Q1_2Z9Sjl131jA_4.mid) shown both in its original version as
well as after parsing to abc format and back to MIDI using
EasyABC.

30
Chapter 4
Results

In this chapter the results of the interviews and experiments are presented. Section 4.1
presents an analysis of the interviews, Section 4.2 presents the evaluations of the data sets,
Section 4.3 presents the evaluations of the pretrained models while Section 4.4 and 4.5 shows
the results from retraining Performance RNN and GPT-3.

4.1 Interview results


Quotes have been translated as closely as possible from Swedish to English. The interviewees
will be referred to as P1, P2, P3 and P4. The interview protocol is included in Appendix A.1.

Rhythm and structure All four interviewees were asked the question "What would
make a song really uncomfortable to listen to?", to try to find specific music characteristics that
could be quantifiable. In discussing this, all of the interviewees highlighted the importance
of rhythm and structure. A structural element that P1 and P4 mentioned is the feeling of
"coming home", in a song, referring to a notion of being centered around a note from which
the melodies branch, but then return to create the feeling of "coming home". When speaking
of rhythm, P4 also mentioned that even though music is largely based on repetition, a negative
characteristic of music could be that it is being too repetitive.

"music benefits from being organized, even unexpected things, if done with the right
timing, can result in something organized and structured that we enjoy listening to", P3

"there should be a right amount of repetition, for example if something that sounds off
is played over and over, some kind of repetition is created and makes it not sound off
anymore. It is always a balancing act. It can also be looked at at different levels - if
there is much repetion on the small scale, but none on a big scale, it is still not quite
there", P4

31
4. Results

Timbre and sound P3 and P4 mentioned that harsh and too loud sounds would be
uncomfortable to listen to. P2 also touched on this subject, and said that "there’s a purely
sonical aspect of music, meaning that the actual sounds we hear should appeal to us in order for us to
like it". Timbre or sound as an evaluation parameter would primarily be applicable for models
generating raw audio, but MIDI also specifies the velocity, i.e the volume, of each tone.

Background music characteristics P1 said that "since the context is quite general,
the music should be calm and act calming to the listener and not induce stress or anger". P2 contrasted
this by saying that the music could also act as a piece of public art whose purpose is to surprise
the listener. In this case the music would not necessarily have to be calming, but could be.
P2 also though that the music should suit the specific space. P4 mentioned ambient music
when talking about background music, which generally means music with a slow tempo and
lack of structure, commonly used for relaxation and meditation. This goes in line with the
statement of P1, that background music in a public space can benefit from being calming.

4.2 Dataset evaluation results


The results from evaluating the datasets based on the chosen metrics are presented in Table
4.1. Since OpenEWLD is not in a MIDI format, we were not able to evaluate it based on
compression ratio and TECs. The other data sets that were not chosen for training were not
evaluated on this either.
The data sets OpenEWLD and EMOPIA are chosen for training data since their charac-
teristics are more distinctive, which hypothetically could make it easier to deduct if a model
has learnt the training data set properly. OpenEWLD and EMOPIA have a higher conso-
nance than the other two, which means that they contain notes in the same key (see Section
2.3.1) and they have a more limited macroharmony close to the optimal range for tonal music
of 5-8 notes. They also have low centricity, which indicates that they have a clearer melodic
structure. The mean consonances of the data sets are also visualized in Figure 4.1.

Table 4.1: The mean and standard deviations of the chosen metrics
evaluated on the chosen data sets. All of the datasets achieve a rela-
tively high consonance.
Data set Size Consonance Macroharmony Centricity Note Density
EMOPIA 1087 0.9572 ± 0.0603 8.4855 ± 1.5621 2.7073 ± 0.2406 0.8554 ± 0.4492
MAESTRO 1276 0.7919 ± 0.0788 11.9898 ± 0.115 3.3451 ± 0.1464 1.2942 ± 0.4459
OpenEWLD 502 0.9413 ± 0.0586 8.8805 ± 1.7428 2.7779 ± 0.2557 0.2139 ± 0.0813
TKSE 177 0.8423 ± 0.0539 11.9714 ± 0.225 3.255 ± 0.1475 0.3955 ± 0.1133

Data set Size Groove Empty beat ratio Compression ratio TECs
EMOPIA 1087 0.9841 ± 0.079 0.0053 ± 0.0157 1.3023 ± 0.1376 27.40 ± 14.34
MAESTRO 1276 0.9761 ± 0.0084 0.0288 ± 0.0218 - -
OpenEWLD 502 0.9308 ± 0.0271 0.0183 ± 0.0286 - -
TKSE 177 0.9183 ± 0.0308 0.0089 ± 0.0232 - -

32
4.3 Pretrained model results

Figure 4.1: Consonance in the four datasets EMOPIA, MAESTRO,


OpenEWLD and TKSE.

4.3 Pretrained model results


The four chosen generative models were run with pretrained weights and a generation tem-
perature of 0.75. The results of evaluating 100 songs are presented in Table 4.2.

Table 4.2: The resulting evaluations of the pretrained models.


Model Consonance Macroharmony Centricity Note density
Performance 0.8782 ± 0.069 10.59 ± 1.613 2.8478 ± 0.3188 1.3172 ± 0.6796
RNN (perfor-
mance_with_dynamics)
Music VAE (hi- 0.9376 ± 0.0729 8.21 ± 1.7905 2.5432 ± 0.3511 0.296 ± 0.1111
erdec_mel_16bar)
GPT-3 - - - -
MuseMorphose 0.9756 ± 0.0325 8.2 ± 1.3084 2.7118 ± 0.1515 0.7935 ± 0.2596

Model Groove Empty beat ratio Compression ratio TECs


Performance 0.9588 ± 0.0214 0.0604 ± 0.1113 1.4457 ± 0.1699 28.8 ± 14.95
RNN (perfor-
mance_with_dynamics)
Music VAE (hi- 0.9893 ± 0.004 0.0379 ± 0.0619 1.4781 ± 0.1541 8.06 ± 2.37
erdec_mel_16bar)
GPT-3 - - - -
MuseMorphose 0.9927 ± 0.0023 0.0001 ± 0.0014 1.7478 ± 0.1521 17.38 ± 4.99

Evaluating the pretrained models gives an indication of their respective potential. The
metric results do not differ that much between the models, but when it comes to modelling
long term structure they perform differently.
Music VAE generated sparse samples where some were completely empty. The note den-
sity for this was much lower than for the other models, which in combination with the low
note density and low number of TECs, i.e low pattern repetition, had us discard this model.
The lack of melodic structure could also be heard when listening to the samples.

33
4. Results

MuseMorphose performed better and showed great potential, but was discarded because
of a lack of documentation of how to retrain and run the model.
Performance RNN and GPT-3 were investigated further through retraining and param-
eter tuning. Performance RNN was chosen because it showed clear patterns and long-term
structure in the generated music, and because it is easily available and is highly tuneable.
GPT-3 was chosen even though it did not have any pre-trained weights for music generation.
Since it is a language based model which has had great results modelling long-term structure
when used in other applications, it would be of interest to try with music generation as well.

4.4 Magenta: Performance RNN


Performance RNN comes in different versions where the performance_with_dynamics model
was chosen since it takes musical dynamics into account but is still fairly close to the base-
model which is what we were interested in investigating.
The song generation command takes a seeding sequence as an input parameter, based on
which the model then generates a completion. Every song generated in the same function call
will therefore start with this sequence. The starting sequence is represented as a list of note
values corresponding to their MIDI numbers. The MIDI number 60 corresponds to middle
C, and -2 corresponds to a pause.

4.4.1 Learning rate


The model was iteratively retrained on a subset of 500 songs from EMOPIA while varying
the learning rate. The learning rate was decreased from 10−2 to 10−5 by factors of 10. A
batch size of 16 and the seeding sequence {C} was used for song generation. The resulting loss
and accuracy for training is presented in Figure 4.2 and the generated music evaluations are
presented in Table 4.3.
In Table 4.3 the music generated by the model trained using a learning rate of 0.001
received a much higher consonance score than the music generated in the other runs. It also
had a lower centricity. When listening to samples from the different runs, using learning rate
0.001 gave better sounding results with better melodic structure.
The model was trained for this number of iterations due to time constraints, and it could
have been trained further to clearly see which model performs the best over time. In this at-
tempt, good scores in consonance, centricity and compression ratio and high accuracy clearly
correlate as can be seen for the training with learning rate 0.001. This trained model version
was therefore used in the further attempts in Section 4.4.2.

4.4.2 Seeding sequence


The impact of the seeding sequence was investigated by varying the seeding sequence for the
same trained model. Training was done using a subset of 500 songs from EMOPIA, a batch
size 16, learning rate 0.001 and temperature 0.75.
The results are presented in Table 4.4. There is a similar performance on all metrics apart
from in note density, which is further visualized in Figure 4.3. A more sparse seeding sequence
resulted in a sparser sample. This could be a direct effect of the sparse seeding sample.

34
4.4 Magenta: Performance RNN

(a) (b)

Figure 4.2: Loss and accuracy as a function of training step for re-
training Performance RNN using different learning rates.

Table 4.3: The metric evaluations on music generated by Perfor-


mance RNN for different learning rates.

Learning rate Consonance Macroharmony Centricity Note Density


0.00001 0.7759 ± 0.0671 11.99 ± 0.0995 3.3349 ± 0.1484 1.0429 ± 0.2909
0.0001 0.8361 ± 0.0864 11.78 ± 0.6258 3.1174 ± 0.2405 1.6938 ± 0.3352
0.001 0.9057 ± 0.0678 11.0 ± 1.1576 2.7623 ± 0.3059 1.8005 ± 0.394
0.01 0.6378 ± 0.0183 12.0 ± 0.0 3.5299 ± 0.0193 1.18 ± 0.1241

Learning rate Groove Empty beat ratio Compression ratio TECs


0.00001 0.9764 ± 0.0033 0.0005 ± 0.0028 1.3214 ± 0.0931 28.01 ± 4.68
0.0001 0.9641 ± 0.0065 0.0182 ± 0.0387 1.4561 ± 0.0452 37.35 ± 6.80
0.001 0.9646 ± 0.0067 0.0041 ± 0.0142 1.5934 ± 0.0991 33.47 ± 7.49
0.01 0.9771 ± 0.0023 0.0002 ± 0.0016 1.3591 ± 0.0280 31.55 ± 3.1571

Table 4.4: The metric evaluations on music generated by Perfor-


mance RNN trained on EMOPIA for different seeding sequences.
The seeding sequence is given as input at the sampling stage and
acts as a starter prompt based on which the model then generates a
continuation.
Seeding sequence Consonance Macroharmony Centricity Note Density
{C} 0.8537 ± 0.0814 11.66 ± 0.696 3.047 ± 0.2534 1.9185 ± 0.45
{C, E, D} 0.8828 ± 0.0756 11.37 ± 0.8906 2.9868 ± 0.2712 1.7175 ± 0.5146
{C, - , E, -, D} 0.8926 ± 0.0766 11.15 ± 1.1347 2.9239 ± 0.2786 1.6671 ± 0.5565
{C, E, - , -, D, C} 0.8871 ± 0.0729 11.28 ± 1.0685 2.9894 ± 0.258 1.6728 ± 0.4938
{C, -, E, - , - , -, D, -, C, -} 0.8799 ± 0.0709 11.22 ± 1.1277 3.0268 ± 0.2371 1.4514 ± 0.6241

Seeding sequence Groove Empty beat ratio Compression ratio TECs


{C} 0.9617 ± 0.0103 0.011 ± 0.0235 1.5847 ± 0.0906 36.9 ± 8.95
{C, E, D} 0.963 ± 0.0112 0.0161 ± 0.0412 1.5292 ± 0.0891 36.02 ± 9.79
{C, - , E, -, D} 0.9603 ± 0.0126 0.0107 ± 0.028 1.4930 ± 0.0940 37.02 ± 10.88
{C, E, - , -, D, C} 0.9594 ± 0.0118 0.0108 ± 0.0265 1.4848 ± 0.099 37.81 ± 10.24
{C, -, E, - , - , -, D, -, C, -} 0.9667 ± 0.0138 0.0123 ± 0.03 1.4472 ± 0.1163 32.59 ± 13.05

35
4. Results

Figure 4.3: The means and standard deviations of the note density
metric evaluated on songs generated with Performance RNN using
different seeding sequences.

4.5 GPT-3
GPT-3 was trained on two different datasets. It was trained on a subset of 500 songs from
EMOPIA, which contains songs in different keys. It was also trained on a subset of 500 songs
from OpenEWLD which only contains songs in one key. OpenEWLD comes in a MusicXML
format which is why it also contains information about song title and artist, which EMOPIA
does not. Therefore this information can be used as song prompts in GPT-3, which is de-
scribed in more detail in Section 4.5.1.

4.5.1 Prompt
GPT-3 takes a prompt as an input based on which it generates a completion. In the typical use
case where it generates text, the generated response depends highly on the input prompt. The
training data are arranged in these prompt-completion pairs from which the model learns.
When instead training to generate music, what the input prompt should contain is not as
evident. We will attempt two setups to investigate the impact the prompt has on the result.
Compared to the text generation use case, the same prompt should be able to give dif-
ferent completions, e.g. "How are you?" could be completed with multiple different answers,
"Great, how are you?", "Not to well actually.", "Good." etc. Therefore the song prompts do
not necessarily have to be unique, since many different completions can be plausible for one
prompt.

36
4.5 GPT-3

OpenEWLD
In the first setup the model was trained on the OpenEWLD dataset which contains informa-
tion about artist and song titles for each song. The artist and song title were used as unique
prompts for each song input to the model. Each song in this dataset was transposed to the
same key before used as input in training.
# OpenEWLD e x a m p l e p r o m p t :
"X: 1 $ T: A song about trees $ C: The oaks $ <song >"

EMOPIA
In the second setup the model was trained on the EMOPIA dataset which does not contain
any information about artist or title. Instead the ABC tune header was used as input. The
ABC header contains song-specific information regarding time and key, which has an impact
on the output of the song. These prompts are therefore not unique for each song. This was
investigated both for the dataset transposed to the same key as well as kept in their original
keys.
# EMOPIA e x a m p l e p r o m p t :
"X: 1 $ M: 3/4 $ L: 1/16 $ K: Em $ <song >"

4.5.2 Learning rate


The impact of the learning rate multiplier (lrm) in GPT-3 was investigated by training the
model for 7 epochs for the lrm 0.02, 0.08, 0.14 and 0.20. The default learning rate is multiplied
by the lrm, meaning that a smaller lrm gives a smaller resulting learning rate. The loss and
accuracy are presented in Figure 4.4.

Figure 4.4: The loss and accuracy for GPT-3 trained on a subset of
EMOPIA for 7 epochs.

Training should be carried out for more epochs in order to be able to draw a conclusion
of optimal learning rate. As expected, a lower learning rate implies slower learning. The
highest learning rate with an lrm of 0.2 gave a relatively high accuracy already after 7 epochs.
This would be interesting to investigate further.

37
4. Results

4.5.3 Epochs
The impact of the training length was investigated by using a high number of training epochs.
A batch size of 32 and a learning rate multiplier of 0.2 was used. The resulting loss and
accuracy are presented in Figure 4.5 and the evaluation of the generated songs are presented
in comparison with the training data in Table 4.5. The evaluation scores of generated music
ends up close to the scores of the training data, and achieves both a high consonance, limited
macroharmony and low centricity. It does not contain as many TECs, i.e repeated patterns,
as the training data, but since the standard deviation of the number of TECs in the training
data is high this does not have to mean that the generated music lacks structure. Listening
to the music also confirms that it contains melodic structure.

Figure 4.5: The loss and accuracy for GPT-3 trained on a subset of
EMOPIA for 25 epochs.

Table 4.5: Evaluation of EMOPIA and GPT-3 trained on EMOPIA


for 25 epochs.
Data set Consonance Macroharmony Centricity Note Density
EMOPIA 0.9572 ± 0.0603 8.4855 ± 1.5621 2.7073 ± 0.2406 0.8554 ± 0.4492
GPT-3, trained on 0.998 ± 0.0074 6.6566 ± 0.9657 2.3309 ± 0.3972 1.3088 ± 0.8328
EMOPIA, 25 epochs

Data set Groove Empty beat ratio Compression ratio TECs


EMOPIA 0.9841 ± 0.079 0.0053 ± 0.0157 1.3023 ± 0.1376 27.40 ± 14.34
GPT-3, trained on 0.9966 ± 0.0014 0.0 ± 0.0 1.9304 ± 0.2381 16.37 ± 4.44
EMOPIA, 25 epochs

The loss and accuracy for a training iteration using OpenEWLD is shown in Figure 4.6 for
comparison. The same training parameters were used and the only difference is the training
data. Here where the model reaches a higher accuracy than in Figure 4.5 already after 3
epochs. This implies that the data set impacts the training results.

38
4.5 GPT-3

Figure 4.6: The loss and accuracy for GPT-3 trained on a subset of
OpenEWLD for 4 epochs.

4.5.4 Transposed training data


To investigate the impact of homogenity in the training data GPT-3 was trained on both the
unchanged EMOPIA dataset as well as the same data set where every song was transposed
into the same key. Figure 4.7 shows the key content of the unaltered EMOPIA dataset. The
EMOPIA dataset was divided into major and minor scales and thereafter transposed to ei-
ther C minor or C major. The model was then trained on the minor and major data sets
separately. A temperature of 0.75 was used for sample generation. The biggest difference in
evaluation results in Table 4.6 can be seen in note density, where it is much lower for the
training interations using the homoegenous datasets. A difference can also be seen between
the smaller data set, EMOPIA Major, and the two others, which gets a higher macroharmony
and a higher number of TECs.

Figure 4.7: How the songs in the EMOPIA dataset are distributed
over keys.

39
4. Results

Table 4.6: Metric evaluations for GPT-3 trained on regular EMOPIA


and transposed EMOPIA.
Data set Size Consonance Macroharmony Centricity Note density
Standard EMOPIA 500 0.9962 ± 0.0191 5.9588 ± 1.399 2.0365 ± 0.5558 1.6951 ± 4.2858
EMOPIA Major 405 0.9943 ± 0.0153 6.45 ± 1.5058 2.0663 ± 0.4744 0.8717 ± 0.508
EMOPIA Minor 500 0.9981 ± 0.0078 5.8774 ± 1.3716 1.9658 ± 0.6256 0.9882 ± 0.5715

Data set Size Groove Empty beat ratio Compression ratio TECs
Standard EMOPIA 500 0.9963 ± 0.0012 0.0 ± 0.0 2.2459 ± 0.5366 13.52 ± 4.86
EMOPIA Major 405 0.9961 ± 0.0012 0.0 ± 0.0 2.103 ± 0.4628 17.84 ± 5.78
EMOPIA Minor 500 0.9958 ± 0.0014 0.0 ± 0.0 2.2509 ± 0.54 13.41 ± 4.81

Table 4.7: Metric evaluations for GPT-3 for varying generation tem-
peratures.
Data set Temperature Consonance Macroharmony Centricity Note density
OpenEWLD 0.5 0.9141 ± 0.0273 10.09 ± 0.8258 2.9736 ± 0.0644 0.379 ± 0.0473
OpenEWLD 0.75 0.9478 ± 0.0369 9.54 ± 1.4793 2.8809 ± 0.1604 0.4412 ± 0.1557
OpenEWLD 1 0.9362 ± 0.0489 10.03 ± 1.5777 2.9387 ± 0.1804 0.494 ± 0.1639
EMOPIA 0.5 0.9979 ± 0.0124 5.6289 ± 1.3646 1.8907 ± 0.5681 1.5119 ± 1.9672
EMOPIA 0.75 0.9954 ± 0.012 6.8021 ± 0.9199 2.3056 ± 0.4132 1.2118 ± 0.6679
EMOPIA 1 0.988 ± 0.0238 7.5747 ± 1.2561 2.5225 ± 0.3829 1.0842 ± 0.6821

Data set Temperature Groove Empty beat ratio Compression ratio TECs
OpenEWLD 0.5 0.9992 ± 0.0001 0.0 ± 0.0 2.9972 ± 0.5308 7.02 ± 1.56
OpenEWLD 0.75 0.9985 ± 0.0006 0.0055 ± 0.0133 3.2550 ± 1.3071 7.85 ± 2.77
OpenEWLD 1 0.9983 ± 0.0007 0.0047 ± 0.0136 3.1388 ± 1.0121 9.06 ± 4.00
EMOPIA 0.5 0.9964 ± 0.0011 0.0 ± 0.0 2.3899 ± 0.6735 13.19 ± 4.62
EMOPIA 0.75 0.9967 ± 0.0011 0.0 ± 0.0 1.9278 ± 0.2485 15.80 ± 4.32
EMOPIA 1 0.9969 ± 0.0011 0.0003 ± 0.0023 1.7051 ± 0.1655 18.08 ± 4.61

4.5.5 Temperature
In Table 4.7 the results from varying the temperature are presented. The same temperature
variations were made for both the model trained on the OpenEWLD and the EMOPIA data
sets. The same prompt was used for generating music from both models:
# Prompt :
"X: 1 $ M: 4/4 $ L: 1/4 $ K: C $ <song >"

40
4.5 GPT-3

(a) (b)

Figure 4.8: The consonance in two different sample sets generated


using GPT-3 retrained on OpenEWLD. A temperature of 0.5 and 1
were used respectively. The consonance of the two different gener-
ated data sets showcases that a low temperature gives low diversity.
The same sample is generated multiple times and the whole data set
contains only 3 unique samples.

41
4. Results

42
Chapter 5
Discussion

In our experiments, two sequential generative models, GPT-3 and Performance RNN were
evaluated on two different data sets, OpenEWLD and EMOPIA. In this section the exper-
iment setup and the results of the experiments are analysed. Conclusions are also made in
relation to the research questions.

5.1 Model performances


This section discusses different aspects of the training results and the differences between
the two models.

5.1.1 Data set homogenity


We found that the key of the data set had an impact on the note density as can be seen in
Section 4.5.4 and Table 4.6. Transposing the dataset to the same key gave a note density closer
to the one of the training data set. This indicates that conforming the data set to only contain
the same key makes it easier for the model to learn the patterns in it. The note density was still
much higher in the generated music than in the training data, which was seen consistently
throughout all experiments. It is not certain what causes this or if it is an indicator of the
music sounding bad. It can also be said that when the author listened to the music with
varying note densities, there is no particular note density that sounds better than the other
quality wise, but there is a difference in how stress-inducing the music sounds. This would be
interesting to investigate further with listener-studies to be able to draw conclusions about
a potential correlation between the calming effect and note density.
There could also be differences seen between the training with the bigger and the smaller
data sets. Since the EMOPIA dataset naturally contains fewer songs in major keys than in
minor keys training was done on different data set sizes. The model that was trained on a
smaller data set had a larger macroharmony score as well as more TECs, i.e more repeated

43
5. Discussion

patterns. This could both be from the model being slightly less trained than the other two,
but could also be a result of melodic differences that were present in these particular subsets
of EMOPIA. Repeating this attempt using more data could clearer show what impact this
has on the results, but we can already with these small data set sizes see an indication in the
pattern metrics that the model trained on the smaller data set had not learnt the structure
patterns as well as the other models.

5.1.2 Learning rate


Both the attempts at varying the learning rate for the two models in Section 4.3 and Section
4.5.2 would have to be run for more epochs in order to find the most optimal learning rate
value. In this case, the optimal learning rate for a particular running time was found, which
was helpful in achieving the best results for the running times available in this project, but
the found learning rates does not have to be the best for longer running times.

5.1.3 Epochs
Training GPT-3 with EMOPIA and OpenEWLD achieved very different accuracies as seen
in Figure 4.6 and Figure 4.5, and training on OpenEWLD achieved a higher accuracy than
EMOPIA with a fifth of the training time. As discussed in the previous section, the optimal
learning rate for longer training times does not have to correspond to the learning rate used,
which could have limited the quality of the results. Since the same learning rate was used in
training with both datasets, this indicates that training is highly dependent on the contents
of the data set and that parameter tuning is required for every data set individually.

5.1.4 Temperature
Investigating the impact the temperature had on GPT-3 showed that the evaluation metrics
are not in themselves a clear indicator of a good generated data set when we are specifically
interested in a diverse data set. In Figure 4.8, both generated data sets have a high conso-
nance, i.e how much of the song is in the same key, and contain songs that sound good to
the author when listening, but if the model is to be used to generate varied samples to play
in a public space the results in Figure 4.8b are preferred over the results in Figure 4.8a. For
low temperature the model generated almost the same song over and over. This follows the
expected behaviour of using a low temperature which inherently lowers the diversity and
makes it only generate the sample with the highest probabilities, i.e. the same over and over
if the prompt is the same. In the example shown in Figure 4.8 the songs were generated using
the same prompt, which is why an alternative idea to achieve more diverse samples could be
to vary the prompts in a smart way by for example randomizing song titles as prompts.

5.1.5 Data format parsing


When retraining GPT3, the MIDI files in the training data set were parsed from MIDI to
ABC notation. This conversion between different file formats always comes with some loss
of information due to the nature of the formats. A conversion also happens in Performance

44
5.2 Metric

RNN, where the MIDI files are converted into a vector representation to be able to be used
as input in the LSTM network. This has a big impact the outcome of the training since the
original data set can be assumed to have been altered in some way from parsing. This aspect
is difficult to eliminate completely, but the impact of it could be made as small as possible by
choosing file formats that do not require many parsing steps. Furthermore, the two different
datasets that GPT-3 were retrained on came in different formats, MusicXML and MIDI,
which results in a difference in parsing between the two runs with the same model. It would
therefore be interesting to find a way of measuring the information loss and the extent of the
alteration of the training data in order to include this in the comparison between the two
models as well as the runs with different datasets using the same model.

5.2 Metric
Discussing music quality and fundamental demands on music in the interviews gave a good
understanding about what parameters that would be interesting to quantify in an evaluation
metric. It also showed some of the diversity in liking and opinions, and the difficulty in
summarizing the broad spectrum of musicality into numbers. Because of the creative and
artistic perspectives on music, it can be a difficult task to classify something as "good" or "bad"
since there is always someone that is going to disagree. The following sections evaluates the
different metrics and how well they encapsulate music quality.

5.2.1 Note density


P1 stated in their interview that the music should have a calming effect on the listener. As
mentioned in Section 3.3, previous studies [5] [16] have shown that a low song tempo lowers
the heart rate which makes us relaxed. Therefore the note density of a song could potentially
have a correlation with the calming effect of the music, since the note density is also changed
with tempo. From listening to the music, the note density did not seem to have a correlation
with the quality or musicality of the samples.

5.2.2 Empty beat ratio


This metric showed little variation over the resulting evaluations. Instead of calculating the
mean of a set of songs, this metric could be used to sort out songs that contain too much
silence.

5.2.3 Consonance
The consonance metric is based on what ratio of the notes in the song belong to its most
common key. The most common key is found by parsing through all of the notes and finding
which belong to what key. Since parallel keys consist of the same tones, they get the same
scores. Therefore there is a bias towards the key that is parsed first, which can lead to the
key being misdetermined. This happens when GPT-3 is trained on the OpenEWLD dataset.
All generated songs ideally use the tones in the key C, and therefore some are found to be

45
5. Discussion

in its parallel key, Am, which lowers the consonance metric. By improving the key-finding
algorithm and taking chord progressions and melodic structure into account, this could be
improved.

5.2.4 Centricity
The centricity metric, which measures if some notes are more common than others, could
be an indicator of the music quality. A lower value of this metric means that some notes are
more common, which indicates that the model does not choose notes undeliberately but has
instead learnt the harmonic patterns that are common in music. The scores varied between
2 and 3, which corresponds to the scores of the data sets, i.e of real music. The training
iterations that achieved higher accuracy also achieved a lower centricity score. For example
this can be seen in Table 4.3, where the model trained with the learning rate that achieved the
highest accuracy also obtained the lowest centricity score. When listening to samples with
different centricity scores, it is difficult to distuingish a difference due to the difference in
music content in the training data sets as well as other quality parameters that varied. This
would therefore be interesting to further correlate with human perception.

5.2.5 TECs
It showed to be difficult to quantify structure and repeatable patterns such as motifs or chord
progressions in one number. Our attempt was to calculate the number of TECs, but what a
"good" number of TECs is is not completely clear as this can vary immensely between songs
depending on music genre and song length for example. It could be said though, that a very
low number of TECs which means that there are no or very few repeated patterns in the
song, would be undesireable. For example, when listening to the music generated by the
pre-trained VAE model MusicVAE (see Section 4.3), the music clearly sounded disorganized
and structure-less, which corresponds to the low number of TECs found in Table 4.2. An
indication of the performance could therefore be given by comparing the number of TECs
to the mean number of TECs of the training data. As seen in Table 4.5 the EMOPIA dataset
varied a lot in the number of TECs with a standard deviation of ∼50% of the mean value
which complicates this by not having a strict interval.
There is also the additional aspect that potential sources of error here could stem from
the SIA algorithm’s capability of finding the repeated patterns in the song. The COSIATEC
implementation of SIA (see Section 2.3.1) that was used only allows a tone to be part of one
repeated pattern [15], which is why there could be an underdetermination of patterns.

5.2.6 Macroharmony
The macroharmony metric shows how well the model knows how to arrange the notes. If the
macroharmony uses all the possible notes, i.e 12, it could be presumed that the music is not
staying within one key. In Table 4.4 where the seeding sequence was varied in Performance
RNN, the macroharmony is close to 12, and the resulting music also sounds incoherent and
messy. On the contrary, in Table 4.6 and in Table 4.7 where temperature and data set key
content were varied the macroharmonies were lower. This indicates that the music content

46
5.3 Legal aspects

of the generated songs is not just randomized, and instead follows a clear pattern that the
model learned. This metric is a clear indicator of the model performing well.

5.2.7 Groove
The groove metric, measuring the consistency of rhythm throughout the music, scored sim-
ilarly for all the music that was evaluated. Therefore not many conclusions can be drawn
from this metric, and instead this could maybe be more useful when evaluating music with
rhythmic instruments in them, as rhythmic patterns could be more easily distinguishable and
possibly more erroneous.

5.3 Legal aspects


In the field of AI-generated music, the legal question of copyright infringement arises. There
is a difference between being inspired by someone else’s work and directly copying it, but
where AI falls in this matter is a current topic of discussion where the laws and rules are
not completely clear. Sampling of music, i.e. using part of someone else’s song in your own,
can be an infringement of copyright and therefore illegal [2]. This specific phenomenon of
sampling music has laid ground for well-known court cases in the past [22]. Copyright means
that the song creator owns the exclusive right to distribute, perform and make new copies of
the work in question [2]. Training an AI-model on copyrighted data is generally a grey-area,
where copyright infringement both could happen when using a copyrighted song in training,
or if the generated music ends up sounding too similar to the original training data.
In the first case, the process of training an AI model is not all that different from the
human learning process. Our lives are a constant processing of data, meaning that we learn
about and categorize impressions and influences we experience in our daily lives. Therefore
everything we create is implicitly inspired by our surroundings to some extent, even if we
maybe are not aware of it. So is it labeled copyright infringement, if someone listens to the
Rolling Stones their whole life and then writes a rock song? The line between inspiration and
plagiarism can be blurry for human composers, and the same goes for the world of computer-
made music.
In the second case, deliberately trying to create a song in the style of a specific artist could
be problematic, though still a grey-area. If it would sound identical to a specific song of the
artist, it would most certainly be a possible copyright infringement. This scenario could be
difficult to rule out, and if the AI generates hundreds or thousands of songs, some could
resemble the training data identically. In an autoencoder model a sample could be taken
exactly from a training data point in the latent space, resulting in a song that already exists.
In practice it would be complicated to prove which songs a neural network has been
trained on, since the evidence of it is only the trained weights of the network. Reverse engi-
neering millions of weights to find the initial training data would be practically impossible.

47
5. Discussion

48
Chapter 6
Conclusion

This thesis researched generative AI models for music generation. By revisiting the research
questions multiple conclusions can be drawn.

RQ1 Which already developed models in the state-of-the-art can be utilised to achieve mu-
sic generation?
By evaluating the training process based on training accuracy and loss, as well as eval-
uating the generated music using the custom metric, we found that transformer and
LSTM models work well for generating music. We also conclude that they perform
well in generating samples with long-term structure.

RQ2 What are the demands on the quality of background music in public spaces?
Through the interviews a clear set of important characteristics were determined which
showed to be useful later in the work. The interviews also showed the diversity in music
and the difficulty of quantifying such a varying form of media.

RQ3 How can the quality of generated music be measured?


We found a set of metrics, where some clearly correlated with music quality, and where
some did not vary much between music of varying quality. Thus, we conclude that there
are many aspects of accurately modeling the quality of music, and that a quantitative
metric should be correlated with human perception in order to obtain meaningful
limits.

RQ4 To what extent can the existing generative models of background music generate back-
ground music suitable for public spaces?
Both GPT-3 and Performance RNN generated music of acceptable quality. They are
both affordable to train in terms of time, and generate music with clear structural el-
ements and melodic contents that follow the conventions of music to a large extent.
They both implement model architectures, transformer and LSTM, that have inherent

49
6. Conclusion

sense of context which this work also confirmed.


We found that the data set had an impact on the training time as well as on the pa-
rameter setup needed to achieve good results. Additionally, the parameter setup at
generation also impacted the outcome of the music. Finally we conclude that the gen-
eration of licence free music using AI is a feasible method which gives usable results.

Future work
In future work it would be of interest to correlate human perception of music with the nu-
merical metrics used in this thesis. This would give clear indications of what scores to aim
for in the metrics.

• It would also be interesting to further investigate ways of measuring repetitions and


patterns in music and finding a metric that summarizes the structure in a song in a
meaningful way.

• Since this thesis project was limited by time and computing resources, another im-
provement would be to train the models for longer and evaluate the differences.

• Furthermore, it would be beneficial to find in what ways the data set contents affect
the outcome of training, i.e what music is easier to train on.

50
References

[1] Tonality. https://www.britannica.com/art/tonality. (Accessed on


03/15/2022).

[2] What musicians should know about copyright. https://www.copyright.gov/


engage/musicians/. (Accessed: 2022-03-15).

[3] xml2abc documentation. https://wim.vree.org/svgParse/xml2abc.html. (Ac-


cessed: 2022-04-12).

[4] Dor Bank, Noam Koenigstein, and Raja Giryes. Autoencoders. arXiv preprint
arXiv:2003.05991, 2020.

[5] Luciano Bernardi, Cesare Porta, and Peter Sleight. Cardiovascular, cerebrovascular, and
respiratory changes induced by different types of music in musicians and non-musicians:
the importance of silence. Heart, 92(4):445–452, 2006. Published by: BMJ Publishing
Group Ltd.

[6] B. Boone and M. Schonbrun. Music Theory 101: From keys and scales to rhythm and melody,
an essential primer on the basics of music theory. Adams 101. Adams Media, 2017.

[7] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan-
guage models are few-shot learners. Advances in neural information processing systems,
33:1877–1901, 2020. Published by: Morgan Kaufmann Publishers Inc.

[8] Bryan Clark. Check out this beatles-inspired song writ-


ten entirely by ai. https://thenextweb.com/news/
check-out-this-beatles-inspired-song-written-entirely-by-ai,
2016 Sep. The next web. (Accessed: 2022-03-24).

[9] M.S.A. Cuthbert. music21. https://github.com/cuthbertLab/music21, 2006-


2022. (Accessed: 2022-05-14).

51
REFERENCES

[10] Li Deng. The mnist database of handwritten digit images for machine learning research.
IEEE Signal Processing Magazine, 29(6):141–142, 2012.

[11] Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and
Ilya Sutskever. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341,
2020.

[12] Ahmad Elgammal. How artificial intelligence completed beethoven’s unfinished tenth
symphony. Smithsonian Magazine, Jun Sep.

[13] Sebastian Garcia-Valencia. Cross entropy as objective function for music generative
models. arXiv preprint arXiv:2006.02217, 2020.

[14] Robert A. Gonsalves. Ai-tunes: Creating new songs with ar-


tificial intelligence. https://towardsdatascience.com/
ai-tunes-creating-new-songs-with-artificial-intelligence-4fb383218146,
Oct 2021. Medium. (Accessed: 2022-03-05).

[15] Dorien Herremans and Elaine Chew. Morpheus: generating structured music with con-
strained patterns and tension. IEEE Transactions on Affective Computing, 10(4):510–523,
2017.

[16] Max J Hilz, Peter Stadler, Thomas Gryc, Juliane Nath, Leila Habib-Romstoeck, Brigitte
Stemper, Susanne Buechner, Samuel Wong, and Julia Koehn. Music induces different
cardiac autonomic arousal effects in young and older persons. Autonomic Neuroscience,
183:83–93, 2014. Published by: Elsevier.

[17] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computa-
tion, 9(8):1735–1780, 1997. Published by: MIT Press.

[18] Hsiao-Tzu Hung, Joann Ching, Seungheon Doh, Nabin Kim, Juhan Nam, and Yi-Hsuan
Yang. Emopia: a multi-modal pop piano dataset for emotion recognition and emotion-
based music generation. arXiv preprint arXiv:2108.01374, 2021.

[19] Barbara Kitchenham, O Pearl Brereton, David Budgen, Mark Turner, John Bailey, and
Stephen Linkman. Systematic literature reviews in software engineering–a systematic
literature review. Information and software technology, 51(1):7–15, 2009. Published by:
Elsevier.

[20] N. Liberg. Easy abc. https://github.com/jwdj/EasyABC, 2014-2022.

[21] David Meredith. Recur sia-rrt: Recursive translatable point-set pattern discovery with
removal of redundant translators. In Joint European Conference on Machine Learning and
Knowledge Discovery in Databases, pages 485–493. Springer, 2019.

[22] Eddie Mullan. Nine most notorious copyright cases in music history. BBC, Jun 2019.

[23] Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. A hier-
archical latent vector model for learning long-term structure in music. In International
conference on machine learning, pages 4364–4373. PMLR, 2018.

52
REFERENCES

[24] Dan Robitzski. Mind-melting ai makes frank sinatra sing "toxic" by britney spears.
Futurism, May 2020.

[25] Colin Robson and Kieran McCartan. Ch 8: Multi-strategy (mixed method) designs. Real
World Research, 4th Edition. John Wiley amp; Sons, 2015.

[26] Ian Simon and Sageev Oore. Performance rnn: Generating music with expressive tim-
ing and dynamics. https://magenta.tensorflow.org/performance-rnn, 2017.
(Accessed: 2022-04-11).

[27] Federico Simonetta, Filippo Carnovalini, Nicola Orio, and Antonio Rodà. Symbolic
music similarity through a graph-based representation. In Proceedings of the Audio Mostly
2018 on Sound in Immersion and Emotion, AM’18, pages 1–7, New York, NY, USA, 2018.
Published by: Association for Computing Machinery.

[28] Dmitri Tymoczko. A Geometry of Music: Harmony and Counterpoint in the Extended Com-
mon Practice. Oxford University Press, Oxford, 2011.

[29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in
neural information processing systems, 30, 2017.

[30] Shih-Lun Wu and Yi-Hsuan Yang. The jazz transformer on the front line: Exploring
the shortcomings of ai-composed music through quantitative measures. arXiv preprint
arXiv:2008.01307, 2020.

[31] Shih-Lun Wu and Yi-Hsuan Yang. Musemorphose: Full-song and fine-grained music
style transfer with one transformer vae. arXiv preprint arXiv:2105.04090, 2021.

53
REFERENCES

54
Appendix A

55
A.

56
A.1 Interview documents

A.1 Interview documents


A.1.1 Consent to be interviewed for research

Consent to be Interviewed for Thesis Project

Project: Generating Music using AI


M. Sc. Thesis Student: Ebba Rickard ([email protected])
Supervisors: Emma Söderberg ([email protected]), Johan Davidsson
([email protected]), Danny Smith ([email protected])
Purpose: The purpose of the thesis project is to investigate different machine learning
methods that can be used to generate music.
Participant’s role in the study: Participate in interview about music quality

Responsibility of the interviewer


Before the interview the interviewer will have:
- Explained the purpose of the interview and how any interview answers or personal
information will be handled.
- Explained the rights of the participant, including the right to withdraw at any time.

What will happen to the interview results


The interview will be recorded and kept confidential and only made available to the
interviewer and the supervisors of the project. Parts of the interview answers may be used in
the project report, but under no circumstances will personally identifiable information be
included, unless the participant explicitly wants to not be anonymous.

Rights of the interviewee


I understand the purpose of this interview as explained to me by the interviewer and I agree
that I can withdraw from the interview at any time. If I choose to do so, I understand that any
potential data already collected will be removed. I understand that I can choose to decline to
answer any questions.

I agree that my answers during the interview will be recorded and that this data will be
handled in accordance with the section above.

…………………………………………
Date

………………………………………… ……………………………………..….
Interviewer Name Interviewee Name

………………………………………… ……………………………………..….
Interviewer Signature Interviewee Signature

57
A.

A.1.2 Interview Protocol

Context

The aim of this master thesis project is to generate music using machine learning models.
Since computers don’t know anything about anything until we teach them, it’s important to
break down the concept of music into its most basic parts in order for the computer to be
able to understand it. Therefore, facts that us humans might think of as obvious also need to
be defined carefully. In addition to this, knowledge about music theory could be used to
automate the evaluation process of the results in a simple way.

Consent
It is up to you if you want to answer a question or not, and the interviewer is always available
to answer questions after the interview. The interview will be recorded in order to simplify the
interview process and the evaluation of the answers. The recording will be shared with the
supervisors of the project, and will be discarded after the project is finished. The answers in
the interview could be included in the report, but in this case they will be completely
anonymous. The final report will be shared with the interview subject.

Interview questions

General

What is your background in composition? Music theory?


Vad har du för bakgrund inom komposition? Musikteori?

If applicable, what was your process the last time you composed a song? Do you have a
common strategy that you use? Do you decide on tempo etc beforehand or does it happen
organically?
Om du har någon, vad var din strategi den senaste gången du komponerade? Har du en
generell strategi när du komponerar musik? Bestämmer du taktart och tempo osv i förväg
eller sker det organiskt?

(In what way does the music theory you know affect the music you create?
På vilket sätt påverkar din teoretiska kunskap den musik du skapar?)

What factors do you think could contribute to making a song really uncomfortable to listen
to? Structure? Sound? (For you but also for the average person, the big mass)
Vilka faktorer tror du skulle kunna bidra till att göra musik riktigt obekväm att lyssna på?
(Både för dig men också för gemene man, den stora massan)

What kind of music do you like to listen to? What do you think it is that makes this music
enjoyable?
Vilken musik brukar du lyssna på? Vad tror du det är som gör den musiken “bra”/bekväm att
lyssna på?

Limit one parameter e.g. melody, can only use one single tone, how would you compose it to
make it musical?

58
A.1 Interview documents

Begränsa en parameter ex melodi, får bara använda en enda ton, hur skulle du komponera
för att göra det musikaliskt?

Kortfattat:
According to you, what theoretical demands are there on music for it to be comfortable to
listen to? (Rhythm, speed, key, melody, chords, singing, what instruments, genre, timbre)
Enligt dig, vilka musikteoretiska krav finns på musik generellt för att den ska vara bekväm att
lyssna på? Taktart? Tempo? Tonart? Melodik, ackordföljder? Sång, vilka instrument? Genre?
Klangfärg?

Project context, background music

What type of music would you want to put in this kind of context? (A very general context
where the listener is unspecified and could be practically anyone)
I kontexten, vad tror du skulle vara bra musik att spela? (En väldigt generell publik miljö med
en ospecificerad lyssnare)

How would you define background music?


Hur skulle du definiera bakgrundsmusik?

What would you consider bad background music?


Vad skulle du tycka är dålig bakgrundsmusik?

If applicable: How is your thought process when composing film music compared to
composing a song with the purpose to be listened to independently? How would you
approach composing background music?
Om relevant: Hur tänker du när du komponerar musik till film jämfört med när du komponerar
en låt som är tänkt att lyssnas på fristående? Hur hade du tänkt om du skulle komponera
bakgrundsmusik?

Generated music
This part of the interview will consist of listening to a couple of AI generated songs and trying
to find the factors that make them sound “fake” or robot-like.

59
INSTITUTIONEN FÖR DATAVETENSKAP | LUNDS TEKNISKA HÖGSKOLA | PRESENTERAD 2022-06-16

EXAMENSARBETE Generating music using AI


STUDENT Ebba Rickard
HANDLEDARE Emma Söderberg (LTH), Johan Davidsson (Axis), Danny Smith (Axis)
EXAMINATOR Elin Anna Topp (LTH)

Generera musik med AI


POPULÄRVETENSKAPLIG SAMMANFATTNING Ebba Rickard

AI i rollen som upphovsmakare av kultur är ett relativt nytt fenomen. Detta examen-
sarbete utforskar specifikt skapandet av musik med hjälp av AI-metoder.
För att spela musik i offentliga utrymmen krävs data. Den förstnämnda är gjord för att generera
det att man följer gällande regler kring upphovs- musik i MIDI-format medan den andra är en stor
rätt och då innehar en kommersiell licens. Som språkmodell som kan användas till att generera
ett alternativ för butiksägare och andra aktörer musik som representeras i ett språkbaserat format.
som spelar musik i offentliga utrymmen skulle en Musiken utvärderades med hjälp av en handfull
möjlighet kunna vara att erbjuda licensfri musik. olika kvantitativa mått baserat på teori om musik,
Musiken skulle kunna genereras av AI, då AI- teori om tonalitet samt om musiks matematiska
genererad musik är licensfri om modellen tränas struktur.
på helt upphovsrättsfri musik.
Musiken skulle kunna vara licensfri även om Sampling parameters
detta frångås, men juridiken kring kultur gener-
erad av AI är omdiskuterad på grund av att det är
ett nytt och till stor del oprövat område. Diskus-
sionen handlar dels om frågan om vem som är up- Training data AI model Generated music

phovsmakare till verken som skapas, men också


om frågan om vilka verk som får användas i tränin-
gen av AI-modeller. Jämförelsen görs ofta med
människans kreativa process, där musiker lyssnar Training parameters
på musik med upphovsrätt, och sedan skapar egen
musik, oundvikligen influerade av de verk den
lyssnat på i livet. Resultaten av försöken visar att träningsdatans
I detta examensarbete undersökte vi möj- strukturella innehåll hade stor inverkan på hur
ligheterna att generera alternativ, licensfri bak- många iterationer som krävdes för att modellerna
grundsmusik med hjälp av maskininlärningsme- skulle uppnå hög träffsäkerhet. Specifikt spelade
toder. Två existerande generativa AI-modeller det roll om all musik i träningsdatan gick i
utvärderades och tränades om. Vi valde att under- samma tonart eller inte, samt hur hög deras not-
söka Long Short Term Memory-modellen (LSTM) densitet var. Dessutom spelade parametriserin-
Performance RNN från Magenta Tensorflow samt gen i samplingssteget roll för att kunna utnyt-
transformermodellen GPT-3 från OpenAI. Båda tja den färdigtränade modellens fulla potential.
valdes baserat på att deras bakomliggande struk- Båda modellerna lyckades modellera övergripande
turer är speciellt bra på att modellera sekventiell struktur i musiken.

You might also like