Generating Music Using AI: Ebba Rickard
Generating Music Using AI: Ebba Rickard
ISSN 1650-2884
LU-CS-EX: 2022-42
LU-CS-EX: 2022-42
Ebba Rickard
Generating Music using AI
Ebba Rickard
[email protected]
Network speakers are used in public spaces for public announcements and
sometimes to play background music. To play music, a special commercial music
licence is required, specifically made for use in public environments. However,
customers have to pay licence fees and educate themselves on the topic of copy-
right licensing in order to do this according to current regulations. In this thesis
we explored the possibilities of generating alternative, licence-free background
music using machine learning methods. We surveyed the field for existing mod-
els and data sets, and carried out interviews with musicians to identify music
quality characteristics that could be used as evaluation metrics.
We chose to tune and compare the transformer model GPT-3 and the Long
Short-Term Memory (LSTM) model Performance_RNN. The music was evalu-
ated using the COSIATEC algorithm to find recurrent patterns as well as using a
custom metric based on Tymoszcko’s theories of tonality. Experiments were car-
ried out investigating the impact of learning rate, training data characteristics
and generation parameters. Both GPT-3 and Performance RNN performed well
at generating long term structure in music, but the training time and accuracy
differed depending on the chosen data set. To add to the findings in this thesis
it would be interesting to investigate the correlation between human perception
of the music and the scores obtained in this report. It would also be of interest to
further investigate the impact of the training data characteristics, such as genre
and melodic content.
I want to take the opportunity to thank my supervisor Emma Söderberg, for the commitment,
engagement and encouragement put into guiding me in my work. I could not have had a
better supervisor.
A big thank you also to my supervisors at Axis, Johan Davidsson and Danny Smith, for
the warm welcome at Axis as well as the important help throughout the entire thesis process.
3
4
Contents
1 Introduction 7
1.1 Project context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Scope and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Report outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Background 11
2.1 Machine learning basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Artificial Neural networks - ANN . . . . . . . . . . . . . . . . . . 12
2.1.2 Optimization aspects . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Generative models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Autoencoder - AE . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Long Short-Term Memory - LSTM . . . . . . . . . . . . . . . . . . 17
2.2.4 Examples of generative models . . . . . . . . . . . . . . . . . . . . 17
2.3 Music Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Digital music representation . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.1 MIDI (Musical Instrument Digital Interface) . . . . . . . . . . . . . 22
2.4.2 Raw audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.3 Text representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Method 25
3.1 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Music generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5
CONTENTS
4 Results 31
4.1 Interview results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Dataset evaluation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Pretrained model results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Magenta: Performance RNN . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.1 Learning rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.2 Seeding sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 GPT-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.5.1 Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.5.2 Learning rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5.3 Epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5.4 Transposed training data . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5.5 Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Discussion 43
5.1 Model performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1.1 Data set homogenity . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1.2 Learning rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.3 Epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.4 Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.5 Data format parsing . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.1 Note density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.2 Empty beat ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.3 Consonance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.4 Centricity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.5 TECs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.6 Macroharmony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.7 Groove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Legal aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6 Conclusion 49
References 51
A 55
A.1 Interview documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A.1.1 Consent to be interviewed for research . . . . . . . . . . . . . . . . 57
A.1.2 Interview Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6
Chapter 1
Introduction
Artificial Intelligence (AI) generating music has had a significant acceleration of progress
during the last decades [11] [31]. In generative modelling an AI model is trained to generate
new data such as images, text or audio. This thesis deals with the generation of score-based
music.
1.2 Aim
The aim of this thesis is to explore ways of generating music using machine learning which
is suitable for use as background music. There has been great success in the field of generat-
ing music using AI in recent years, with producing convincing piano melodies [23][31] and
even music with realistic vocals [11]. In the context of this thesis, the music in question will
be played in public environments. The focus will therefore be on generating music that is
suitable as background music in public spaces. More specifically, this thesis will explore the
possibilities in music generation by investigating existing methods and models, testing out
7
1. Introduction
well-supported choices of models, finding objective ways of evaluating and comparing the
methods, and ultimately to find a method that works for the specific context at Axis. The
work in this thesis has been driven by the following research questions:
RQ1 Which already developed models in the state-of-the-art can be utilised to achieve music
generation?
RQ2 What are the demands on the quality of background music in public spaces?
RQ4 To what extent can the existing generative models of background music generate back-
ground music suitable for public spaces?
1.3 Methodology
The work in this thesis was carried out using a mixed methodology approach [25] consisting
of reviewing the state-of-the-arts, conducting semi-structured interviews and carrying out
experiments. The results of the experiments were evaluated from a quantitative perspective.
8
1.6 Report outline
Chapter 1 - Introduction: Introduces the subject, context and aim of the thesis. The scope
and its limitations are introduced together with the research questions.
Chapter 3 - Method: This chapter includes a detailed description of how the work was con-
ducted in researching the research questions.
Chapter 4 - Results: The results of the carried out experiments. This section also contains a
brief evaluation of the results.
Chapter 5 - Discussion: A detailed analysis of the results and the experiment setup.
Chapter 6 - Conclusion: This section contains a conclusion answering the posed research
questions.
9
1. Introduction
10
Chapter 2
Background
This chapter describes the background theory for this thesis. In Section 2.1 fundamental
machine learning theory is introduced, Section 2.2 handles some generative modelling ar-
chitectures, Section 2.3 gives an introduction to music theory and finally some details about
processing music digitally are presented in Section 2.4.
Generative models A generative machine learning model can classify data, just like
a discriminative model, but with the additional capability of being able to generate new
data. If a discriminatory model predicts how likely data with a certain set of characteristics
belong to a certain label, the generative model can predict the probability of a certain set of
characteristics, given the label. This means that when a generative machine learning model is
given the instruction to generate a picture of a dog, it generates it based on the characteristics
11
2. Background
it learned through classifying images of animals. In practice it may for example find the
probability of a dog having fur and four legs, and then generates the image based on these
findings.
12
2.1 Machine learning basics
N hidden layers
Input Output
x1
x2
Σ i=1
wixi y
x3
13
2. Background
Loss
An ANN learns by minimizing the loss function. The loss function is the difference between
the true desired output and the network output. A common loss function to use for se-
quential data is the cross entropy loss function [13]. Cross entropy is defined in Formula 2.1.
p(xi ) denotes the original probability distribution and q(x) is the predicted distribution, the
output of the model.
X
H(p, q) = − p(xi )logq(xi ) (2.1)
xi
Batch size
The batch size defines the number of data samples used in one training iteration before up-
dating the weights. A large batch size in the size of the data set can lead to the model being
bad at generalizing data it has not seen before. The optimal batch size to use is different for
every model and data set.
Temperature
The temperature is a parameter used when generating data samples. It determines the confi-
dence of the model, where a low temperature benefits the most probable results and therefore
generates data with less diversity. Instead using a high temperature includes less probable
sample generations, resulting in more variation in the data but also more mistakes.
14
2.2 Generative models
2.2.1 Autoencoder - AE
An autoencoder is a type of neural network that learns how to represent data in a meaningful
way using less information than the original representation. The autoencoder architecture
consists of an encoder network and a decoder network. The data is encoded and then repre-
sented in the latent space, z, which is why these types of architectures are sometimes called
deep latent variable models. After the data have been encoded, the aim is to reproduce the
original data by decoding the encoded representation. Figure 2.3 shows the autoencoder ar-
chitecture in more detail. The figure visualizes how the input is encoded in the latent space
by using an example from the MNIST data set [10]. In the latent space the image is clearly
represented using less information, in this case fewer image pixels. It can then be decoded
to reproduce the original input. Because of the shape of the network architecture, the latent
space is often called a bottleneck.
The encoder and decoder of the autoencoder architecture can be used together or sepa-
rately as different parts of a generative model. By encoding data using an encoder, one can
benefit from the data being represented with less information by using this as training data.
This can make training processes faster since a big part of what makes training take time is
the information heavy data. After training, the output can be decoded by a decoder in order
to get a sample with the sought after detail [4].
2.2.2 Transformers
The transformer architecture was introduced by Vaswani et. al in 2017 [29]. The transformer
architecture has an encoder-decoder structure and an inherent attention mechanism that
enables it to perform very well with sequential data with long-term structure. Because of
this it does not rely on recurrence which makes it faster to train than other encoder-decoder
models.
15
2. Background
Encoder Decoder
Latent space z
16
2.2 Generative models
The attention function is presented in Equation 2.2 and takes queries (Q), keys (K) and
values (V) as input and maps them to an output value. dk is the dimension of the K and Q
vectors and this factor scales the dot-product to avoid vanishing gradient problems, which
is a common issue with neural networks where the gradient of the loss function becomes too
small resulting in an undertrained network. Q, K and V are all derived from the same input
when referring to self-attention.
QKT
attention(Q, K, V) = softmax( √ )V (2.2)
dk
The animal didn’t cross the street because it was too tired.
You don’t want to cross them.
The meaning of a word in a sentence is deduced by its surrounding context, and can
even depend on tokens from further away in the sequence than in the direct proximity of the
word. Figure 2.4 shows how the word it relates to the other words in the sentence. It depends
the most on the words the animal. If there were instead multiple animals not crossing the
street, it would have been replaced by they. The word it depends directly on some words, and
contextually on others. The same theory applies to music, which also has structural semantics
similar to language.
The transformer architecture can use the whole previously generated sequence to predict
the next. In comparison, an RNN forgets tokens from further back in the sequence and only
takes recent words, or as in the context of this thesis, music notes, into account.
17
2. Background
The The
animal animal
didn didn
‘ ‘
t t
cross cross
the the
street street
because because
it it
was was
too too
tire tire
d d
18
2.3 Music Theory
LSTM networks are introduced in Section 2.2.3. Input data of dimension 2048 was en-
coded into 512 latent dimensions. This model has freely available trained weights and
was trained on a MIDI dataset created by the Magenta team themselves by collecting
1.5 million songs from the internet. Instead of directly decoding from the latent space,
decoding is done bar-wise by using a conductor for each bar and decoding from these
separately, with the aim of improving performance with sequential data.
GPT-3. GPT stands for Generative Pre-trained Transformer and GPT-3 is the third genera-
tion GPT model. GPT-3 was at the time of its release 10 times bigger than any other
sparse NLP model with its 175 billion parameters [7]. GPT-3 was trained on 45 TB of
text data collected from the internet, and is designed for use in NLP tasks, in which it
achieves high accuracy. This model can be applied to music generation by representing
music in a text format. By fine-tuning the pre-trained GPT-3 model it can be used to
generate music while still taking advantage of the sense of context GPT-3 already has.
Performance RNN is an open source LSTM model for music generation from Magenta Ten-
sorflow [26]. The model focuses on generating music with expressive timing and dy-
namics, and successfully generates cohesive music with local long-term structure. It is
trained on the Yamaha e-Piano competition dataset containing 1400 midi files.
2.3.1 Fundamentals
In this section some fundamental music theory is presented. It is later used in the evaluation
of music and is an important aspect in finding a quantitative metric. The theory is explained
based on the piano clavier, seen in Figure 2.5.
1
Note that when referring to music theory in this report, we generally mean western music theory.
19
2. Background
Pitch The frequency of a sound wave corresponds to its pitch. The higher the pitch, the
higher the frequency.
Octave The next double frequency of a tone marks an octave. For example, 220 Hz, 440 Hz
and 880 Hz all correspond to the note A but in different octaves. We say that they
are harmonics of the same tone. This means that the same set of tones are repeated in
octaves spanning over the entire piano in Figure 2.5.
Harmony Harmony is one or more notes played at the same time. Three or more notes
form a chord. Playing two or more tones simultaneously can either sound dissonant or
consonant depending on the note distance between the notes. Dissonant sounds feel
unresolved and can be stressful or uncomfortable to listen to. Consonant sounds are the
opposite and sound pleasant. Consonance and dissonance can be used in alternation
to create tension and emotion in music.
Keys Western music usually follows a key. A key defines a set of eight tones which are dis-
tanced with a decided distance from each other so that no dissonances appear. The
major and minor keys are most commonly used in western music but there are also
other modes. The eight white keys make up the C major key, but the major key can be
transposed to have any note as the base. Transposing a key does not change the melodic
content, only the pitches.
Every major scale has a parallel key in minor. This key contains the exact same tones
but has another tone as the base. The base tones of the two parallel keys are spaced
three semitones from each other, e.g the parallel key of C major is A minor.
Some notes in the key have a greater attraction to the base tone than others. These are
the fourth and fifth tones. In many popular songs the chords corresponding to these
specific notes are used more than the others in the key.
Repetition An important part of music is the rhythm and consistent beat as well as rep-
etition to some extent. Music with too much and too little repetition can both be
uncomfortable to listen to.
Tonality Tonality is the principle of arranging music around a central note, a tone. More
specifically it defines a set of rules for the specific relationship between chords, notes and keys
that regulate much of modern and older music [1]. We can call music tonal or atonal. In the
following sections the characteristics of tonal music are presented.
Criteria for tonal music In this project tonality is of interest because essentially
all music we listen to is considered to be tonal, and it is therefore desirable that the music we
generate also is tonal. Tonality is also a widely discussed concept that has been theoretisized
20
2.4 Digital music representation
F # G# A # C# D# F # G# A # C# D# F#
♭
G♭ A B♭ D♭ E
♭
G♭ A
♭
B♭ D♭ E
♭
G♭
F G A B C D E F G A B C D E FG
in such a way that it could act as support for a potential evaluation metric.2 Dmitri Tymoczko
proposes a set of criteria [28] that define the characteristics of tonal music. The five proposed
criteria that a tonal musical piece contains are as follows:
• Conjunct melodic motion. Melodies tend to move by short distances from note to
note. To illustrate, the distance between C and D is shorter than the distance between
C and G in terms of note distances.
• Limited macroharmony. Tymoczko uses the term “macroharmony” to refer to the total
collection of notes heard over moderate spans of musical time. Tonal music tends to
use relatively small macroharmonies, often involving five to eight notes.
• Centricity. Over moderate spans of musical time, one note is heard as being more
prominent than the others, appearing more frequently and serving as a goal of musical
motion.
21
2. Background
22
2.4 Digital music representation
ABC notation ABC notation contains a header and a part containing the musical contents
of the song. The header contains information about the key, time, default note length
and song name. ABC notation uses letter notation, a-g and z to denote the note values
and rests. Other elements are used to mark note lengths, chords and sharps and flats.
The notes are described in their designated order, why the position of each note does
not have to be explicitly written out.
Figure 2.7 shows the difference in length between MusicXML and ABC notation for the same
piece. MusicXML is much wordier without giving any extra information about the piece. See
Section 3.5.1 for more details about the formats used in this report.
23
2. Background
24
Chapter 3
Method
The purpose of this chapter is to describe the detailed approach used to address the following
research questions (also listed in Section 1.2):
RQ1 Which already developed models in the state-of-the-art can be utilised to achieve music
generation?
RQ2 What are the demands on the quality of background music in public spaces?
RQ3 How can the quality of generated music be measured?
RQ4 To what extent can the existing generative models of background music generate back-
ground music suitable for public spaces?
To address RQ1 a review of the state-of-the-art was done through a semi-structured literature
review. Investigating music quality in RQ3 was done through interviews with professional
musicians. The results from the interviews in combination with a semi-structured literature
review addressed RQ2. Finally, experiments were carried out to address RQ4 where models
found in RQ1 were implemented using metrics from RQ3.
25
3. Method
Papers found: 11
In addition to the semi-structured literature review, articles were also found through
citation analysis, or snowballing [19], which is a reason for its semi-structured nature. This
was done by back-propagating the sources in some of the articles found through the literature
review. Snowballing was used when searching for papers about music evaluation criteria for
RQ3 by investigating what metrics had been used in evaluating the models in the articles
from RQ1.
3.2 Interviews
Four semi-structured interviews [25] were carried out with professional musicians. They had
all composed music of their own and had rigorous musical education. The aim of the inter-
views was to attempt to define the more abstract aspects of what makes music sound good,
and later use this knowledge either to create or choose a mathematical evaluation method.
The analysis was carried out by finding patterns and common themes in the interviews. The
interview protocol and consent form is included in Appendix A.1.
3.3 Metrics
The interviews together with the literature review resulted in a set of important, measurable
characteristics of music. Most of the metrics do not have a specified reference value, but
instead need to be compared to a reference piece or distribution. To find reference points for
the generated music, metrics were computed for the training data sets, which were used as a
representation of human-made music. This shows how the scores vary and can give a hint of
what could be a lower limit score or an interval to stay within. The properties of the music
were evaluated using the following metrics:
26
3.3 Metrics
Consonance. This metric is calculated as the key consistency of a song, which is measured
by registering to what extent the notes that are played in a sequence belong to the
same key. The consonance is measured between 0-1 for each key in major and minor,
meaning that a song in C major should get a consonance score of 1 for C major and 0
for every other key. A high consonance score is therefore considered as an indicator of
good quality.
Centricity The harmonies in a musical piece should be structurally similar, meaning that
some notes and chords are more common than others. This can be measured by col-
lecting the occurrence of each note in a histogram, whereafter the statistical entropy
of the histogram is measured. Entropy is high if all notes are common, and low if a
couple of notes dominate. It is therefore assumed that a song with low entropy follows
tonality theory better [30].
Macroharmony The total number of used notes in the song is defined as the macroharmony.
The maximum macroharmony is 12, i.e notes in different octaves count as the same
note. Tymozsckos theory of tonal music (presented in Section 2.3.1) say that tonal
music should have a macroharmony of 5-8 notes [28].
Groove Grooving pattern similarity, or consistency of rhythm. We limit the generated music
by saying that the rhythm in bars close to each other in time should not differ too much
but be somewhat consistent. For music in general this is not always true, but to branch
out when it comes to rhythm requires more detailed specification of how this should
be done [30].
Empty beat ratio The music should not have completely silent parts in the middle of the
sequence, which is why a low empty beat ratio is aimed for.
Note density The note density measures the ratio of note onsets, i.e the number of note
onsets divided by the total length of the sequence. Previous research [5] [16] shows
that a low song tempo can have a calming effect on the listener. Lowering the tempo
makes the note density lower for a sequence length fixed in time, which then results
in note density correlating with tempo. With this reasoning the aimed at note density
should not be too high.
Compression ratio and TECs The structure induction algorithm (SIA) by Meredith [21] can
be used to calculate repetition and patterns in long-term structure. SIA identifies
translational equivalence classes (TECs) by representing each note as a point and ap-
plying a pattern recognizing algorithm to it. From this a visual representation of the
musical patterns can be obtained. SIA finds all patterns in a musical passage, including
ones that are not of interest from a musicological perspective. COSIATEC by Mered-
ith [21] is an improved implementation of SIA and finds actual musical patterns more
accurately by filtering out the most relevant patterns based on coverage, compactness
and compression ratio. The compression ratio measures the total number of points
that make up a pattern’s total numer of occurrences.
27
3. Method
Table 3.2: Details about type, training data and licence for the mod-
els chosen for evaluation.
28
3.5 Training data
29
3. Method
30
Chapter 4
Results
In this chapter the results of the interviews and experiments are presented. Section 4.1
presents an analysis of the interviews, Section 4.2 presents the evaluations of the data sets,
Section 4.3 presents the evaluations of the pretrained models while Section 4.4 and 4.5 shows
the results from retraining Performance RNN and GPT-3.
Rhythm and structure All four interviewees were asked the question "What would
make a song really uncomfortable to listen to?", to try to find specific music characteristics that
could be quantifiable. In discussing this, all of the interviewees highlighted the importance
of rhythm and structure. A structural element that P1 and P4 mentioned is the feeling of
"coming home", in a song, referring to a notion of being centered around a note from which
the melodies branch, but then return to create the feeling of "coming home". When speaking
of rhythm, P4 also mentioned that even though music is largely based on repetition, a negative
characteristic of music could be that it is being too repetitive.
"music benefits from being organized, even unexpected things, if done with the right
timing, can result in something organized and structured that we enjoy listening to", P3
"there should be a right amount of repetition, for example if something that sounds off
is played over and over, some kind of repetition is created and makes it not sound off
anymore. It is always a balancing act. It can also be looked at at different levels - if
there is much repetion on the small scale, but none on a big scale, it is still not quite
there", P4
31
4. Results
Timbre and sound P3 and P4 mentioned that harsh and too loud sounds would be
uncomfortable to listen to. P2 also touched on this subject, and said that "there’s a purely
sonical aspect of music, meaning that the actual sounds we hear should appeal to us in order for us to
like it". Timbre or sound as an evaluation parameter would primarily be applicable for models
generating raw audio, but MIDI also specifies the velocity, i.e the volume, of each tone.
Background music characteristics P1 said that "since the context is quite general,
the music should be calm and act calming to the listener and not induce stress or anger". P2 contrasted
this by saying that the music could also act as a piece of public art whose purpose is to surprise
the listener. In this case the music would not necessarily have to be calming, but could be.
P2 also though that the music should suit the specific space. P4 mentioned ambient music
when talking about background music, which generally means music with a slow tempo and
lack of structure, commonly used for relaxation and meditation. This goes in line with the
statement of P1, that background music in a public space can benefit from being calming.
Table 4.1: The mean and standard deviations of the chosen metrics
evaluated on the chosen data sets. All of the datasets achieve a rela-
tively high consonance.
Data set Size Consonance Macroharmony Centricity Note Density
EMOPIA 1087 0.9572 ± 0.0603 8.4855 ± 1.5621 2.7073 ± 0.2406 0.8554 ± 0.4492
MAESTRO 1276 0.7919 ± 0.0788 11.9898 ± 0.115 3.3451 ± 0.1464 1.2942 ± 0.4459
OpenEWLD 502 0.9413 ± 0.0586 8.8805 ± 1.7428 2.7779 ± 0.2557 0.2139 ± 0.0813
TKSE 177 0.8423 ± 0.0539 11.9714 ± 0.225 3.255 ± 0.1475 0.3955 ± 0.1133
Data set Size Groove Empty beat ratio Compression ratio TECs
EMOPIA 1087 0.9841 ± 0.079 0.0053 ± 0.0157 1.3023 ± 0.1376 27.40 ± 14.34
MAESTRO 1276 0.9761 ± 0.0084 0.0288 ± 0.0218 - -
OpenEWLD 502 0.9308 ± 0.0271 0.0183 ± 0.0286 - -
TKSE 177 0.9183 ± 0.0308 0.0089 ± 0.0232 - -
32
4.3 Pretrained model results
Evaluating the pretrained models gives an indication of their respective potential. The
metric results do not differ that much between the models, but when it comes to modelling
long term structure they perform differently.
Music VAE generated sparse samples where some were completely empty. The note den-
sity for this was much lower than for the other models, which in combination with the low
note density and low number of TECs, i.e low pattern repetition, had us discard this model.
The lack of melodic structure could also be heard when listening to the samples.
33
4. Results
MuseMorphose performed better and showed great potential, but was discarded because
of a lack of documentation of how to retrain and run the model.
Performance RNN and GPT-3 were investigated further through retraining and param-
eter tuning. Performance RNN was chosen because it showed clear patterns and long-term
structure in the generated music, and because it is easily available and is highly tuneable.
GPT-3 was chosen even though it did not have any pre-trained weights for music generation.
Since it is a language based model which has had great results modelling long-term structure
when used in other applications, it would be of interest to try with music generation as well.
34
4.4 Magenta: Performance RNN
(a) (b)
Figure 4.2: Loss and accuracy as a function of training step for re-
training Performance RNN using different learning rates.
35
4. Results
Figure 4.3: The means and standard deviations of the note density
metric evaluated on songs generated with Performance RNN using
different seeding sequences.
4.5 GPT-3
GPT-3 was trained on two different datasets. It was trained on a subset of 500 songs from
EMOPIA, which contains songs in different keys. It was also trained on a subset of 500 songs
from OpenEWLD which only contains songs in one key. OpenEWLD comes in a MusicXML
format which is why it also contains information about song title and artist, which EMOPIA
does not. Therefore this information can be used as song prompts in GPT-3, which is de-
scribed in more detail in Section 4.5.1.
4.5.1 Prompt
GPT-3 takes a prompt as an input based on which it generates a completion. In the typical use
case where it generates text, the generated response depends highly on the input prompt. The
training data are arranged in these prompt-completion pairs from which the model learns.
When instead training to generate music, what the input prompt should contain is not as
evident. We will attempt two setups to investigate the impact the prompt has on the result.
Compared to the text generation use case, the same prompt should be able to give dif-
ferent completions, e.g. "How are you?" could be completed with multiple different answers,
"Great, how are you?", "Not to well actually.", "Good." etc. Therefore the song prompts do
not necessarily have to be unique, since many different completions can be plausible for one
prompt.
36
4.5 GPT-3
OpenEWLD
In the first setup the model was trained on the OpenEWLD dataset which contains informa-
tion about artist and song titles for each song. The artist and song title were used as unique
prompts for each song input to the model. Each song in this dataset was transposed to the
same key before used as input in training.
# OpenEWLD e x a m p l e p r o m p t :
"X: 1 $ T: A song about trees $ C: The oaks $ <song >"
EMOPIA
In the second setup the model was trained on the EMOPIA dataset which does not contain
any information about artist or title. Instead the ABC tune header was used as input. The
ABC header contains song-specific information regarding time and key, which has an impact
on the output of the song. These prompts are therefore not unique for each song. This was
investigated both for the dataset transposed to the same key as well as kept in their original
keys.
# EMOPIA e x a m p l e p r o m p t :
"X: 1 $ M: 3/4 $ L: 1/16 $ K: Em $ <song >"
Figure 4.4: The loss and accuracy for GPT-3 trained on a subset of
EMOPIA for 7 epochs.
Training should be carried out for more epochs in order to be able to draw a conclusion
of optimal learning rate. As expected, a lower learning rate implies slower learning. The
highest learning rate with an lrm of 0.2 gave a relatively high accuracy already after 7 epochs.
This would be interesting to investigate further.
37
4. Results
4.5.3 Epochs
The impact of the training length was investigated by using a high number of training epochs.
A batch size of 32 and a learning rate multiplier of 0.2 was used. The resulting loss and
accuracy are presented in Figure 4.5 and the evaluation of the generated songs are presented
in comparison with the training data in Table 4.5. The evaluation scores of generated music
ends up close to the scores of the training data, and achieves both a high consonance, limited
macroharmony and low centricity. It does not contain as many TECs, i.e repeated patterns,
as the training data, but since the standard deviation of the number of TECs in the training
data is high this does not have to mean that the generated music lacks structure. Listening
to the music also confirms that it contains melodic structure.
Figure 4.5: The loss and accuracy for GPT-3 trained on a subset of
EMOPIA for 25 epochs.
The loss and accuracy for a training iteration using OpenEWLD is shown in Figure 4.6 for
comparison. The same training parameters were used and the only difference is the training
data. Here where the model reaches a higher accuracy than in Figure 4.5 already after 3
epochs. This implies that the data set impacts the training results.
38
4.5 GPT-3
Figure 4.6: The loss and accuracy for GPT-3 trained on a subset of
OpenEWLD for 4 epochs.
Figure 4.7: How the songs in the EMOPIA dataset are distributed
over keys.
39
4. Results
Data set Size Groove Empty beat ratio Compression ratio TECs
Standard EMOPIA 500 0.9963 ± 0.0012 0.0 ± 0.0 2.2459 ± 0.5366 13.52 ± 4.86
EMOPIA Major 405 0.9961 ± 0.0012 0.0 ± 0.0 2.103 ± 0.4628 17.84 ± 5.78
EMOPIA Minor 500 0.9958 ± 0.0014 0.0 ± 0.0 2.2509 ± 0.54 13.41 ± 4.81
Table 4.7: Metric evaluations for GPT-3 for varying generation tem-
peratures.
Data set Temperature Consonance Macroharmony Centricity Note density
OpenEWLD 0.5 0.9141 ± 0.0273 10.09 ± 0.8258 2.9736 ± 0.0644 0.379 ± 0.0473
OpenEWLD 0.75 0.9478 ± 0.0369 9.54 ± 1.4793 2.8809 ± 0.1604 0.4412 ± 0.1557
OpenEWLD 1 0.9362 ± 0.0489 10.03 ± 1.5777 2.9387 ± 0.1804 0.494 ± 0.1639
EMOPIA 0.5 0.9979 ± 0.0124 5.6289 ± 1.3646 1.8907 ± 0.5681 1.5119 ± 1.9672
EMOPIA 0.75 0.9954 ± 0.012 6.8021 ± 0.9199 2.3056 ± 0.4132 1.2118 ± 0.6679
EMOPIA 1 0.988 ± 0.0238 7.5747 ± 1.2561 2.5225 ± 0.3829 1.0842 ± 0.6821
Data set Temperature Groove Empty beat ratio Compression ratio TECs
OpenEWLD 0.5 0.9992 ± 0.0001 0.0 ± 0.0 2.9972 ± 0.5308 7.02 ± 1.56
OpenEWLD 0.75 0.9985 ± 0.0006 0.0055 ± 0.0133 3.2550 ± 1.3071 7.85 ± 2.77
OpenEWLD 1 0.9983 ± 0.0007 0.0047 ± 0.0136 3.1388 ± 1.0121 9.06 ± 4.00
EMOPIA 0.5 0.9964 ± 0.0011 0.0 ± 0.0 2.3899 ± 0.6735 13.19 ± 4.62
EMOPIA 0.75 0.9967 ± 0.0011 0.0 ± 0.0 1.9278 ± 0.2485 15.80 ± 4.32
EMOPIA 1 0.9969 ± 0.0011 0.0003 ± 0.0023 1.7051 ± 0.1655 18.08 ± 4.61
4.5.5 Temperature
In Table 4.7 the results from varying the temperature are presented. The same temperature
variations were made for both the model trained on the OpenEWLD and the EMOPIA data
sets. The same prompt was used for generating music from both models:
# Prompt :
"X: 1 $ M: 4/4 $ L: 1/4 $ K: C $ <song >"
40
4.5 GPT-3
(a) (b)
41
4. Results
42
Chapter 5
Discussion
In our experiments, two sequential generative models, GPT-3 and Performance RNN were
evaluated on two different data sets, OpenEWLD and EMOPIA. In this section the exper-
iment setup and the results of the experiments are analysed. Conclusions are also made in
relation to the research questions.
43
5. Discussion
patterns. This could both be from the model being slightly less trained than the other two,
but could also be a result of melodic differences that were present in these particular subsets
of EMOPIA. Repeating this attempt using more data could clearer show what impact this
has on the results, but we can already with these small data set sizes see an indication in the
pattern metrics that the model trained on the smaller data set had not learnt the structure
patterns as well as the other models.
5.1.3 Epochs
Training GPT-3 with EMOPIA and OpenEWLD achieved very different accuracies as seen
in Figure 4.6 and Figure 4.5, and training on OpenEWLD achieved a higher accuracy than
EMOPIA with a fifth of the training time. As discussed in the previous section, the optimal
learning rate for longer training times does not have to correspond to the learning rate used,
which could have limited the quality of the results. Since the same learning rate was used in
training with both datasets, this indicates that training is highly dependent on the contents
of the data set and that parameter tuning is required for every data set individually.
5.1.4 Temperature
Investigating the impact the temperature had on GPT-3 showed that the evaluation metrics
are not in themselves a clear indicator of a good generated data set when we are specifically
interested in a diverse data set. In Figure 4.8, both generated data sets have a high conso-
nance, i.e how much of the song is in the same key, and contain songs that sound good to
the author when listening, but if the model is to be used to generate varied samples to play
in a public space the results in Figure 4.8b are preferred over the results in Figure 4.8a. For
low temperature the model generated almost the same song over and over. This follows the
expected behaviour of using a low temperature which inherently lowers the diversity and
makes it only generate the sample with the highest probabilities, i.e. the same over and over
if the prompt is the same. In the example shown in Figure 4.8 the songs were generated using
the same prompt, which is why an alternative idea to achieve more diverse samples could be
to vary the prompts in a smart way by for example randomizing song titles as prompts.
44
5.2 Metric
RNN, where the MIDI files are converted into a vector representation to be able to be used
as input in the LSTM network. This has a big impact the outcome of the training since the
original data set can be assumed to have been altered in some way from parsing. This aspect
is difficult to eliminate completely, but the impact of it could be made as small as possible by
choosing file formats that do not require many parsing steps. Furthermore, the two different
datasets that GPT-3 were retrained on came in different formats, MusicXML and MIDI,
which results in a difference in parsing between the two runs with the same model. It would
therefore be interesting to find a way of measuring the information loss and the extent of the
alteration of the training data in order to include this in the comparison between the two
models as well as the runs with different datasets using the same model.
5.2 Metric
Discussing music quality and fundamental demands on music in the interviews gave a good
understanding about what parameters that would be interesting to quantify in an evaluation
metric. It also showed some of the diversity in liking and opinions, and the difficulty in
summarizing the broad spectrum of musicality into numbers. Because of the creative and
artistic perspectives on music, it can be a difficult task to classify something as "good" or "bad"
since there is always someone that is going to disagree. The following sections evaluates the
different metrics and how well they encapsulate music quality.
5.2.3 Consonance
The consonance metric is based on what ratio of the notes in the song belong to its most
common key. The most common key is found by parsing through all of the notes and finding
which belong to what key. Since parallel keys consist of the same tones, they get the same
scores. Therefore there is a bias towards the key that is parsed first, which can lead to the
key being misdetermined. This happens when GPT-3 is trained on the OpenEWLD dataset.
All generated songs ideally use the tones in the key C, and therefore some are found to be
45
5. Discussion
in its parallel key, Am, which lowers the consonance metric. By improving the key-finding
algorithm and taking chord progressions and melodic structure into account, this could be
improved.
5.2.4 Centricity
The centricity metric, which measures if some notes are more common than others, could
be an indicator of the music quality. A lower value of this metric means that some notes are
more common, which indicates that the model does not choose notes undeliberately but has
instead learnt the harmonic patterns that are common in music. The scores varied between
2 and 3, which corresponds to the scores of the data sets, i.e of real music. The training
iterations that achieved higher accuracy also achieved a lower centricity score. For example
this can be seen in Table 4.3, where the model trained with the learning rate that achieved the
highest accuracy also obtained the lowest centricity score. When listening to samples with
different centricity scores, it is difficult to distuingish a difference due to the difference in
music content in the training data sets as well as other quality parameters that varied. This
would therefore be interesting to further correlate with human perception.
5.2.5 TECs
It showed to be difficult to quantify structure and repeatable patterns such as motifs or chord
progressions in one number. Our attempt was to calculate the number of TECs, but what a
"good" number of TECs is is not completely clear as this can vary immensely between songs
depending on music genre and song length for example. It could be said though, that a very
low number of TECs which means that there are no or very few repeated patterns in the
song, would be undesireable. For example, when listening to the music generated by the
pre-trained VAE model MusicVAE (see Section 4.3), the music clearly sounded disorganized
and structure-less, which corresponds to the low number of TECs found in Table 4.2. An
indication of the performance could therefore be given by comparing the number of TECs
to the mean number of TECs of the training data. As seen in Table 4.5 the EMOPIA dataset
varied a lot in the number of TECs with a standard deviation of ∼50% of the mean value
which complicates this by not having a strict interval.
There is also the additional aspect that potential sources of error here could stem from
the SIA algorithm’s capability of finding the repeated patterns in the song. The COSIATEC
implementation of SIA (see Section 2.3.1) that was used only allows a tone to be part of one
repeated pattern [15], which is why there could be an underdetermination of patterns.
5.2.6 Macroharmony
The macroharmony metric shows how well the model knows how to arrange the notes. If the
macroharmony uses all the possible notes, i.e 12, it could be presumed that the music is not
staying within one key. In Table 4.4 where the seeding sequence was varied in Performance
RNN, the macroharmony is close to 12, and the resulting music also sounds incoherent and
messy. On the contrary, in Table 4.6 and in Table 4.7 where temperature and data set key
content were varied the macroharmonies were lower. This indicates that the music content
46
5.3 Legal aspects
of the generated songs is not just randomized, and instead follows a clear pattern that the
model learned. This metric is a clear indicator of the model performing well.
5.2.7 Groove
The groove metric, measuring the consistency of rhythm throughout the music, scored sim-
ilarly for all the music that was evaluated. Therefore not many conclusions can be drawn
from this metric, and instead this could maybe be more useful when evaluating music with
rhythmic instruments in them, as rhythmic patterns could be more easily distinguishable and
possibly more erroneous.
47
5. Discussion
48
Chapter 6
Conclusion
This thesis researched generative AI models for music generation. By revisiting the research
questions multiple conclusions can be drawn.
RQ1 Which already developed models in the state-of-the-art can be utilised to achieve mu-
sic generation?
By evaluating the training process based on training accuracy and loss, as well as eval-
uating the generated music using the custom metric, we found that transformer and
LSTM models work well for generating music. We also conclude that they perform
well in generating samples with long-term structure.
RQ2 What are the demands on the quality of background music in public spaces?
Through the interviews a clear set of important characteristics were determined which
showed to be useful later in the work. The interviews also showed the diversity in music
and the difficulty of quantifying such a varying form of media.
RQ4 To what extent can the existing generative models of background music generate back-
ground music suitable for public spaces?
Both GPT-3 and Performance RNN generated music of acceptable quality. They are
both affordable to train in terms of time, and generate music with clear structural el-
ements and melodic contents that follow the conventions of music to a large extent.
They both implement model architectures, transformer and LSTM, that have inherent
49
6. Conclusion
Future work
In future work it would be of interest to correlate human perception of music with the nu-
merical metrics used in this thesis. This would give clear indications of what scores to aim
for in the metrics.
• Since this thesis project was limited by time and computing resources, another im-
provement would be to train the models for longer and evaluate the differences.
• Furthermore, it would be beneficial to find in what ways the data set contents affect
the outcome of training, i.e what music is easier to train on.
50
References
[4] Dor Bank, Noam Koenigstein, and Raja Giryes. Autoencoders. arXiv preprint
arXiv:2003.05991, 2020.
[5] Luciano Bernardi, Cesare Porta, and Peter Sleight. Cardiovascular, cerebrovascular, and
respiratory changes induced by different types of music in musicians and non-musicians:
the importance of silence. Heart, 92(4):445–452, 2006. Published by: BMJ Publishing
Group Ltd.
[6] B. Boone and M. Schonbrun. Music Theory 101: From keys and scales to rhythm and melody,
an essential primer on the basics of music theory. Adams 101. Adams Media, 2017.
[7] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan-
guage models are few-shot learners. Advances in neural information processing systems,
33:1877–1901, 2020. Published by: Morgan Kaufmann Publishers Inc.
51
REFERENCES
[10] Li Deng. The mnist database of handwritten digit images for machine learning research.
IEEE Signal Processing Magazine, 29(6):141–142, 2012.
[11] Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and
Ilya Sutskever. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341,
2020.
[12] Ahmad Elgammal. How artificial intelligence completed beethoven’s unfinished tenth
symphony. Smithsonian Magazine, Jun Sep.
[13] Sebastian Garcia-Valencia. Cross entropy as objective function for music generative
models. arXiv preprint arXiv:2006.02217, 2020.
[15] Dorien Herremans and Elaine Chew. Morpheus: generating structured music with con-
strained patterns and tension. IEEE Transactions on Affective Computing, 10(4):510–523,
2017.
[16] Max J Hilz, Peter Stadler, Thomas Gryc, Juliane Nath, Leila Habib-Romstoeck, Brigitte
Stemper, Susanne Buechner, Samuel Wong, and Julia Koehn. Music induces different
cardiac autonomic arousal effects in young and older persons. Autonomic Neuroscience,
183:83–93, 2014. Published by: Elsevier.
[17] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computa-
tion, 9(8):1735–1780, 1997. Published by: MIT Press.
[18] Hsiao-Tzu Hung, Joann Ching, Seungheon Doh, Nabin Kim, Juhan Nam, and Yi-Hsuan
Yang. Emopia: a multi-modal pop piano dataset for emotion recognition and emotion-
based music generation. arXiv preprint arXiv:2108.01374, 2021.
[19] Barbara Kitchenham, O Pearl Brereton, David Budgen, Mark Turner, John Bailey, and
Stephen Linkman. Systematic literature reviews in software engineering–a systematic
literature review. Information and software technology, 51(1):7–15, 2009. Published by:
Elsevier.
[21] David Meredith. Recur sia-rrt: Recursive translatable point-set pattern discovery with
removal of redundant translators. In Joint European Conference on Machine Learning and
Knowledge Discovery in Databases, pages 485–493. Springer, 2019.
[22] Eddie Mullan. Nine most notorious copyright cases in music history. BBC, Jun 2019.
[23] Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. A hier-
archical latent vector model for learning long-term structure in music. In International
conference on machine learning, pages 4364–4373. PMLR, 2018.
52
REFERENCES
[24] Dan Robitzski. Mind-melting ai makes frank sinatra sing "toxic" by britney spears.
Futurism, May 2020.
[25] Colin Robson and Kieran McCartan. Ch 8: Multi-strategy (mixed method) designs. Real
World Research, 4th Edition. John Wiley amp; Sons, 2015.
[26] Ian Simon and Sageev Oore. Performance rnn: Generating music with expressive tim-
ing and dynamics. https://magenta.tensorflow.org/performance-rnn, 2017.
(Accessed: 2022-04-11).
[27] Federico Simonetta, Filippo Carnovalini, Nicola Orio, and Antonio Rodà. Symbolic
music similarity through a graph-based representation. In Proceedings of the Audio Mostly
2018 on Sound in Immersion and Emotion, AM’18, pages 1–7, New York, NY, USA, 2018.
Published by: Association for Computing Machinery.
[28] Dmitri Tymoczko. A Geometry of Music: Harmony and Counterpoint in the Extended Com-
mon Practice. Oxford University Press, Oxford, 2011.
[29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in
neural information processing systems, 30, 2017.
[30] Shih-Lun Wu and Yi-Hsuan Yang. The jazz transformer on the front line: Exploring
the shortcomings of ai-composed music through quantitative measures. arXiv preprint
arXiv:2008.01307, 2020.
[31] Shih-Lun Wu and Yi-Hsuan Yang. Musemorphose: Full-song and fine-grained music
style transfer with one transformer vae. arXiv preprint arXiv:2105.04090, 2021.
53
REFERENCES
54
Appendix A
55
A.
56
A.1 Interview documents
I agree that my answers during the interview will be recorded and that this data will be
handled in accordance with the section above.
…………………………………………
Date
………………………………………… ……………………………………..….
Interviewer Name Interviewee Name
………………………………………… ……………………………………..….
Interviewer Signature Interviewee Signature
57
A.
Context
The aim of this master thesis project is to generate music using machine learning models.
Since computers don’t know anything about anything until we teach them, it’s important to
break down the concept of music into its most basic parts in order for the computer to be
able to understand it. Therefore, facts that us humans might think of as obvious also need to
be defined carefully. In addition to this, knowledge about music theory could be used to
automate the evaluation process of the results in a simple way.
Consent
It is up to you if you want to answer a question or not, and the interviewer is always available
to answer questions after the interview. The interview will be recorded in order to simplify the
interview process and the evaluation of the answers. The recording will be shared with the
supervisors of the project, and will be discarded after the project is finished. The answers in
the interview could be included in the report, but in this case they will be completely
anonymous. The final report will be shared with the interview subject.
Interview questions
General
If applicable, what was your process the last time you composed a song? Do you have a
common strategy that you use? Do you decide on tempo etc beforehand or does it happen
organically?
Om du har någon, vad var din strategi den senaste gången du komponerade? Har du en
generell strategi när du komponerar musik? Bestämmer du taktart och tempo osv i förväg
eller sker det organiskt?
(In what way does the music theory you know affect the music you create?
På vilket sätt påverkar din teoretiska kunskap den musik du skapar?)
What factors do you think could contribute to making a song really uncomfortable to listen
to? Structure? Sound? (For you but also for the average person, the big mass)
Vilka faktorer tror du skulle kunna bidra till att göra musik riktigt obekväm att lyssna på?
(Både för dig men också för gemene man, den stora massan)
What kind of music do you like to listen to? What do you think it is that makes this music
enjoyable?
Vilken musik brukar du lyssna på? Vad tror du det är som gör den musiken “bra”/bekväm att
lyssna på?
Limit one parameter e.g. melody, can only use one single tone, how would you compose it to
make it musical?
58
A.1 Interview documents
Begränsa en parameter ex melodi, får bara använda en enda ton, hur skulle du komponera
för att göra det musikaliskt?
Kortfattat:
According to you, what theoretical demands are there on music for it to be comfortable to
listen to? (Rhythm, speed, key, melody, chords, singing, what instruments, genre, timbre)
Enligt dig, vilka musikteoretiska krav finns på musik generellt för att den ska vara bekväm att
lyssna på? Taktart? Tempo? Tonart? Melodik, ackordföljder? Sång, vilka instrument? Genre?
Klangfärg?
What type of music would you want to put in this kind of context? (A very general context
where the listener is unspecified and could be practically anyone)
I kontexten, vad tror du skulle vara bra musik att spela? (En väldigt generell publik miljö med
en ospecificerad lyssnare)
If applicable: How is your thought process when composing film music compared to
composing a song with the purpose to be listened to independently? How would you
approach composing background music?
Om relevant: Hur tänker du när du komponerar musik till film jämfört med när du komponerar
en låt som är tänkt att lyssnas på fristående? Hur hade du tänkt om du skulle komponera
bakgrundsmusik?
Generated music
This part of the interview will consist of listening to a couple of AI generated songs and trying
to find the factors that make them sound “fake” or robot-like.
59
INSTITUTIONEN FÖR DATAVETENSKAP | LUNDS TEKNISKA HÖGSKOLA | PRESENTERAD 2022-06-16
AI i rollen som upphovsmakare av kultur är ett relativt nytt fenomen. Detta examen-
sarbete utforskar specifikt skapandet av musik med hjälp av AI-metoder.
För att spela musik i offentliga utrymmen krävs data. Den förstnämnda är gjord för att generera
det att man följer gällande regler kring upphovs- musik i MIDI-format medan den andra är en stor
rätt och då innehar en kommersiell licens. Som språkmodell som kan användas till att generera
ett alternativ för butiksägare och andra aktörer musik som representeras i ett språkbaserat format.
som spelar musik i offentliga utrymmen skulle en Musiken utvärderades med hjälp av en handfull
möjlighet kunna vara att erbjuda licensfri musik. olika kvantitativa mått baserat på teori om musik,
Musiken skulle kunna genereras av AI, då AI- teori om tonalitet samt om musiks matematiska
genererad musik är licensfri om modellen tränas struktur.
på helt upphovsrättsfri musik.
Musiken skulle kunna vara licensfri även om Sampling parameters
detta frångås, men juridiken kring kultur gener-
erad av AI är omdiskuterad på grund av att det är
ett nytt och till stor del oprövat område. Diskus-
sionen handlar dels om frågan om vem som är up- Training data AI model Generated music