Pitch Histograms in Audio and Symbolic
Pitch Histograms in Audio and Symbolic
George Tzanetakis
(corresponding author)
1. INTRODUCTION
Traditionally, music information retrieval (MIR) has been separated in symbolic MIR where structured signals
such as MIDI files are used, and audio MIR where arbitrary unstructured audio signals are used. For symbolic
MIR, melodic information is typically utilized while for audio MIR typically timbral and rhythmic information is
used. In this paper, the main focus is the representation of global pitch content statistical information about
musical signals both in symbolic and audio form. More specifically, Pitch Histograms are defined and proposed as
a way to represent pitch content information and are evaluated in the context of automatic musical genre
classification.
Given the rapidly increasing importance of digital music distribution, as well as the fact that large web-based
music collections are continuing to grow in size exponentially, it is obvious that the ability to effectively navigate
within these collections is a desirable quality. Hierarchies of musical genres are used to structure on-line music
stores, radio stations as well as private collections of computer users.
Up to now, genre classification for digitally stored music has been performed manually and therefore automatic
classification mechanisms would constitute a valuable addition to existing music information retrieval systems.
One could, for instance, envision an Internet music search engine that searches for a set of specific musical
features (genre being one of them), as specified by the user, within a space of feature-annotated audio files.
Musical content features that are good for genre classification can be used in other type of analysis such as
similarity retrieval or summarization. Therefore genre classification provides a way to evaluate automatically
extracted features that describe musical content. Although the division of music into genres is somewhat subjective
and arbitrary, there exist perceptual criteria related to the timbral, rhythmic and pitch content of music that can be
used to characterize a particular musical genre. In this paper, we focus on pitch content information and propose
Pitch Histograms as way to represent such information.
Symbolic representations of music such as MIDI files are essentially similar to musical scores and typically
describe the start, duration, volume, and instrument type of every note of a musical piece. Therefore, in the case of
symbolic representation the extraction of statistical information related to the distribution of pitches, namely the
Pitch Histogram, is trivial. On the other hand, extracting pitch information from audio signals is not easy.
Extracting a symbolic representation from an arbitrary audio signal, called “polyphonic transcription”, is still an
open research problem solved only for simple and synthetic “toy” examples. Although the complete pitch
information of an audio signal can not be extracted reliably, automatic multiple pitch detection algorithms can still
provide enough accurate information to calculate overall statistical information about the distribution of pitches in
the form of a Pitch Histogram. In this paper, Pitch Histograms are evaluated in the context of musical genre
classification. The effect of pitch detection errors for the audio case is investigated by comparing genre
classification results for MIDI and audio-from-MIDI signals. For the remainder of the paper it is important to
define the following terms: symbolic, audio-from-MIDI and audio. Symbolic refers to MIDI files, audio-from-
MIDI refers to audio signals generated using a synthesizer playing a MIDI file and audio refers to general audio
signals such as mp3 files found on the web.
This work can be viewed as a bridge connecting audio and symbolic MIR through the use of pitch information for
retrieval and genre classification. Another valuable idea described in this paper is the use of MIDI data as the
ground truth for evaluating audio analysis algorithms applied to audio-from-MIDI data.
The remainder of this paper is structured as follows: A review of related work is provided in Section 2. Section 3
introduces Pitch Histograms and describes their calculation for symbolic and audio data. The evaluation of Pitch
Histograms features in the context of musical genre classification is described in Section 4. Section 5 describes the
implementation of the system and Section 6 contains conclusions and directions for future work.
2. RELATED WORK
Music Information Retrieval (MIR) refers to the process of indexing and searching music collections. MIR systems
can be classified according to various aspects such as the type of queries allowable, the similarity algorithm, and
the representation used to store the collection. Most of the work in MIR has traditionally concentrated on
symbolic representations such as MIDI files. This is due to several factors such as the relative ease of extracting
structured information from symbolic representations as well as their modest performance requirements, at least
compared to MIR performed on audio signals. More recently a variety of MIR techniques for audio signals have
been proposed. This development is spurred by increases in hardware performance and development of new Signal
Processing and Machine Learning algorithms.
Symbolic MIR has its roots in dictionaries of musical themes such as Barlow and DeRoure (1948). Because of its
symbolic nature, it is often influenced by ideas from the field of text information retrieval (Baeza-Yates and
Ribeiro-Neto, 1999). Some examples of modeling symbolic music information as text for retrieval purposes are
described in Downie (1999) and Pickens (2000). In most cases the query to the system consists of a melody or a
melodic contour. These queries can either be entered manually or transcribed from a monophonic audio recording
of the user humming or singing the desired melody. The second approach is called Query-by-humming and some
early examples are Kageyama, Mochizuki and Takashima (1993) and Ghias, Logan, Chamberlin and Smith (1995).
A variety of different methods for calculating melodic similarity are described in Hewlett and Selfridge-Field
(1998). In addition to melodic information, other types of information extracted from symbolic signals can also be
utilized for music retrieval. As an example the production of figured bass and its use for tonality recognition is
described in Barthelemy and Bonardi (2001) and the recognition of Jazz chord sequences is treated in Pachet
(2000). Unlike symbolic MIR which typically focuses on pitch information, audio MIR has traditionally used
features that describe the timbral characteristics of musical textures as well as beat information. Representative
examples of techniques for retrieving music based on audio signals include: performances of the same orchestral
piece based on its long-term energy profile (Foote, 2000), discrimination of music and speech (Logan, 2000)
(Scheirer & Slaney, 1997), classification, segmentation and similarity retrieval of musical audio signals
(Tzanetakis & Cook, 2000), and automatic beat detection algorithms (Scheirer, 1998) (Laroche, 2001).
Although accurate multiple pitch detection on arbitrary audio signals (polyphonic transcription) is an unsolved
problem, it is possible to extract statistical information regarding the overall pitch content of musical signals.
Pitch Histograms are such a representation of pitch content that has been used together with timbral and rhythmic
features for automatic musical genre classification in Tzanetakis and Cook (2002). The idea of Pitch Histograms is
similar to the Pitch Profiles proposed in (Krumhansl, 1990) for the analysis of tonal music in symbolic form. The
original version of this paper first appeared in Tzanetakis, Ermolinskyi and Cook (2002). Pitch Histograms are
further explored and their performance is compared both for symbolic and audio signals in this paper. The goal of
the paper is not to demonstrate that features based on Pitch Histograms are better or more useful in any sense
compared to other existing features but rather to show their value as an additional alternative source of musical
content information. As already mentioned, symbolic MIR and audio MIR traditionally have used different
algorithms and types of information. This work can be viewed as an attempt to bridge these two distinct
approaches.
3. PITCH HISTOGRAMS
Pitch Histograms are global statistical representations of the pitch content of a musical piece. Features calculated
from them can be used for genre classification, similarity retrieval as well as any type of analysis where some
representation of the musical content is required. In the following subsections, Pitch Histograms are defined and
used to extract features for genre classification.
3.1 Pitch Histogram Definition
A Pitch Histogram is, basically, an array of 128 integer values (bins) indexed by MIDI note numbers and showing
the frequency of occurrence of each note in a musical piece. Intuitively, Pitch Histograms should capture at least
some amount of information regarding harmonic features of different musical genres and pieces. One expects, for
instance, that genres with more complex tonal structure (such as Classical music or Jazz) will exhibit a higher
degree of tonal change and therefore have more pronounced peaks in their histograms than genres such as Rock,
Hip-Hop or Electronica music that typically contain simple chord progressions.
Two versions of the histogram are considered: an unfolded (as defined above) and a folded version. In the folded
version, all notes are transposed into a single octave (array of size 12) and mapped to a circle of fifths, so that
adjacent histogram bins are spaced a fifth apart, rather than a semitone. More specifically if we denote n the MIDI
note number (C4 is 60) then the following conversion can be used to get the folded version index c: c = (n mod
12). For mapping to the circle of fifths the following conversion can be used c’ = (7 x c) mod 12.
Folding is perform in order to represent pitch class information independently of octave and the mapping to the
circle of fifths is done in order to make the histogram better suited for expressing tonal music relations and it was
found empirically that the extracted features result in better classification accuracy. As an example a piece in C
major will have strong peaks at C and G (tonic and dominant) and will be more closely related to a piece in G
major (G and D peaks) than a piece in C# major. The mapping to the circle of fifths makes the Pitch Histograms of
two harmonically related pieces more similar in shape that when the chromatic histogram is used. It can therefore
be said that the folded version of the histogram contains information regarding the pitch content of the music (or a
crude approximation of harmonic information), whereas the unfolded version is useful for determining the pitch
range of the piece. As an example consider two pieces both mostly in C major, one of which is two octave higher
on average than the other. These two pieces will have very similar folded histograms however their unfolded
histograms will be different as the higher piece will have more energy at the higher pitch bins of the unfolded Pitch
Histogram.
X2 = IDFT(|DFT(xlow)|k) + IDFT(|DFT(xhigh)|k)
= IDFT(|DFT(xlow)|k + |DFT(xhigh)|k)
where xlow and xhigh are the low and the high channel signals before the periodicity detection blocks in Figure 2.
The parameter k determines the frequency-domain compression (for normal autocorrelation k=2). The Fast Fourier
Transform (FFT) and its inverse (IFFT) are used to speed the computation of the transforms.
The peaks of the summary autocorrelation function (SACF) (signal x2 of Figure 2) are relatively good indicators
of potential pitch periods in the signal analyzed. In order to filter out integer multiple of the fundamental period, a
peak pruning technique is used. The original SACF curve is first clipped to positive values and then time-scaled by
a factor of two and subtracted from the original clipped SACF function, and again the result is clipped to have
positive values only. That way, repetitive peaks with double the time lag of the basic peak are removed. The
resulting function is called the enhanced summary autocorrelation (ESACF) and its prominent peaks are
accumulated in the Pitch Histogram calculation. More details about the calculation steps of this multiple pitch
detection model, as well as its evaluation and justification can be found in Tolonen & Karjalainen (2000).
Figure 1. Unfolded Pitch Histograms of two Jazz pieces (left) and two Irish folk songs (right).
The sparseness of the left side histograms results from the few chord changes of Irish folk music.
Input
HighPass
>1kHz LowPass
<1kHz
Half-wave Rectifier
LowPass
x1
Periodicity Periodicity
detection detection
x2
SACF
Enhance
r
The classification results are also summarized in Table 1 in the form of a so-called confusion matrix. Its columns
correspond to the actual genre and the rows to the genre predicted by the classifier. For example, the cell of row 5,
column 3 contains value 10, meaning that 10% of jazz (row 5) was incorrectly classified as rock music (column 3).
The percentages of correct classifications lie on the main diagonal of the confusion matrix. It can be seen that 39%
of rock was incorrectly classified as Electronica and the confusion between Electronica and other genres is a
source of several other significant miscalculations. All of this indicates that the harmonic content analysis is not
well suited for Electronica music because of its extremely broad nature. Some of its melodic components can be
mistaken for rock, jazz or even classical music, whereas Electronica’s main distinguishing feature, namely the
extremely repetitive structure of its percussive and melodic elements is not reflected in any way on the Pitch
Histogram. It is clear from inspecting the Table that certain genres are much better classified based on their pitch
content than other something which is expected. However even in the cases of confusion, the results are
significantly better than random and therefore would provide useful information especially if combined with other
features.
In addition to these results, some representative pair-wise genre classification accuracy results are shown in Figure
4. A 2-genre classifier succeeds in correctly identifying the genre with 80% accuracy on average (1.6 times better
than chance). The classifier correctly distinguishes between Irish Folk music and Jazz with 94% accuracy, which is
the best classification result. The worst pair is Rock and Electronica, as can be expected, since both of these genres
often employ simple and repetitive tonal combinations.
Figure 4. Pair-wise evaluation in MIDI
It will be shown below that other feature-evaluating techniques, such as the analysis of rhythmic features or the
examination of timbral texture can provide additional information for musical genre classification and be more
effective in distinguishing Electronica from other musical genres. This is expected because Electronica is more
characterized by its rhythmic and timbral characters rather than its pitch content.
An attempt was made to investigate the dynamic properties of the proposed classification technique by studying
the dependence of the algorithm’s accuracy on the time-domain length of the supplied input data. Instead of letting
the algorithm process MIDI files for the full length of 150 seconds, the histogram-constructing routine was
modified to only process the first n-second chunk of the file, where n is a variable quantity. The average
classification accuracy across one hundred files is plotted as a function of n in Figure 5.
The observed dependence of classification accuracy to the input data length is characterized by two pivotal points
on the graph. The first point occurs at around 0.9 seconds, which is when the accuracy improves to approximately
35% from the random 20%. Hence, approximately one second of musical data is needed by our classifier to start
identifying genre-related harmonic properties of the data. The second point occurs at approximately 80 seconds
into the MIDI file, which is when the accuracy curve starts flattening off. The function reaches its absolute peak at
around 240 seconds (4 minutes).
4.4 Audio generated from MIDI representation
The genre classification results for the audio-from-MIDI representation are shown in Figure 6. Although the results
are not as good as the ones obtained from MIDI data, they are still significantly better than random classification.
More details are provided in Table 2 in the form of a confusion matrix. From Table 2, it can be seen that
Electronica is much harder to classify correctly in this case, probably due to noise in the feature vectors caused by
pitch errors of the multiple-pitch detection algorithm. A comparison of these results with the ones obtained using
the MIDI representation and general audio is provided in the next subsection. We have no reason to believe that
the outcome of the comparison was in any way influenced by the specifics of the MIDI-to-Audio conversion
procedure. Experiments with different software synthesizers for audio-from-MIDI conversion showed no
significant change in the results. The main reason for the decrease in performance is due to the complexity of
multiple pitch detection in audio signals even if they are generated from MIDI. Of course, no information from the
original MIDI signal is used for the computation of the Pitch Histogram in audio-from-MIDI case.
Figure 5. Average classification accuracy as a function of the length of input MIDI data (in seconds)
4.5 Comparison
One of the objectives of the described experiments was to estimate the amount of classification error introduced by
the multi-pitch detection algorithm used for the construction of Pitch Histograms from audio signals. Knowing that
MIDI pitch information (and therefore pitch content feature vectors extracted from MIDI) is fully accurate by
definition it is possible to estimate this amount by comparing the MIDI classification results with those obtained
from the audio-from-MIDI representation. A large discrepancy would indicate that the errors introduced by
multiple-pitch detection algorithm significantly affect the extracted feature vectors.
Figure 7. Classification accuracy comparison
Table 3. Comparison of classification results
Multi-pitch Full Feature
RND
Features Set
Audio-from-
43 ±7% 75 ±6% 20%
MIDI
Audio 40 ±6% 70 ±6% 20%
The results of the comparison are shown in Figure 7. The same data is also provided in Table 3. It can be observed
that there is a decrease in performance between the MIDI and audio-from-MIDI representations. However, despite
the errors, the features computed from audio-from-MIDI still provide significant information for genre
classification. A further smaller decrease in classification accuracy is observed between the audio-from-MIDI and
audio representations. This is probably due to the fact that cleaner multiple pitch detection results can be obtained
from the audio-from-MIDI examples because of the artificial nature of the synthesized signals. The comparison of
the audio-from-MIDI and audio case is only indicative as the correspondence is only at the genre level. Basically it
shows that similar classification results can be obtained for general audio signals as with audio-from-MIDI and
therefore Pitch Histograms are not only applicable to audio-from-MIDI data. The detailed results of the audio
classification (confusion matrix) are not included as no direct comparison can be performed with the results of the
audio-from-MIDI data.
In addition to information regarding pitch or harmonic content, other types of information, such as timbral texture
and rhythmic structure can be utilized to characterize musical genres. The full feature set results shown in Figure 7
and Table 3 refer to the feature set described and used for genre classification in Tzanetakis & Cook (2002). In
addition to the described pitch content features, this feature set contains timbral texture features (Short-Time
Fourier Transform (STFT) based, Mel-Frequency Cepstral Coefficients (MFCC)), as well as features about the
rhythmic structure derived from Beat Histograms calculated using the Discrete Wavelet Transform.
It is interesting to compare this result with the performance of humans in classifying musical genre, which has
been investigated in Perrot & Gjerdingen (1999). It was determined that humans are able to correctly distinguish
between ten genres with 53% accuracy after listening to only 250 milliseconds audio samples. Listening to three
seconds of music yielded 70% accuracy (against 10% chance). Ten genres were used for this study. Although
direct comparison of these results with the described results is not possible due to different number of genres, it is
clear that the automatic performance is not far away from the human performance. These results also indicate the
fuzzy nature of musical genre boundaries.
Figure 8. Three-dimensional time-pitch surface (X axis = time, Y axis = pitch, Z axis = bin amp)
5. IMPLEMENTATION
The software used for the audio Pitch Histogram calculation, as well as for the classification and evaluation, is
available as a part of MARSYAS (Tzanetakis & Cook, 2000), a free software framework for rapid development
and evaluation of computer audition applications. The software for the MIDI Pitch Histogram calculation is
available as separate C++ code and will be integrated into MARSYAS in the future. The framework follows a
client-server architecture. The server contains all the pattern recognition, signal processing and numerical
computations and runs on any platform that provides C++ compilation facilities. A client graphical user interface
written in Java controls the server. MARSYAS is available under the Gnu Public License (GPL) at:
http://www.cs.princeton.edu/~gtzan/marsyas.html
In order to experimentally investigate the results and performance of the Pitch Histograms, a set of visualization
interfaces for displaying the time evolution of pitch content information was developed. It is our hope that these
interfaces will provide new insights for the design and development of new features based on the time evolution of
Pitch Histograms.
These tools provide three distinct modes of visualization:
1) Standard Pitch Histogram plots (Figure 1) where the x-axis corresponds to the histogram bin and the y-axis
corresponds to the amplitude. These plots don’t show the time evolution of the histogram and just display the final
result.
2) Three-dimensional pitch-time surfaces (Figure 8) where the evolution of Pitch Histograms is depicted by
appending histograms in time. The axes are: discrete time, discrete pitch (fold or unfolded) and the height is the
amplitude of the particular histogram bin at that time and pitch.
3) Projection of the pitch-time surfaces onto a two-dimensional bitmap, with height represented as the grayscale
color value (Figure 9).
These visualization tools are written in C++ and use OpenGL for the 3D graphics rendering.
Figure 9. Examples of grayscale pitch-time surfaces: Jazz (top) and Irish Folk music (bottom), X axis = time, Y axis=pitch.
The upper part of Figure 8 shows an ascending chromatic scale of equal-length non-overlapping notes. A snapshot
of the time-pitch surface of an actual music piece is shown in the lower part of Figure 8. Although more difficult to
interpret visually than the simple scale example, one can observe thick slice that in most cases correspond to
chords. By visual inspection of Figure 9, various types of interesting information can be observed. Some examples
are: the higher pitch range of the particular Irish piece (lower part) compared to the Jazz piece (upper part), as well
as its different periodic structure and melodic movement. These observations seem to generalize to the particular
genres and potentially be used for the extraction of more powerful pitch content features.
7. REFERENCES
[1] Allamanche, E. et al. (2001) Content-based identification of audio material using MPEG-7 Low Level
Description. In Proc. Int. Symposium on Music Information Retrieval (ISMIR), Bloomington, Indiana.
[2] Baeza-Yates, R., & Ribeiro-Neto, B. (1999) Modern Information Retrieval. Harlow: Addison-Wesley.
[3] Barlow, H., & DeRoure, D. (1948). A Dictionary of Musical Themes. New York: Crown.
[4] Barthelemy, J., & Bonardi, A. (2001) Figured Bass and Tonality Recognition. In Proc. Int. Symposium on
Music Information Retrieval (ISMIR), Bloomington, Indiana.
[5] Downie, J. S. (1999) Evaluating a Simple Approach to Music Information Retrieval: Conceiving Melodic N-
grams as Text. Ph.D thesis, University of Western Ontario.
[6] Duda, R., Hart, P., & Stork, D. (2000) Pattern Classification. New York: John Wiley & Sons.
[7] Foote, J. (2000) ARTHUR: Retrieving Orchestral Music by Long-Term Structure. In Proc. Int. Symposium on
Music Information Retrieval (ISMIR), Plymouth, MA.
[8] Ghias, A., Logan, J., Chamberlin, D., & Smith, B.C. (1995) Query by humming: Musical information retrieval
in an audio database. In Proc.of ACM Multimedia, 231-236.
[9] Goto, M. et al. (2002) RWC Music Database: Popular, Classical and Jazz Music Databases. In Proc. Int.
Symposium on Music Information Retrieval (ISMIR), Paris, France.
[10] Hewlett, W.B., and Selfridge-Field, Eleanor (Eds) (1998) Melodic Similarity: Concepts, Procedures and
Applications. Computing in Musicology, 11.
[11] Kageyama, T., Mochizuki, K., & Takashima, Y. (1993) Melody Retrieval with Humming. In Proc. Int.
Computer Music Conference (ICMC), 349-351.
[12] Krumhansl, C.L. (1990) Cognitive Foundations of Music Pitch. New York: Oxford University Press.
[13] Laroche, J. (2001) Estimating Tempo, Swing and Beat Locations in Audio Recordings. In Proc. IEEE Int.
Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 135-139, Mohonk, NY.
[14] Logan, B. (2000) Mel Frequency Cepstral Coefficients for Music Modeling. In Proc. Int. Symposium on
Music Information Retrieval (ISMIR), Plymouth, MA.
[15] Pachet, F. (2000) Computer Analysis of Jazz Chord Sequences: Is Solar a Blues. Readings in Music and
Artificial Intelligence, Miranda, E. Ed, Harwood Academic Publishers.
[16] Perrot, D., & Gjerdigen, R. (1999) Scanning the dial: An exploration of factors in the identification of musical
style. In Proc. of the Society for Music Perception and Cognition pp.88, (abstract).
[17] Pickens, J. (2000) A Comparison of Language Modeling and Probabilistic Text Information Retrieval
Approaches to Monophonic Music Retrieval. In Proc. Int. Symposium on Music Information Retrieval
(ISMIR), Plymouth, MA.
[18] Scheirer, E., & Slaney, M. (1997) Construction and Evaluation of a Robust Multifeature Speech/Music
Discriminator. In Proc. Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Munich, Germany.
[19] Scheirer, E. (1998) Tempo and Beat Analysis of Acoustic Musical Signals. Journal of the Acoustical Society
of America, 103(1):588,601.
[20] Tolonen, T., & Karjalainen, M. (2000) A Computationally Efficient Multipitch Analysis Model. IEEE Trans.
On Speech and Audio Processing, 8(6), 708-716.
[21] Tzanetakis, G., & Cook, P. (2000) Audio Information Retrieval (AIR) Tools. In Proc. Int. Symposium on
Music Information Retrieval (ISMIR), Plymouth, MA.
[22] Tzanetakis, G., & Cook, P., (2002) Musical Genre Classification of Audio Signals. IEEE Transactions on
Speech and Audio Processing, 10(5), 293-302 .
[23] Tzanetakis, G., Ermolinskyi, A. and Cook, P., (2002) Pitch Histograms in Audio and Symbolic Music
Information Retrieval, In Proc. Int. Conference on Music Information Retrieval (ISMIR), Paris, France, 31-38.
[24] Tzanetakis, G. & Cook, P. (2000) Marsyas: A framework for audio analysis. Organised Sound. 4(3), 2000.
Figure 1. Unfolded Pitch Histograms of two Jazz pieces (left) and two Irish folk songs (right).
The sparseness of the left side histograms results from the few chord changes of Irish folk music.
Input
HighPass
>1kHz LowPass
<1kHz
Half-wave Rectifier
LowPass
x1
Periodicity Periodicity
detection detection
x2
SACF
Enhance
r