Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views15 pages

Categorization Broadcast Audio Objects

This paper explores a categorization framework for broadcast audio objects through experiments involving expert and non-expert listeners. The study identifies seven general categories of audio objects based on their cognitive processing in complex auditory scenes, which can enhance object-based audio rendering strategies. The findings aim to inform the development of perceptually grounded representations for improved listener experiences in various broadcast contexts.

Uploaded by

Mohammad Asgari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views15 pages

Categorization Broadcast Audio Objects

This paper explores a categorization framework for broadcast audio objects through experiments involving expert and non-expert listeners. The study identifies seven general categories of audio objects based on their cognitive processing in complex auditory scenes, which can enhance object-based audio rendering strategies. The findings aim to inform the development of perceptually grounded representations for improved listener experiences in various broadcast contexts.

Uploaded by

Mohammad Asgari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Journal of the Audio Engineering Society PAPERS

Vol. 64, No. 6, June 2016


DOI: http://dx.doi.org/10.17743/jaes.2016.0007

Categorization of Broadcast Audio Objects in


Complex Auditory Scenes
1 1
JAMES WOODCOCK , WILLIAM J. DAVIES ,
([email protected])
1 2
TREVOR J. COX, AES Member, AND FRANK MELCHIOR, AES Member

1
Acoustics Research Centre, University of Salford, Salford, M5 4WT, United Kingdom
2
BBC R&D, Dock House, MediaCityUK, Salford, M50 2LH, United Kingdom

This paper presents a series of experiments to determine a categorization framework for


broadcast audio objects. Object-based audio is becoming an evermore important paradigm for
the representation of complex sound scenes. However, there is a lack of knowledge regarding
object level perception and cognitive processing of complex broadcast audio scenes. As cat-
egorization is a fundamental strategy in reducing cognitive load, knowledge of the categories
utilized by listeners in the perception of complex scenes will be beneficial to the development
of perceptually based representations and rendering strategies for object-based audio. In this
study expert and non-expert listeners took part in a free card sorting task using audio objects
from a variety of different types of program material. Hierarchical agglomerative clustering
suggests that there are seven general categories, which relate to sounds indicating actions and
movement, continuous and transient background sound, clear speech, non-diegetic music and
effects, sounds indicating the presence of people, and prominent attention grabbing transient
sounds. A three-dimensional perceptual space calculated via multidimensional scaling sug-
gests that these categories vary along dimensions related to the semantic content of the objects,
the temporal extent of the objects, and whether the object indicates the presence of people.

0 INTRODUCTION metadata describing the position of each element in time


and space for rendering at the reproduction end.
The aim of the work presented in this paper is to de- Object-based audio (OBA) is often considered to be the
termine a categorization framework for typical broadcast future of spatial audio transmission [1–4] and is the primary
audio objects. The proliferation of new technologies with focus of this paper. The major advantage of OBA over tra-
which broadcast audio is consumed has resulted in a need to ditional channel-based approaches is that, as the rendering
shift from the channel-based paradigm traditionally adopted is done at the receiver end, the virtual scene can be ren-
by broadcast media to a more format agnostic approach. dered in such a way as to optimize the reconstruction for
In the transmission of broadcast audio, there are a vari- the given reproduction device and listening environment
ety of ways by which a virtual scene can be represented. [5]. The retention of audio objects through the transmis-
Channel based representations are directly related to a spe- sion chain opens the potential for object level processing,
cific loudspeaker layout such as stereo and 5.1 and are the such as specific rules for how to render different types of
most widely used method to represent virtual scenes. In this objects on different reproduction systems. In this paper the
paper the term “sound scene” refers to a physical pressure term “audio object” refers to an audio signal with associated
field, the “auditory scene” refers to the listener’s perception metadata whereas the term “auditory object” refers to the
of the sound scene, and the “virtual scene” refers to some perceptual construct. Although there have been several pro-
virtual representation of the scene that can be transmitted posals of how to represent OBA (see, for example, Spatial
and reproduced. Transformation based representations such Audio Object Coding [6], MPEG-4 AudioBIFS [7], and the
as ambisonics utilize spatially orthogonal basis functions, Audio Scene Description Format [8]), knowledge regarding
such as spherical harmonics, to represent the virtual scene object level perception of complex broadcast audio scenes
as a set of transformation coefficients that are then decoded is rather limited.
on the reproduction side. Object-based representations store In general, the aim of sound reproduction in the context
and transmit different elements of the content along with of broadcast audio is to, as far as possible, faithfully recreate

380 J. Audio Eng. Soc., Vol. 64, No. 6, 2016 June


PAPERS CATEGORIZATION OF BROADCAST AUDIO OBJECTS IN COMPLEX AUDITORY SCENES

what the content producer has experienced in the production gories for auditory events within urban soundscapes: “event
environment. However, the content delivery chain can result sequences” where the source of the sound can easily be
in significant degradations of the auditory scene perceived identified, and “amorphous sequences” where it cannot
by the listener. For the representation, transmission, and [15]. Based on this categorization, it appears that sounds
rendering of OBA it is therefore advantageous to understand are processed primarily as meaningful events and where
how complex auditory scenes are cognitively processed by source identification fails sounds are processed according
the listener. to physical or low level perceptual parameters. This view is
Listeners make sense of their acoustic environment by backed up by a number of neuro-cognitive studies, which
parsing auditory input into auditory objects [9–11], and have found that the processing of environmental sounds is
there is evidence to suggest that subsequent processing of dependent on the relationship the sound has to its refer-
these auditory events occurs at an object, rather than signal, ent. For example, in a behavioral study by Giordano et al.
level [12]. The formation of auditory objects (also often [16], it was found that the evaluation of sounds produced
called auditory events or auditory streams in the literature) by non-living objects is biased towards low level acoustic
consists in assigning elements of the acoustic input to one features whereas the processing of sounds produced by liv-
or more sources. This process is termed auditory scene ing creatures is biased toward sound independent semantic
analysis and is driven by two processes: (1) a pre-attentive information. There have been complementary findings in
partitioning process based on Gestalt principles [9]; (2) a neuro-imaging studies, where object category specific tem-
schema driven process that uses prior knowledge to ex- poral activations relating to non-living action/tool sounds,
tract meaning from the acoustic representation [13]. Object animal vocalizations, and human vocalizations have been
categorization is a fundamental process underlying human observed (see, for example, Lewis et al. [17]).
cognition [14]; without the cognitive process of categoriza- Cognitive categories for environmental sounds have been
tion, people would not be able to cope with the volume explored by Gygi et al. [18] who found three distinct clus-
of sensory information to which they are constantly sub- terings of sounds that related to harmonic sounds, discrete
jected. Thus, an understanding of how listeners categorize impact sounds, and continuous sounds. A study into audi-
and assign concepts to auditory objects is central to the un- tory categories for environmental sounds in complex au-
derstanding of the perception of complex auditory scenes. ditory scenes revealed two main categories relating to the
Knowledge of general categories for broadcast audio presence or absence of human activity [19], and common
objects will aid the translation of produced virtual sound categories for auditory events in soundscapes include “nat-
scenes to the intended listener experience by: (1) providing ural,” “human,” and “mechanical” [see Payne et al. [20] for
a perceptually grounded framework for OBA representa- a review]. It is important to note, however, that the cog-
tions, and (2) allowing the investigation of object cate- nitive categorization framework is contingent; this means
gory specific rendering rules (i.e., what to do when ren- that the categorization framework may change depending
dering an object of category X for system Y). Subsequent upon factors such as location and soundscape [21]. This has
perceptual testing to optimize listener experience for dif- potentially important implications for broadcast audio, as
ferent loudspeaker-based reproduction systems, as well as the categorization framework will almost certainly change
headphone and binaural reproduction, based on these cate- depending on the program material and factors such as the
gories will allow the development of intelligent rendering presence of a screen. This suggests that different rendering
schemes that will maximize the quality of experience for a rules may be needed for different scene types; for example,
given listening situation. For example, experiments could material with accompanying visuals may require additional
be conducted to determine quantitative rules that can be object categories to account for objects that appear on and
used when rendering different categories of objects for dif- off screen.
ferent loudspeaker layouts. Expert listeners could be given As well as investigations into the categorization of indi-
control of a small number of parameters of an object based vidual auditory objects, there has been some investigation
mix (examples of these parameters might include the level into the categorization of complex auditory scenes. Rum-
and position of objects of a certain category) and asked to mukainen et al. [22] investigated the categorization of nat-
vary these parameters for a variety of different speaker lay- ural audiovisual scenes using a free card sorting paradigm.
outs. This knowledge could then be built into a rendering Based on the sorting experiment, five categories of scenes
scheme in the form of an additional semantic layer to the were identified that related to calm scenes, still scenes,
signal level manipulations that are carried out to render ob- noisy scenes, vivid scenes, and open scenes. A three di-
ject based audio content for different loudspeaker layouts; mensional multidimensional scaling solution was calcu-
for example, it might be expected the signal level optimiza- lated (see Sec. 1.6.3), and the dimensions of the resulting
tion for objects that carry dialogue might be different to perceptual space were found to relate to calmness, open-
diffuse background objects. What isn’t currently known is ness, and the presence of people.
how many categories listeners utilize and the nature of these From the literature detailed in this section, it can be seen
categories. that previous studies have focussed on the categorization of
Considering how fundamental categorization is to the isolated auditory objects or the categorization of complex
human experience of the world, there is scant knowledge scenes as a whole. Object-based audio presents the opportu-
regarding the categorization of auditory objects. Sound- nity to optimize the reproduction for each individual sound
scape research has suggested two generic cognitive cate- source in a produced scene. It would therefore be beneficial

J. Audio Eng. Soc., Vol. 64, No. 6, 2016 June 381


WOODCOCK ET AL. PAPERS

to have knowledge of how listeners categorize audio ob- 4)Feature film (Woman in Black)
jects in broadcast audio scenes. This would allow rendering 5)Naturalistic soundfield recordings of urban soundscapes
schemes to have perceptually motivated, category specific around the city center of Manchester, UK
optimization rules. Although previous work into catego-
rization of soundscapes and environmental sounds provides All of the broadcast program material was available in
insight into cognitive categories for everyday sounds in iso- a 5.0 mix. A number of clips were selected from each of
lation, there is a lack of knowledge regarding how listeners the content types for use in the test. The length of the
categorize objects in complex auditory scenes. clips ranged from 33 seconds to 4 minutes 32 seconds,
Broadcast sound scenes differ from real world scenes and the clips were cut to be the length of a single scene.
because they have been produced; this implies that some This was done so as to provide an ecologically valid set
structure has already been imposed on the scene by the con- of stimuli; as the aim of the study is to understand how
tent producer. What is not clear is how listeners perceive listeners categorize audio objects in typical broadcast audio
this structure, and the implications this has on the cogni- scenes, it is important that listeners are able to understand
tive categorization scheme used by the listener. This paper the context of each object within the scene. The clips were
reports on a series of exploratory experiments that were selected so as to reflect a wide range of scene types from
conducted with the primary aim of determining general the different types of program material. Eleven clips were
cognitive categories for common broadcast audio objects. used for the radio drama content (15.5 minutes in total), 8
The experiments are in the form of case-studies that ex- clips were used for the feature film content (14.5 minutes
plore the categorization of audio objects for different types in total), 4 clips were used for the nature documentary
of program material produced in 5.0. The types of material content (13.5 minutes in total), 7 clips were used for the
explored are radio drama, live events, nature documentary, live event content (9.2 minutes in total), and 9 clips were
feature film, and naturalistic soundscape recordings. The used for the naturalistic recordings (11.2 minutes in total). It
nature documentary and feature film content also include should be noted that categorization of complex stimuli can
video. be influenced by the length of the stimulus [22]; however,
in the case of the present study the length of the scene
should not influence the categorization as it is the objects
1 METHODS AND MATERIALS
within the scene that are being categorized, not the scenes
1.1 Ethics themselves.
The experiments described in this paper were approved The radio drama material, nature documentary material,
by the University of Salford ethics committee. Participants and feature film material are all commercially available;
took part in the experiments voluntarily, and written consent the times of the clips used are detailed in Appendix A. The
was taken prior to the test session. Participants were told naturalistic soundfield recordings are available to download
that they were free to withdraw from the experiment at any here http://dx.doi.org/10.17866/rd.salford2234293.
time without needing to give a reason to the researcher.
1.4 Reproduction
1.2 Participants
Audio was reproduced using Genelec 8030A active loud-
A total of 21 participants took part in the test. Ten of these speakers arranged in a 5.0 setup in accordance with ITU-R
participants had practical experience of audio engineering. BS. 775 [23] in the University of Salford semi-anechoic
The remaining 11 participants had neither experience of chamber. The radius of the loudspeaker layout was 1.30
audio engineering nor formal training in acoustics or audio. m and the listener was seated in the center of the array.
Audiograms were not considered necessary as the aim of the The loudspeakers were adjusted to have equal gains by
experiments was to investigate the overall experience, rather generating a full scale pink noise signal for each loud-
than quantify the effects of lower level features. However, speaker and adjusting the gain of the loudspeaker so that
participants were asked if they had normal hearing prior the sound pressure level in the center of the array was equal
to the experiment. Participants were recruited via an email (85 dBA) for each loudspeaker. The program material was
invitation and through social media, and they were paid for reproduced from 24-bit wav files sampled at 48 kHz via an
their time. RME UFX soundcard. The naturalistic soundfield record-
ings were decoded to 5.0 using the Soundfield Surround
1.3 Stimuli Zone VST plugin. The radio drama, feature film, live event,
In the experiments reported in this paper, five different and nature documentary material were reproduced with no
types of program material were investigated: modifications to the gain of the original material and the
naturalistic soundscape material was set to a comfortable
1)Radio drama (BBC productions of the “Wizard of Oz” listening level.
and “Hitchhiker’s Guide to the Galaxy: Tertiary Phase”) For the program material with associated video content
2)Nature documentary (BBC production of “Life: Chal- (nature documentary and feature film), the video content
lenges of Life”) was reproduced via a laptop with a 15.6” screen (1366 x
3)Live events (BBC productions of the last night of the 768 resolution). The laptop was positioned on a table in the
proms, tennis at Wimbledon, and a soccer match) test room and was approximately 0.8 m from the participant.

382 J. Audio Eng. Soc., Vol. 64, No. 6, 2016 June


PAPERS CATEGORIZATION OF BROADCAST AUDIO OBJECTS IN COMPLEX AUDITORY SCENES

1.5 Procedure been sorted into categories. This procedure is often referred
Participants were required to complete a sorting task. to in the literature as a free sorting task [24].
A large number of variants of the sorting method exist, Once the participant was happy with their grouping, they
each of which results in different types of data. Details were asked to give a short label to each of the categories
of the different methods can be found in Coxon [24]. The they had formed, and also to give a rating from 0 to 10 of
main differences between variants stem from whether the the importance that category of audio objects had in their
number of categories is determined by the researcher (fixed overall experience of the scene. Note that, as participants
sorting) or the participant (free sorting) and whether the were required to sort all of the objects and some participants
meaning of the categories is specified by the researcher were unable to identify some of the objects in the clips, this
(closed sorting) or the participant (open sorting). In the procedure resulted in a small number of the participants
present study, as there were no a priori assumptions made forming a category of sounds they could not identify.
regarding the number of categories or the meaning of the This procedure was carried out for each of the types
categories, a free and open sorting methodology was used. of program material; therefore, each participant completed
For each type of program material, participants were five separate card sorts. The participants were told that they
given a set of cards. Each card was labelled with an object. were free to make new categories for the different content
Each set of cards contained all of the identifiable objects types. The order in which the different types of program ma-
within the program material. Each card also contained an terial were presented was randomized for each participant.
identifier to help the participant identify the clip in which After the participant had completed the card sort for
the object occurred and the time of the first occurrence of each type of content, they were presented with the all of the
the object in the clip. The objects printed on the cards were category labels they had generated throughout the entire
identified by a group of five expert listeners prior to the test. procedure. The participant was asked to sort the categories
The aim of this exercise was to identify as many individual into groups that represented the same concept. The aim
objects in the clips as possible. The expert listeners were of this was to investigate commonalities and differences
given a list of objects for each of the clips, which had between the categorization structure for the different types
previously been identified by the main author of this paper; of program material.
their task was then to identify any objects missing from Participants were allowed 3.5 hours to complete the test,
the list, or to modify the description of any objects they and the participants were given the opportunity of 2 short
disagreed with. comfort breaks throughout the test. Due to this time restric-
For the radio drama material there were 176 cards, for tion, 4 of the participants did not manage to complete card
the feature film content there were 142 cards, for the nature sorts for all 5 types of content. The data for the tests they
documentary content there were 91 cards, for the live event did complete are used in the subsequent analyses reported
content there were 105 cards, and for the naturalistic urban in this paper. It is interesting to note that despite the length
soundscape recordings there were 110 cards. of the test, most of the participants stated that they found
Participants were presented with an interface developed the process enjoyable and not overly fatiguing.
in Pure Data with which they could start, stop, rewind,
fast forward, and switch between the different clips. The 1.6 Analysis
participants were asked to sort the cards into groups on the 1.6.1 Data Preparation
desk in front of them according to the following criteria:
For each type of program material, a categorization ma-
“Please sort the cards into groups such that the sounds
trix was formed that took on a value of 1 if an object had
in each group serve a similar function or purpose in the
been grouped in a given category and a 0 otherwise. A
composition of the scene.”
categorization matrix encompassing all of the program ma-
The participants were instructed that they were required
terial types was formed in the same way based on each
to sort the objects according to their function in the scene
participants’ sorting of their category labels.
and not necessarily according to the similarity of the sounds
A co-occurrence matrix was generated for each partici-
themselves. If the participants asked for an example they
pant for all of the audio objects over all of the different types
were told the following: “Consider that you were asked
of program material. This matrix was constructed from pair-
to sort the instruments in an orchestra so that the in-
wise similarities of the objects by assigning pairs of objects
struments in each group serve a similar function or pur-
that had been grouped in the same category a 1 and pairs
pose in the orchestra. You may decide to make a per-
of objects that were not grouped in the same category a 0.
cussion group which contains the timpani, triangle, and
The individual similarity matrices were averaged over the
snare drum. Although these instruments all have a dif-
participant group to generate an average similarity matrix.
ferent sound, they each serve a similar purpose in the
A graphical representation of the construction of these
orchestra.”
matrices and the subsequent analysis is shown in Fig. 1.
Participants were instructed that they could form as many
or as few categories as they wished and that the relative
positions of the categories on the desk was unimportant. 1.6.2 Agglomerative Hierarchical Clustering
They were asked to use all of the cards for the given type The categorization matrices were analyzed using ag-
of program material, such that at the end of the test all of glomerative hierarchical clustering. Hierarchical clustering
the cards from all of the scenes for that type of material had is a technique that produces a nested sequence of partitions

J. Audio Eng. Soc., Vol. 64, No. 6, 2016 June 383


WOODCOCK ET AL. PAPERS

Fig. 1. Graphical representation of the construction of the data matrices and subsequent analysis for the sorting task for different types
of program material. The same procedure was used in the sorting of the category labels generated across the program items.

of a dataset, with a top level cluster that encompasses all data analysis technique the aim of which is to determine
objects and a bottom level consisting of each object as an a configuration of a group of objects in a low dimen-
individual. Intermediate levels show the merging of two sional multidimensional space. The resulting configuration
clusters from the lower level. The results of hierarchical provides a visual representation of pairwise distances or
clustering are most often displayed as dendrograms that (dis)similarities between objects in the group. This low di-
graphically represent this merging process. In agglomer- mensional representation is assumed to represent a latent
ative clustering, the merging process starts at the bottom perceptual space, with the dimensions representing salient
level, with all objects as individual clusters. At each sub- orthogonal perceptual features. Multidimensional scaling
sequent stage, the closest pair of clusters are merged. The has been used extensively in sensory sciences, and for sound
dendrograms produced by this analysis can then be cut at perception in areas such as the perception of musical tim-
different levels to examine the structure of the data; this cut bre [27, 28], the perception of concert hall quality [29], and
is often made to give the average number of groups formed product sound quality [30].
by the participants [22].
The clustering using the Ward’s minimum variance
2 RESULTS
method, which aims to minimize the total within cluster
variance defined by the sum of the squared Euclidean dis- The following sections show the results of the hierar-
tances within each cluster [25], was conducted both row- chical cluster analysis for the different types of program
wise and column-wise on the categorization matrices, thus material investigated in this paper. Due to the number of
resulting in two clustering solutions for each of the types labels in the clustering solution, it is not possible to repro-
of program material; one solution relating to the clustering duce the full clustering solutions in this paper. By way of
of the audio objects and the other solution relating to the example, Fig. 2 and Fig. 3 show a truncated version of the
clustering of the descriptive labels participants attributed to full clustering solution (the first cluster of category labels
their groups. generated by the participants and the first cluster of audio
objects for the radio drama program material). In the fig-
1.6.3 Multidimensional Scaling ures that accompany these results, labels have been assigned
The co-occurrence similarity matrix (generated by aver- summarizing each of the clusters that are formed when cut-
aging the individual similarity matrices over the participant ting the clustering of category labels at a level that results
group) was analyzed using non-metric multidimensional in a number of clusters equal to the median number of
scaling [26]. Multidimensional scaling is an exploratory clusters formed across participants for that type of program

384 J. Audio Eng. Soc., Vol. 64, No. 6, 2016 June


PAPERS CATEGORIZATION OF BROADCAST AUDIO OBJECTS IN COMPLEX AUDITORY SCENES

ticipants’ category labels for this cluster include “Sound


of movement (P19),” “Activity sounds (P1),” and “Plot
forwarding/vital sounds (P9),” and related objects include
footsteps, opening of doors, and clinking of glasses. The
third cluster of category labels relates to non-diegetic (here,
“non-diegetic” refers to whether the audio object is implied
to be present in the scene): music and effects; participants’
category labels for this cluster include “Musical sounds
(P1),” “Music and SFX (not part of scene) (P3),” and “Mu-
sic (outside scene) (P20),” and related objects include mu-
sical instruments along with low frequency rumbling and
whooshing sounds. The fourth cluster of category labels re-
lates to both localizable and continuous background sounds;
participants’ category labels that appear in this cluster in-
clude “Background effects (P2),” “Ambient sound (P4),”
and “Background noise. Set a location (P13),” and related
objects include rain sounds, wind sounds, and birds tweet-

Narrating story/directing listener_P15


Complementary voices_P14

Most important voice or narrator_P14


Predominant sounds_P10
Dialogue_P7

Dialogue_P3
Clear speech_P21

Dialogue_P2

Dialogue in scene_P20

Emotion/tension in characters/individuals_P15
Characters/narration_P13

Non dominant foreground sound_P11

Dominant foreground sound_P11


Human interation voices_P1
Main characters' voices + dialogue_P9
Central narrative_P6

Conversation_P19
Key information_P18
Voices_P4

Voices_P16
Voice/vocal related sounds (speech/screams)_P12

ing. The interpretation of the fifth cluster of category labels


in less clear and seems to encompass a number of different
less well defined categories including vocalizations, atten-
tion grabbing impact sounds, diffuse atmospheric sounds,
and diegetic music. Vocalizations appeared as a well de-
fined cluster of audio objects.

2.2 Feature Film


Fig. 5 shows the results of the cluster analysis with re-
spect to the category labels for the feature film program
material. The median number of categories formed for this
program material was 5, and as such the dendrogram shown
in Fig. 5 has been cut so as to show 5 clusters.
From the clustering of category labels (left to right), the
first cluster relates to sounds relating to actions and move-
Fig. 2. First cluster of category labels for the radio drama program ment; participants’ category labels for this cluster include
material. PN indicates that the category label was produced by
participant N. “Dominant/meaningful event sound (P11),” “Single event
sounds (P9),” and “Sounds resulting from human activ-
ities (P21),” and objects related to this category include
footsteps, impacts of objects on tables, and doors open-
material. The median number of clusters was taken as the ing. The second cluster of category labels relate to clear
starting point to interpret the clustering solutions. Any fur- speech and dialogue; participants’ category labels for this
ther interpretable subclusters are also discussed. In the case cluster include “Human voice (P14),” “Dialogue (P2),” and
of the cluster of category labels shown in Fig. 1, this cluster “Key information (P18),” and objects related to this cate-
was summarized by the researcher as “Clear speech.” The gory include the main character voices along with vocaliza-
full clustering of audio objects and categories can be found tion such as screaming. The third cluster of category labels
at http://dx.doi.org/10.17866/rd.salford2234293. couldn’t be clearly interpreted as a whole; it did however
encompass a clear cluster of prominent, attention grabbing
2.1 Radio Drama sounds that occur off-screen; participants’ category labels
Fig. 4 shows the results of the cluster analysis with re- for this cluster include “Off-screen but significant (P7),”
spect to the category labels for the radio drama program “Things happening out of the scene (P13),” and “Impact
material. The median number of categories formed for this sound, loud, distinct (P14),” and related objects include
program material was 5, and as such the dendrogram shown impact sounds from upstairs (off-screen), clattering of cart
in Fig. 4 has been cut so as to show 5 clusters. wheels, and a glass smashing. The fourth cluster of category
From the clustering of the category labels (from left to labels relate to non-diegetic music and effects; participants’
right), the first cluster relates to clear speech; participants’ category labels for this cluster include “Music (P2),” “Mood
category labels for this cluster include “Dialogue in scene defining (usually music) (P6),” and “Music and sound ef-
(P201 ),” “Clear speech (P21),” and “Dialogue (P2, P3, and fects (not part of scene) (P3).” The related objects for this
P7)” and objects in this group include the main charac- cluster could be seen to clearly cluster into non-diegetic
ter voices. The second cluster of category labels relates music (i.e., strings and synth pads) and effects (i.e., low
to sounds that coincide with actions or movement; par- frequency rumbling and high frequency whispering). The

J. Audio Eng. Soc., Vol. 64, No. 6, 2016 June 385


WOODCOCK ET AL. PAPERS

uncle_henrys_voice_RDS5
munchkin_laugh_RDS2
male_voice_answering_door_RDS11
commentators_voice_RDS4
male_computer_voice_RDS8

robot_voice_RDS4
male_voice_d_RDS8

male_voice_b_RDS4
male_voice_b_RDS9
male_voice_a_RDS9
male_voice_a_RDS4
male_voice_d_RDS1
male_voice_a_RDS10
male_voice_a_RDS11
male_voice_b_RDS11
tin_mans_voice_RDS6

lions_voice_RDS6
pterodactyl_voice_RDS11

thor_voice_RDS11
robot_voice_RDS1
aunt_ems_voice_RDS5
male_voice_c_RDS9
tin_mans_voice_RDS7
lions_voice_RDS7
lions_voice_RDS3
tin_mans_voice_RDS3
munckin_voice_RDS2
scarecrows_voice_RDS7
scarecrows_voice_RDS6

scarecrows_voice_RDS3
female_computer_voice_RDS8

dorothys_voice_RDS5
dorothys_voice_RDS2
dorothys_voice_RDS3

dorothys_voice_RDS6

female_voice_RDS11

dorothys_voice_RDS7
monkeys_voice_RDS6

monkey_voice_RDS2
Fig. 3. First cluster of audio objects for the radio drama program material.

fifth cluster of category labels relate to both diffuse and lo- of a venus fly trap closing. Within the cluster of category
calizable background sounds; participants’ category labels labels, two clear subcategories can be seen relate to sounds
for this cluster include “Background (P18),” “Diffuse atmos coinciding with on-screen action and sounds that don’t have
(P20),” and ”Scene setting (P5),“ and objects related to this a visual counterpart. The second cluster of category labels
cluster include birdsong, wind whistling, and crowd babble. relates to non-diegetic music and effects; participants’ cat-
Further inspection of this cluster of audio objects reveals a egory labels for this cluster include “Musical instruments
clear clustering of continuous (i.e., wind and crowd babble) (P16),” “Music/SFX (P7),” and “Non-diegetic music (P9),”
and transient background sounds (i.e., birdsong and horses and the audio objects related to this cluster include musi-
hooves). cal instruments and synthesized effects. The third cluster
of category labels relates to localizable and diffuse back-
2.3 Nature Documentary ground sounds; participants’ category labels for this cluster
Fig. 6 shows the results of the cluster analysis with re- include “Quieter sounds (P10),” “Envelopment/scene set-
spect to the category labels for the nature documentary ting (P6),” and “Non-dominant event sound (P11),” and the
material. The median number of categories formed for this audio objects related to this category include bird calls, the
program material was 5, and as such the dendrogram shown sound of rustling grass, and the sound of splashing water.
in Fig. 6 has been cut so as to show 5 clusters. The fourth cluster of category labels relates to the narra-
From the clustering of category labels (left to right), the tion; participants’ category labels for this cluster include
first cluster relates to sounds relating to actions or move- “Narrator/Narration (many participants),” “Key informa-
ment; participants’ category labels for this cluster include tion (P18),” and “Dialogue outside scene (P20).” The fifth
“Dominant event sound (related with the video) (P11),” cluster of category labels presents no clear grouping, but
“Sounds resulting from animals movements/actions (P21),” contains a subgroup of prominent animal vocalizations; par-
and “Sounds directly relating to actions on-screen (P12),” ticipants’ category labels for this sub-cluster include “Ani-
and the audio objects related to this category include animal mal noises observable (P3),” “Prominent animal vocaliza-
footsteps, splash of animals entering water, and the crunch tions (normally on screen) (P20),” and “Important sounds

386 J. Audio Eng. Soc., Vol. 64, No. 6, 2016 June


PAPERS CATEGORIZATION OF BROADCAST AUDIO OBJECTS IN COMPLEX AUDITORY SCENES

Diffuse atmospheric sounds


Clear speech

Sounds related to

Non−diegetic

Background sounds

Diegetic music
(diffuse and localisable)

Prominent attention grabbing sounds


actions and movement

music and effects

Vocalisations

Sounds related to

Non−diegetic

Background sounds

Prominent animal
Narration
(diffuse and localisable)

vocalisations
actions and movement

music and effects


Fig. 6. Dendrogram showing hierarchical agglomerative clus-
Fig. 4. Dendrogram showing hierarchical agglomerative cluster- tering of category labels for the nature documentary program
ing of category labels for the radio drama program material. material.

rial. The median number of categories formed for this pro-


gram material was 5, and as such the dendrogram shown in
Fig. 7 has been cut so as to show 5 clusters.
From the clustering of category labels (left to right),
the first cluster relates to commentary and clear speech;
participants’ category labels related to this cluster include
“Commentary (P4),” “Key information/narrative (P18),”
and “Verbal description/direction (P18),” and audio ob-
jects related to this category include commentators’ voices,
tennis umpires’ voices, and stadium announcements. The
second cluster of category labels relates to primary event
sounds; participants’ category labels related to this cluster
Sounds related to

Clear speech

Prominent attention

Non−diegetic

Background sounds
grabbing sounds

(diffuse and localisable)


actions and movement

music and effects

include “Primary event sounds (P7),” “Primary (P17),” and


“Target music from the stage/field (P21).” This category
can split into two further categories relating to music where
the focus of the live event is music (related audio objects
include individual musical instruments) and event sounds
for sporting events (related audio objects include ball kicks,
referee’s whistle, and the impact of a tennis ball on a racket).
The interpretation of the third category is less clear, it does
Fig. 5. Dendrogram showing hierarchical agglomerative cluster- however contain a subcategory of impact sounds; partici-
ing of category labels for the feature film program material. pants’ category labels related to this cluster include “Move-
ment/impact (P15),” “Sounds related to actions (P20),” and
“Foreground sound effects (P2).” The fourth cluster of cat-
egory labels is related to the reaction of the crowd to events;
(P16),” and the audio objects related to this cluster include participants’ category labels related to this cluster include
ostrich vocalizations, seal vocalizations, and the sound of a “Crowd noise (P1),” “Crowd reaction (P4),” and “Collec-
whale blowing. tive sounds/vocalizations (P9),” and audio objects related to
this category include applause, crowd cheering, and laugh-
2.4 Live Events ter. The fifth cluster of category labels is related to lo-
Fig. 7 shows the results of the cluster analysis with re- calizable background sounds; participants’ category labels
spect to the category labels for the feature live event mate- related to this cluster include “Non-dominant event sound

J. Audio Eng. Soc., Vol. 64, No. 6, 2016 June 387


WOODCOCK ET AL. PAPERS

clear speech

Music

Impact sounds
Primary event sounds:

Sports event sounds

Crowd reaction

Localisable Background sounds


Commentary and

Music
Continuous background

Low amplitude impact sounds

Human voices and vocalisations

High amplitude event sounds


Low amplitude event sounds

Vehicle sounds
Fig. 7. Dendrogram showing hierarchical agglomerative cluster-
ing of category labels for the live event program material. Fig. 8. Dendrogram showing hierarchical agglomerative clus-
tering of category labels for the naturalistic urban soundscape
recordings.

(P11),” “Individually identifiable sounds (P3),” and “Quiet


sounds which are identifiable from the background noise
(P10).” Closer inspection of this cluster of audio objects clear subgroups; the first is related to music (related au-
reveals a clear grouping of living (i.e., coughs and crowd dio objects include music in shops), the second is related
whistling) and non-living sounds (i.e., clicks, party poppers, to vehicle sounds (related audio objects include vehicle
and fireworks). acceleration sounds, cars starting, and the clunk of a ve-
hicle passing over a manhole cover), and the third is re-
2.5 Naturalistic Recordings lated to low level impact sounds (related audio objects in-
clude various unidentifiable impacts). The fourth cluster
Fig. 8 shows the results of the cluster analysis with re-
of category labels relates to human voice and vocaliza-
spect to the category labels for the naturalistic recordings
tions; participants’ category labels related to this cluster
of urban soundscapes. The median number of categories
include “Human voice (P1),” “Presence of people (P15),”
formed for this program material was 5, and as such the
and “Human generated sounds/noises/vocalizations (P9),”
dendrogram shown in Fig. 8 has been cut so as to show 5
and audio objects related to this category include voices,
clusters.
laughter, and coughing sounds. The fifth cluster of category
From the clustering of category labels (left to right),
labels relates to high amplitude localizable event sounds;
the first cluster relates to low amplitude localizable event
participants’ category labels related to this cluster include
sounds; participants’ category labels related to this clus-
“Dominant and meaningful event sound (P11),” “Louder
ter include “Non-dominant event sound (P11),” “Low level
sounds (P10),” and “High level foreground event sounds
event sound (P20),” and “Sounds resulting from human ac-
(P20),” and audio objects related to this category include
tivities (P21),” and audio objects related to this category
doors closing, mobile phone notifications, and the sound a
include the rustling of paper, jangling of coins, and vari-
chair scraping against the floor.
ous impact sounds. The second cluster of category labels
is related to continuous background sounds; participants’
category labels related to this cluster include “Ambient 2.6 All Material
sounds (P9),” “Background filler/bed (P2),” and “Back- From the free sort of category labels across all of the
ground sounds which indicate the scene (P21),” and au- program material types, participants formed a median of
dio objects related to this category include unintelligible seven groups. The first of these clusters consists of sounds
voices, distant traffic noise, and air conditioning sounds. related to actions and movement. The second and third
The interpretation of the third cluster of category labels clusters relate to background sounds, with the second clus-
is less clear. Within this cluster there are a number of ter mainly relating to transient background sounds and the

388 J. Audio Eng. Soc., Vol. 64, No. 6, 2016 June


PAPERS CATEGORIZATION OF BROADCAST AUDIO OBJECTS IN COMPLEX AUDITORY SCENES

third cluster mainly relating to continuous ambient sounds,


0.6
crowd reaction, and sounds indicating the presence of peo-
ple. Some overlap was observed in the category labels for
these two groups, with, for example, the category “sec-
ondary action sounds” being in the same cluster as diffuse 0.3

and ambient categories. The fourth of these clusters relates Transient


Sounds
related to
to clear speech and dialogue. The fifth cluster relates to background
actions

Dimension II
sounds
non-diegetic music and effects. The interpretation of the
0.0
sixth cluster was less clear, but contained a cluster of clear Continuous Crowd reaction
speech that is outside of the scene, music that occurs within ambient and prescence
sound
the scene, and human vocalizations. The seventh cluster of people
Dialogue
related to prominent transient sounds. −0.3
From the sorting of the category labels, a similarity ma- Non−diegetic
music and
trix was built using the method described in Sec. 1.6. The effects
data was subject to non-metric multidimensional scaling
(MDS). Whereas the hierarchical clustering solutions pre- −0.6

sented in the preceding section gave a hierarchical view of −0.6 −0.3 0.0 0.3 0.6
Dimension I
the categorization structure, MDS provides a different way
of interpreting the data by allowing the investigation of the 0.6

independent perceptual dimensions along which the objects


and categories vary. To determine an optimum dimension-
ality of the scaling, solutions were calculated in 2 to 9 di- Crowd reaction
0.3 and prescence
mensions and the stress was inspected. A three-dimensional of people
solution gives a non-metric stress of 0.12, which suggests Transient
background Dialogue
a fair fit with the original data [31]. The Pearson’s corre-
Dimension III

sounds
Continuous
lation between the original and fitted distances for a three 0.0
ambient
dimensional solution is 0.89 (p < 0.001). sound Sounds
related to
Fig. 9 shows the configuration of audio objects in the Non−diegetic
actions
music and
three dimensional multidimensional scaling solution. The effects
−0.3
points in this figure relate to individual audio objects. The
groupings have been formed by a hierarchical agglomera-
tive clustering of the dissimilarity matrix; this resulted in
slightly different grouping than the cluster analysis that was −0.6
conducted on the co-occurrence matrix, with crowd reac- −0.6 −0.3 0.0 0.3 0.6
tions and sounds indicating the presence of people emerging Dimension I

as a cluster and prominent transient sounds being grouped 0.6


with sounds relating to actions.

Crowd reaction
3 DISCUSSION 0.3 and prescence
of people
Transient
3.1 Interpretation of Perceptual Dimensions Dialogue background
Dimension III

sounds
From the ordering of objects along the dimensions of Continuous
0.0
ambient
the multidimensional scaling configuration shown in Fig. sound Sounds
9, some interpretation can be made of the meaning of these Non−diegetic
related to
actions
dimensions. The first dimension appears to be related to music and
effects
the relationship the object has to it’s referent; that is to −0.3

say, whether the object carries semantic information such


as clear speech or is related to an action. This can be seen
in the progression along the first dimension of object cate-
−0.6
gories from continuous background objects (exemplified by
sounds such as low frequency rumbling, birdsong, and dis- −0.6 −0.3 0.0
Dimension II
0.3 0.6

tant traffic noise) through to short localizable background


sounds, action sounds, vocalizations, and finally dialogue Fig. 9. Configuration of audio objects in a three dimensional non-
and clear speech. This progression of object categories metric multidimensional scaling solution. Ellipses show the clus-
terings identified in Sec. 2.6. For clarity, objects that fell into the
along the first perceptual dimension parallels findings from unclear cluster have not been plotted. A color version of this figure
neuro-cognive studies [17, 16] where differences have been is available at http://dx.doi.org/10.17866/rd.salford2234293.
found in the processing of non-living action/tool sounds,
animal vocalizations, and human vocalizations. The sec-

J. Audio Eng. Soc., Vol. 64, No. 6, 2016 June 389


WOODCOCK ET AL. PAPERS

ond dimension appears to be related to the temporal extent Table 1. Results of a linear regression model relating object
of the audio objects. This can be seen in the progression position in a 3 dimensional MDS solution to mean object
importance. MDS1 is the position of the object on the first
of object categories along this dimension, from music and perceptual dimension, MDS2 is the position of the object on the
dialogue at one extreme and transient sounds at the other. second perceptual dimension, and MDS3 is the position of the
This supports the findings of Gygi et al. [18], who derived object on the third perceptual dimension. Numbers in brackets
a perceptual space for the categorization of environmental are standard errors.
sounds and found that the second perceptual dimension dif-
ferentiated between continuous sounds and impact sounds. Dependent variable:
Importance
The interpretation of the third perceptual dimension is less
clear, but it appears to relate to the presence of people MDS1 3.42*∗ *
with non-diegetic music and effects at one extreme of the (0.103)
dimension, and dialogue and crowd reactions appearing MDS2 −1.19*∗ *
at the other extreme. This is consistent with findings in (0.110)
MDS3 −1.35*∗ *
soundscapes research [19] and research into the perception (0.139)
of complex audiovisual scenes [22]. Constant 6.60*∗ *
(0.023)
Observations 624
3.2 Differences between Naive and Expert R2 0.679
Listeners Adjusted R2 0.677
Residual Std. Error 0.585 (df = 620)
Research into audio quality and the perception of ur- F Statistic 435.568*∗ * (df = 3; 620)
ban soundscapes has revealed differences between listen-
Note: *p<0.1; **p<0.05; *∗ *p<0.01
ers who have training in acoustics or audio engineering (so
called “expert listeners”) and those who don’t (so called
“naive listeners”). For example, Guastavino [32] [described a percentage indicating the sum of the true positives and
in Guastavino and Katz [33]] found that the preferred au- true negatives over the the number of all possible pairs of
dio reproduction method for urban soundscapes varied de- objects in the clustering solution.
pending on whether the listener is a sound engineer, acous- Based on the calculated Rand Index between the expert
tician, or non-expert. Sound engineers were found to give and non-expert clustering solutions, for the radio drama
greater precedence to localization and precision of sources, program material 79% of pairs of objects were categorized
whereas non-expert listeners and acousticians gave greater in the same way, for the feature film material 75%, for the
precedence to presence and spatial distribution of sound. nature documentary material 86%, for the live events ma-
Perceptions of audio quality can change depending on the terial 87%, and for the naturalistic recordings 78%. Faye
experience and role of the listener. For example, Rumsey et al. [36] suggest that the free sorting with naive partici-
et al. [34] have investigated the relationships between ex- pants leads to similar results as descriptive analysis by an
perienced listener ratings of multichannel audio quality and expert panel, and the similarity between the clustering for
naive listener preferences. It was found that timbral fidelity, expert and non-expert groups appears to support this claim.
frontal spatial fidelity, and surround spatial fidelity con- Differences between the two listener groups included a ten-
tributed to expert listeners’ ratings of basic audio quality, dency for the expert listener group to use more technical
however only timbral fidelity and surround spatial fidelity language such as foley and diegetic. Further, the categoriza-
contributed significantly to naive listeners’ ratings of pref- tion structure was found to be more homogeneous across the
erence. expert listener groups, with the non-expert listener group
To explore if there were any differences in the catego- creating more unique categories.
rization strategy for expert and naive listeners, data from
each of the types of program material were split into two
subsets. The first subset contained data from those partici- 3.3 Importance of Groups
pants who stated that they had previous practical experience For each object, a mean importance rating was calcu-
of audio engineering and the second subset contained data lated by assigning each object the importance rating given
from those participants who stated that they had no previ- by the participant of the group in which it was included
ous practical experience of audio engineering. Hierarchical and taking an average of these ratings across participants.
agglomerative clustering (Ward method) was conducted for A multiple linear regression model was calculated with
the audio objects on each of these subsets of data for each the positions of the sounds on the three axes calculated
type of program material. The similarity of the clustering in the multidimensional scaling analysis as independent
solutions for the two different groups of listeners was then variables and the mean importance of each object as the
assessed using the Rand Index [35], which is a measure of dependent variable. The results of this model are shown in
the agreement between two clustering solutions. The mea- Table 1. The model was found to be a significant fit and
sure takes into account true positive decisions where two accounted for 68% of the variance in the importance scores
j = 0.68, p < 0.001). A forward-backward stepwise
2
objects have been classified in the same cluster and true (Rad
negative decisions where two objects have been classified regression resulted in no dropping of variables in the model.
in different clusters. The Rand Index is then expressed as This suggests that each of the dimensions are significantly

390 J. Audio Eng. Soc., Vol. 64, No. 6, 2016 June


PAPERS CATEGORIZATION OF BROADCAST AUDIO OBJECTS IN COMPLEX AUDITORY SCENES

related to the perceived importance of each of the object pleted free sorting tasks for five types of program mate-
categories. rial. Hierarchical agglomerative cluster analysis revealed at
Taking the model coefficient for the first perceptual di- least seven categories across the different types of program
mensions as an example, the interpretation of Table 1 is material. These categories relate to sounds indicating ac-
such that a 3.41 increase in an object’s position on the first tions and movement, continuous and transient background
perceptual dimension corresponds to a unit increase in the sound, clear speech, non-diegetic music and effects, sounds
object’s perceived importance. The sign of the regression indicating the presence of people, and prominent attention
coefficients suggest that perceived importance increases as grabbing transient sounds. A three-dimensional perceptual
sounds progress along Dimension I and decreases as sounds space calculated via multidimensional scaling suggests that
progress along Dimension II and Dimension III. The first these categories vary along the dimensions of semantic con-
perceptual dimension was found to be related to the se- tent, continuous-transient, and presence-absence of people.
mantic information carried by the object. The coefficient The position of an audio object on the dimensions of the
for the first perceptual dimension in the model shown in perceptual space were found to be related to the perceived
Table 1 therefore suggests that objects carrying semantic importance of the object. These results are well supported
information have the greatest weighting on the perceived by findings in environmental psychology, soundscape re-
importance of an object to a scene. search, and neuro-cognitive studies, and have applications
in psychological research into complex auditory scene per-
3.4 Consequences for Object Based Audio ception, multimedia quality-of-experience testing, and the
development of object based audio processing.
The results presented in this paper provide a framework
for the categorization of broadcast audio objects in com-
plex auditory scenes. Considering the median number of 5 ACKNOWLEDGMENTS
clusters produced for each type of program material, the re-
sults presented in Sec. 2 suggest that listeners utilize around This work was supported by the EPSRC Programme
five categories for each of the types of program material. Grant S3A: Future Spatial Audio for an Immersive Lis-
Overall, there appear to be at least seven unique categories tener Experience at Home (EP/L000539/1) and the BBC
across the program material, suggesting that the catego- as part of the BBC Audio Research Partnership. The au-
rization structure is somewhat contingent on the type of thor would like to thank Chris Pike and Steve Marsh
material. from BBC R&D for their help in sourcing the ma-
Object-based audio opens up the possibility of object terial for the tests. Finally, the author would like to
level manipulation of audio content, where different cat- thank the participants of the listening tests for their
egories of object can be subject to different rules and time. The experimental data underlying the findings are
manipulations. This would allow the signal level manip- fully available without restriction, details are available
ulation used in the rendering of spatial audio to be opti- from http://dx.doi.org/10.17866/rd.salford2234293. Due to
mized on a category by category basis. The results pre- copyright restrictions, the radio drama, live events, na-
sented in this paper provide a perceptual basis for such ture documentary, and feature film program material used
a categorization framework, ensuring that the categories in the listening experiments is not available from this
used are relevant to how listeners parse complex auditory link. A metadata record of these data can be found at
scenes. http://dx.doi.org/10.17866/rd.salford2234413.
Knowledge of the categorization structure will allow the
investigation of high level semantic rules that can be used
to optimize the rendering of spatial audio material. For 6 REFERENCES
example, sounds relating to actions or movements may be
[1] J. Herre, J. Hilpert, A. Kuntz, and J. Plogsties
treated differently to continuous background sounds when
“MPEG-H Audio—The New Standard for Universal
rendered to different loudspeaker layouts.
Spatial/3D Audio Coding,” J. Audio Eng. Soc., vol.
Finally, the categories presented in this paper provide
62, pp. 821–830 (2014 Dec.). http://dx.doi.org/10.17743/
a perceptual basis for future metadata specifications for
jaes.2014.0049
object-based audio and could provide the basis for future
[2] S. Spors, H. Wierstorf, A. Raake, F. Melchior, M.
high level languages for the description of the rendering of
Frank, and F. Zotter “Spatial Sound with Loudspeakers
spatial audio. In terms of object based workflows, this may
and its Perception: A Review of the Current State,” Pro-
take the form of a metadata field that allows content pro-
ceedings of the IEEE, vol. 101, pp. 1920–1938 (2013 Jul.).
ducers to tag and group different objects in the production
http://dx.doi.org/10.1109/JPROC.2013.2264784
according to the categories presented in this paper.
[3] R. Oldfield, B. Shirley, and J. Spille “An Object-
Based Audio System for Interactive Broadcasting,” pre-
4 CONCLUSIONS sented at the 137th Convention of the Audio Engineering
Society (2014 Oct.), convention paper 9148.
This paper has presented a series of experiments con- [4] C. Kim, “Object-Based Spatial Audio: Concept, Ad-
ducted to determine categories for auditory objects in com- vantages, and Challenges,” in 3D Future Internet Media
plex broadcast audio scenes. Twenty-one participants com- (Springer–Verlag, New York, 2014), pp. 79–84.

J. Audio Eng. Soc., Vol. 64, No. 6, 2016 June 391


WOODCOCK ET AL. PAPERS

[5] B. Shirley, R. Oldfield, F. Melchior, and J.-M. Batke, [18] B. Gygi, G. R. Kidd, and C. S. Watson “Simi-
“Platform Independent Audio,” in Media Production, larity and Categorization of Environmental Sounds,” Per-
Delivery and Interaction for Platform Independent Sys- cept. Psychophys., vol. 69, pp. 839–855 (2007 Aug.).
tems: Format-Agnostic Media, John Wiley & Sons, Chich- http://dx.doi.org/10.3758/BF03193921
ester, 2013), pp. 130–165. [19] C. Guastavino “Categorization of Envi-
[6] J. Herre, H. Purnhagen, J. Koppens, O. Hellmuth, J. ronmental Sounds,” Can. J. Exp. Psychol., vol.
Engdegård, J. Hilper, L. Villemoes, L. Terentiv, C. Falch, A. 61, pp. 54–63 (2007 Mar.). http://dx.doi.org/10.1037/
Hölzer, et al., “MPEG Spatial Audio Object Coding—The cjep2007006
ISO/MPEG Standard for Efficient Coding of Interactive [20] S. Payne, W. Davies, and M. Adams Research
Audio Scenes,” J. Audio Eng. Soc., vol. 60, pp. 655–673 into the Practical and Policy Applications of Soundscape
(2012 Oct.). Concepts and Techniques in Urban Areas (Department
[7] E. D. Scheirer, R. Vaananen, and J. Huopaniemi of Environment, Food and Rural Affairs, London, 2009).
“Audiobifs: Describing Audio Scenes with the MPEG– pp. 30–35.
4 Multimedia Standard,” IEEE Multimedia, vol. 1, [21] W. J. Davies, M. D. Adams, N. S. Bruce, R.
pp. 237–250 (1999 Sept.). http://dx.doi.org/10.1109/ Cain, A. Carlyle, P. Cusack, D. A. Hall, K. I. Hume,
6046.784463 A. Irwin, P. Jennings, et al., “Perception of Sound-
[8] M. Geier, J. Ahrens, and S. Spors “Object-Based scapes: An Interdisciplinary Approach,” Appl. Acoust.,
Audio Reproduction and the Audio Scene Description For- vol 74, pp. 224–231 (2013 Feb.). http://dx.doi.org/
mat,” Organ. Sound, vol. 15, pp. 219–227 (2010 Dec.). 10.1016/j.apacoust.2012.05.010
http://dx.doi.org/10.1017/S1355771810000324 [22] O. Rummukainen, J. Radun, T. Virtanen, and
[9] A. S. Bregman Auditory Scene Analysis: The Percep- V. Pulkki “Categorization of Natural Dynamic Audio-
tual Organization of Sound (MIT Press, Cambridge, 1994), visual Scenes,” PloS One, vol. 9, e95848 (2014 May).
pp. 1–45. http://dx.doi.org/10.1371/journal.pone.0095848
[10] T. D. Griffiths and J. D. Warren “What [23] International Telecommunication Union, “ITU-
Is an Auditory Object?” Nat. Rev. Neurosci., vol. R BS.775-2, Multichannel Stereophonic Sound System
5, pp. 887–892 (2004 Nov.). http://dx.doi.org/10.1038/ with and without Accompanying Picture,” (International
nrn1538 Telecommunication Union, Geneva, 2006).
[11] J. K. Bizley and Y. E. Cohen “The What, Where [24] A. P. M. Coxon “Sorting Data: Collection and
and How of Auditory-Object Perception,” Nat. Rev. Neu- Analysis” (Sage Publications, Thousand Oaks, 1999),
rosci., vol. 14, pp. 693–707 (2013 Sept.) http://dx.doi.org/ pp. 1–104.
10.1038/nrn3565 [25] J. H. Ward “Hierarchical Grouping to Op-
[12] N. Ding and J. Z. Simon “Emergence of Neu- timize an Objective Function,” J. Am. Stat. Assoc.,
ral Encoding of Auditory Objects While Listening to vol. 58 pp. 236–244 (1963). http://dx.doi.org/10.1080/
Competing Speakers,” P. Natl. Acad. Sci. USA, vol. 109, 01621459.1963.10500845
pp. 11854–11859 (2012 Jul.). http://dx.doi.org/10.1073/ [26] I. Borg and P. J. Groenen “Modern Multidi-
pnas.1205381109 mensional Scaling: Theory and Applications” (Springer–
[13] S. A. Shamma, M. Elhilali, and C. Micheyl “Tem- Verlag, New York, 2005), pp. 3–14.
poral Coherence and Attention in Auditory Scene Analy- [27] J. M. Grey “Multidimensional Perceptual Scaling
sis,” Trends Neurosci., vol. 34, pp. 114–123 (2011 Mar.). of Musical Timbres,” J. Acoust. Soc. Am., vol. 61, pp. 1270–
http://dx.doi.org/10.1016/j.tins.2010.11.002 1277 (1977 May). http://dx.doi.org/10.1121/1.381428
[14] E. Rosch, C. B. Mervis, W. D. Gray, D. M. John- [28] S. McAdams, S. Winsberg, S. Donnadieu, G. De
son, and P. Boyes-Braem “Basic Objects in Natural Cate- Soete, and J. Krimphoff “Perceptual Scaling of Syn-
gories,” Cognitive Psychol., vol. 8, pp. 382–439 (1976 Jul.). thesized Musical Timbres: Common Dimensions, Speci-
http://dx.doi.org/10.1016/0010-0285(76)90013-X ficities, and Latent Subject Classes,” Psychol. Res., vol.
[15] D. Dubois, C. Guastavino, and M. Raimbault, 58, pp. 177–192 (1995 Dec.). http://dx.doi.org/10.1007/
“A Cognitive Approach to Urban Soundscapes: Using BF00419633
Verbal Data to Access Everyday Life Auditory Cate- [29] M. R. Schroeder, D. Gottlob, and K. Siebrasse
gories, ” Acta Acust. United Ac., vol. 92, pp. 865–874 “Comparative Study of European Concert Halls: Correla-
(2006 Nov.). tion of Subjective Preference with Geometric and Acoustic
[16] B. L. Giordano, J. McDonnell, and S. McAdams Parameters,” J. Acoust. Soc. Am., vol. 56, pp. 1195–1201
“Hearing Living Symbols and Nonliving Icons: Category (1974 Oct). http://dx.doi.org/10.1121/1.1903408
Specificities in the Cognitive Processing of Environmental [30] E. Parizet, E. Guyader, and V. Nosulenko “Anal-
Sounds,” Brain Cognition, vol. 73, pp. 7–19 (2010 Jun.). ysis of Car Door Closing Sound Quality,” Appl. Acoust.,
http://dx.doi.org/10.1016/j.bandc.2010.01.005 vol. 69, pp. 12–22 (2008 Jan.). http://dx.doi.org/10.1016/
[17] J. W. Lewis, J. A. Brefczynski, R. E. Phinney, J. J. j.apacoust.2006.09.004
Janik, and E. A. DeYoe “Distinct Cortical Pathways for [31] J. B. Kruskal “Multidimensional Scaling by
Processing Tool versus Animal Sounds,” J. Neurosci., vol. Optimizing Goodness of Fit to a Nonmetric Hypoth-
25, pp. 5148–5158 (2005 May). http://dx.doi.org/10.1523/ esis,” Psychometrika, vol. 29, pp. 1–27 (1964 Mar.).
JNEUROSCI.0419-05.2005 http://dx.doi.org/10.1007/BF02289565

392 J. Audio Eng. Soc., Vol. 64, No. 6, 2016 June


PAPERS CATEGORIZATION OF BROADCAST AUDIO OBJECTS IN COMPLEX AUDITORY SCENES

[32] C. Guastavino “Etude sémantique et acoustique de APPENDIX


la perception des basses fréquences dans l’environnement
sonore urbain,” Ph.D. thesis, Université Paris 6 (2003). The clips for the radio drama material were taken
[33] C. Guastavino and B. F. Katz “Perceptual Evalu- from “The Hitchhiker’s Guide to the Galaxy: Tertiary
ation of Multi-Dimensional Spatial Audio Reproduction,” Phase. BBC, 2004” and “The Wonderful Wizard of Oz.
J. Acoust. Soc. Am., vol. 116, pp. 1105–1115 (2004 Aug.). BBC, 2009”. The clips from “The Hitchhiker’s Guide to
http://dx.doi.org/10.1121/1.1763973 the Galaxy: Tertiary Phase” occurred at approximately
[34] F. Rumsey, S. Zielinski, R. Kassier, and S. Bech (min:sec) 05:00–06:30 (Episode 1), 14:40–17:40, 18:20–
“Relationships between Experienced Listener Ratings of 19:40 (Episode 2), 10:30–13:05, 18:20–20:20, 30:20–33:20
Multichannel Audio Quality and Naı̈ve Listener Prefer- (Episode 3), 20:00 (Episode 4), and 24:50 (Episode 5). The
ences,” J. Acoust. Soc. Am., vol. 117, pp. 3832–3840 (2005 clips from “The Wonderful Wizard of Oz” occurred at ap-
Jun.). http://dx.doi.org/10.1121/1.1904305 proximately 00:00–02:30, 04:15–5:37, 20:00–22:40, and
[35] W. M. Rand “Objective Criteria for the Eval- 22:50–25:40.
uation of Clustering Methods,” J. Am. Stat. Assoc., The clips for the nature documentary material were taken
vol. 66, pp. 846–850 (1971). http://dx.doi.org/10.1080/ from “Life. Episode 1, Challenges of Life. BBC, 2009.” The
01621459.1971.10482356 clips used occurred at approximately 04:30–06:00, 11:19–
[36] P. Faye, D. Brémaud, M. D. Daubin, P. Cour- 15:30, 21:53–23:40, and 27:20–29:32.
coux, A. Giboreau, and H. Nicod “Perceptive Free Sort- The clips for the feature film material were taken from
ing and Verbalization Tasks with Naive Subjects: An Al- “The Woman in Black, 2012.” The clips used occurred
ternative to Descriptive Mappings,” Food Qual. Prefer., at approximately 04:40–6:31, 10:49–12:15, 14:35–16:36,
vol. 15, pp. 781–791 (2004). http://dx.doi.org/10.1016/ 16:36–18:05, 19:51–23:38, 23:38–25:10, 45:43–46:57 ,
j.foodqual.2004.04.009 53:09–54:19.

J. Audio Eng. Soc., Vol. 64, No. 6, 2016 June 393


WOODCOCK ET AL. PAPERS

THE AUTHORS

James Woodcock William J Davies Frank Melchior Trevor J Cox

James Woodcock is a research fellow at the University of Audio Research Partnership at BBC Research and De-
Salford. His primary area of research is the perception and velopment. From 2009 to 2012 he was the Chief Tech-
cognition of complex sound and vibration. James holds a nical Officer and Director Research and Development at
B.Sc. in audio technology, a M.Sc. by research in product IOSONO GmbH, Germany. From 2003 to 2009 he worked
sound quality, and a Ph.D. in the human response to whole as a researcher at the Fraunhofer Institute Digital Media
body vibration, all from the University of Salford. James is Technology, Germany. He holds several patents and has
currently working on the EPSRC funded S3A project. His authored and co-authored a number of papers in interna-
work on this project mainly focuses on the perception of tional journals and conference proceedings. His research is
auditory objects in complex scenes, the listener experience currently focused on next generation audio for broadcast
of spatial audio, and intelligent rendering for object-based and interdisciplinary innovations for new audience expe-
audio. riences in an IP based broadcast world of the future. Dr.
• Melchior is member of the Audio Engineering Society, the
Bill Davies is professor of acoustics and perception German Acoustical Society, and represents the BBC in the
at the University of Salford. He researches human re- International Telecommunication Union and the European
sponse to complex sound fields in areas such as room Broadcasting Union.
acoustics, spatial audio, and urban soundscapes. He led
the Positive Soundscape Project, an interdisciplinary effort •
to develop new ways of evaluating the urban sound envi- Trevor Cox is Professor of Acoustic Engineering at
ronment. Bill also leads work on perception of complex the University of Salford and a past president of the
auditory scenes on the S3A project. He edited a special UK’s Institute of Acoustics (IOA). Trevor’s diffuser de-
edition of Applied Acoustics on soundscapes, and sits on signs can be found in rooms around the world. He is
ISO TC43/SC1/WG54 producing standards on soundscape co-author of Acoustic Absorbers and Diffusers (3rd edi-
assessment. He is also an Associate Dean in the School of tion 8/16). He was awarded the IOA’s Tyndall Medal in
Computing, Science and Engineering at Salford, and Vice- 2004. He is currently working on two major audio projects.
President of the Institute of Acoustics (the UK professional www.goodrecording.net combines perceptual testing and
body). Bill holds a B.Sc. in Electroacoustics and a Ph.D. blind signal processing to detect recording errors in user
in auditorium acoustics, both from Salford. He is the au- generated content. S3A is investigating future technologies
thor of 80 academic publications in journals, conference for spatial audio in the home. Trevor was given the IOA
proceedings, and books. award for promoting acoustics to the public in 2009. He has
• presented shows to 15,000 pupils including performing at
Frank Melchior received the Dipl.-Ing. degree in me- the Royal Albert Hall. Trevor has presented 24 documen-
dia technology from the Ilmenau University of Technology, taries for BBC radio including “The Physicist’s Guide to
Germany, in 2003 and the Dr.-Ing. degree from Delft Uni- the Orchestra.” For his popular science book Sonic Wonder-
versity of Technology, The Netherlands, in 2011. Since land (in USA: The Sound Book), he won an ASA Science
2012 he is leading the audio research group and the BBC Writing Award in 2015. @trevor cox

394 J. Audio Eng. Soc., Vol. 64, No. 6, 2016 June

You might also like