Sound Event Detection in Everyday Environments


Computational Audio Content Analysis in Everyday Environments

This page presents an excerpt from the chapter Sound Event Detection in Everyday Environments, taken from my PhD thesis titled Computational Audio Content Analysis in Everyday Environments.

Thesis cover
Publication

Toni Heittola. Computational Audio Content Analysis in Everyday Environments. PhD thesis, Tampere University, 6 2021.

PDF

Computational Audio Content Analysis in Everyday Environments

Abstract

Our everyday environments are full of sounds that have a vital role in providing us information and allowing us to understand what is happening around us. Humans have formed strong associations between physical events in their environment and the sounds that these events produce. Such associations are described using textual labels, sound events, and they allow us to understand, recognize, and interpret the concepts behind sounds. Examples of such sound events are dog barking, person shouting or car passing by. This thesis deals with computational methods for audio content analysis of everyday environments. Along with the increased usage of digital audio in our everyday life, automatic audio content analysis has become a more and more pursued ability. Content analysis enables an in-depth understanding of what was happening in the environment when the audio was captured, and this further facilitates applications that can accurately react to the events in the environment. The methods proposed in this thesis focus on sound event detection, the task of recognizing and temporally locating sound events within an audio signal, and include aspects related to development of methods dealing with a large set of sound classes, detection of multiple sounds, and evaluation of such methods. The work presented in this thesis focuses on developing methods that allow the detection of multiple overlapping sound events and robust acoustic model training based on mixture audio containing overlapping sounds. Starting with an HMM-based approach for prominent sound event detection, the work advanced by extending it into polyphonic detection using multiple Viterbi iterations or sound source separation. These polyphonic sound event detection systems were based on a collection of generative classifiers to produce multiple labels for the same time instance, which doubled or in some cases tripled the detection performance. As an alternative approach, polyphonic detection was implemented using class-wise activity detectors in which the activity of each event class was detected independently and class-wise event sequences were merged to produce the polyphonic system output. The polyphonic detection increased applicability of the methods in everyday environments substantially. For evaluation of methods, the work proposed a new metric for polyphonic sound event detection which takes into account the polyphony. The new metric, a segment-based F-score, provides rigorous definitions for the correct and erroneous detections, besides being more suitable for comparing polyphonic annotation and polyphonic system output than the previously used metrics and has since become one of the standard metrics in the research field. Part of this thesis includes studying sound events as a constituent part of the acoustic scene based on contextual information provided by their co-occurrence. This information was used for both sound event detection and acoustic scene classification. In sound event detection, context information was used to identify the acoustic scene in order to narrow down the selection of possible sound event classes based on this information, which allowed use of context-dependent acoustic models and event priors. This approach provided moderate yet consistent performance increase across all tested acoustic scene types, and enabled the detection system to be easily expanded to new scenes. In acoustic scene classification, the scenes were identified based on the distinctive and scene-specific sound events detected, with performance comparable to traditional approaches, while the fusion of these two approaches showed a significant further increase in the performance. The thesis also includes significant contribution to the development of tools for open research in the field, such as standardized evaluation protocols, and release of open datasets, benchmark systems, and open-source tools.

Sound Event Detection in Everyday Environments

Detection of sound events is required to gain an understanding of the content of audio recordings from everyday environments. Sound events encountered in our everyday environments are often overlapping other sounds in time and frequency, as discussed in the previous chapters. Therefore, polyphonic sound event detection is essential for well-performing audio content analysis in everyday environments. This chapter goes through the work published in [Mesaros2010, Heittola2010, Heittola2011, Heittola2013a, Heittola2013b, Mesaros2016, Mesaros2018]. These publications are dealing with various aspects of polyphonic detection system: forming audio datasets, evaluating the detection performance, training acoustic models from mixture signals, detecting overlapping sounds, using contextual information, and organizing evaluation campaigns related to sound event detection.

Problem Definition

Sound event detection aims to simultaneously estimate what is happening and when it is happening. In other words, the aim is to automatically find a start and end time for a sound event and associate a textual class label for the sound event. The detection can be done either by outputting the most prominent sound event at the time (monophonic detection) or by outputting also other simultaneously active events (polyphonic detection). Examples of both types of detection are shown in Figure 4.1. Monophonic detection captures a fragmented view of the auditory scene, long sound events could be split in the detection into smaller events, and quieter event in the background might get covered by louder events and not get detected at all. Sound events detected with monophonic detection scheme might be sufficient for certain applications, however, for general content analysis, the polyphonic detection is often required.

Illustration on how monophonic and polyphonic sound event detection captures the events in the auditory scene.

Figure 4.1 Illustration on how monophonic and polyphonic sound event detection captures the events in the auditory scene.

Input to the detection system is acoustic features \(\boldsymbol{x}_{t}\) which are extracted in each time frame \(t\) for the input signal. Aim is to learn an acoustic model able to estimate presence of predefined sound event classes at each time frame \(\boldsymbol{y}_{t}\). The model learning is done based on learning examples: audio recordings along with annotated sound event activities. The sound event class presence probability at the time frame is given by posterior probability \(p(\boldsymbol{y}_{t}|\boldsymbol{x}_{t})\). The probabilities are converted into binary class activity per frame, event roll, and sound event onset and offset time stamps are acquired based on consecutive active frames. The output of the system is usually formatted as a list of detected events, event list, containing a class label with onset and offset timestamps for each event.

Challenges

The main challenges faced in the development of a robust polyphonic SED are related to characteristics of everyday environments and everyday sounds. When the main part of the work included in this theses was done, the additional challenges were related to the dataset quality, amount of examples in the dataset, and lack of established benchmark datasets and evaluation protocols to support the SED system development.

Simultaneously occurring sound events in the auditory scene produce a mixture signal, and the challenge of the SED system is to be able to detect individual sound events from this signal. The detection should be focusing only on certain sound event classes while being robust against the interfering sounds. As the number of sound event classes that can be realistically used in the SED system is much lower than the actual number of sound sources in most of the natural auditory scenes, some amount of these overlapping sounds will be always unknown to the SED system and can be considered to be interfering sound. In a real use case, the position of the sound-producing source in relation to the audio capturing microphone cannot be controlled, leading to varying loudness levels of the sound events and together with overlapping sounds challenging signal-to-noise ratios in the captured audio recordings. The variability in the acoustic properties of the environments (e.g. room acoustics, or reverberation), will further contribute to the diversity of audio material.

Sound instances assigned with the same sound event label have often a large intra-class variability. This is due to the variability in the sound-producing mechanisms of the sound events, and this variability should be taken into account when learning the acoustic models for the sound events to produce a well-performing SED system. Ideally, one needs a large set of learning examples to fully cover the variability of the sounds in the model learning, however, in practice for most use cases it is impossible to collect such dataset. Thus the robust acoustic modeling of the sound events with limited amount of learning examples is one of the major challenges. Sound events occurring in natural everyday environments are connected through the context they appear, however, the temporal sequence of these events seldom follows any strict structure. This unstructured nature of the audio presents an extra challenge for the SED system design compared to speech recognition or music context retrieval systems where the analysis can be steered with structural constrains of the target signal.

The development of the robust SED system is dependent on the dataset quality and the number of examples per sound event class available in the dataset. The collection of data is relatively easy, however, manual annotation is a subjective process and this poses a challenge for the learning and evaluation of the system. In the annotation stage, the person is listening to the recordings and manually indicating the onset and offset timestamps of the sound events, and selecting the appropriate textual label to describe the sound event. As the amplitude envelopes of the sound events are often quite smooth, clear change points are hard to determine and this temporal ambiguity will eventually lead some degree of subjectivity in the onset/offset annotations. Furthermore, the selection of the textual label involves the listener’s own prior experiences, and if free label selection is allowed, sounds will be labeled in a varying manner across annotators. Even the same annotator might label similar sounds differently depending on the context the sound event occurred. These subjective aspects of the annotation process produce noisy reference data, which has to be taken into account in the development and evaluation of the SED systems.

Audio Datasets

The work included in this thesis is based on three datasets: Sound Effects 2009, TUT-SED 2009, and TUT-SED 2016. All of these datasets were collected and annotated for sound event detection research in the Audio Research Group at Tampere University. The information about these datasets is summarised in Table 4.1.

Table 4.1 Information about the datasets used in this thesis.
Meta data Audio data
Dataset Used in Event instances Event classes Scene classes Files Length Notes
Sound Effects 2009 [Mesaros2010] 1359619 13599h 24min Isolated sounds,
Proprietary dataset
TUT-SED 2009 [Mesaros2010, Heittola2010, Heittola2011, Heittola2013a, Heittola2013b] 100406110 10318h 53min Continuous recordings,
Proprietary dataset
TUT-SED 2006, Development [Mesaros2016, Mesaros2018] 954182 221h 18min Continuous recordings,
Open dataset
TUT-SED 2006, Evaluation [Mesaros2018] 511182 1035min Continuous recordings,
Open dataset

Sound Effects 2009

For work in [Mesaros2010], a collection of isolated sounds was collected from a commercial online sound effects sample database (Sound Ideas samples through StockMusic.com). The samples were originally captured for commercial audio-visual productions in a close-microphone setup having relatively minimal background ambiance presence. Samples were collected from nine general contextual classes: crowd, hallway, household, human, nature, office, outdoors, shop, vehicles. In total the dataset comprises 1359 samples belonging to 61 distinct classes.

TUT-SED 2009

The TUT-SED 2009 dataset was the first sound event dataset having real-life continuous recordings captured in large amount of common everyday environments. The dataset was manually annotated with strong labels. This dataset was used in the majority of the works included in this thesis [Mesaros2010, Heittola2010, Heittola2011, Heittola2013a, Heittola2013b]. The dataset is a proprietary data collection, and cannot be shared outside the University of Tampere. The data was originally collected as part of an industrial project where the public release was never the aim. As a result, permissions for public data release were not asked from persons present in the recorded scenes. The aim of the data collection was to have a representative collection of audio scenes, and recordings were collected from ten acoustic scenes. Typical office work environments were represented in the data collection with office and hallway scenes. The street, inside a moving car, and inside a moving bus scenes represented typical urban transportation scenarios, whereas the grocery shop and restaurant scenes represented typical public space scenarios. Examples of leisure time scenarios were represented by beach, in the audience of basketball game, and in the audience of track and field event scenes.

For each scene type, a single location with multiple recording positions (8-14 positions) was selected as the aim of the data collection was to see how well material from a tightly focused set of acoustic scenes could be modeled. Each recording was 10-30 minutes long to capture a representative set of events in the scene. In total, the dataset consists of 103 recordings totaling almost 19 hours. The audio recordings were captured using a binaural recording setup, where a person is wearing in-ear microphones (Soundman OKM II Classic/Studio A3) in his/her ears during the recordings. Recordings were stored in a portable digital recorder (Roland Edirol R-09) using a 44.1kHz sampling rate and 24-bit resolution.

All the recordings were manually annotated by indicating onset and offset timestamps of events and assigning a descriptive textual label for the sound events. The annotations were done mostly by the same person who did the recordings to ensure as detailed as possible annotations: the annotator had some prior knowledge of the auditory scene to help identify the sound sources. A low-quality video was captured during the audio capture to help annotation of complex scenes with a large variety of sound sources (e.g. street environment) by helping the annotator to recall the scene better while doing the annotation. Due to the complexity of the material and the annotation task, the annotator first made a list of active events in the recording, and then annotated temporal activity for these events within the recording. The event labels for the list were freely chosen instead of using predefined set of global labels. This resulted in a large set of labels which were then manually grouped into 61 distinct event classes after the whole dataset was annotated. On average, there was 2.7 simultaneous sound events active at all times in the recordings. In the grouping process, labels describing the same or very similar sound events were pooled under the same event class, for example, “cheer” and “cheering”, and “barcode reader beep” and “card reader beep”. Only event classes containing at least 10 examples were taken into account, while more rare events were collected into a single class labeled as “unknown”. Figure 4.3 illustrates the relative amount of event activity per class for the whole dataset as well as per scene. Each scene class has 14 to 23 active events, and many event classes appear in multiple scenes (e.g., speech), while some event classes are highly scene-specific (e.g., referee whistle in basketball games). For example, “speech” events covers 43.9% of the recorded time in the dataset. Overall, the activity amount of event classes is not well-balanced as expected for natural everyday environments.

Event activity statistics for TUT-SED 2009 dataset. Event activity is presented as percentage of active event time versus overall duration. Upper panel shows overall event activity, while lower panel shows scene-wise event activity.

Figure 4.3 Event activity statistics for TUT-SED 2009 dataset. Event activity is presented as percentage of active event time versus overall duration. Upper panel shows overall event activity, while lower panel shows scene-wise event activity.

TUT-SED 2016

The creation of the TUT Sound events 2016 dataset (TUT-SED 2016) was motivated by the lack of an open dataset with high acoustic variability [Mesaros2016]. The data collection was implemented in 2015-2016 under the European Research Council funded Everysound project, and the recording locations were selected from Finland. To ensure high acoustic variability of the captured audio each recording was done in a different location: different streets, different homes. Compared to the TUT-SED 2009 dataset, the TUT-SED 2016 dataset has two scene classes (indoor home environments and outdoor residential areas) and larger acoustic variability within the scene classes. The recording setup was similar to TUT-SED 2009, binaural in-ear microphone setup with the same microphone model and digital recorder device with the same format settings (44.1 kHz and 24-bit). The duration of recordings were set to 3-5 minutes considering this would be the most likely length that someone would record in a real use case. The person recording was required to keep the body and head movement minimum during the recording to enable the possible use of spatial information present in binaural recordings. Furthermore, the person was instructed to keep the amount of his/her own speech to a minimum to avoid near-field speech.

The sound events in the recordings were manually annotated with the onset and offset timestamps, and freely chosen event label. A noun-verb pair was used as an event label during the annotation (e.g. “people; talking” and “car; passing”); nouns were used to characterize the sound source while verbs to characterize the sound production mechanism. Recordings and annotations were done by two research assistants, each annotated the material he/she recorded and they were instructed to annotate all audible sound events in the scene. In the post-processing stage, recordings were annotated for microphone failures and interference noises caused by mobile phones, and this was stored as extra meta information. Sound event classes used in the published dataset were selected based on their frequency in the raw annotations and the number of different recordings they appeared in. The event labels that were semantically similar given the context were mapped together to ensure distinct classes. For example, “car engine; running” and “engine; running” were mapped together, and various impact sounds such as “banging” and “clacking” were grouped together under “object impact”. This resulted in a total of 18 sound classes, each having sufficient amount of examples for learning acoustic models.

The dataset was used in DCASE Challenge 2016 task for sound event detection in real life audio [Mesaros2018], and the data was released in two datasets: development dataset and evaluation dataset. The development dataset was bundled with 4-fold cross-validation setup, while the evaluation dataset was originally released without reference annotations and later updated with reference annotations after the evaluation campaign.

During the data collection campaign a large amount of recordings from 10 scene types were collected, and only a small subset was published in the TUT-SED 2016 dataset [TUTSED2016DEV, TUTSED2016EVAL]. Later, more material was annotated in a similar fashion, and released as TUT-SED 2017 dataset [TUTSED2017DEV, TUTSED2017EVAL] for the DCASE challenge 2017 [Mesaros2019b]. This dataset contained recordings from a single scene class (street), and had relatively small number of target sound event classes (6). The rest of the material was released without sound event annotations as datasets for acoustic scene classification tasks: TUT Acoustic Scenes 2016 (TUT-ASC 2016) [TUTASC2016DEV, TUTASC2016EVAL] for DCASE challenge 2016 [Mesaros2016] and TUT Acoustic Scenes 2017 (TUT-ASC 2017) [TUTASC2017DEV, TUTASC2017EVAL] for DCASE challenge 2017 [Mesaros2019b].

Evaluation

The quantitative evaluation of the SED system performance is done by comparing the system output with a reference available for the test data. For the datasets used in this thesis, the reference was created by manually annotating the audio material (see about annotations) and storing the annotations as a list of sound event instances with associated textual label and temporal information (onset and offset timestamps). The evaluation takes into account both the label and the temporal information. When dealing with monophonic annotations and monophonic SED systems, the evaluation is straightforward as the system output at a given time is correct if the predicted event class coincides with the reference class. However, in the case of polyphonic annotations and polyphonic SED systems, the reference can contain multiple active sound events at a given time and there can be multiple correctly and erroneously detected events at the same time instance. All these cases have to be counted for the metric. The evaluation metrics for polyphonic SED can be categorized into segment-based and event-based metrics depending on how the temporal information is handled in the evaluation. This section is based on work published in [Heittola2013a, Heittola2011, Mesaros2016b], and it addresses the research question on how to evaluate sound event detection systems with polyphonic system output.

Segment-Based Metrics

The first segment-based metric for polyphonic sound event detection, called block-wise F-score, was introduced and used in . In (Annamaria Mesaros, Heittola, and Virtanen 2016a) this metric was formalized as segment-based F-score, and segment-based error rate (ER) was introduced to complement it. These metrics have since become the standard metrics in the research field, and they have been used in many DCASE challenge tasks as ranking criteria. In this thesis, the segment-based F-score is used as performance measurement in , and segment-based ER is used as metric in . In the segment-based evaluation, the intermediate statistics for the metric are calculated in a fixed time grid, often in one-second time segments. An illustrative example showing metrics calculation is shown in Figure 4.4.

Calculation of two segment-based metrics: F-score and error rate. Comparisons are made at a fixed time-segment level, where both the reference annotations and system output are rounded into the same time resolution. Binary event activities in each segment are then compared and intermediate statistics are calculated.

Figure 4.4 Calculation of two segment-based metrics: F-score and error rate. Comparisons are made at a fixed time-segment level, where both the reference annotations and system output are rounded into the same time resolution. Binary event activities in each segment are then compared and intermediate statistics are calculated.

Segment-based metrics © 2021 by Toni Heittola is licensed under CC BY-NC-ND 4.0

The sound event activity is compared between the reference annotation and the system output in fixed one-second segments. The event is considered correctly detected if both reference and output indicate event activity and this case is referred to as true positive. In case the system output indicates an event to be active within the segment but the reference annotation indicates the event to be inactive, the output is considered as a false positive within the time segment. Conversely, in case the reference indicates the event to be active within the segment, and the system output indicates inactivity for the same event class, the output is considered as a false negative. Total counts of true positives, false positives and false negatives are denoted as \(TP\), \(FP\), and \(FN\).

F-score

Segment-based F-score is calculated by first accumulating the intermediate statistics over evaluated segments for all classes and then summing them up to get overall intermediate statistics (instance-based metric, micro-averaging). The precision \(P\) and recall \(R\) are calculated according to the overall statistics as

$$ \label{eq-segment-based-precision-and-recall} P = \frac{{TP}}{{TP}+{FP}}\,\,,\quad R = \frac{{TP}}{{TP}+{FN}} \tag{4.1} $$

and the F-score:

$$ \label{eq-segment-based-fscore} {{F}}=\frac{2\cdot{P}\cdot{R}}{{P}+{R}}=\frac{2\cdot{TP}}{2\cdot{TP}+{FP}+{FN}} \tag{4.2} $$

The calculation process is illustrated in panel (a) of Figure 4.4.

The F-score is a widely known metric and easy to understand, and because of this, it is often the preferred metric for SED evaluation. The magnitude of the F-score is largely determined by the number of true positives, which is dominated by the system performance on the large classes. In this case, it may be preferable to use class-based averaging (macro-averaging) as overall performance measurement, which means calculating F-score for each class based on the class-wise intermediate statistics, and then averaging the class-wise F-scores in order to get a single value. However, this requires the presence of all classes in the test material to avoid classes with undefined recall (\(TP + FN = 0\)). This calls for extra attention when designing experiments in a train/test setting, especially when using recordings from uncontrolled everyday environments. In this thesis the segment based F-score is used with two different segment lengths, and is denoted as \(F_{seg,1sec}\) for 1-second segments lengths and as \(F_{seg,30sec}\) for 30-second lengths.

Error Rate

Error rate (ER) measures the amount of errors in terms of substitutions (\(S\)), insertions (\(I\)), and deletions (\(D\)) that are calculated in a segment-by-segment manner. In the metric calculation, true positives, false positives, and false negatives are counted in each segment and based on these counts substitutions, insertions, and deletions are then calculated in each segment. In a segment \(k\), the number of substitution errors \(S\left ( k \right )\) is defined as the number of reference events for which the system outputted an event but with an incorrect event label. In this case, there is one false positive and one false negative in the segment; substitution errors are calculated by pairing false positives and false negatives without designating which erroneous event substitutes which event. Once the substitution errors are counted per segment, the remaining false positives are counted as insertion errors \(I\left ( k \right )\) and false negatives as deletion errors \(D\left ( k \right )\). The insertion errors are attributed to segments having incorrect event activity in the system output, and the deletion errors are attributed to segments having event activity in the reference but not in the system output. This can be formulated as follows:

$$ \begin{aligned} S\left ( k \right ) &= \min \left ( FN\left ( k \right ), FP\left ( k\right ) \right ) \\ D\left ( k \right ) &= \max \left ( 0, FN\left ( k\right ) - FP\left ( k\right )\right ) \\ I\left ( k \right ) &= max \left ( 0, FP\left ( k\right ) - FN\left ( k\right )\right ) \label{eq-segment-based-er-intermediate}\end{aligned} \tag{4.3} $$

The error rate is calculated then by summing the segment-wise counts for \(S\), \(D\), and \(I\) over the total number of evaluated segments \(K\), with \(N(k)\) being the number of active reference events in segment \(k\) [Poliner2006]:

$$ \label{eq-segment-based-er} ER=\frac{\sum_{k=1}^K{S(k)}+\sum_{k=1}^K{D(k)}+\sum_{k=1}^K{I(k)}}{\sum_{k=1}^K{N(k)}} \tag{4.4} $$

The metric calculation is illustrated in panel (b) of Figure 4.4.

The total error rate is commonly used to evaluate the system performance in speech recognition and speaker diarization evaluation, and parallel use in SED makes the metric more approachable for many researchers. On the other hand, interpretation of the error rate can be difficult as the value is a score rather than a percentage and the value can be over 1 if the system makes more errors than correct estimations. An error rate with exact value of 1.0 is also trivial to achieve with the system outputting no active events. Therefore, additional metrics, such as segment-based F-score, should be used together with ER to get a more comprehensive performance estimate for the system.

Event-Based Metrics

The event-based F-score and error rate are used as metrics in [Mesaros2016]. In these metrics, the system output and the reference annotation are compared in an event-by-event manner: intermediate statistics (true positives, false positives, and false negatives) are counted based on event instances. In the evaluation process, an event in the system output is regarded as correctly detected (true positive) if it has a temporal position that is overlapping with the temporal position of an event in the reference annotation with the same label, and its onset and offset meet specified conditions. An event in the system output without correspondence to the reference annotation according to the onset and offset condition is regarded as a false positive, whereas the event in the reference annotation without correspondence to system output is regarded as a false negative.

For the true positive, the positions of event onsets and offsets are compared using a temporal collar to allow some tolerance and set the desired evaluation resolution. The manually created reference annotations have some level of subjectivity in the temporal positions of onset and offset (see about annotations) and the temporal tolerance can be used to alleviate the effect of this subjectivity in the evaluation. In [Mesaros2016], a collar of 200 ms was used, while a more permissive collar of 500 ms was used, for example, in DCASE challenge task for rare sound event detection [Mesaros2019b]. The offset condition is set to be more permissive as the exact offset timestamp is often less important than the onset of well-performing SED. The collar size for the offset condition adapts to different event sizes by selecting maximum among the fixed 200 ms collar and the 50% of the current reference event duration to cover the differences between short and long events. Evaluation of event instances based on these conditions is shown in Figure 4.5.

The reference event and events in the system output with the same event labels compared based on onset condition and offset condition.

Figure 4.5 The reference event and events in the system output with the same event labels compared based on onset condition and offset condition.

The evaluation can be done solely based on onset condition or based on both onset and offset conditions depending on how the system performance is required to be evaluated. In , both conditions are used together.

The event-based F-score is calculated the same way as the segment-based F-score. The event-based intermediate statistics (\(TP\), \(FP\), and \(FN\)) are counted and summed up to get overall counts. Precision, recall, and F-score are calculated based on Equations 4.1 and 4.2. Same as for segment-based F-score, event-based F-score can be calculated based on total counts (instance-based, micro-average) or based on class-wise performance (class-based, macro-average). The metric calculation is illustrated in Figure 4.6 panel (a). The event-based error rate is defined with respect to the number of reference sound event instances. The substitutions are defined differently than in segment-based error rate: events with the correct temporal position but incorrect class labels are counted as substitutions, whereas, insertions and deletions are assigned for events unaccounted for as correct or substituted in system output or reference. The overall metric is calculated based on these error counts similarly to segment-based metric with Equation 4.3. The metric calculation is illustrated in Figure 4.6 panel (b).

Calculation of two event-based metrics: F-score and error rate. F-score is calculated based on overall intermediate statistics counts. Error rate counts total number of errors of different types (substitutions, insertion and deletion).

Figure 4.6 Calculation of two event-based metrics: F-score and error rate. F-score is calculated based on overall intermediate statistics counts. Error rate counts total number of errors of different types (substitutions, insertion and deletion).

Event-based metrics © 2021 by Toni Heittola is licensed under CC BY-NC-ND 4.0

In comparison to the segment-based metrics, the event-based metrics will usually give lower performance estimate values because it is generally more complicated to match onsets and offsets than overall activity of the event. The event-based metrics measure the ability of the system to detect the correct event in the right temporal position, acting as a measure of onset/offset detection capability. Thus, event-based metrics are the recommended choice for applications where the detection of onsets and offsets of sounds is an essential feature.

Legacy Metrics

Earlier works included in this thesis, , used metrics defined for the CLEAR 2006 and CLEAR 2007 evaluation campaigns (Andrey Temko et al. 2009). These metrics were evaluated only for known non-speech events, and therefore “speech” and “unknown” events were excluded from the calculations. The first metric originating from CLEAR campaign was defined as a balanced F-score and denoted by ACC. In the evaluation, the outputted sound event was considered to be correctly detected if the temporal center of the event lies between the timestamps of a reference event with the same event class, or if there exists at least one reference event with the same event class whose temporal center lies between the timestamps of the outputted event. Conversely, the reference event was considered to be correctly detected if there was at least one outputted event whose temporal center is situated between the timestamps of reference sound event from the same event class, or if the temporal center of the reference event lies between the timestamps of at least one outputted event from the same event class. The calculation process of the intermediate statistics for the metric is illustrated in Figure 4.7 panel a. The metric was defined as

$$ ACC = \frac{2\cdot P\cdot R}{P+R} \label{eq-f-score} \tag{4.5} $$

with the precision \(P\) and the recall \(R\) defined as

$$ P = \frac{N_{sys\_cor}}{N_{ref}}\,\,,\quad R = \frac{N_{ref\_cor}}{N_{sys}} \label{eq-CLEAR-precision-and-recall} \tag{4.6} $$

Calculation of intermediate statistics for two legacy metrics: ACC and AEER.

Figure 4.7 Calculation of intermediate statistics for two legacy metrics: ACC and AEER.

ACC and AEER metrics © 2021 by Toni Heittola is licensed under CC BY-NC-ND 4.0

The second metric originating from CLEAR campaign considered the temporal resolution of the outputted sound events by using a metric adapted from a speaker diarization task. The metric defined the acoustic event error rate (AEER) expressed as time percentage [Temko2009a]. The metric computes intermediate statistics in adjacent time segments defined by onsets and offsets of the reference and system output events. In each segment (\(seg\)), the number of the events is counted (\(N_{ref}\) and \(N_{sys}\)) along with the correctly outputted events \(N_{cor}\). The intermediate statistic calculation for AEER is illustrated in Figure 4.7 panel b. The overall AEER score is calculated as the fraction of the time that is not attributed correctly to a sound event:

$$ AEER = \frac{\sum\limits_{seg}\left \{ \text{dur}(seg)\cdot\max(N_{\text{ref}},N_{\text{sys}})-N_{\text{cor}}\right \}}{\sum\limits_{seg}\left \{ \text{dur}(seg)\cdot N_{\text{ref}}\right \}}\label{ch4:eq:AEER} \tag{4.7} $$

where \(\text{dur}(seg)\) is the duration of the segment.

The ACC metric can be seen as an event-based F-score where the correctness of the events is defined using centers of events instead of their onsets and offsets. Similarly, the AEER metric can be seen as a segment-based metric calculated in non-constant sized segments. As a result, these metrics are measuring different aspects of performance at different time-scales. This is problematic as neither of them will give sufficient performance measures alone, but their simultaneous usage is not advisable due to their different definition. These shortcomings are alleviated by the new metrics (F-score and ER) for segment-based and event-based measurement defined in [Mesaros2016b]. These metrics were defined using the same temporal resolution, and even though individually they still provide only an incomplete view of the system performance, they can be used together easily to get a more complete view of the performance. Moreover, the evaluation with AEER is based on non-constant segment lengths determined by the combination of reference events and system output events, making the evaluation segments different from system to system. Error rate defined in [Mesaros2016b] uses uniform segment length and simple rules to determine the correctness of the system output per segment, making the metric easier to understand.

References

[Abesser2020]Abeßer, Jakob. 2020. “A Review of Deep Learning Based Methods for Acoustic Scene Classification.” Applied Sciences 10 (6): 2020.
[Atrey2006]Atrey, Pradeep K, Namunu C Maddage, and Mohan S Kankanhalli. 2006. “Audio Based Event Detection for Multimedia Surveillance.” In 2006 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5:V–. IEEE.
[Barchiesi2015]Barchiesi, D., D. Giannoulis, D. Stowell, and M. D. Plumbley. 2015. “Acoustic Scene Classification: Classifying Environments from the Sounds They Produce.” IEEE Signal Processing Magazine 32 (3): 16–34.
[Bello2019]Bello, Juan P., Claudio Silva, Oded Nov, R. Luke Dubois, Anish Arora, Justin Salamon, Charles Mydlarz, and Harish Doraiswamy. 2019. “SONYC: A System for Monitoring, Analyzing, and Mitigating Urban Noise Pollution.” Communications of the ACM 62 (2): 68–77.
[Cakir2017](1, 2) Çakır, Emre, Giambattista Parascandolo, Toni Heittola, Heikki Huttunen, and Tuomas Virtanen. 2017. “Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection.” Transactions on Audio, Speech and Language Processing 25 (6): 1291–1303.
[Cakir2018]Çakir, E., and T. Virtanen. 2018. “End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input.” In 2018 International Joint Conference on Neural Networks (IJCNN), 1–7.
[Chahuara2016]Chahuara, Pedro, Anthony Fleury, François Portet, and Michel Vacher. 2016. “On-Line Human Activity Recognition from Audio and Home Automation Sensors: Comparison of Sequential and Non-Sequential Models in Realistic Smart Homes 1.” Journal of Ambient Intelligence and Smart Environments 8 (4): 399–422.
[Chen2005]Chen, Jianfeng, Alvin Harvey Kam, Jianmin Zhang, Ning Liu, and Louis Shue. 2005. “Bathroom Activity Monitoring Based on Sound.” In International Conference on Pervasive Computing, 47–61. Springer.
[Clavel2005]Clavel, C., T. Ehrette, and G. Richard. 2005. “Events Detection for an Audio-Based Surveillance System.” In IEEE International Conference on Multimedia and Expo, 1306–9.
[Cramer2019]Cramer, J., H. Wu, J. Salamon, and J. P. Bello. 2019. “Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings.” In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3852–56.
[Crocco2016]Crocco, Marco, Marco Cristani, Andrea Trucco, and Vittorio Murino. 2016. “Audio Surveillance: A Systematic Review.” ACM Computing Surveys 48 (4).
[DCASE2016Workshop]Virtanen, Tuomas, Annamaria Mesaros, Toni Heittola, Mark D. Plumbley, Peter Foster, Emmanouil Benetos, and Mathieu Lagrange. 2016. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (Dcase2016). Tampere, Finland: Tampere University of Technology. Department of Signal Processing.
[DCASE2017Workshop]Virtanen, Tuomas, Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Emmanuel Vincent, Emmanouil Benetos, and Benjamin Martinez Elizalde. 2017. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (Dcase2017). Tampere, Finland: Tampere University of Technology. Laboratory of Signal Processing.
[DCASE2018Workshop]Plumbley, Mark D., Christian Kroos, Juan P. Bello, Gaël Richard, Daniel P. W. Ellis, and Annamaria Mesaros. 2018. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (Dcase2018). Tampere, Finland: Tampere University of Technology. Laboratory of Signal Processing.
[DCASE2019Workshop]Mandel, Michael, Justin Salamon, and Daniel P. W. Ellis. 2019. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (Dcase2019). New York, NY, USA: New York University.
[Do2016]Do, H. M., W. Sheng, M. Liu, and Senlin Zhang. 2016. “Context-Aware Sound Event Recognition for Home Service Robots.” In 2016 IEEE International Conference on Automation Science and Engineering (CASE), 739–44.
[Ellis1996]Ellis, D.P. W. 1996. “Prediction-Driven Computational Auditory Scene Analysis.” PhD thesis, MIT Media Laboratory, Cambridge, Massachusetts.
[Eronen2006](1, 2) Eronen, A. J., V. T. Peltonen, J. T. Tuomi, A. P. Klapuri, S. Fagerlund, T. Sorsa, G. Lorho, and J. Huopaniemi. 2006. “Audio-Based Context Recognition.” IEEE Transactions on Audio, Speech, and Language Processing 14 (1): 321–29.
[Fonseca2019]Fonseca, Eduardo, Manoj Plakal, Daniel PW Ellis, Frederic Font, Xavier Favory, and Xavier Serra. 2019. “Learning Sound Event Classifiers from Web Audio with Noisy Labels.” In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 21–25. IEEE.
[Gasc2013]Gasc, A., J. Sueur, F. Jiguet, V. Devictor, P. Grandcolas, C. Burrow, M. Depraetere, and S. Pavoine. 2013. “Assessing Biodiversity with Sound: Do Acoustic Diversity Indices Reflect Phylogenetic and Functional Diversities of Bird Communities?” Ecological Indicators 25: 279–87.
[Geiger2013]Geiger, J. T., B. Schuller, and G. Rigoll. 2013. “Large-Scale Audio Feature Extraction and SVM for Acoustic Scene Classification.” In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 1–4.
[Gemmeke2017]Gemmeke, Jort F, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. “Audio Set: An Ontology and Human-Labeled Dataset for Audio Events.” In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 776–80.
[Heittola2008]Heittola, Toni, and Anssi Klapuri. 2008. “TUT Acoustic Event Detection System 2007.” In Multimodal Technologies for Perception of Humans: International Evaluation Workshops CLEAR 2007 and RT 2007, edited by Rainer Stiefelhagen, Rachel Bowers, and Jonathan Fiscus, 364–70. Cham, Switzerland: Springer Verlag.
[Heittola2010]Heittola Toni, Annamaria Mesaros, Antti Eronen, and Tuomas Virtanen, “Audio Context Recognition Using Audio Event Histograms,” in Proceedings of 2010 European Signal Processing Conference, (Aalborg, Denmark), pp. 1272–1276, 2010.
[Heittola2011]Heittola Toni, Annamaria Mesaros, Tuomas Virtanen, and Antti Eronen, “Sound Event Detection in Multisource Environments Using Source Separation,” in Workshop on Machine Listening in Multisource Environments, (Florence, Italy), pp. 36–40, 2011.
[Heittola2013a]Heittola Toni, Annamaria Mesaros,Antti Eronen, and Tuomas Virtanen, “Context-Dependent Sound Event Detection,” in EURASIP Journal on Audio, Speech and Music Processing, Vol. 2013, No. 1, 13 pages, 2013.
[Heittola2013b]Heittola Toni, Annamaria Mesaros, Tuomas Virtanen, and Moncef Gabbouj, “Supervised Model Training for Overlapping Sound Events Based on Unsupervised Source Separation,” in Proceedings of the 35th International Conference on Acoustics, Speech, and Signal Processing, (Vancouver, Canada), pp. 8677–8681, 2013.
[Hershey2017]Hershey, S., S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, et al. 2017. “CNN Architectures for Large-Scale Audio Classification.” In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 131–35.
[Jansen2017]Jansen, A., J. F. Gemmeke, D. P. W. Ellis, X. Liu, W. Lawrence, and D. Freedman. 2017. “Large-Scale Audio Event Discovery in One Million YouTube Videos.” In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 786–90.
[Jung2019]Jung, S., J. Park, and S. Lee. 2019. “Polyphonic Sound Event Detection Using Convolutional Bidirectional LSTM and Synthetic Data-Based Transfer Learning.” In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 885–89.
[Kong2019]Kong, Qiuqiang, Yong Xu, Iwona Sobieraj, Wenwu Wang, and Mark D Plumbley. 2019. “Sound Event Detection and Time–Frequency Segmentation from Weakly Labelled Data.” IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (4): 777–87.
[Kraft2005]Kraft, F., R. Malkin, T. Schaaf, and A. Waibel. 2005. “Temporal ICA Classification of Acoustic Events in a Kitchen Enviroment.” In Proceedings of ICSLP-Interspeech, 2689–92.
[Kumar2013]Kumar, A., R. M. Hegde, R. Singh, and B. Raj. 2013. “Event Detection in Short Duration Audio Using Gaussian Mixture Model and Random Forest Classifier.” In 21st European Signal Processing Conference (EUSIPCO), 1–5.
[Kumar2016]Kumar, Anurag, and Bhiksha Raj. 2016. “Weakly Supervised Scalable Audio Content Analysis.” In 2016 IEEE International Conference on Multimedia and Expo (ICME), 1–6. IEEE.
[Lafay2017]Lafay, G., E. Benetos, and M. Lagrange. 2017. “Sound Event Detection in Synthetic Audio: Analysis of the DCASE 2016 Task Results.” In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 11–15.
[Mesaros2010](1, 2) Mesaros Annamaria, Toni Heittola, Antti Eronen, and Tuomas Virtanen, “Acoustic Event Detection in Real Life Recordings,” in Proceedings of 2010 European Signal Processing Conference, Aalborg, Denmark), pp. 1267–1271, 2010
[Mesaros2016](1, 2, 3, 4) Mesaros Annamaria, Toni Heittola, and Tuomas Virtanen, “TUT Database for Acoustic Scene Classification and Sound Event Detection,” In 24th European Signal Processing Conference 2016 (EUSIPCO 2016), pp. 1128–1132, 2016.
[Mesaros2016b](1, 2) Mesaros, Annamaria, Toni Heittola, and Tuomas Virtanen. 2016. “Metrics for Polyphonic Sound Event Detection.” Applied Sciences 6 (6).
[Mesaros2017a]Mesaros, Annamaria, Toni Heittola, and Tuomas Virtanen. 2017a. “Assessment of Human and Machine Performance in Acoustic Scene Classification: DCASE 2016 Case Study.” In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 319–23.
[Mesaros2017b]Mesaros, A., T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen. 2017. “DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System.” In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (Dcase2017), 85–92.
[Mesaros2018](1, 2, 3, 4) Mesaros Annamaria, Toni Heittola, Emmanouil Benetos, Peter Foster, Mathieu Lagrange, Tuomas Virtanen, and Mark D. Plumbley, “Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(2):379–393, Feb 2018.
[Mesaros2019a]Mesaros, Annamaria, Toni Heittola, and Tuomas Virtanen. 2019. “Acoustic Scene Classification in DCASE 2019 Challenge: Closed and Open Set Classification and Data Mismatch Setups.” In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (Dcase2019), 164–68.
[Mesaros2019b](1, 2, 3, 4, 5) Mesaros, Annamaria, Aleksandr Diment, Benjamin Elizalde, Toni Heittola, Emmanuel Vincent, Bhiksha Raj, and Tuomas Virtanen. 2019. “Sound Event Detection in the DCASE 2017 Challenge.” IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (6): 992–1006.
[Parascandolo2016]Parascandolo, G., H. Huttunen, and T. Virtanen. 2016. “Recurrent Neural Networks for Polyphonic Sound Event Detection in Real Life Recordings.” In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6440–44.
[Peltonen2002](1, 2) Peltonen, V., J. Tuomi, A. Klapuri, J. Huopaniemi, and T. Sorsa. 2002. “Computational Auditory Scene Recognition.” In 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2:II-1941-II-1944.
[Piczak2015]Piczak, K. J. 2015. “Environmental Sound Classification with Convolutional Neural Networks.” In 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), 1–6.
[Poliner2006]Poliner, Graham E, and Daniel PW Ellis. 2006. “A Discriminative Model for Polyphonic Piano Transcription.” EURASIP Journal on Advances in Signal Processing 2007: 1–9.
[Rakotomamonjy2015]Rakotomamonjy, A., and G. Gasso. 2015. “Histogram of Gradients of Time–Frequency Representations for Audio Scene Classification.” IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (1): 142–53.
[Roma2013]Roma, G., W. Nogueira, and P. Herrera. 2013. “Recurrence Quantification Analysis Features for Environmental Sound Recognition.” In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 1–4.
[Salamon2015](1, 2) Salamon, J., and J. P. Bello. 2015. “Unsupervised Feature Learning for Urban Sound Classification.” In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 171–75.
[Salamon2017]Salamon, Justin, and Juan Pablo Bello. 2017. “Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification.” IEEE Signal Processing Letters 24 (3): 279–83.
[Stiefelhagen2006]Stiefelhagen, Rainer, Keni Bernardin, Rachel Bowers, John Garofolo, Djamel Mostefa, and Padmanabhan Soundararajan. 2006. “The CLEAR 2006 Evaluation.” In Proceedings of the 1st International Evaluation Conference on Classification of Events, Activities and Relationships, 1–44. CLEAR’06. Berlin, Heidelberg: Springer-Verlag.
[Stiefelhagen2007]Stiefelhagen, Rainer, Keni Bernardin, Rachel Bowers, R. Travis Rose, Martial Michel, and John S. Garofolo. 2007. “The CLEAR 2007 Evaluation.” In Multimodal Technologies for Perception of Humans, International Evaluation Workshops CLEAR 2007 and RT 2007, edited by Rainer Stiefelhagen, Rachel Bowers, and Jonathan G. Fiscus, 4625:3–34. Lecture Notes in Computer Science. Cham, Switzerland: Springer Verlag.
[Stiefelhagen2008]Stiefelhagen, Rainer, Rachel Bowers, and Jonathan Fiscus. 2008. Multimodal Technologies for Perception of Humans: International Evaluation Workshops CLEAR 2007 and RT 2007. Vol. 4625. Cham, Switzerland: Springer Verlag.
[Stowell2015](1, 2, 3) Stowell, D., D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley. 2015. “Detection and Classification of Acoustic Scenes and Events.” Multimedia, IEEE Transactions on 17 (10): 1733–46.
[Szabo2016]Szabó, Beáta T., Susan L. Denham, and István Winkler. 2016. “Computational Models of Auditory Scene Analysis: A Review.” Frontiers in Neuroscience 10: 524.
[TUTASC2016DEV]Mesaros, Annamaria, Toni Heittola, and Tuomas Virtanen. 2016. “TUT Acoustic Scenes 2016, Development Dataset.” Zenodo. https://doi.org/10.5281/zenodo.45739.
[TUTASC2016EVAL]Mesaros, Annamaria, Toni Heittola, and Tuomas Virtanen. 2016. “TUT Acoustic Scenes 2016, Evaluation Dataset.” Zenodo. https://doi.org/10.5281/zenodo.165995.
[TUTASC2017DEV]Mesaros, Annamaria, Toni Heittola, and Tuomas Virtanen. 2017. “TUT Acoustic Scenes 2017, Development Dataset.” Zenodo. https://doi.org/10.5281/zenodo.400515.
[TUTASC2017EVAL]Mesaros, Annamaria, Toni Heittola, and Tuomas Virtanen. 2017c. “TUT Acoustic Scenes 2017, Evaluation Dataset.” Zenodo. https://doi.org/10.5281/zenodo.1040168.
[TUTSED2016DEV]Mesaros, Annamaria, Toni Heittola, and Tuomas Virtanen. 2016. “TUT Sound Events 2016, Development Dataset.” Zenodo. https://doi.org/10.5281/zenodo.45759
[TUTSED2016EVAL]Mesaros, Annamaria, Toni Heittola, and Tuomas Virtanen. 2017. “TUT Sound Events 2016, Evaluation Dataset.” Zenodo. https://doi.org/10.5281/zenodo.996424
[TUTSED2017DEV]Mesaros, Annamaria, Toni Heittola, and Tuomas Virtanen. 2017. “TUT Sound Events 2017, Development Dataset.” Zenodo. https://doi.org/10.5281/zenodo.814831
[TUTSED2017EVAL]Mesaros, Annamaria, Toni Heittola, and Tuomas Virtanen. 2017. “TUT Sound Events 2017, Evaluation Dataset.” Zenodo. https://doi.org/10.5281/zenodo.1040179
[Takahashi2018]Takahashi, N., M. Gygli, and L. Van Gool. 2018. “AENet: Learning Deep Audio Features for Video Analysis.” IEEE Transactions on Multimedia 20 (3): 513–24.
[Temko2006]Temko, Andrey, Robert Malkin, Christian Zieger, Dušan Macho, Climent Nadeu, and Maurizio Omologo. 2006. “CLEAR Evaluation of Acoustic Event Detection and Classification Systems.” In International Evaluation Workshop on Classification of Events, Activities and Relationships, 311–22. Springer.
[Temko2006a]Temko, Andrey, Robert Malkin, Christian Zieger, Dušan Macho, Climent Nadeu, and Maurizio Omologo. 2006. “CLEAR Evaluation of Acoustic Event Detection and Classification Systems.” In International Evaluation Workshop on Classification of Events, Activities and Relationships, 311–22. Springer.
[Temko2006b]Temko, Andrey, and Climent Nadeu. 2006. “Classification of Acoustic Events Using SVM-Based Clustering Schemes.” Pattern Recognition 39 (4): 682–94.
[Temko2007]Temko, Andrey, Climent Nadeu, and Joan-Isaac Biel. 2007. “Acoustic Event Detection: SVM-Based System and Evaluation Setup in CLEAR’07.” In Multimodal Technologies for Perception of Humans: International Evaluation Workshops CLEAR 2007 and RT 2007, edited by Rainer Stiefelhagen, Rachel Bowers, and Jonathan Fiscus, 354–63. Cham, Switzerland: Springer Verlag.
[Temko2009]Temko, A., and C. Nadeu. 2009. “Acoustic Event Detection in a Meeting-Room Environment.” Pattern Recognition Letters 30 (14): 1281–88.
[Temko2009a]Temko, Andrey, Climent Nadeu, Dusan Macho, Robert Malkin, Christian Zieger, and Maurizio Omologo. 2009. “Acoustic Event Detection and Classification.” In Computers in the Human Interaction Loop, edited by Alexander H. Waibel and Rainer Stiefelhagen, 61–73. Springer London.
[Tokozume2017]Tokozume, Y., and T. Harada. 2017. “Learning Environmental Sounds with End-to-End Convolutional Neural Network.” In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2721–25.
[Tran2011]Tran, Huy Dat, and Haizhou Li. 2011. “Probabilistic Distance SVM with Hellinger-Exponential Kernel for Sound Event Classification.” In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2272–75. IEEE.
[Vacher2004]Vacher, M, D Istrate, Laurent Besacier, Jean-François Serignat, and E Castelli. 2004. “Sound Detection and Classification for Medical Telesurvey.” In 2nd Conference on Biomedical Engineering, edited by Calgary ACTA Press, 395–98.
[Valenzise2007]Valenzise, G., L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti. 2007. “Scream and Gunshot Detection and Localization for Audio-Surveillance Systems.” In Proceedings of the 2007 IEEE Conference on Advanced Video and Signal Based Surveillance, 21–26.
[Virtanen2018]Virtanen, Tuomas, Mark D Plumbley, and Dan Ellis. 2018. Computational Analysis of Sound Scenes and Events. Cham, Switzerland: Springer Verlag.
[Xu2018]Xu, Y., Q. Kong, W. Wang, and M. D. Plumbley. 2018. “Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network.” In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 121–25.
[Zhou2007]Zhou, Xi, Xiaodan Zhuang, Ming Liu, Hao Tang, Mark Hasegawa-Johnson, and Thomas Huang. 2007. “HMM-Based Acoustic Event Detection with AdaBoost Feature Selection.” In Multimodal Technologies for Perception of Humans: International Evaluation Workshops CLEAR 2007 and RT 2007, edited by Rainer Stiefelhagen, Rachel Bowers, and Jonathan Fiscus, 345–53. Cham, Switzerland: Springer Verlag.
[Zhuang2009]Xiaodan Zhuang, J. Huang, G. Potamianos, and M. Hasegawa-Johnson. 2009. “Acoustic Fall Detection Using Gaussian Mixture Models and GMM Supervectors.” In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 69–72.
[Zhuang2010]Zhuang, Xiaodan, Xi Zhou, Mark A Hasegawa-Johnson, and Thomas S Huang. 2010. “Real-World Acoustic Event Detection.” Pattern Recognition Letters 31 (12): 1543–51.