Illustrating DCASE


Academic illustrations for understanding DCASE topics

This page features a curated collection of academic illustrations created by Toni Heittola, crafted to visually support and deepen understanding of topics related to DCASE (Detection and Classification of Acoustic Scenes and Events), specifically sound event detection, acoustic scene classification and general sound classification. These visuals are designed to clarify complex concepts and offer intuitive representations for both educational and scientific purposes.

Many of the illustrations have been developed over the years to support DCASE publications and tutorials, and form part of the visual material for my PhD thesis work. All figures are licensed under Creative Commons, allowing free use in presentations and publications. Feel free to use them, but please remember to provide proper attribution.

Analysis Task Descriptions

System input and output characteristics for three analysis systems: acoustic scene classification, audio tagging, and sound event detection.

System input and output characteristics for three analysis systems: acoustic scene classification, audio tagging, and sound event detection.

Acoustic Scene Classification

DCASE2018 Challenge Acoustic Scene Classification

DCASE2018 Challenge Acoustic Scene Classification

Acoustic Scene Classification with Multiple Devices

DCASE2020 Challenge Acoustic Scene Classification

DCASE2020 Challenge Acoustic Scene Classification

DCASE2020 Challenge Acoustic Scene Classification

DCASE2020 Challenge Acoustic Scene Classification

Audio-Visual Scene Classification

DCASE2021 Challenge Audio-Visual Scene Classification

DCASE2021 Challenge Audio-Visual Scene Classification

Sound Event Detection

DCASE2018 Challenge Sound Event Detection

DCASE2018 Challenge Sound Event Detection

Audio Tagging

DCASE2018 Challenge Audio Tagging

DCASE2018 Challenge Audio Tagging

Auditory Perception

An example showing auditory perception in auditory scene with two overlapping sounds.

An example showing auditory perception in auditory scene with two overlapping sounds.

Auditory perception © 2021 by Toni Heittola is licensed under CC BY-NC-ND 4.0

Annotation process

Annotation with segment-level temporal information and with detailed temporal information.

Annotation with segment-level temporal information and with detailed temporal information.

Annotating onset and offset of different sounds: boundaries of the sound event are not always obvious.

Annotating onset and offset of different sounds: boundaries of the sound event are not always obvious.

Types of annotations for sound events

Types of annotations for sound events

Content analysis of environmental audio

General machine learning approach

General machine learning approach

Examples of sound sources and corresponding sound events in an urban park acoustic scene.

Examples of sound sources and corresponding sound events in an urban park acoustic scene.

Illustration on how monophonic and polyphonic sound event detection captures the events in the auditory scene.

Illustration on how monophonic and polyphonic sound event detection captures the events in the auditory scene.

The basic structure of an audio content analysis system.

The basic structure of an audio content analysis system.

Acoustic features

The processing pipeline of acoustic feature extraction.

The processing pipeline of acoustic feature extraction.

Processing pipeline © 2021 by Toni Heittola is licensed under CC BY-NC-ND 4.0

Acoustic feature representations.

Acoustic feature representations.

Mel-scaling (top panel) and mel-scale filterbank with 20 triangular filters (bottom panel).

Mel-scaling (top panel) and mel-scale filterbank with 20 triangular filters (bottom panel).

Mel-scaling © 2021 by Toni Heittola is licensed under CC BY-NC-ND 4.0

Static and dynamic feature representations.

Static and dynamic feature representations.

Systems

Overview of the supervised learning process for audio content analysis. The system implements a multi-class multi-label classification approach for sound event detection task.

Overview of the supervised learning process for audio content analysis. The system implements a multi-class multi-label classification approach for sound event detection task.

Overview of the recognition process for audio content analysis.

Overview of the recognition process for audio content analysis.

Recognition process © 2021 by Toni Heittola is licensed under CC BY-NC-ND 4.0

Converting class presence probability (middle panel) into sound class activity estimation with onset and offset timestamps (bottom panel).

Converting class presence probability (middle panel) into sound class activity estimation with onset and offset timestamps (bottom panel).

System architectures

General system architecture

General system architecture

Sound classification (single label classification)

Sound classification (single label classification)

Sound classification © 2019 by Toni Heittola is licensed under CC BY-NC-ND 4.0

Audio tagging (multi label classification)

Audio tagging (multi label classification)

Audio Tagging © 2019 by Toni Heittola is licensed under CC BY-NC-ND 4.0

Sound event detection input and output

Sound event detection input and output

Sound event detection

Sound event detection

Sound event detection © 2019 by Toni Heittola is licensed under CC BY-NC-ND 4.0

GMM and HMM

Example of one-variate Gaussian distribution with four components.

Example of one-variate Gaussian distribution with four components.

Gaussian distribution © 2021 by Toni Heittola is licensed under CC BY-NC-ND 4.0

Example of a hidden Markov model. The state transition matrix is represented as a graph: nodes represent the states and weighted edges indicate the transition probabilities. Two model topologies are presented in the figure: fully-connected and left-to-right. The dotted transitions have zero-probability in the left-to-right topology.

Example of a hidden Markov model. The state transition matrix is represented as a graph: nodes represent the states and weighted edges indicate the transition probabilities. Two model topologies are presented in the figure: fully-connected and left-to-right. The dotted transitions have zero-probability in the left-to-right topology.

Neural networks

Overview of an artificial neuron (left panel) and the basic structure of a feedforward neural network (right panel).

Overview of an artificial neuron (left panel) and the basic structure of a feedforward neural network (right panel).

CNN

Convolutional neural networks

Convolutional neural networks

Convolutional neural networks

Convolutional neural networks

CNN layers © 2019 by Toni Heittola is licensed under CC BY-NC-ND 4.0

Convolutional neural networks

Convolutional neural networks

CNN layers © 2019 by Toni Heittola is licensed under CC BY-NC-ND 4.0

Multi-instance learning

Weakly supervised learning: multi-instance learning

Weakly supervised learning: multi-instance learning

Data augmentation

Audio data augmentation: reusing existing data

Audio data augmentation: reusing existing data

Evaluation

Contingency table

Contingency table

Contingency table © 2019 by Toni Heittola is licensed under CC BY-NC-ND 4.0

Segment-Based Metrics

Calculation of two segment-based metrics: F-score and error rate. Comparisons are made at a fixed time-segment level, where both the reference annotations and system output are rounded into the same time resolution. Binary event activities in each segment are then compared and intermediate statistics are calculated.

Calculation of two segment-based metrics: F-score and error rate. Comparisons are made at a fixed time-segment level, where both the reference annotations and system output are rounded into the same time resolution. Binary event activities in each segment are then compared and intermediate statistics are calculated.

Segment-based metrics © 2021 by Toni Heittola is licensed under CC BY-NC-ND 4.0

Event-Based Metrics

The reference event and events in the system output with the same event labels compared based on onset condition and offset condition.

The reference event and events in the system output with the same event labels compared based on onset condition and offset condition.

Calculation of two event-based metrics: F-score and error rate. F-score is calculated based on overall intermediate statistics counts. Error rate counts total number of errors of different types (substitutions, insertion and deletion).

Calculation of two event-based metrics: F-score and error rate. F-score is calculated based on overall intermediate statistics counts. Error rate counts total number of errors of different types (substitutions, insertion and deletion).

Event-based metrics © 2021 by Toni Heittola is licensed under CC BY-NC-ND 4.0

Legacy Metrics

Calculation of intermediate statistics for two legacy metrics: ACC and AEER.

Calculation of intermediate statistics for two legacy metrics: ACC and AEER.

ACC and AEER metrics © 2021 by Toni Heittola is licensed under CC BY-NC-ND 4.0