This page features a curated collection of academic illustrations created by Toni Heittola, crafted to visually support and deepen understanding of topics related to DCASE (Detection and Classification of Acoustic Scenes and Events), specifically sound event detection, acoustic scene classification and general sound classification. These visuals are designed to clarify complex concepts and offer intuitive representations for both educational and scientific purposes.
Many of the illustrations have been developed over the years to support DCASE publications and tutorials, and form part of the visual material for my PhD thesis work. All figures are licensed under Creative Commons, allowing free use in presentations and publications. Feel free to use them, but please remember to provide proper attribution.
Analysis Task Descriptions

System input and output characteristics for three analysis systems: acoustic scene classification, audio tagging, and sound event detection.
Acoustic Scene Classification

DCASE2018 Challenge Acoustic Scene Classification
Acoustic Scene Classification with Multiple Devices

DCASE2020 Challenge Acoustic Scene Classification

DCASE2020 Challenge Acoustic Scene Classification
Audio-Visual Scene Classification

DCASE2021 Challenge Audio-Visual Scene Classification
Sound Event Detection

DCASE2018 Challenge Sound Event Detection
Audio Tagging

DCASE2018 Challenge Audio Tagging
Auditory Perception

An example showing auditory perception in auditory scene with two overlapping sounds.
Annotation process

Annotation with segment-level temporal information and with detailed temporal information.

Annotating onset and offset of different sounds: boundaries of the sound event are not always obvious.

Types of annotations for sound events
Content analysis of environmental audio

General machine learning approach

Examples of sound sources and corresponding sound events in an urban park acoustic scene.

Illustration on how monophonic and polyphonic sound event detection captures the events in the auditory scene.

The basic structure of an audio content analysis system.
Acoustic features

The processing pipeline of acoustic feature extraction.

Acoustic feature representations.

Mel-scaling (top panel) and mel-scale filterbank with 20 triangular filters (bottom panel).

Static and dynamic feature representations.
Systems

Overview of the supervised learning process for audio content analysis. The system implements a multi-class multi-label classification approach for sound event detection task.

Overview of the recognition process for audio content analysis.

Converting class presence probability (middle panel) into sound class activity estimation with onset and offset timestamps (bottom panel).
System architectures

General system architecture

Sound classification (single label classification)

Audio tagging (multi label classification)

Sound event detection input and output

Sound event detection
GMM and HMM

Example of one-variate Gaussian distribution with four components.

Example of a hidden Markov model. The state transition matrix is represented as a graph: nodes represent the states and weighted edges indicate the transition probabilities. Two model topologies are presented in the figure: fully-connected and left-to-right. The dotted transitions have zero-probability in the left-to-right topology.
Neural networks

Overview of an artificial neuron (left panel) and the basic structure of a feedforward neural network (right panel).
CNN

Convolutional neural networks
Multi-instance learning

Weakly supervised learning: multi-instance learning
Data augmentation

Audio data augmentation: reusing existing data
Evaluation
Segment-Based Metrics

Calculation of two segment-based metrics: F-score and error rate. Comparisons are made at a fixed time-segment level, where both the reference annotations and system output are rounded into the same time resolution. Binary event activities in each segment are then compared and intermediate statistics are calculated.
Event-Based Metrics

The reference event and events in the system output with the same event labels compared based on onset condition and offset condition.

Calculation of two event-based metrics: F-score and error rate. F-score is calculated based on overall intermediate statistics counts. Error rate counts total number of errors of different types (substitutions, insertion and deletion).
Legacy Metrics

Calculation of intermediate statistics for two legacy metrics: ACC and AEER.