Guiding Research Journeys


Supervision in Audio, Machine Learning, and Sound Analysis

I am actively involved in guiding students at various stages of their academic journey, from Bachelor's theses to doctoral research. My supervision and advisory involvement focuses on topics related to general audio signal processing, machine learning, and computational sound scene analysis.

I enjoy working closely with students, sharing insights, and learning together as we tackle real-world challenges through research. The supervision is more than just guiding a project; it is about collaboration, discovery, and helping students develop into confident and independent researchers. Feel free to browse through the projects I have supervised. If something sparks your interest, don't hesitate to reach out. I am always happy to discuss potential topics and explore how we might work together.

Research Projects on Timeline

Over the course of my academic career, I have been actively involved in a wide range of research projects that have contributed to the completion of academic theses and degrees at various levels. These projects have spanned diverse topics within the fields of machine learning and audio signal processing, often forming the foundation for Master's and doctoral research.

The timeline below provides an overview of selected projects, highlighting the evolution of research themes and the wide range of topics explored over the years.

--- displayMode: normal # compact config: theme: forest gantt: topPadding: 50 leftPadding: 8 rightPadding: 8 topAxis: false numberSectionStyles: 2 barHeight: 22 barGap: 6 fontSize: 12 sectionFontSize: 16 # gridLineStartPadding: 0 --- gantt todayMarker off dateFormat YYYY-MM-DD axisFormat %Y section Advisory Involvement in Doctoral Projects Zero-Shot Learning: active, 2022-06-01, 2025-09-01 Representation Learning: active, 2021-01-01, 2024-06-01 Active Learning : active, 2017-01-01, 2020-10-01 Deep Learning: active,2015-01-01,2019-01-31 section Supervised Research Projects Speaker Distortions : done, 2024-06-01, 2025-05-31 Captioning : done, 2022-08-01, 2023-09-30 Real-time Sound Event Detection : done, 2020-01-01, 2020-11-30 Traffic Monitoring : done,2018-05-01,2019-08-31 Real-time Audio Analysis : done,2014-02-01,2014-09-30 Semi-Supervised Learning : done, 2012-06-01, 2013-10-31 Guitar Transcription : done,2011-05-01,2011-12-31 Speaker Modeling : done, 2008-04-01, 2010-03-31

Project Advisory Involvement

In addition to supervising individual theses, I have served in advisory roles for several doctoral research projects. These collaborations have focused on advancing the state of the art in audio classification, sound event detection, active learning, representation learning, and zero-shot learning.

As a doctoral advisor, my role has included guiding research direction, contributing to methodological development, and supporting students in publishing high-quality scientific work. These projects reflect long-term, in-depth engagements that align closely with my research interests and contribute to the broader academic community.

Doctoral-level projects where I have served in an advisory role:

  • Duygu Dogan – Zero-Shot Audio Classification (2022–2025)
  • Shanshan Wang – Audio-Video Feature Representation Learning (2021–2024)
  • Zhao Shuyang – Active Learning for Sound Event Detection (2017–2020)
  • Emre Cakir – Deep Neural Networks for Sound Event Detection (2015–2019)

Zero-Shot Audio Classification

Duygu Dogan
Doctoral project advisor
2022 — 2025

In recent years, the field of Zero-Shot Audio Classification has emerged as a promising approach for enabling machines to recognize sounds they have never encountered before, without the need for labeled training data. I have had the pleasure of collaborating with Duygu Dogan on a research project aimed at advancing the state of the art in this area. The project focuses on developing models that can generalize to novel sound classes by leveraging external semantic information, such as textual or visual embeddings, rather than relying solely on annotated audio datasets.

  • Use of image-based semantic embeddings to bridge the gap between audio and visual modalities, enabling zero-shot classification through cross-modal knowledge transfer.
  • Introduction of a temporal attention mechanism to enhance the model’s ability to detect and differentiate overlapping sounds in multi-label settings as this is an essential capability for real-world acoustic environments.

Multi-Label Zero-Shot Audio Classification with Temporal Attention

Abstract

Zero-shot learning models are capable of classifying new classes by transferring knowledge from the seen classes using auxiliary information. While most of the existing zero-shot learning methods focused on single-label classification tasks, the present study introduces a method to perform multi-label zero-shot audio classification. To address the challenge of classifying multi-label sounds while generalizing to unseen classes, we adapt temporal attention. The temporal attention mechanism assigns importance weights to different audio segments based on their acoustic and semantic compatibility, thus enabling the model to capture the varying dominance of different sound classes within an audio sample by focusing on the segments most relevant for each class. This leads to more accurate multi-label zero-shot classification than methods employing temporally aggregated acoustic features without weighting, which treat all audio segments equally. We evaluate our approach on a subset of AudioSet against a zero-shot model using uniformly aggregated acoustic features, a zero-rule baseline, and the proposed method in the supervised scenario. Our results show that temporal attention enhances the zero-shot audio classification performance in multi-label scenario.

Keywords

Adaptation models;Attention mechanisms;Accuracy;Zero-shot learning;Event detection;Conferences;Semantics;Focusing;Acoustics;multi-label zero-shot learning;audio classification;audio tagging;temporal attention

Cites: 2 (see at Google Scholar)

PDF

Zero-Shot Audio Classification using Image Embeddings

Abstract

Supervised learning methods can solve the given problem in the presence of a large set of labeled data. However, the acquisition of a dataset covering all the target classes typically requires manual labeling which is expensive and time-consuming. Zero-shot learning models are capable of classifying the unseen concepts by utilizing their semantic information. The present study introduces image embeddings as side information on zero-shot audio classification by using a nonlinear acoustic-semantic projection. We extract the semantic image representations from the Open Images dataset and evaluate the performance of the models on an audio subset of AudioSet using semantic information in different domains; image, audio, and textual. We demonstrate that the image embeddings can be used as semantic information to perform zero-shot audio classification. The experimental results show that the image and textual embeddings display similar performance both individually and together. We additionally calculate the semantic acoustic embeddings from the test samples to provide an upper limit to the performance. The results show that the classification performance is highly sensitive to the semantic relation between test and training classes and textual and image embeddings can reach up to the semantic acoustic embeddings when the seen and unseen classes are semantically similar.

Keywords

zero-shot learning, audio classification, semantic embeddings, image embeddings

Cites: 8 (see at Google Scholar)

PDF

Feature Representation Learning Using Audio-Video Data

Shanshan Wang
Doctoral project advisor
2021 — 2024

In collaboration with Shanshan Wang, I have contributed to research project focused on multimodal representation learning, particularly in the context of audio-visual scene analysis. The project explored how complementary information from audio and video streams can be leveraged to improve the performance of machine learning models in complex urban environments.

  • This work contributes to the growing field of multimodal machine learning, where the integration of audio and visual modalities enables more robust and context-aware systems. The findings have implications for applications such as smart city monitoring, surveillance, and environmental sensing, and have been presented at leading venues including ICASSP and DCASE.
  • A curated dataset of urban acoustic scenes was developed to support audio-visual learning tasks, providing a valuable resource for benchmarking and reproducibility in the field.
  • The collaboration also included a comprehensive analysis of submissions to the DCASE2021 Challenge Task on Audio-Visual Scene Classification, offering insights into effective model architectures and fusion strategies for multimodal learning.
  • The research introduced and evaluated self-supervised learning strategies for audio-video data, with a focus on positive and negative sampling techniques that enhance the quality of learned representations without requiring manual annotations.

Self-supervised Representation Learning on Audio-Video Data

Abstract

Feature representation learning using audio-video data has gained significant attention due to its ability to leverage complementary information from both modalities. Audio and visual signals provide distinct yet correlated perspectives of the same scene, making their joint modeling beneficial for applications such as video understanding, environmental sound analysis, and autonomous perception. By integrating both modalities, models can learn richer and more discriminative feature representations, improving performance in tasks like audio-visual scene classification and cross-modal retrieval. Recently, self-supervised learning (SSL) has emerged as a powerful alternative to supervised approaches, primarily due to its ability to learn meaningful representations without labeled data. SSL exploits intrinsic structures within the data to generate pseudo-labels, enabling models to extract robust and generalizable features from large-scale, unlabeled datasets. This is particularly advantageous in multi-modal learning, where obtaining labeled data is often labor-intensive and costly. This thesis focuses on processing and learning from audio-visual data directly for scene classification, and learning feature representations for other audio classification tasks. The first part of the thesis focuses on a new audio-video dataset of urban scenes that we curated and published, and presents a case study on audio-visual scene analysis. Our findings show that integrating both audio and visual modalities yields significantly better performance than using each modality alone. The work in the second part of the thesis explores self-supervised feature representation learning, focusing on enhancing contrastive learning techniques for multimodal learning. The first direction investigated was spatial alignment between audio and visual modalities, using spatial correspondence as a supervisory signal to improve cross-modal representation learning. Our experiments demonstrate that replacing standard log-mel spectrogram features with the first-order Ambisonics intensity vector significantly improves audio-visual spatial alignment task performance. The second direction investigated was an improved contrastive loss function that strengthens the discriminative power of feature embeddings by introducing an angular margin between positive and negative pairs. Results indicate that applying this loss in both supervised and self-supervised learning settings leads to substantial performance improvements. Finally, the third direction investigated was the sampling techniques in self-supervised learning. We introduced a soft-positive sampling strategy to refine contrastive learning by selecting informative positive pairs while reducing the impact of noisy samples. Experimental results suggest that in self-supervised learning setups with small datasets, features learned through soft-positive sampling outperform those obtained from traditional sampling approaches.

PDF

Positive and Negative Sampling Strategies for Self-Supervised Learning on Audio-Video Data

Abstract

In Self-Supervised Learning (SSL), Audio-Visual Correspondence (AVC) is a popular task to learn deep audio and video features from large unlabeled datasets. The key step in AVC is to randomly sample audio and video clips from the dataset and learn to minimize the feature distance between the positive pairs (corresponding audio-video pair) while maximizing the distance between the negative pairs (non-corresponding audio-video pairs). The learnt features are shown to be effective on various downstream tasks. However, these methods achieve subpar performance when the size of the dataset is rather small. In this paper, we investigate the effect of utilizing class label information in the AVC feature learning task. We modified various positive and negative data sampling techniques of SSL based on class label information to investigate the effect on the feature quality. We propose a new sampling approach which we call soft-positive sampling, where the positive pair for one audio sample is not from the exact corresponding video, but from a video of the same class. Experimental results suggest that when the dataset size is small in SSL setup, features learnt through the soft-positive sampling method significantly outperform those from the traditional SSL sampling approaches. This trend holds in both in-domain and out-of-domain downstream tasks, and even outperforms supervised classification. Finally, experiments show that class label information can easily be obtained using a publicly available classifier network and then can be used to boost the SSL performance without adding extra data annotation burden.

Keywords

Representation learning;Annotations;Conferences;Self-supervised learning;Signal processing;Sampling methods;Market research;self-supervised learning;sampling strategies;soft-positive;audio-video data

Cites: 1 (see at Google Scholar)

PDF

Audio-Visual Scene Classification: Analysis of DCASE 2021 Challenge Submissions

Abstract

This paper presents the details of the Audio-Visual Scene Classification task in the DCASE 2021 Challenge (Task 1 Subtask B). The task is concerned with classification using audio and video modalities, using a dataset of synchronized recordings. This task has attracted 43 submissions from 13 different teams around the world. Among all submissions, more than half of the submitted systems have better performance than the baseline. The common techniques among the top systems are the usage of large pretrained models such as ResNet or EfficientNet which are trained for the task-specific problem. Fine-tuning, transfer learning, and data augmentation techniques are also employed to boost the performance. More importantly, multi-modal methods using both audio and video are employed by all the top 5 teams. The best system among all achieved a logloss of 0.195 and accuracy of 93.8%, compared to the baseline system with logloss of 0.662 and accuracy of 77.1%.

Cites: 23 (see at Google Scholar)

PDF

A curated dataset of urban acoustic scenes for audio-visual scene analysis

Abstract

This paper introduces a curated dataset of urban scenes for audio-visual scene analysis which consists of carefully selected and recorded material. The data was recorded in multiple European cities, using the same equipment, in multiple locations for each scene, and is openly available. We also present a case study for audio-visual scene recognition and show that joint modeling of audio and visual modalities brings significant performance gain compared to state of the art uni-modal systems. Our approach obtained an 84.8% accuracy compared to 75.8% for the audio-only and 68.4% for the video-only equivalent systems.

Keywords

Audio-visual data, Scene analysis, Acoustic scene, Pattern recognition, Transfer learning

Cites: 38 (see at Google Scholar)

PDF

Active Learning for Sound Event Classification and Detection

Zhao Shuyang
Doctoral project advisor
2017 — 2020

Between 2017 and 2020, I had the opportunity to server as an advisor for the doctoral research of Zhao Shuyang, which focused on developing active learning strategies for sound event classification and detection. This collaboration addressed a critical challenge in machine listening: how to efficiently train high-performing models with minimal labeled data, an issue particularly relevant in large-scale real-world audio environments.

  • The research contributed to the growing body of work on data-efficient learning in audio analysis and has influenced subsequent studies in semi-supervised learning, domain adaptation, and interactive machine learning. The proposed methods offer practical value for applications where labeled data is scarce or expensive to obtain, such as urban sound monitoring, wildlife acoustics, and smart city infrastructure.
  • The research introduced clustering-based and committee-based sample selection methods that significantly improved the efficiency of training sound event classifiers by prioritizing the most informative unlabeled samples.
  • A novel active learning framework for polyphonic sound event detection was proposed and validated, demonstrating that model performance could be substantially improved with fewer labeled examples, reducing annotation costs without compromising accuracy.
  • The work also explored heterogeneous data sources for training vocal mode classifiers, highlighting the potential of active learning in scenarios with domain shifts and limited supervision.
  • These methods were evaluated on benchmark datasets and presented at leading conferences, including ICASSP, IWAENC, and WASPAA, culminating in a peer-reviewed journal article in IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Clustering Analysis and Active Learning for Sound Event Detection and Classification

Abstract

The objective of the thesis is to develop techniques that optimize the performances of sound event detection and classification systems at minimal supervision cost. The state-of-the-art sound event detection and classification systems use acoustic models developed using machine learning techniques. The training of acoustic models typically relies on a large amount of labeled audio data. Manually assigning labels to audio data is often the most time-consuming part in a model development process. Unlabeled data is abundant in many practical cases, but the amount of annotations that can be made is limited. Thus, the practical problem is optimizing the accuracies of acoustic models with a limited amount of annotations. In this thesis, we started with the idea of clustering unlabeled audio data. Clustering results can be used to derive propagated labels from a single label assignment; meanwhile, clustering itself does not require labeled data. Based on this idea, an active learning method was proposed and evaluated for sound classification. In the experiments, the proposed active learning method based on k-medoids clustering outperformed reference methods based on random sampling and uncertainty sampling. In order to optimize the sample selection after annotating the k medoids, mismatch-first farthest-traversal was proposed. The active learning performances were further improved according to the experimental results. The active learning method proposed for sound classification was extended to sound event detection. Sound segments were generated based on change point detection within each recording. The sound segments were selected for annotation based on mismatch-first farthest-traversal. During the training of acoustic models, each recording was used as an input of a recurrent convolutional neural network. The training loss was derived from frames corresponding to only annotated segments. In the experiments on a dataset where sound events are rare, the proposed active learning method required annotating only 2% of the training data to achieve similar accuracy, with respect to annotating all the training data. In addition to active learning, we investigated using cluster analysis to group recordings with similar recording conditions. Feature normalization according to cluster statistics was used to bridge the distribution shift due to mismatched recording conditions. The achieved performance clearly outperformed feature normalization based on global statistics and statistics per recording. The proposed active learning methods enable efficient labeling on large-scale audio datasets, potentially saving a large amount of annotation effort in the development of acoustic models. In addition, core ideas behind the proposed methods are generic and they can be extended to other problems such as natural language processing, as is investigated in [8].

Cites: 1 (see at Google Scholar)

PDF

Active Learning for Sound Event Detection

Abstract

This paper proposes an active learning system for sound event detection(SED). It aims at maximizing the accuracy of a learned SED model with limited annotation effort. The proposed system analyzes an initially unlabeled audio dataset, from which it selects sound segments for manual annotation. The candidate segments are generated based on a proposed change point detection approach, and the selection is based on the principle of mismatch-first farthest-traversal. During the training of SED models, recordings are used as training inputs, preserving the long-term context for annotated segments. The proposed system clearly outperforms reference methods in the two datasets used for evaluation(TUT Rare Sound 2017 and TAU Spatial Sound 2019). Training with recordings as context outperforms training with only annotated segments. Mismatch-first farthest-traversal outperforms reference sample selection methods based on random sampling and uncertainty sampling. Remarkably, the required annotation effort can be greatly reduced on the dataset where target sound events are rare: by annotating only 2% of the training data, the achieved SED performance is similar to annotating all the training data.

Cites: 43 (see at Google Scholar)

An Active Learning Method Using Clustering and Committee-Based Sample Selection for Sound Event Classification

Abstract

This paper proposes an active learning method to control a labeling process for efficient annotation of acoustic training material, which is used for training sound event classifiers. The proposed method performs K-medoids clustering over an initially unlabeled dataset, and medoids as local representatives, are presented to an annotator for manual annotation. The annotated label on a medoid propagates to other samples in its cluster for label prediction. After annotating the medoids, the annotation continues to the unexamined sounds with mismatched prediction results from two classifiers, a nearestneighbor classifier and a model-based classifier, both trained with annotated data. The annotation on the segments with mismatched predictions are ordered by the distance to the nearest annotated sample, farthest first. The evaluation is made on a public environmental sound dataset. The labels obtained through a labeling process controlled by the proposed method are used to train a classifier, using supervised learning. Only 20% of the data needs to be manually annotated with the proposed method, to achieve the accuracy with all the data annotated. In addition, the proposed method clearly outperforms other active learning algorithms proposed for sound event classification through all the experiments, simulating varying fraction of data that is manually labeled.

Keywords

active learning;K-medoids clustering;committee-based sample selection;sound event classification

Cites: 17 (see at Google Scholar)

PDF

Learning Vocal Mode Classifiers from Heterogeneous Data Sources

Abstract

This paper targets on a generalized vocal mode classifier (speech/singing) that works on audio data from an arbitrary data source. However, previous studies on sound classification are commonly based on cross-validation using a single dataset, without considering the cases that training and testing data are recorded in mismatched condition. Experiments revealed a big difference between homogeneous recognition scenario and heterogeneous recognition scenario, using a new dataset TUT-vocal-2016. In the homogeneous recognition scenario, the classification accuracy using cross-validation on TUT-vocal-2016 was 95.5%. In heterogeneous recognition scenario, seven existing datasets were used as training material and TUT-vocal-2016 was used for testing, the classification accuracy was only 69.6%. Several feature normalization methods were tested to improve the performance in heterogeneous recognition scenario. The best performance (96.8%) was obtained using the proposed subdataset-wise normalization.

Cites: 1 (see at Google Scholar)

Active Learning for Sound Event Classification by Clustering Unlabeled Data

Abstract

This paper proposes a novel active learning method to save annotation effort when preparing material to train sound event classifiers. K-medoids clustering is performed on unlabeled sound segments, and medoids of clusters are presented to annotators for labeling. The annotated label for a medoid is used to derive predicted labels for other cluster members. The obtained labels are used to build a classifier using supervised training. The accuracy of the resulted classifier is used to evaluate the performance of the proposed method. The evaluation made on a public environmental sound dataset shows that the proposed method outperforms reference methods (random sampling, certainty-based active learning and semi-supervised learning) with all simulated labeling budgets, the number of available labeling responses. Through all the experiments, the proposed method saves 50%-60% labeling budget to achieve the same accuracy, with respect to the best reference method.

Keywords

active learning, sound event classification, K-medoids clustering

Cites: 57 (see at Google Scholar)

Deep Neural Networks for Sound Event Detection

Emre Cakir
Doctoral project advisor
2015 — 2019

I had the opportunity to serve as an advisor for Emre Cakir’s doctoral research, which focused on advancing deep learning methods for sound event detection. The project explored the use of convolutional and recurrent neural networks to model complex acoustic environments, enabling machines to detect and classify overlapping sound events in real-world audio recordings. In this project, Emre worked on developing novel deep learning architectures for machine listening, optimizing training strategies, and evaluating performance on large-scale datasets. The research contributed significantly to the DCASE community and laid the groundwork for several widely cited publications in the field.

  • The CRNN-based approach for sound event detection, introduced in this research, has become a benchmark architecture within the DCASE community and has been widely adopted in both academic and applied contexts. With over 700 citations, the method continues to influence the development of modern machine listening systems. Its impact was further recognized with the IEEE Signal Processing Society Best Paper Award in 2024, underscoring its significance in the field of computational audio analysis.
  • The research demonstrated that multi-label classification frameworks are more effective than combined single-label approaches for modeling overlapping sound events, resulting in significant improvements in detection accuracy.
  • The proposed sound event detection methods were validated on diverse real-world datasets, making this one of the first studies to systematically evaluate performance across a wide range of acoustic scenes and contexts. The approaches achieved state-of-the-art results in polyphonic sound event detection tasks.

Deep Neural Networks for Sound Event Detection

Abstract

The objective of this thesis is to develop novel classification and feature learning techniques for the task of sound event detection (SED) in real-world environments. Throughout their lives, humans experience a consistent learning process on how to assign meanings to sounds. Thanks to this, most of the humans can easily recognize the sound of a thunder, dog bark, door bell, bird singing etc. In this work, we aim to develop systems that can automatically detect the sound events commonly present in our daily lives. Such systems can be utilized in e.g. contextaware devices, acoustic surveillance, bio-acoustical and healthcare monitoring, and smart-home cities. In this thesis, we propose to apply the modern machine learning methods called deep learning for SED. The relationship between the commonly used timefrequency representations for SED (such as mel spectrogram and magnitude spectrogram) and the target sound event labels are highly complex. Deep learning methods such as deep neural networks (DNN) utilize a layered structure of units to extract features from the given sound representation input with increased abstraction at each layer. This increases the network’s capacity to efficiently learn the highly complex relationship between the sound representation and the target sound event labels. We found that the proposed DNN approach performs significantly better than the established classifier techniques for SED such as Gaussian mixture models. In a time-frequency representation of an audio recording, a sound event can often be recognized as a distinct pattern that may exhibit shifts in both dimensions. The intra-class variability of the sound events may cause to small shifts in the frequency domain content, and the time domain shift results from the fact that a sound event can occur at any time for a given audio recording. We found that convolutional neural networks (CNN) are useful to learn shift-invariant filters that are essential for robust modeling of sound events. In addition, we show that recurrent neural networks (RNN) are effective in modeling the long-term temporal characteristics of the sound events. Finally, we combine the convolutional and recurrent layers in a single classifier called convolutional recurrent neural networks (CRNN), which emphasizes the benefits of both and provides state-of-the-art results in multiple SED benchmark datasets. Aside from learning the mappings between the time-frequency representations and the sound event labels, we show that deep learning methods can also be utilized to learn a direct mapping between the the target labels and a lower level representation such as the magnitude spectrogram or even the raw audio signals. In this thesis, the feature learning capabilities of the deep learning methods and the empirical knowledge on the human auditory perception are proposed to be integrated through the means of layer weight initialization with filterbank coefficients. This results with an optimal, ad-hoc filterbank that is obtained through gradient based optimization of the original coefficients to improve the SED performance.

Cites: 12 (see at Google Scholar)

PDF

Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

Abstract

Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure. Convolutional neural networks (CNN) are able to extract higher level features that are invariant to local spectral and temporal variations. Recurrent neural networks (RNNs) are powerful in learning the longer term temporal context in the audio signals. CNNs and RNNs as classifiers have recently shown improved performances over established methods in various sound recognition tasks. We combine these two approaches in a Convolutional Recurrent Neural Network (CRNN) and apply it on a polyphonic sound event detection task. We compare the performance of the proposed CRNN method with CNN, RNN, and other established methods, and observe a considerable improvement for four different datasets consisting of everyday sound events.

Cites: 784 (see at Google Scholar)

Domestic Audio Tagging with Convolutional Neural Networks

Abstract

In this paper, the method used in our submission for DCASE2016 challenge task 4 (domestic audio tagging) is described. The use of convolutional neural networks (CNN) to label the audio signals recorded in a domestic (home) environment is investigated. A relative 23.8% improvement over the Gaussian mixture model (GMM) baseline method is observed over the development dataset for the challenge.

Cites: 31 (see at Google Scholar)

PDF

Multi-Label vs. Combined Single-Label Sound Event Detection With Deep Neural Networks

Abstract

In real-life audio scenes, many sound events from different sources are simultaneously active, which makes the automatic sound event detection challenging. In this paper, we compare two different deep learning methods for the detection of environmental sound events: combined single-label classification and multi-label classification. We investigate the accuracy of both methods on the audio with different levels of polyphony. Multi-label classification achieves an overall 62.8% accuracy, whereas combined single-label classification achieves a very close 61.9% accuracy. The latter approach offers more flexibility on real-world applications by gathering the relevant group of sound events in a single classifier with various combinations.

Cites: 65 (see at Google Scholar)

Polyphonic Sound Event Detection Using Multi Label Deep Neural Networks

Abstract

In this paper, the use of multi label neural networks are proposed for detection of temporally overlapping sound events in realistic environments. Real-life sound recordings typically have many overlapping sound events, making it hard to recognize each event with the standard sound event detection methods. Frame-wise spectral-domain features are used as inputs to train a deep neural network for multi label classification in this work. The model is evaluated with recordings from realistic everyday environments and the obtained overall accuracy is 58.9%. The method is compared against a state-of-the-art method using non-negative matrix factorization as a pre-processing stage and hidden Markov models as a classifier. The proposed method improves the accuracy by 19% percentage points overall.

Cites: 390 (see at Google Scholar)

PDF

Supervised Research Projects

Over the years, I have co-supervised a wide range of research projects at both the Bachelor's and Master's levels. These projects have explored diverse topics in audio signal processing, machine learning, and audio content analysis, often contributing to real-world applications and advancing academic knowledge.

Below are some of the most recent Master's thesis projects I have supervised:

Projects

A complete list of supervised projects and related publications is shown below.

Projects: 19 ( Master thesis : 8, Projects : 11 )

2025

Compensation of Loudspeaker Nonlinearities with Deep Neural Networks

Abstract

Loudspeakers generate sound waves from electrical audio signals. They are inherently nonlinear as their performance varies when using small and high amplitude signals. Small loudspeakers or micro speakers, found in consumer electronics, are particularly sensitive to sound degrading nonlinearities or distortion at high playback volumes. This thesis proposes a compensation method based on deep neural networks (DNNs). A DNN based compensation model is trained to pre-compensate the loudspeaker input to reduce distortion. The compensation model learns to modify the input signal of a DNN based nonlinear loudspeaker model such that the nonlinear model output minimizes error to the output of a linear regression loudspeaker model. Both loudspeaker models are trained on monaural data recorded from a laptop micro speaker. The compensation model successfully reduces nonlinearities in simulation. Practical experiments, where compensated audio is played and recorded from the laptop speaker, show that the amount of nonlinearities is decreased. Informal listening of the recordings suggest that the compensation slightly alters some elements of the loudspeaker output sound.

PDF

2023

Enhancing Domain-Specific Automated Audio Captioning: a Study on Adaptation Techniques and Transfer Learning

Abstract

Automated audio captioning is a challenging cross-modal task that takes an audio sample as input to analyze it and generate its caption in natural language as output. The existing datasets for audio captioning such as AudioCaps and Clotho encompass a diverse range of domains, with current proposed systems primarily focusing on generic audio captioning. This thesis delves into the adaptation of generic audio captioning systems to domain-specific contexts, simultaneously aiming to enhance generic audio captioning performance. The adaptation of the generic models to specific domains has been explored using two different techniques: complete fine-tuning of neural model layers and layer-wise fine-tuning within transformers. The process involves initial training with a generic captioning setup, followed by adaptation using domain-specific training data. In generic captioning, the process for training starts with training the model on the AudioCaps dataset followed by fine-tuning it using the Clotho dataset. This is accomplished through the utilization of a transformer-based architecture, which integrates a patchout fast spectrogram transformer (PaSST) for audio embeddings and a BART transformer. Word embeddings are generated using a byte-pair encoding (BPE) tokenizer tailored to the training datasets’ unique words, aligning the vocabulary with the generic captioning task. Experimental adaptation mainly focuses on audio clips related to animals and vehicles. The results demonstrate notable improvements in the performance of the generic and domain adaptation systems. Generic captioning has demonstrated an improvement in SPIDEr scores, increasing from 0.291 during fine-tuning to 0.301 with layer-wise fine-tuning. Specifically, we observed a notable increase in SPIDEr scores, from 0.315 to 0.323 for animal-related audio clips and from 0.298 to 0.308 for vehicle-related audio clips.

PDF

2020

Real-Time Sound Event Detection With Python

Abstract

Python is a popular programming language for rapid research prototyping in various research fields, owing it to the massive repository of well-maintained 3rd party packages, built-in capabilities of the language and strong community. This work investigates the feasibility of Python for the task of performing sound event detection (SED) in real-time, which is important in demonstrating project research results to any interested parties or utilise it for practical purposes such as acoustic health care monitoring, e.g. in attempts to reduce the transmission of the COVID-19 disease. The relevant background theory for detecting sound events based on a pre-determined sound recordings is first provided, which is followed by introduction to the basic of concepts that enable performing the same in real-time. Then, Python real-time system designs based on two related approaches are proposed and their feasibility is also evaluated with the help of corresponding reference system implementations. The results acquired with the implementations strongly suggest that Python is indeed very feasible for performing real-time SED, even when using a sophisticated model that possess 3.7M total parameters.

PDF

Sound Based Classification of Studded Tires: Automatic Tire Classification System

Abstract

The use of studded tires causes rutting of asphalt pavements and generates street dust to the environment. The maintenance of paved roads and cleaning of street dust requires resources and causes health risks. These effects are notable especially in spring time when the snow and ice has melted away from road surfaces. In order to predict these phenomena, the number of vehicles using studded tires should be measured continuously. Previously the estimations about the proportions of winter and summer tires have been created based on figures provided by car service companies that offer tire changing services. Occasional hearing based roadside sample surveys have also been made. Unlike the statistics from car service companies, hearing based data collection methods provide location and time specific information about the use of studded tires. Hearing based data collection is a difficult and labour-consuming task and it has not been applied widely. The purpose of this thesis was to find out if an automatic tire classification system could be implemented to collect data about the use of studded tires. A dataset of in-road audio recordings was exploited in the study. The dataset was collected from two measurement sites by using contact microphones under the road pavement. The measuring points were placed next to automatic traffic measurement stations that are used by Finnish Transport Infrastructure Agency in data collection purposes. Digital signal processing and machine learning was applied in the designing of the tire classification system. A passenger car detector was implemented to restrict the classification only for tires of passenger cars and to determine the exact bypass times of detected vehicles. Feature extraction from the audio data was done according to modeling of the human auditory system. Two versions of the tire classifier were designed, one based on support vector machine and the other on multilayer perceptron. The dataset was annotated by labelling the recordings with the information about the vehicle class and the tire type used in the vehicle. The recordings of passenger cars were used in the training and testing of the classifier-models. The split of data into a training set and test set was done according to recording locations, meaning that data from one location was named as the training set while the remaining data from the other location was used as test set. This way the generalization of the system could be verified as the classifier-models could not learn the recording location-specific factors of the test set during the training. A comparison of the two classifier models was made according to the results of the experiments that were carried out with the test set. The results of the experiments prove that automatic and instant tire classification is possible with the proposed methods. Both the passenger car detector and the tire classifier performed well in the experiments by scoring about 95% test accuracy. The differences between the results of the classifier models were small. The results imply that the system is able to generalize its knowledge from one recording environment to another without being explicitly trained to do so. However, due to the small amount of measurement sites used in the experiments, it is impossible to make reliable conclusions about general adaptivity of the system without further research. In order to improve the performance and reliability of the system, more data from new measurement sites should be collected in the follow-up research.

PDF

2019

Environmental sound recognition and prototype game design

Abstract

This project consists of creating a game using environmental sound recognition. The basic idea of the game is an escape room: the player will have to solve a series of enigmas by finding the right sounds to make in order to get out of the room. We will train a machine learning model to recognize the sounds used in the game. The dataset will consist of objects and human-made sounds. The data will be retrieved from existing datasets or created by us in case we lack available resources. The sound recognizer model will be made in Python and the game with Unity.

Clients

Toni Heittola, and Tuomas Virtanen

Synthetic generation of environmental audio learning examples for neural networks

Abstract

Deep neural network methods need to have a wide range of various training examples in order to train a classifier for predicting different classes. Also, big dataset makes the classifier learn about different conditions which results in better generalization. In audio processing, it is often difficult to find a large dataset. Therefore, data needs to be generated synthetically by mixing audio signals from different sources. In this project, we are going to develop a method using Keras data generator class in which environmental audio sounds are generated for binary classification application. While generating the synthetic sounds, some existing variations in acoustic conditions such as signal-to-noise ratio, acoustic conditions in indoor acoustic scenes, general acoustic conditions in outdoor acoustic scenes along with sound shifts in time and pitch should be considered.

Clients

Toni Heittola, and Tuomas Virtanen

2017

Organizing acoustic scene excerpts into 2D map with t-SNE

Abstract

The aim of this project was to develop a Python program, that uses t-Distributed Stochastic Neighbor Embedding, ”t-SNE”, to visualize high dimensional audio scene feature vector data as a 2D-map. This can be used to visualize audio scene feature vectors, and see how well the data is separable using the gathered features, and the t-SNE method.

Clients

Toni Heittola, and Tuomas Virtanen

Acoustic scene classification on Andoird platform

Abstract

The project implemented a neural network classifier on Android. The classifier used Tensorflow as backend for managing the classification flow. The classifier was trained to classify auditory scenes from extracted features. Client application was implemented using Kotlin-programming language and requires Android 7.1 to operate.

Clients

Toni Heittola, and Tuomas Virtanen

Animal onomatopoeia game

Abstract

The project is part of the course SGN-81006 Signal Processing Innovation Project, and the topic is given to us by our clients, researcher Toni Heittola and associate professor Tuomas Virtanen, working in the Audio Research Group at Tampere University of Technology, Laboratory of Signal Processing. In the project we gathered a small data set of animal onomatopoeias and trained a simple classifier using that data set. The classifier was then used for controlling a simple game where the player guides animals by imitating the sounds they make.

Clients

Toni Heittola, and Tuomas Virtanen

2014

Real-Time Audio Analysis

Abstract

The application areas of audio analysis have been gaining popularity over the last decades because of its support to numerous industrial products. In the conventional approaches, audio analysis algorithms, which are based on the pattern recognition approach, are often worked in a non-real-time situation. Most of the non-real-time audio analysis systems are designed for the rapid development, readability and maintainability of code, moreover, provides cross-platform functionality, efficient audio data analysis, and less latency. Hence, these requirements allow the overall development cost with same portability while notably improving the performance of a system. Generally, these programs are written using poor programming styles or using programming languages not suitable for real-time applications such as Java. This implies a high difficulty in changing the existing source code according to real-time requirements hard and tedious work of extending it. This forces research to deal with programming problems instead of speech and audio analysis innovations. In addition to deal with prior issues real-time audio analysis system also provides such platform to test and research the audio analysis algorithms. Purpose of this study is to research APIs that offer a low-latency, high efficiency option for developing real-time audio analysis system. Basic component of pattern recognition are block framing, windowing and mel-frequency cepstral coefficients (MFCC). The presented program is implemented in real-time using efficient APIs such as PortAudio and LibXtract. The program uses mel-frequency cepstral coefficients (MFCC) to process the small frame size of an audio signal without loss of audio signal power for improving the performance of audio analysis system. Audio analysis system can also be used in numerous products, that are not only useful for audio content analysis, audio classiffication, pattern recognition system and music information retrieval. But, also advantageous from practical engineering viewpoint for real-time input applications such as automatic sound event detection system. The system also indicate that it is successfully portable for Linux, Ubuntu, and major platforms for real-time audio input that is usually restricted in some audio analysis systems involves conventional approaches.

Music Video Analysis Using Signal Processing Tools

Abstract

Visual cuts points in music videos are often aligned with the musical beat, and on the higher level with musical structural change points (e.g. chorus-verse). The idea of this study is to investigate this relation more closely by using automatic video cut point detection and automatic musical structure analysis.

Clients

Toni Heittola, Tuomas Virtanen, Joni Kämäräinen, and Katariina Mahkonen

Real-time sound classification system using Python

Abstract

Python has gained recent years wide popularity in the research community. There is already a wide range of pattern recognition related toolbox available for Python. The aim of this project was to investigate possibilities to use Python for acoustic pattern recognition and develop system capable for real-time sound classification.

Clients

Toni Heittola, and Tuomas Virtanen

Acoustic context recognition using i-vector

Abstract

The aim of this project was to study i-vector approach for audio context recognition.

Clients

Toni Heittola, and Tuomas Virtanen

2013

Semi-supervised musical instrument recognition

Abstract

The application areas of music information retrieval have been gaining popularity over the last decades. Musical instrument recognition is an example of a specific research topic in the field. In this thesis, semi-supervised learning techniques are explored in the context of musical instrument recognition. The conventional approaches employed for musical instrument recognition rely on annotated data, i.e. example recordings of the target instruments with associated information about the target labels in order to perform training. This implies a highly laborious and tedious work of manually annotating the collected training data. The semi-supervised methods enable incorporating additional unannotated data into training. Such data consists of merely the recordings of the instruments and is therefore significantly easier to acquire. Hence, these methods allow keeping the overall development cost at the same level while notably improving the performance of a system. The implemented musical instrument recognition system utilises the mixture model semi-supervised learning scheme in the form of two EM-based algorithms. Furthermore, upgraded versions, namely, the additional labelled data weighting and class-wise retraining, for the improved performance and convergence criteria in terms of the particular classification scenario are proposed. The evaluation is performed on sets consisting of four and ten instruments and yields the overall average recognition accuracy rates of 95.3 and 68.4%, respectively. These correspond to the absolute gains of 6.1 and 9.7% compared to the initial, purely supervised cases. Additional experiments are conducted in terms of the effects of the proposed modifications, as well as the investigation of the optimal relative labelled dataset size. In general, the obtained performance improvement is quite noteworthy, and future research directions suggest to subsequently investigate the behaviour of the implemented algorithms along with the proposed and further extended approaches.

PDF

Classification of the Sounds of Footsteps and Person Identification

Abstract

The sound of footsteps contains a wide range of information about the person producing them. Humans are quite often using this information to identify persons in situations without visual contact. For example, they can tell how fast a person is walking, what kind of shoes a person is wearing, how tall a person is, or even the mood of a person. The combination of these features will make the sounds of the footsteps characteristic for certain person. The aim of the project is to study the automatic classification of the sound of footsteps and see how reliably one can do automatic identification of persons based on it.

Clients

Toni Heittola, and Tuomas Virtanen

Organizing a Database of Sound Samples

Abstract

In modern sample based music production, the management of large sample libraries intuitively is challenging problem. The aim of the project is study various ways to organize sample library according to the acoustic properties of samples.

Clients

Toni Heittola, and Tuomas Virtanen

2012

Automatic Guitar Chord Detection

Abstract

Automatic guitar chord detection is a process that attempts to detect a guitar chord from a piece of audio. Generally, automatic chord detection is considered to be a part of a large problem termed as automatic transcription. Although there has been a lot of research in the field of automatic transcription, but having a reliable transcription system is still a distant prospect. Chord detection becomes interesting as chords have comparatively stable structure and they completely describe the occurring harmonies in a piece of music. This thesis presents a novel approach for detecting the correctness of musical chords played by guitar. The approach is based on pattern matching technique applied to the database of chords and their typical mistakes. Mistakes are the versions of a chord where typical playing errors are made. Transient of a chord is skipped and its spectrum is whitened. A certain region of whitened spectra is chosen as a feature vector. Cosine distance is computed between the extracted features and the data present in a reference chord database. Finally, the system detects the correctness of a played chord based on k-Nearest Neighbor (k-NN) classifier. The developed system uses two types of spectral whitening techniques: one is based on Linear Predictive Coding (LPC) and the other is based on Phase Transform-beta (PHAT-beta). The average accuracy shown by LPC based system is 72% while that of PHAT-beta is 82.5%. The system was also evaluated under different noise conditions.

PDF

Classification of Insects Based on Sound

Abstract

Insect borne diseases kill a million people and destroy tens of billions of euros worth of crops annually. At the same time, beneficial insects pollinate the majority of crop species, and it has been estimated that approximately one third of all food consumed by humans is directly pollinated by bees alone. If we could inexpensively count and classify insects, we could plan interventions more accuracy, thus saving lives in the case of insect vectored disease, and growing more food in the case of insect crop pests. The aim of the project is to classify the insect based on the sound they produce while flying.

Clients

Toni Heittola, and Tuomas Virtanen

2010

Parameter Adaptation in Nonlinear Loudspeaker Models

Abstract

Loudspeaker is a device that converts electric input signal to acoustic output. The most common type of loudspeaker is a moving-coil transducer. The behaviour of a moving-coil transducer can be considered to be linear only when displacement of the coil-diaphragm assembly is small. When input signal level rises, nonlinearities start to cause audible distortion. In this thesis we examine microspeaker, a small loudspeaker used in mobile phones. The electro-mechanical process which converts the electrical signal into sound waves is exaplained. Based on this, we present a continuous-time, linear model of a loudspeaker mounted in a closed box. The model describes the loudspeaker's small-signal behaviour using only few parameters. We then consider the main sources of nonlinearities and how to model them. Two major sources nonlinearities are added to the continuous-time model. Then transformations from continuous-time models to discrete-time models are considered. The nonlinear model is converted to discrete-time while taking into account the properties of the microspeaker. The main purpose of this thesis is to study performance of a algorithm that finds the parameter values of the nonlinear loudspeaker model. Performance of the algorithm is compared to performance of an earlier algorithm for the linear loudspeaker model. The parameter values are found and changes in them are tracked using an adaptive signal processing method called system identification. The parameter values are updated using LMS algorithm. Since the discrete-time mechanical model of the microspeaker is based on a recursive filter, LMS algorithm for recursive filters is presented. We also review previous research related to parameter identification in linear and nonlinear loudspeaker models. Based on results from the experiments the studied algorithm is deemed to be yet incomplete. Linear parameters adapt in general quickly whereas the nonlinear parameters adapt too slowly and sometimes erroneously. The difference between the output predicted by the nonlinear loudspeaker model and the actual output of the loudspeaker (prediction error) is too high, meaning the parameters do not adapt to their true values. The model is also prone to instability. The algorithm requires further development regarding adaptation speed and prevention of instability. Other development considering initial parameter values and operation during silent moments should also be conducted in the future.