
Share
Researchers unveil a self-supervised learning framework that excels in capturing unique singer identities, surpassing conventional methods and bridging the gap between speech and singing voice recognition technologies.
In a significant step forward for singing voice processing, researchers Bernardo Torres, Stefan Lattner, and Gaël Richard have developed a self-supervised learning framework that extracts high-quality singer identity representations. This work, titled "Singer Identity Representation Learning using Self-Supervised Techniques," addresses the gap between advancements in speech and singing voice identity representation. The paper was accepted at the ISMIR conference in Milan, Italy, 2023.
Traditionally, voice identity models have been trained primarily on speech data, which has led to significant progress in speaker verification and other speech-related tasks. However, these models often struggle with singing voices due to their unique characteristics, such as pitch variations and expressive content. The new framework by Torres et al. leverages self-supervised learning (SSL) techniques specifically tailored for singing voices.
Self-Supervised Learning Techniques: The researchers explored various SSL methods, including contrastive learning and masked prediction, to train the singer identity encoder. These techniques allow the model to learn representations without explicit labels, making it more scalable and versatile.
Data Augmentations: During training, the team applied data augmentations such as pitch shifting, time stretching, and noise addition to ensure that the learned representations are invariant to these variations. This is crucial for capturing the essence of a singer's identity regardless of the specific song or performance.
Architecture: The model architecture consists of a convolutional neural network (CNN) followed by several transformer layers. The CNN extracts low-level features from the raw audio, while the transformers capture higher-level temporal dependencies and context.
The researchers evaluated their framework on multiple tasks and datasets to demonstrate its effectiveness:

Singer Similarity: The model was tested on a task where it had to determine if two singing segments belong to the same singer. It outperformed both speaker verification models and pre-trained wav2vec 2.0 baselines, achieving state-of-the-art results.
Singer Identification: In this task, the model had to identify the singer from a set of known singers. Again, it surpassed existing methods, showing robust performance even on out-of-domain data.
Out-of-Domain Generalization: The framework's ability to generalize to unseen datasets is particularly noteworthy. It maintained high accuracy on datasets with different genres and recording conditions, highlighting its adaptability.
For practitioners in the field of audio processing and music information retrieval (MIR), this research opens up new possibilities for singing voice applications:
Singing Voice Synthesis: High-quality singer identity representations can be used to create more realistic and expressive synthetic voices.
Music Analysis: The model's ability to generalize across different genres and conditions makes it a valuable tool for music analysis tasks, such as identifying vocalists in large music databases.
Personalization: By accurately capturing singer identity, the framework can enable personalized music experiences, such as recommending songs based on a user's favorite singers.
The work by Torres, Lattner, and Richard represents a significant advancement in singing voice processing. By leveraging self-supervised learning and data augmentations, they have created a robust framework that outperforms existing methods. The release of their code and trained models will undoubtedly facilitate further research and innovation in this exciting area.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
12 January 2024
88 articles
Related Articles
Related Articles
More Stories