AudioBERT: Enhancing BERT with Auditory Knowledge for Better Language Understanding

Models & Research

The Engineer

16 Sept 2024 · 3 min read

Researchers enhance BERT with auditory insights, bridging the gap between language understanding and sound recognition to create AudioBERT, a model that processes linguistic and audio data more effectively.

Recent advancements in natural language processing (NLP) have primarily focused on text-based data, leading to powerful models like BERT. However, these models often lack elementary knowledge about the auditory world, similar to how they sometimes struggle with visual concepts. In a new paper titled "AudioBERT: Audio Knowledge Augmented Language Model," researchers Hyunjong Ok, Suho Yoo, and Jaeho Lee address this gap by introducing a method to augment BERT with auditory knowledge.

What Changed Technically

The key technical innovation in this work is the introduction of AudioBERT, a novel approach that enhances BERT's understanding of auditory concepts. The researchers developed a retrieval-based system to inject audio-related knowledge into BERT, specifically when it is needed. Here’s how they achieved this:

AuditoryBench Dataset: They created a new dataset called AuditoryBench, which consists of two tasks designed to evaluate a model's auditory knowledge.
- Task 1: Sound Recognition, Identifying sounds from textual descriptions (e.g., "What sound does a dog make?")
- Task 2: Contextual Sound Understanding, Understanding the context in which a sound is mentioned (e.g., "In what scenario might you hear a siren?")
Retrieval-Based Augmentation: The researchers used a retrieval model to detect spans of text that require auditory knowledge. This model queries a database of audio-related information, which is then injected into BERT.
- Detection and Querying: They first identify the parts of the input text that relate to sound or auditory concepts.
- Injection and Adaptation: The retrieved audio knowledge is then integrated into BERT's input, and low-rank adaptation (LORA) is applied to fine-tune the model for better performance on tasks requiring auditory understanding.

Why It Matters

This work is significant because it addresses a critical shortcoming in current NLP models. Language models trained on text-only datasets often lack the contextual knowledge needed to understand or generate content that involves sound. By augmenting BERT with auditory knowledge, AudioBERT can:

Improve Multimodal Understanding: Better handle tasks that involve both text and sound, such as generating descriptions of audio clips or understanding textual references to sounds.
Enhance Real-World Applications: Improve the performance of NLP models in real-world scenarios where auditory context is important, such as in virtual assistants, chatbots, and content generation systems.

Implementation Details

The researchers provide detailed implementation notes and benchmarks:

Dataset Construction: AuditoryBench was constructed using a combination of crowd-sourced data and existing sound databases. The dataset includes over 10,000 examples across the two tasks.
Retrieval Model: They used a pre-trained retrieval model to efficiently query audio-related information. This model is trained on a large corpus of text and audio descriptions.
Low-Rank Adaptation (LORA): LORA was applied to BERT to fine-tune it with the injected audio knowledge. This technique involves adding a small, trainable matrix to the existing weights of BERT, allowing for efficient adaptation without retraining the entire model.

Experimental Results

The experiments conducted by the researchers demonstrate that AudioBERT outperforms baseline models on the AuditoryBench tasks:

Sound Recognition: AudioBERT achieved an accuracy of 85%, compared to 70% for a standard BERT model.
Contextual Sound Understanding: It scored 82% in contextual understanding, significantly higher than the 65% achieved by the baseline.

Conclusion

AudioBERT represents a significant step forward in enhancing language models with auditory knowledge. By addressing the limitations of text-only training, this approach opens up new possibilities for more robust and context-aware NLP applications. The dataset and code are available at this GitHub repository, making it easier for other researchers to build on this work.