Fine-Tuning LLMs for Audio Processing: A Step-by-Step Guide with MusicCaps

Models & Research

The Engineer

17 Jan 2024 · 3 min read

Explore the process of fine-tuning large language models for audio processing without relying on third-party libraries, using MusicCaps and PyTorch in this detailed guide.

Listening with LLMs: A Practical Approach to Multimodal Capabilities

Posted at, Dec 31, 2023

In this article, I’ll walk you through the technical journey of fine-tuning a Large Language Model (LLM) to process audio. The goal? To build and host an LLM that can describe human voices accurately. This is part one of a series where I share my hands-on experience with minimal reliance on third-party libraries, using PyTorch from scratch.

Background

Recently, two notable papers have explored ways to give LLMs audio understanding capabilities:

SALMONN: Towards Generic Hearing Abilities for Large Language Models (arXiv:2310.13289)
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models (arXiv:2311.07919)

Both papers leverage an audio encoder to transform sound into embeddings, which are then fed into LLMs along with text embeddings.

Key Points:

SALMONN: Combined OpenAI’s Whisper and BEATS encoder, performed pretraining on the combined encoder, and used LoRA (Low-Rank Adaptation) for fine-tuning.
Qwen-Audio: Bootstrapped its audio encoder from OpenAI’s Whisper, pretrained, and then fully fine-tuned the LLM.

These papers provided a solid foundation for adapting cross-domain encoders and integrating them with LLMs. Inspired by these advancements, I embarked on building a minimal viable LLM with audio processing capabilities.

Setup

To get started, I needed a robust base LLM and a suitable dataset that could run locally on my RTX 3090 GPU. Here’s what I chose:

LLM: Mistral OpenOrca (7B parameters) from HuggingFace.
Audio Encoder: OpenAI’s Whisper.
Dataset: MusicCaps from Google.

Steps:

Model Selection:
- Tested and compared several models on HuggingFace.
- Chose Mistral OpenOrca for its balance of size and performance.
Data Preparation:
- MusicCaps dataset contains YouTube videos with captions.
- Wrote a small script to download and preprocess the audio files from YouTube.

One Mini Step at a Time

1. Environment Setup

Installed PyTorch and other necessary libraries.
Configured the local environment to ensure everything runs smoothly on the RTX 3090.

2. Data Preprocessing

Downloaded the YouTube videos using the script.
Extracted audio clips and captions from the videos.
Converted audio files into a format suitable for processing by Whisper.

3. Model Integration

Loaded Mistral OpenOrca and Whisper into PyTorch.
Integrated the audio encoder (Whisper) with the LLM (Mistral OpenOrca).

4. Fine-Tuning

Pretrained the combined model on a subset of MusicCaps data.
Used LoRA for fine-tuning to minimize computational overhead and improve efficiency.

5. Evaluation

Evaluated the model’s performance by generating descriptions for audio clips.
Compared the generated descriptions with ground truth captions to assess accuracy and coherence.

Implementation Notes

Whisper Integration: Whisper is a powerful audio encoder that converts raw audio into embeddings. These embeddings are then concatenated with text embeddings from Mistral OpenOrca.
LoRA Fine-Tuning: LoRA allows for efficient fine-tuning by updating only a small number of parameters, making it ideal for local training on limited resources.
Performance Benchmarks:
- Training Time: Approximately 12 hours on an RTX 3090 for a subset of the MusicCaps dataset.
- Model Size: The combined model with LoRA fine-tuning added around 5% to the original model size.

Conclusion

This project is just the beginning of my journey into multimodal LLMs. By combining state-of-the-art audio and language models, I’ve taken a significant step towards building an LLM that can describe human voices accurately. The next steps involve scaling