
Share
Explore the process of fine-tuning large language models for audio processing without relying on third-party libraries, using MusicCaps and PyTorch in this detailed guide.
Posted at, Dec 31, 2023
In this article, I’ll walk you through the technical journey of fine-tuning a Large Language Model (LLM) to process audio. The goal? To build and host an LLM that can describe human voices accurately. This is part one of a series where I share my hands-on experience with minimal reliance on third-party libraries, using PyTorch from scratch.
Recently, two notable papers have explored ways to give LLMs audio understanding capabilities:
Both papers leverage an audio encoder to transform sound into embeddings, which are then fed into LLMs along with text embeddings.
These papers provided a solid foundation for adapting cross-domain encoders and integrating them with LLMs. Inspired by these advancements, I embarked on building a minimal viable LLM with audio processing capabilities.
To get started, I needed a robust base LLM and a suitable dataset that could run locally on my RTX 3090 GPU. Here’s what I chose:

Model Selection:
Data Preparation:
This project is just the beginning of my journey into multimodal LLMs. By combining state-of-the-art audio and language models, I’ve taken a significant step towards building an LLM that can describe human voices accurately. The next steps involve scaling
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
17 January 2024
133 articles
Related Articles
Related Articles
More Stories