TangoFlux: Fast and Faithful Text-to-Audio Generation with Flow Matching and CLAP-Ranked Preference Optimization

Models & Research

The Engineer

6 Jan 2025 · 3 min read

TangoFlux harnesses advanced Flow Matching and CLAP-Ranked Preference Optimization to revolutionize text-to-audio generation, producing high-quality audio in record time-setting a new standard for synthetic sound.

TangoFlux, a new Text-to-Audio (TTA) generative model developed by researchers from DeCLaRe Lab at the Singapore University of Technology and Design (SUTD), NVIDIA, and Lambda Labs, is making waves in the audio generation community. This 515M parameter model can generate up to 30 seconds of 44.1kHz stereo audio in just 3.7 seconds on a single A40 GPU. The key innovation lies in its use of Flow Matching and CLAP-Ranked Preference Optimization (CRPO), which significantly enhances the alignment and quality of generated audio.

What Changed Technically?

Flow Matching

Flow Matching is a technique that maps input data to a target distribution, ensuring that the generated audio closely matches the intended output. This method allows TangoFlux to produce high-fidelity audio with minimal latency.
By using flow-based models, the researchers can efficiently capture the complex distributions of audio signals, leading to more natural and consistent sound generation.

CLAP-Ranked Preference Optimization (CRPO)

CLAP stands for Contrastive Language-Audio Pretraining, a model that learns to align text and audio representations. CRPO leverages this alignment to generate and optimize preference data.
The process involves iteratively generating pairs of audio samples and using human feedback to rank these pairs. This ranked data is then used to fine-tune the TTA model, ensuring it produces more aligned and contextually appropriate audio.

Why It Matters

Performance Benchmarks

TangoFlux outperforms existing models in both objective and subjective benchmarks.
Objective Metrics: The model achieves state-of-the-art performance on metrics like Mean Opinion Score (MOS) and Perceptual Evaluation of Speech Quality (PESQ).
Subjective Feedback: Human evaluators consistently prefer the audio generated by TangoFlux over other leading TTA models.

Practical Applications

Content Creation: TangoFlux can significantly speed up the process of generating high-quality audio for podcasts, videos, and music production.
Accessibility: The model's fast generation times make it ideal for real-time applications, such as assistive technologies for the visually impaired or real-time language translation.

Salient Features

Speed: TangoFlux can generate up to 30 seconds of 44.1kHz stereo audio in about 3 seconds on an A40 GPU.
Quality: The model produces high-fidelity audio that closely matches the intended content, thanks to its flow-based architecture and CRPO framework.

Comparative Samples

| Text Description | Stable Audio Open | TANGO 2 | AudioLDM2 | AudioBox | TangoFlux (Ours) | | --- | --- | --- | --- | --- | --- | | Melodic human whistling harmonizing with natural birdsong | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | | A basketball bounces rhythmically on a court, shoes squeak against the floor, and a referee’s whistle cuts through the air. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. | Your browser does not support the audio element. |

Conclusion

TangoFlux represents a significant advancement in text-to-audio generation, combining cutting-edge techniques like Flow Matching and CLAP-Ranked Preference Optimization to produce high-quality, contextually aligned audio at unprecedented speeds. With its open-source code and models, researchers and practitioners can further explore and build upon this groundbreaking work.