Hibiki: A Decoder-Only Model for High-Fidelity Simultaneous Speech-to-Speech Translation

Models & Research

The Engineer

7 Feb 2025 · 3 min read

Hibiki, developed by Meta AI, revolutionizes speech-to-speech translation with real-time processing, decoding chunks of speech as they come for seamless cross-language communication.

Hibiki, a new decoder-only model introduced by researchers from Meta AI, tackles the challenging task of simultaneous speech-to-speech translation with impressive results. Unlike traditional models that wait for the entire source utterance to complete before translating, Hibiki processes and translates speech in real-time, chunk by chunk. This capability is crucial for applications like live interpretation and real-time communication across languages.

Technical Breakdown

Model Architecture

Decoder-Only Design: Hibiki uses a decoder-only architecture, which simplifies the model by eliminating the need for an encoder. This design choice reduces latency and makes it more suitable for real-time processing.
Multistream Language Model: The model processes multiple streams of data (source speech, target speech) simultaneously. It jointly generates text and audio tokens, enabling both speech-to-text and speech-to-speech translation.

Training and Data

Weakly-Supervised Method: To address the challenge of identifying optimal delays for real-time translation, the researchers developed a weakly-supervised method. This method leverages the perplexity scores from an off-the-shelf text translation system to determine the best delay for each word.
Synthetic Data Creation: By using these perplexity scores, the model creates aligned synthetic data that simulates real-world translation scenarios. This synthetic data is then used to train Hibiki.

Inference Process

Adaptive Translation: During inference, Hibiki uses vanilla temperature sampling to perform adaptive simultaneous speech translation. The model dynamically adjusts its translation based on the context it has accumulated so far.
Real-Time Compatibility: The simplicity of the decoder-only architecture and the adaptive nature of the inference process make Hibiki compatible with batched translation and even real-time on-device deployment.

Performance Metrics

On a French-English simultaneous speech translation task, Hibiki demonstrates state-of-the-art performance in several key metrics:

Translation Quality: High accuracy in translating content from one language to another.
Speaker Fidelity: Maintains the natural voice characteristics of the speaker during translation.
Naturalness: Produces translations that sound natural and fluent.

Practical Implications

The ability to perform high-fidelity simultaneous speech-to-speech translation opens up a range of applications:

Live Interpretation: Ideal for conferences, meetings, and other live events where real-time translation is essential.
Real-Time Communication: Enhances communication in multilingual environments, such as international businesses or global communities.

Conclusion

Hibiki represents a significant advancement in the field of speech-to-speech translation. Its decoder-only design, multistream processing capabilities, and adaptive inference process make it a powerful tool for real-time translation tasks. The model's performance on French-English translation tasks is particularly noteworthy, demonstrating its potential for practical applications.