FunAudioLLM: Enhancing Voice Interaction with Multilingual Speech and Emotion Recognition

Models & Research

The Engineer

8 Jul 2024 · 3 min read

FunAudioLLM combines cutting-edge speech recognition and emotion detection to make voice interactions with AI more natural and expressive, supporting multiple languages and nuanced emotional responses.

FunAudioLLM, a framework developed by Alibaba's Tongyi SpeechTeam, aims to revolutionize natural voice interactions between humans and large language models (LLMs). The core of this framework consists of two groundbreaking models: SenseVoice for high-precision speech recognition, emotion detection, and audio event recognition; and CosyVoice for advanced speech generation with multi-language support, timbre control, and emotional expression.

Key Features of FunAudioLLM

SenseVoice: Multilingual Speech Recognition and Emotion Detection

High-Precision Recognition: SenseVoice excels in multilingual speech recognition, supporting over 50 languages. It delivers low latency, making it suitable for real-time applications.
Emotion Recognition: The model can accurately detect emotions from spoken words, enhancing the naturalness of voice interactions.
Audio Event Detection: Beyond just recognizing speech, SenseVoice can identify various audio events, which is crucial for context-aware applications.

CosyVoice: Advanced Speech Generation

Multi-Lingual Support: CosyVoice generates natural-sounding speech in multiple languages, ensuring a global reach.
Timbre Control: Users can control the timbre of the generated voice, allowing for personalized and diverse outputs.
Emotional Expression: The model can generate speech with specific emotional tones, making interactions more engaging and human-like.
Zero-Shot Generation: CosyVoice supports zero-shot in-context generation, enabling it to produce speech based on context without prior training on that specific content.
Cross-Lingual Voice Cloning: It can clone voices across different languages, maintaining the speaker's unique characteristics.
Instruction-Following Capabilities: The model can generate speech based on detailed instructions, making it versatile for various applications.

Applications of FunAudioLLM

Speech-to-Speech Translation
- Translate spoken words from one language to another in real-time, facilitating global communication.
Emotional VoiceChat
- Create more engaging and natural voice conversations by incorporating emotional cues.
Interactive Podcasts
- Generate dynamic and interactive content for podcasts, enhancing listener engagement.
Expressive Audiobook Narration
- Provide audiobooks with expressive narration, making the listening experience more enjoyable.

Technical Details

SenseVoice

Architecture: SenseVoice uses a transformer-based architecture optimized for real-time processing. It leverages self-attention mechanisms to capture long-range dependencies in audio signals.
Training Data: The model is trained on a diverse dataset of multilingual speech, emotional expressions, and various audio events.
Performance Benchmarks:
- Latency: Less than 100ms for real-time processing.
- Recognition Accuracy: Over 95% in noisy environments.

CosyVoice

Architecture: CosyVoice is built on a generative adversarial network (GAN) framework, which allows it to generate high-quality speech with fine control over timbre and emotion.
Training Data: The model is trained using a large corpus of multilingual speech data, including diverse emotional expressions.
Performance Benchmarks:
- Naturalness Score: Achieves a MOS (Mean Opinion Score) of 4.5 out of 5 in naturalness tests.
- Zero-Shot Generation Accuracy: Over 90% accuracy in generating speech based on context without prior training.

Open Source and Availability

Modelscope: SenseVoice and CosyVoice
HuggingFace: SenseVoice and [CosyVoice]
GitHub: The training, inference, and fine-tuning codes are available on the FunAudioLLM GitHub repository.

By integrating SenseVoice and CosyVoice with LLMs, FunAudioLLM paves the way for more natural and expressive voice interactions in a wide range of applications. Whether it's translating speech in real-time, creating engaging voice chats, or producing dynamic podcasts