Meta Introduces Spirit LM: An Open-Source Multimodal Model for Text and Speech

Models & Research

The Engineer

28 Oct 2024 · 3 min read

Meta's Spirit LM breaks new ground by merging text and speech in a single open-source model, challenging giants like GPT-4o and EVI 2 with enhanced naturalness and expressiveness in AI-generated dialogue.

Meta has just unveiled Spirit LM, its first open-source multimodal language model designed to seamlessly integrate text and speech inputs and outputs. This move positions Spirit LM as a direct competitor to models like OpenAI's GPT-4o and Hume’s EVI 2, which also handle multiple modalities. The model is the brainchild of Meta’s Fundamental AI Research (FAIR) team and aims to enhance the naturalness and expressiveness of AI-generated speech.

Technical Breakdown

What Changed?

Multimodal Integration: Unlike traditional models that process spoken input through automatic speech recognition (ASR) before synthesizing it with a language model, Spirit LM handles both text and speech natively. This eliminates the need for separate ASR and TTS pipelines.
Expressive Speech Generation: By incorporating phonetic, pitch, and tone tokens, Spirit LM can capture and generate more nuanced emotional states in speech, such as excitement or sadness.

Key Features

Phonetic Tokens: Used to process and generate speech accurately.
Pitch and Tone Tokens: Added in the Expressive version to enhance the model's ability to convey emotions.
Cross-Modal Tasks: Spirit LM can perform tasks like ASR, TTS, and speech classification, maintaining natural expressiveness in its outputs.

Model Versions

Meta has released two versions of Spirit LM:

Spirit LM Base:
- Tokens: Phonetic tokens only.
- Use Case: Suitable for applications requiring accurate speech processing and generation without the need for emotional nuances.
Spirit LM Expressive:
- Tokens: Phonetic, pitch, and tone tokens.
- Use Case: Ideal for scenarios where capturing and conveying emotions is crucial.

Training Data

Both versions of Spirit LM are trained on a combination of text and speech datasets. This training approach ensures the model can handle cross-modal tasks effectively while maintaining the natural expressiveness of human speech.

Noncommercial Usage Only

While Meta’s commitment to open science is commendable, it's important to note that Spirit LM is currently available only for non-commercial use under the Meta FAIR Noncommercial Research License. This license allows users to:

Use: Utilize the model for research purposes.
Reproduce: Duplicate the model for study and experimentation.
Modify: Make changes to the model to suit specific research needs.
Create Derivatives: Develop new models based on Spirit LM.

However, any distribution of these models or derivatives must adhere to the noncommercial restriction. This limitation might be a hurdle for entrepreneurs and business leaders looking to leverage the technology in commercial products.

Why It Matters

For AI practitioners, the introduction of Spirit LM represents a significant step forward in multimodal language modeling. The ability to handle both text and speech natively opens up new possibilities for more natural and expressive voice experiences. Whether you're working on virtual assistants, educational tools, or any application that benefits from human-like interaction, Spirit LM offers a powerful tool to enhance your projects.

Conclusion

Meta’s Spirit LM is a promising addition to the landscape of multimodal language models. Its advanced capabilities in handling text and speech, combined with its open-source nature (albeit for non-commercial use), make it a valuable resource for researchers and developers looking to push the boundaries of AI voice experiences.