Stability AI and Arm Release Stable Audio Open Small for On-Device Text-to-Audio Generation on Smartphones

Products & Applications

The Engineer

19 May 2025 · 3 min read

Stability AI and Arm unveil Stable Audio Open Small, a lightweight text-to-audio model for smartphones, offering quick audio generation with just 341 million parameters-perfect for developers seeking efficient mobile solutions.

May 14, 2025

Stability AI and Arm have announced the release of Stable Audio Open Small, a compact text-to-audio model designed to run entirely on Arm CPUs. This new model is optimized for generating short audio samples quickly and efficiently on mobile devices. Here’s what changed technically and why it matters to developers and practitioners.

Key Changes and Technical Details

Model Size and Performance: Stable Audio Open Small has 341 million parameters, making it significantly smaller than its predecessor while maintaining high output quality and prompt adherence. It can generate up to 11 seconds of audio on a smartphone in less than 8 seconds.
Optimization for Arm CPUs: The model leverages Arm's KleidiAI software stack, which is designed to optimize AI workloads on Arm processors. This ensures that the model runs efficiently on a wide range of mobile devices, from high-end smartphones to budget models.
Real-World Deployment: By running entirely on-device, Stable Audio Open Small enables real-time audio generation without relying on cloud services. This is particularly useful for applications requiring low latency and data privacy, such as voice assistants, gaming, and interactive storytelling.

Why It Matters

On-Device Capabilities:
- Latency: On-device processing reduces the delay between user input and output, making the experience more seamless.
- Privacy: User data remains on the device, enhancing privacy and security.
- Connectivity Independence: The model can function without an internet connection, expanding its usability in various environments.
Developer Accessibility:
- Open Source: Stable Audio Open Small is available under the permissive Stability AI Community License, allowing both commercial and non-commercial use.
- Learning Resources: Arm has created a new Arm Learning Path with hands-on guidance for developers interested in running the model on Arm CPUs.

Performance Benchmarks:
- Generation Time: The model can produce 11 seconds of audio in under 8 seconds, making it suitable for real-time applications.
- Quality: Despite its smaller size, Stable Audio Open Small maintains high-quality audio output and accurate prompt adherence.

How It Works

Architecture: The model uses a combination of text encoding and audio synthesis techniques to generate audio from text inputs. The architecture is optimized to run efficiently on Arm CPUs, ensuring that the computational requirements are minimal.
Implementation: Developers can download the model weights from Hugging Face and access the code on GitHub. The provided resources include detailed documentation and sample implementations to help developers get started.

Use Cases

Voice Assistants: Enhance voice assistants with natural-sounding, contextually relevant audio responses.
Gaming: Generate dynamic sound effects and dialogues in real-time for more immersive gaming experiences.
Interactive Storytelling: Create interactive narratives where the audio content adapts to user choices.

Conclusion

The release of Stable Audio Open Small represents a significant step forward in on-device text-to-audio generation. By combining Arm's hardware optimization with Stability AI's cutting-edge model, developers can now create powerful and efficient audio applications for mobile devices. Whether you’re building a voice assistant, enhancing a game, or creating an interactive story, this model offers the performance and flexibility needed to bring your ideas to life.