
Share
Researchers introduce synthetic annotations to enhance text-to-speech systems, allowing precise control over speech attributes without the need for reference recordings, opening up new creative possibilities.
Dan Lyth, Simon King
Recent advancements in text-to-speech (TTS) models have shown remarkable capabilities in generating natural-sounding speech. However, controlling speaker identity and style often relies on reference speech recordings, which can be limiting for creative applications. In a new paper, researchers Dan Lyth and Simon King propose a scalable method to label various aspects of speaker identity, style, and recording conditions using synthetic annotations. This approach allows for training a TTS model on a large dataset (45k hours) and significantly improves audio fidelity.
The TTS model is a speech language model that leverages the synthetic annotations for conditioning. Key architecture details include:

The model demonstrates high-fidelity speech generation across a wide range of attributes. Here are some examples:
American Female with a Slightly Low-Pitched Voice:
Female Voice with an Italian Accent:
Male Voice with an Indian Accent:
Male Voice with a Macedonian Accent:
Male Voice with an American Accent:
Male Voice with a Canadian Accent:
Tags
Original Sources
↗ https://www.text-description-to-speech.com/?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
8 February 2024
88 articles
Related Articles
Related Articles
More Stories