Enhancing Text-to-Speech with Natural Language Guidance and Synthetic Annotations

Models & Research

The Engineer

8 Feb 2024 · 3 min read

Researchers introduce synthetic annotations to enhance text-to-speech systems, allowing precise control over speech attributes without the need for reference recordings, opening up new creative possibilities.

Enhancing Text-to-Speech with Natural Language Guidance and Synthetic Annotations

Dan Lyth, Simon King

Paper

Recent advancements in text-to-speech (TTS) models have shown remarkable capabilities in generating natural-sounding speech. However, controlling speaker identity and style often relies on reference speech recordings, which can be limiting for creative applications. In a new paper, researchers Dan Lyth and Simon King propose a scalable method to label various aspects of speaker identity, style, and recording conditions using synthetic annotations. This approach allows for training a TTS model on a large dataset (45k hours) and significantly improves audio fidelity.

Key Technical Changes

Synthetic Annotations: Instead of relying on human-labeled descriptions, the researchers developed a method to generate synthetic annotations for speaker identity, style, and recording conditions. This enables scaling to large datasets without the bottleneck of manual labeling.
Natural Language Conditioning: The model uses natural language prompts to control speaker attributes, providing an intuitive way to specify desired characteristics like accent, pitch, and pace.

Methodology

Dataset: A 45k hour dataset was used for training. This extensive dataset includes a diverse range of accents, prosodic styles, channel conditions, and acoustic environments.
Annotation Generation: Synthetic annotations were generated for each data point, covering aspects such as:
- Speaker identity (e.g., gender, age, accent)
- Style (e.g., pitch, pace, monotony)
- Recording conditions (e.g., noise level, proximity)

Model Architecture

The TTS model is a speech language model that leverages the synthetic annotations for conditioning. Key architecture details include:

Encoder: Processes the text input and natural language prompts to generate context-aware embeddings.
Decoder: Generates the audio waveform based on the encoder's output and synthetic annotations.
Post-processing: Includes techniques for enhancing audio fidelity, such as spectral normalization and noise reduction.

Results

The model demonstrates high-fidelity speech generation across a wide range of attributes. Here are some examples:

American Female with a Slightly Low-Pitched Voice:
- Prompt: "An American female with a slightly low-pitched voice reads a book. Her words are captured in an excellent and very close-sounding recording. The speaker reads with a slightly quick pace."
- Audio: [Our model] [Audiobox] [Ground truth]
Female Voice with an Italian Accent:
- Prompt: "A female voice with an Italian accent reads from a book. The recording is very noisy. The speaker reads fairly quickly with a slightly high-pitched and monotone voice."
- Audio: [Our model] [Audiobox] [Ground truth]
Male Voice with an Indian Accent:
- Prompt: "A male voice with an Indian accent reads slowly from a book, his words fairly close-sounding and slightly clean. He speaks in a slightly monotone fashion, but his voice is fairly high-pitched, adding a touch of eagerness to his reading."
- Audio: [Our model] [Audiobox] [Ground truth]
Male Voice with a Macedonian Accent:
- Prompt: "A male voice with a Macedonian accent reads a book aloud. The recording is very close-sounding but slightly noisy. The voice is quite monotone with a fairly low pitch."
- Audio: [Our model] [Audiobox] [Ground truth]
Male Voice with an American Accent:
- Prompt: "A male voice with an American accent reads a book. The recording is very close-sounding and very clean. His voice is slightly monotone, but the excellent recording and his slightly low pitch draw the listener in."
- Audio: [Our model] [Audiobox] [Ground truth]
Male Voice with a Canadian Accent:
- Prompt: "A male voice with a Canadian accent reads a book aloud. The recording is excellent. His delivery is slightly monotone."
- Audio: [Our model] [Audiobox] [Ground truth]

Enhancing Text-to-Speech with Natural Language Guidance and Synthetic Annotations

Enhancing Text-to-Speech with Natural Language Guidance and Synthetic Annotations

Key Technical Changes

Methodology

Model Architecture

Results

Implications for Practitioners