A Two-Stage Transformer Model for Emotion-Driven Piano Performance Generation

Models & Research

The Engineer

1 Aug 2024 · 3 min read

Researchers unveil a two-stage Transformer model that enhances emotional depth in piano performances by separately handling valence through lead sheet composition and arousal via performance attributes like tempo and articulation.

In a recent paper, researchers have introduced a novel two-stage Transformer-based model designed to generate emotion-driven piano performances. This approach addresses the limitations of previous end-to-end models that struggled with accurate emotion modeling. The first stage focuses on valence modeling using lead sheet composition (melody + chord), while the second stage tackles arousal modeling by incorporating performance attributes like articulation, tempo, and velocity.

Key Innovations

Two-Stage Framework: Separates valence and arousal modeling to improve emotional accuracy.
Functional Representation: A new method for encoding symbolic music that considers musical keys and tonality.
Improved Emotion Control: Enables flexible control over arousal levels while maintaining the same lead sheet.

Technical Details

Two-Stage Framework

The model is divided into two stages:

Valence Modeling (Stage 1):
- Input: Lead sheet (melody + chord)
- Output: Valence-driven composition
- Purpose: Capture emotional valence through the lead sheet, focusing on melody and chord interactions.
Arousal Modeling (Stage 2):
- Input: Lead sheet from Stage 1
- Output: Arousal-driven performance attributes (articulation, tempo, velocity)
- Purpose: Introduce performance-level details to control arousal levels, allowing for more nuanced emotional expression.

Functional Representation

The functional representation is an alternative to the popular REMI (REpresentational Music Interface) method. It encodes both melody and chords using Roman numerals relative to musical keys, which helps in capturing the interactions among notes, chords, and tonalities.

Key Features:
- Musical Keys: Recognizes the significance of major-minor tonality in shaping valence perception.
- Roman Numerals: Encodes melody and chords relative to the key, enhancing the representation's ability to capture emotional nuances.
- Conversion Rules: Provides strict one-to-one and optional one-to-either conversions between note pitches and Roman numerals.

Experiments and Results

The researchers conducted experiments to evaluate the effectiveness of their framework and functional representation. The results demonstrated significant improvements in emotion modeling:

Mean Opinion Score (MOS):
- Valence-Oriented: Higher scores indicate better performance.
- Arousal-Oriented: Lower scores indicate better performance.
- The proposed method outperformed both one-stage and two-stage models using REMI representation.
Confusion Matrices:
- Four-quadrant (4Q) listening tests showed that the functional representation model had a clearer distinction between different emotional quadrants, indicating more accurate emotion control.

Generation Samples

To illustrate the capabilities of the proposed framework, the researchers provided generation samples from three models:

REMI (one): One-stage generation model with REMI representation (baseline).
REMI (two): Two-stage generation model with REMI representation.
Functional (two): Two-stage generation model with functional representation (main proposal).

Same Lead Sheet, Different Arousal Performance

The following examples demonstrate piano performances generated from the same lead sheet but with different arousal levels:

Low Arousal: Softer, slower performance with less dynamic variation.
High Arousal: Louder, faster performance with more dynamic variation.

These samples highlight the model's ability to generate diverse emotional expressions while maintaining the same musical structure.

Conclusion

The two-stage Transformer-based model and functional representation offer significant advancements in emotion-driven piano performance generation. By separating valence and arousal modeling and considering musical keys, this approach provides more accurate and flexible control over emotional expression in generated performances.