When Words Smile😀:

Generating Diverse Emotional Facial Expressions from Text

Harbin Institute of Technology (Shenzhen), University of Macau,
Nanyang Technological University, National University of Singapore
EMNLP 2025, Oral (*Correspondence)

Abstract

Enabling digital humans to express rich emotions has significant applications in dialogue systems, gaming, and other interactive scenarios. While recent advances in talking head synthesis have achieved impressive results in lip synchronization, they tend to overlook the rich and dynamic nature of facial expressions. To fill this critical gap, we introduce an endto-end text-to-expression model that explicitly focuses on emotional dynamics. Our model learns expressive facial variations in a continuous latent space and generates expressions that are diverse, fluid, and emotionally coherent. To support this task, we introduce EmoAva, a large-scale and high-quality dataset containing 15,000 text–3D expression pairs. Extensive experiments on both existing datasets and EmoAva demonstrate that our method significantly outperforms baselines across multiple evaluation metrics, marking a significant advancement in the field.


OverView


Teaser

Top: The existing pipeline for synthesizing emotional avatars, which can only generate limited expressions that lack of diversity. Bottom: The proposed end-to-end system that directly maps text to facial expressions (codes), aims to generate diverse, emotionally consistent, and temporally smooth expressions.



Demonstrations


Text: I am so dead.

Text: That'd be great!

Text: What a beautiful story.

Text: What the hell?

Continuous Text-to-Expression Generator


Teaser

Given a text, the model autoregressively generates a sequence of expression vectors. The green block and pink block represent the proposed Expression-wise Attention (EwA) module and the core Conditional Variational Autoregressive Decoder (CVAD) module, respectively.