Expressive Talking Human from Single-Image with Imperfect Priors

Abstract

Building realistic and animatable avatars still requires minutes of multi-view or monocular self-rotating videos, and most methods lack precise control over gestures and expressions. To push this boundary, we address the challenge of constructing a whole-body talking avatar from a single image. We propose a novel pipeline that tackles two critical issues: 1) complex dynamic modeling and 2) generalization to novel gestures and expressions. To achieve seamless generalization, we leverage recent pose-guided image-to-video diffusion models to generate imperfect video frames as pseudo-labels. To overcome the dynamic modeling challenge posed by inconsistent and noisy pseudo-videos, we introduce a tightly coupled 3DGS-mesh hybrid avatar representation and apply several key regularizations to mitigate inconsistencies caused by imperfect labels. Extensive experiments on diverse subjects demonstrate that our method enables the creation of a photorealistic, precisely animatable, and expressive whole-body talking avatar from just a single image.

Method Overview

Our method constructs an expressive whole-body talking avatar from a single image. We begin by generating pseudo body and head frames using pre-trained generative models, driven by a collected video dataset with diverse poses. Per-pixel supervision on the input image, perceptual supervision on imperfect pseudo labels, and mesh-related constraints are then applied to guide the 3DGS-mesh coupled avatar representation, ensuring realistic and expressive avatar reconstruction and animation.

Comparison

We compare our work with baseline methods, ExAvatar, ELICIT, MimicMotion and Make-Your-Anchor on diverse tasks like cross-identity motion reenactment, self-driven motion reenactment and upper-body video-driven. Our method achieves accurate and realistic animation with almost all fine details preserved and identity unchanged.

Note that for ExAvatar, since it takes short videos as input, we compare with two versions of it: ExAvatar-40shot, which uses 40-shot images, and ExAvatar-1shot, which uses one-shot images as input. For Make-Your-Anchor, as we find it does not perform well on one-shot input, we only compare it with available video input by fine-tuning it on a short video clip.

More results on NVS and motion reenactment

In this paper, we introduce a novel pipeline for creating expressive talking avatars from a single image. We propose a coupled 3DGS-Mesh avatar representation, incorporating several key constraints and a carefully designed hybrid learning framework that combines information from both the input image and pseudo frames. Experimental results demonstrate that our method outperforms existing techniques, with our one-shot avatar even surpassing state-of-the-art methods that require video input. Considering its simplicity in construction and ability to generate vivid, realistic animations, our method shows significant potential for practical applications of talking avatars across various fields.

BibTeX


    @inproceedings{xiang2025expressive,
      author    = {Jun Xiang and Yudong Guo and Leipeng Hu and Boyang Guo and Yancheng Yuan and Juyong Zhang},
      title     = {Expressive Talking Human from Single-Image with Imperfect Priors},
      booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
      year      = {2025},
    }

Expressive Talking Human from Single-Image
with Imperfect Priors

ICCV 2025

Given a one-shot image (e.g., your favorite photo) as input, our method reconstructs a fully expressive whole-body talking avatar that captures personalized details and supports realistic animation, including vivid body gestures and natural expression changes.

Abstract

Method Overview

Comparison

More results on cross-identity video-driven

More results on NVS and motion reenactment

BibTeX

Expressive Talking Human from Single-Imagewith Imperfect Priors

ICCV 2025

Given a one-shot image (e.g., your favorite photo) as input, our method reconstructs a fully expressive whole-body talking avatar that captures personalized details and supports realistic animation, including vivid body gestures and natural expression changes.

Abstract

Method Overview

Comparison

More results on cross-identity video-driven

More results on NVS and motion reenactment

BibTeX

Expressive Talking Human from Single-Image
with Imperfect Priors