Building realistic and animatable avatars still requires minutes of multi-view or monocular self-rotating videos, and most methods lack precise control over gestures and expressions. To push this boundary, we address the challenge of constructing a whole-body talking avatar from a single image. We propose a novel pipeline that tackles two critical issues: 1) complex dynamic modeling and 2) generalization to novel gestures and expressions. To achieve seamless generalization, we leverage recent pose-guided image-to-video diffusion models to generate imperfect video frames as pseudo-labels. To overcome the dynamic modeling challenge posed by inconsistent and noisy pseudo-videos, we introduce a tightly coupled 3DGS-mesh hybrid avatar representation and apply several key regularizations to mitigate inconsistencies caused by imperfect labels. Extensive experiments on diverse subjects demonstrate that our method enables the creation of a photorealistic, precisely animatable, and expressive whole-body talking avatar from just a single image.
Our method constructs an expressive whole-body talking avatar from a single image. We begin by generating pseudo body and head frames using pre-trained generative models, driven by a collected video dataset with diverse poses. Per-pixel supervision on the input image, perceptual supervision on imperfect pseudo labels, and mesh-related constraints are then applied to guide the 3DGS-mesh coupled avatar representation, ensuring realistic and expressive avatar reconstruction and animation.
We compare our work with baseline methods, ExAvatar, ELICIT, MimicMotion and Make-Your-Anchor on diverse tasks like cross-identity motion reenactment, self-driven motion reenactment and upper-body video-driven. Our method achieves accurate and realistic animation with almost all fine details preserved and identity unchanged.
Note that for ExAvatar, since it takes short videos as input, we compare with two versions of it: ExAvatar-40shot, which uses 40-shot images, and ExAvatar-1shot, which uses one-shot images as input. For Make-Your-Anchor, as we find it does not perform well on one-shot input, we only compare it with available video input by fine-tuning it on a short video clip.Using the same driving pose, identities with completely different attributes can be driven in the same way, thanks to the SMPL-X model and the 3DGS-mesh coupled representation.
In this paper, we introduce a novel pipeline for creating expressive talking avatars from a single image. We propose a coupled 3DGS-Mesh avatar representation, incorporating several key constraints and a carefully designed hybrid learning framework that combines information from both the input image and pseudo frames. Experimental results demonstrate that our method outperforms existing techniques, with our one-shot avatar even surpassing state-of-the-art methods that require video input. Considering its simplicity in construction and ability to generate vivid, realistic animations, our method shows significant potential for practical applications of talking avatars across various fields.