Recently, text-guided digital portrait editing has attracted more and more attentions. However, existing methods still struggle to maintain consistency across time, expression, and view or require specific data prerequisites. To solve these challenging problems, we propose CosAvatar, a high-quality and user-friendly framework for portrait tuning. With only monocular video and text instructions as input, we can produce animatable portraits with both temporal and 3D consistency. Different from methods that directly edit in the 2D domain, we employ a dynamic NeRF-based 3D portrait representation to model both the head and torso. We alternate between editing the video frames' dataset and updating the underlying 3D portrait until the edited frames reach 3D consistency. Additionally, we integrate the semantic portrait priors to enhance the edited results, allowing precise modifications in specified semantic areas. Extensive results demonstrate that our proposed method can not only accurately edit portrait styles or local attributes based on text instructions but also support expressive animation driven by a source video.
We train two dynamic neural radiance fields to reconstruct the head and torso region separately based on the tracking coefficient and pose. Once the NeRF model is constructed, we use Instruct-Pix2Pix to generate edited results based on the instruction and progressively update the sequence datasets and NeRF model. In the end, through the utilization of the NeRF-based portrait representation, we are able to produce an editing sequence that maintains temporary coherence and 3D consistency.
our method can handle various text-driven visual editing tasks, including global style editing and attribute editing of portrait local regions. Our method performs well in portrait editing with extreme poses and expressions, generating 3D-consistent and temporally coherent results.
We compare CosAvatar with state-of-the-art video editing methods. By incorporating prior information about the head as guidance, our method accurately captures motion in expression and posture, thereby ensuring temporal consistency in the edited results. Furthermore, our method could achieve accurate editing of the local attributes, particularly the identity of the edited portrait, which is a challenging task for other methods.
By utilizing a NeRF model with FLAME coefficients as conditional inputs, our method enables the direct transfer of facial expressions and head poses from the reference video to the edited portrait.
If you find our paper useful for your work please cite:
@article{Xiao2023CosAvatar,
author = {Xiao, Haiyao and Zhong, Chenglai and Gao, Xuan and Guo, Yudong and Zhang, Juyong},
title = {CosAvatar: Consistent and Animatable Portrait Video Tuning with Text Prompt},
journal = {arXiv:2311.18288},
year = {2023},
}
}