Click to learn more
We first track the SMPL-X coefficients of the given monocular video, and then use a Neural Gaussian Texture mechanism to get a 3D Gaussian feature field. These neural Gaussians are further splatted to render portrait images. An iterative dataset update strategy is applied for portrait editing, and a Multimodal Face Aware Editing module is proposed to enhance expression quality and preserve personalized facial structures.
Our scheme is a unified portrait video editing framework. Any structure preserving image editing model could be used to synthesize a 3D consistent and temporally coherent portrait video.
We use InstructPix2Pix as 2D editing model. Its UNet takes three inputs: an input RGB image, a text instruction and noised latent. We add partial noise to rendered image and edit it based on input source image and instruction.
We focus on two kinds of editing works based on image prompt. One kind is to extract the global style of a reference image and another aims to customize an image by placing an object at a specific location. These approaches are utilized in our experiments for style transfer and virtual try-on. We use a Neural Style Transfer algorithm to transfer the style of a reference image to the dataset frames, and use AnyDoor to change the clothes of the subject.
We utilize IC-Light to manipulate the illumination of the video frames. Given a text description as the light condition, our method can harmoniously adjust the lighting of the portrait video.
We compare our method with state-of-the-art video editing methods, including TokenFlow, Rerender A Video, CoDeF and AnyV2V. Our method remarkably outperform other methods in prompt preservation, identity preservation and temporal consistency.
Inspired by Neural Texture proposed by Defered Neural Rendering, we propose Neural Gaussian Texture, which stores learnable feature for each Gaussian, instead of storing spherical harmonic coefficients. We then employ a 2D neural renderer to transform the splatted feature map into RGB signals. This approach provides a more informative feature than SH coefficients and allows for better fusion of splatted features, facilitating the editing of more complex styles like Lego and pixel art.
When editing an upper body image where the face occupies a relatively small portion, the model's editing may not be robust enough to head pose and facial structure. Face-Aware Portrait Editing (FA) could enhance the awareness of face structures by performing editing twice.
By mapping rendered image and input source image into the latent expression space of EMOCA , and optimizing for expression similarity, we can further keep the expressions natural and consistent with original input video frames.
@inproceedings{Gao2024PortraitGen, title = {Portrait Video Editing Empowered by Multimodal Generative Priors}, author = {Xuan Gao and Haiyao Xiao and Chenglai Zhong and Shimin Hu and Yudong Guo and Juyong Zhang}, booktitle = {ACM SIGGRAPH Asia Conference Proceedings}, year = {2024}, }
This research was supported by the National Natural Science Foundation of China (No.62122071, No.62272433, No.62402468), the Fundamental Research Funds for the Central Universities (No. WK3470000021), and the advanced computing resources provided by the Supercomputing Center of University of Science and Technology of China.