PortraitGen
Portrait Video Editing Empowered by
Multimodal Generative Priors

Xuan Gao, Haiyao Xiao, Chenglai Zhong, Shimin Hu, Yudong Guo, Juyong Zhang,

University of Science and Technology of China

SIGGRAPH Asia 2024




Click to learn more


PortraitGen lifts 2D portrait video into 4D Gaussian field.
It achieves multimodal portrait editing in just 30 minutes ⏰.
The edited 3D portrait could also be rendered at 100 FPS ⚡.


pipeline

We first track the SMPL-X coefficients of the given monocular video, and then use a Neural Gaussian Texture mechanism to get a 3D Gaussian feature field. These neural Gaussians are further splatted to render portrait images. An iterative dataset update strategy is applied for portrait editing, and a Multimodal Face Aware Editing module is proposed to enhance expression quality and preserve personalized facial structures.

🎬Introduction Video🎬


🎨Multimodal Portrait Editing🎨

Our scheme is a unified portrait video editing framework. Any structure preserving image editing model could be used to synthesize a 3D consistent and temporally coherent portrait video.

Text Driven Editing

We use InstructPix2Pix as 2D editing model. Its UNet takes three inputs: an input RGB image, a text instruction and noised latent. We add partial noise to rendered image and edit it based on input source image and instruction.

Image Driven Editing

We focus on two kinds of editing works based on image prompt. One kind is to extract the global style of a reference image and another aims to customize an image by placing an object at a specific location. These approaches are utilized in our experiments for style transfer and virtual try-on. We use a Neural Style Transfer algorithm to transfer the style of a reference image to the dataset frames, and use AnyDoor to change the clothes of the subject.

Relighting

We utilize IC-Light to manipulate the illumination of the video frames. Given a text description as the light condition, our method can harmoniously adjust the lighting of the portrait video.


🔍Comparison🔍

We compare our method with state-of-the-art video editing methods, including TokenFlow, Rerender A Video, CoDeF and AnyV2V. Our method remarkably outperform other methods in prompt preservation, identity preservation and temporal consistency.

🧪Ablation Study🧪

Neural Gaussian Texture

Inspired by Neural Texture proposed by Defered Neural Rendering, we propose Neural Gaussian Texture, which stores learnable feature for each Gaussian, instead of storing spherical harmonic coefficients. We then employ a 2D neural renderer to transform the splatted feature map into RGB signals. This approach provides a more informative feature than SH coefficients and allows for better fusion of splatted features, facilitating the editing of more complex styles like Lego and pixel art.

Face-Aware Portrait Editing

When editing an upper body image where the face occupies a relatively small portion, the model's editing may not be robust enough to head pose and facial structure. Face-Aware Portrait Editing (FA) could enhance the awareness of face structures by performing editing twice.

Expression Similarity Guidance

By mapping rendered image and input source image into the latent expression space of EMOCA , and optimizing for expression similarity, we can further keep the expressions natural and consistent with original input video frames.

BibTex

@inproceedings{Gao2024PortraitGen,
title = {Portrait Video Editing Empowered by Multimodal Generative Priors},
author = {Xuan Gao and Haiyao Xiao and Chenglai Zhong and Shimin Hu and Yudong Guo and Juyong Zhang},
booktitle = {ACM SIGGRAPH Asia Conference Proceedings},
year = {2024},
}

Acknowledgements

This research was supported by the National Natural Science Foundation of China (No.62122071, No.62272433, No.62402468), the Fundamental Research Funds for the Central Universities (No. WK3470000021), and the advanced computing resources provided by the Supercomputing Center of University of Science and Technology of China.

We use the templates released by LAMP and GeneAvatar.