HERA: Hybrid Explicit Representation

for Ultra-Realistic Head Avatars

CVPR 2025

Hongrui Cai1*     Yuting Xiao2*     Xuan Wang3†     Jiafei Li3     Yudong Guo1     Yanbo Fan4     Shenghua Gao5     Juyong Zhang1†

1University of Science and Technology of China     2ShanghaiTech University    
3Xi’an Jiaotong University     4Nanjing University     5University of Hong Kong    

*Equal contribution     Corresponding authors    

TL;DR: We propose a hybrid explicit representation for modeling ultra-realistic head avatars.

Abstract

We introduce a novel approach to creating ultra-realistic head avatars and rendering them in real time (≥ 30 fps at 2048 × 1334 resolution). First, we propose a hybrid explicit representation that combines the advantages of two primitive based efficient rendering techniques. UV-mapped 3D mesh is utilized to capture sharp and rich textures on smooth surfaces, while 3D Gaussian Splatting is employed to represent complex geometric structures. In the pipeline of modeling an avatar, after tracking parametric models based on captured multi-view RGB videos, our goal is to simultaneously optimize the texture and opacity map of mesh, as well as a set of 3D Gaussian splats localized and rigged onto the mesh facets. Specifically, we perform α-blending on the color and opacity values based on the merged and re-ordered z-buffer from the rasterization results of mesh and 3DGS. This process involves the mesh and 3DGS adaptively fitting the captured visual information to outline a high-fidelity digital avatar. To avoid artifacts caused by Gaussian splats crossing the mesh facets, we design a stable hybrid depth sorting strategy. Experiments illustrate that our modeled results exceed those of state-of-the-art approaches.

Method

The overall pipeline of proposed HERA. In the canonical space, there is a mesh with a texture UV map (visualized in RGB format) and an opacity UV map, along with several Gaussian splats defined in the local coordinate system of the mesh facets. During animation, the positions of the mesh vertices change, causing the rigged splats to move accordingly. Under the camera view, both the mesh and Gaussian splats are rasterized using the proposed hybrid approach, and the image is rendered through α-blending. The entire pipeline is fully differentiable. Guided by the captured image, the texture map and the opacity map are optimized while the rigged Gaussian splats are updated and densified simultaneously.

Hybrid Depth Sorting

For sorting depths of different geometric primitives, we propose a strategy to prevent Gaussian splats from crossing mesh facets. Instead of (a) sorting the depths of 3DGS (non per-pixel value) and mesh (per-pixel value) directly, (b) we compare the depth of projected point and GS depth to sort stably.

Free Viewpoint Video

Our HERA could render an avatar at a resolution of 2048 × 1334 with a frame rate of approximately 81 FPS and achieves a PSNR of 34.0 ± 0.5 dB.

Comparisons

We compare our HERA with state-of-the-art methods of animatable avatars.

Novel View sysnthesis

Novel Expression Animation

More Evaluations of Hybrid Representation

We conduct comparisons of our proposed hybrid representation on novel view synthesis in static scenes and dynamic scenes.

NVS in static scenes

NVS in dynamic scenes

BibTeX

If you find HERA useful for your work please cite:

@inproceedings{Cai2025HERA,
  author    = {Hongrui Cai, Yuting Xiao, Xuan Wang, Jiafei Li, Yudong Guo, Yanbo Fan, Shenghua Gao, Juyong Zhang},
  title     = {HERA: Hybrid Explicit Representation for Ultra-Realistic Head Avatars},
  booktitle = {Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)},
  year      = {2025}
}

Acknowledgements

This research was supported by the National Natural Science Foundation of China (No.62441224, No.62272433).