In this paper, we propose an efficient neural radiance field based novel view synthesis method for human performance. Given monocular self-rotating videos of human performers, our method can train from scratch and achieve high-fidelity results in about twenty minutes. Some recent works have utilized the neural radiance field for dynamic human reconstruction. However, most of these methods need multi-view inputs and require hours of training, making it still difficult for practical use. To address this challenging problem, we introduce a surface-relative representation based on multi-resolution hash encoding that can greatly improve the training speed and aggregate inter-frame information. Extensive experimental results on several different datasets demonstrate the effectiveness and efficiency of our approach to challenging monocular videos.
Overview of our method. Given a sample point at any frame, we first obtain a surface-relative representation conditioned on human body surface via KNN to aggregate the corresponding point information of different frames. Then we exploit multi-resolution hash encoding to get the feature, which is the encoded input to the NeRF MLP to regress color and density.
We test our method on multiple datasets (ZJU-Mocap dataset[Peng et al. 2021], People-Snapshot dataset [Alldieck et al. 2018] and our collected dataset).
We compare our method with state-of-the-art implicite human novel view synthesis methods.
We evaluate the choice of the human surface (SMPL or SelfRecon) in our algorithm.