SHERF: Generalizable Human NeRF from a Single Image

  • 1S-Lab, Nanyang Technological University
  • 2Sensetime Research
  • *Equal Contribution
  • ✉Corresponding Author
TL;DR: SHERF learns a Generalizable Human NeRF to animate 3D humans from a single image.
Existing Human NeRF methods for reconstructing 3D humans typically rely on multiple 2D images from multi-view cameras or monocular videos captured from fixed camera views. However, in real-world scenarios, human images are often captured from random camera angles, presenting challenges for high-quality 3D human reconstruction. In this paper, we propose SHERF, the first generalizable Human NeRF model for recovering animatable 3D humans from a single input image. SHERF extracts and encodes 3D human representations in canonical space, enabling rendering and animation from free views and poses. To achieve high-fidelity novel view and pose synthesis, the encoded 3D human representations should capture both global appearance and local fine-grained textures. To this end, we propose a bank of 3D-aware hierarchical features, including global, point-level, and pixel-aligned features, to facilitate informative encoding. Global features enhance the information extracted from the single input image and complement the information missing from the partial 2D observation. Point-level features provide strong clues of 3D human structure, while pixel-aligned features preserve more fine-grained details. To effectively integrate the 3D-aware hierarchical feature bank, we design a feature fusion transformer. Extensive experiments on THuman, RenderPeople, ZJU_MoCap, and HuMMan datasets demonstrate that SHERF achieves state-of-the-art performance, with better generalizability for novel view and pose synthesis.
Method Overview
Figure 1. SHERF Framework.

To render the target image, we first cast rays and sample points in the target space. The sample points are transformed to the canonical space through inverse LBS. We then query the corresponding 3D-aware global, point-level, and pixel-aligned features. The deformed points, combined with the bank of features, are input into the feature fusion transformer and NeRF decoder to get the RGB and density, which are further used to produce the target image through volume rendering.

Demo Video
        title={SHERF: Generalizable Human NeRF from a Single Image},
        author={Hu, Shoukang and Hong, Fangzhou and Pan, Liang and Mei, Haiyi and Yang, Lei and Liu, Ziwei},
        journal={arXiv preprint},
Wonderful Human Generation

There are lots of wonderful related works that might be of interest to you.

3D Human Generation

+ EVA3D is a high-quality unconditional 3D human generative model that only requires 2D image collections for training.

+ AvatarCLIP generate and animate diverse 3D avatars given descriptions of body shapes, appearances and motions in a zero-shot way.

2D Human Generation

+ Text2Human proposes a text-driven controllable human image generation framework.

+ StyleGAN-Human scales up high-quality 2D human dataset and achieves impressive 2D human generation results.

Motion Generation

+ MotionDiffuse is the first diffusion-model-based text-driven motion generation framework with probabilistic mapping, realistic synthesis and multi-level manipulation ability.

+ Bailando introduces an actor-critic-based reinforcement learning scheme to the GPT to achieve synchronized alignment between diverse motion tempos and music beats.