HumanGif: Single-View Human Diffusion with Generative Prior

  • 1Sony AI
  • 2Sony Group Corporation
TL;DR: HumanGif learns a single-view human diffusion with generative prior.
Abstract
Previous 3D human creation methods have made significant progress in synthesizing view-consistent and temporally aligned results from sparse-view images or monocular videos. However, it remains challenging to produce perpetually realistic, view-consistent, and temporally coherent human avatars from a single image, as limited information is available in the single-view input setting. Motivated by the success of 2D character animation, we propose HumanGif, a single-view human diffusion model with generative prior. Specifically, we formulate the single-view-based 3D human novel view and pose synthesis as a single-view-conditioned human diffusion process, utilizing generative priors from foundational diffusion models to complement the missing information. To ensure fine-grained and consistent novel view and pose synthesis, we introduce a Human NeRF module in HumanGif to learn spatially aligned features from the input image, implicitly capturing the relative camera and human pose transformation. Furthermore, we introduce an image-level loss during optimization to bridge the gap between latent and image spaces in diffusion models. Extensive experiments on RenderPeople and DNA-Rendering datasets demonstrate that HumanGif achieves the best perceptual performance, with better generalizability for novel view and pose synthesis.
Method Overview
Figure 1. HumanGif Framework.

Given a single input human image and a target pose sequence, our HumanGif produces a sequence of target images aligned with the human subject in the input image and target human poses. To synthesize view-consistent and pose-consistent outputs, our HumanGif proposes to incorporate generative prior, a Human NeRF module, image-level loss, multi-view attention, and temporal attention.

Demo Video