research-article

Open access

Cafca: High-quality Novel View Synthesis of Expressive Faces from Casual Few-shot Captures

Authors: Marcel C. Buehler, Gengyan Li, Erroll Wood, Leonhard Helminger, Xu Chen, Tanmay Shah, Daoye Wang, Stephan Garbin, Sergio Orts-Escolano, Otmar Hilliges, Dmitry Lagun, Jérémy Riviere, Paulo Gotardo, Thabo Beeler, Abhimitra Meka, Kripasindhu SarkarAuthors Info & Claims

SA '24: SIGGRAPH Asia 2024 Conference Papers

Article No.: 29, Pages 1 - 12

https://doi.org/10.1145/3680528.3687580

Published: 03 December 2024 Publication History

All formats PDF

Abstract

Volumetric modeling and neural radiance field representations have revolutionized 3D face capture and photorealistic novel view synthesis. However, these methods often require hundreds of multi-view input images and are thus inapplicable to cases with less than a handful of inputs. We present a novel volumetric prior on human faces that allows for high-fidelity expressive face modeling from as few as three input views captured in the wild. Our key insight is that an implicit prior trained on synthetic data alone can generalize to extremely challenging real-world identities and expressions and render novel views with fine idiosyncratic details like wrinkles and eyelashes. We leverage a 3D Morphable Face Model to synthesize a large training set, rendering each identity with different expressions, hair, clothing, and other assets. We then train a conditional Neural Radiance Field prior on this synthetic dataset and, at inference time, fine-tune the model on a very sparse set of real images of a single subject. On average, the fine-tuning requires only three inputs to cross the synthetic-to-real domain gap. The resulting personalized 3D model reconstructs strong idiosyncratic facial expressions and outperforms the state-of-the-art in high-quality novel view synthesis of faces from sparse inputs in terms of perceptual and photo-metric quality.

1 Introduction

“Who sees the human face correctly: the photographer, the mirror, or the painter?” - Pablo Picasso. Our visual acuity is remarkably hypertuned to perceive the details of human faces due to evolutionary design [Pascalis and Kelly 2009]. Individual-specific expressions play a particularly significant role in perception tasks such as identification or estimation of emotion and intent [Kret 2015; Lee and Anderson 2017; Sinha et al. 2006]. Highly personalizable 3D representations that model the idiosyncracies of face shape and deformation at high quality are crucial to truly immersive and photorealistic 3D portrait photography. Perhaps, if trained well, an AI could join Picasso’s list.

Fig. 1:

Wide-scale adoption and democratization of 3D portrait photography demands casual captures in uncontrolled environments – i.e., only a few shots taken with a handheld camera. Volumetric representations [Barron et al. 2022; Mildenhall et al. 2020; Park et al. 2021a; 2021b], have demonstrated impressive quality and photorealism in synthesizing novel views of a large variety of 3D scenes using densely captured 2D images. Extended works relax the requirement for dense capture through various forms of regularization – for instance, entropy minimization [Rebain et al. 2022], spatial smoothness [Niemeyer et al. 2022], depth regularization [Guangcong et al. 2023; Prinzler et al. 2023] and volume bounds [Sarkar et al. 2023]. These methods target general scenes and still require dozens of views captured simultaneously. As such, they cannot be applied to expressive 3D portrait photography in the wild, due to ambiguities that arise from the limited input and uncontrolled conditions.

Lifting very sparse 2D views onto a 3D reconstruction requires a strong face prior, crafted using diverse data captured across the human population. However, large real datasets are expensive and challenging to collect and typically suffer from low resolution, sampling limitations, and biases in their coverage of diversity and detail of facial geometry and appearance, under varied viewpoints and lighting. Here, our key insight is that such prior can be built from synthetic data alone and fine-tuned on a few real-world images to bridge the synthetic-real domain gap, generalizing robustly to challenging portraits captured in the wild, as shown in Figs. 1 and 9.

Recent works have shown that well-curated, calibrated, and diverse datasets can be built from synthetic graphical renderings. Such data has been successfully applied to various face perception tasks, including sparse problems like landmark localization and segmentation [Wood et al. 2021; 2022] and dense problems like view synthesis [Sun et al. 2021] and relighting [Yeh et al. 2022]. However, such synthetic data is unable to model the full light transport of complex interactions like sub-surface scattering, specular polarization, and global illumination, thus presenting a significant domain gap. Interestingly, sparse regression problems can robustly overcome such gaps at inference time. In contrast, denser synthesis problems cannot, requiring domain adaptation solutions that suffer from quality losses that prevent their use in building a 3D face prior.

We propose a novel, volumetric facial prior that is learned from a large dataset of synthetic images of different identities and expressions, rendered with accessories like hair and beard styles, glasses, clothing, and skin textures. At inference time, given a few input images at arbitrary resolutions, we fine-tune this prior to reconstruct a high-quality, personalized volumetric model of the captured subject. This new model then allows for 3D-consistent novel view synthesis at high resolution, even when the original input shows exaggerated facial expressions and challenging lighting conditions. We overcome the domain gap between synthetic and real by using an MLP-based generator for volumetric rendering. This MLP learns a strong prior over geometry while generalizing to unseen appearance domains, including colored lighting and shadows. We demonstrate the efficacy of our method through extensive evaluation, ablations, and comparison with the state of the art.

In summary, our key contributions are:

•

We show that synthetic data alone can be successfully leveraged to learn a strong face prior for few-shot, personalized 3D face modeling in the wild.

•

Our new model and few-shot fine-tuning method can robustly reconstruct expressive faces under challenging in-the-wild lighting and synthesize photorealistic novel views with unprecedented quality and fine-scale idiosyncratic details.

Our synthetic dataset is available for research purposes at

https://syntec-research.github.io/Cafca.

Fig. 2:

Fig. 3:

2 Related Work

Multiple works have explored ways to mitigate the data-intensive nature of volumetric reconstruction, where some main themes have emerged: regularization schemes [Jain et al. 2021; Niemeyer et al. 2022; Rebain et al. 2022; Truong et al. 2023; Yang et al. 2023], carefully crafted initialization of model parameters [Kundu et al. 2022; Tancik et al. 2021; Vora et al. 2021], depth signals or feature embeddings [Guangcong et al. 2023; Jain et al. 2021; Truong et al. 2023], and data-driven, pre-trained priors [Buehler et al. 2023; Chan et al. 2022; Chen et al. 2021; Gu et al. 2021; Jain et al. 2021; Mihajlovic et al. 2022; Or-El et al. 2022; Prinzler et al. 2023; Ramon et al. 2021; Rebain et al. 2022; Tan et al. 2022; Truong et al. 2023; Wang et al. 2021; Yu et al. 2021; Zhang et al. 2022].

Regularization. RegNeRF [Niemeyer et al. 2022] employs a smoothness regularization term on the expected depth derived from the predicted density, in conjunction with a patch-based regularizer on appearance which together bias the model towards a 3D consistent solution. FreeNeRF [Yang et al. 2023] proposes a gradual, coarse-to-fine training scheme to prevent overfitting caused by high-frequency, position encoding components early in the fit. Despite their promising results on in-the-wild images, these methods cannot yet provide high-quality reconstructions from few-shot inputs, as we show in Fig. 7.

Initialization. A common strategy is to learn initial model parameters [Finn et al. 2017; Nichol et al. 2018; Sitzmann et al. 2020; Tancik et al. 2021; Zakharov et al. 2019] from a large collection of images. While this strategy has been shown to offer faster convergence, its applicability to high-resolution settings remains challenging due to computational requirements of large neural networks.

Data-driven Priors. With the advent of large datasets of particular domains [Choi et al. 2020; Karras et al. 2019; Liu et al. 2018], recent works have explored data-driven priors trained on a corpus of data specific to the reconstruction task at hand. Generative neural field models in particular [Chan et al. 2022; 2021; Deng et al. 2022a; 2022b; Gu et al. 2021; Niemeyer and Geiger 2021; Rao et al. 2022; Rebain et al. 2022; Schwarz et al. 2020; Tan et al. 2022; Wang et al. 2022; Zhou et al. 2021] have shown promising results. To tackle the heavy computational and memory requirement of volumetric rendering, EG3D [Chan et al. 2022] proposes to use a lightweight tri-plane feature representation. Most of these models further rely on an extra 2D super-resolution step [Chan et al. 2022; 2021; Gu et al. 2021; Hong et al. 2022; Tan et al. 2022; Wang et al. 2023] as they can only be trained at low resolutions due to high memory footprint requirements.

Recent works [Buehler et al. 2023; Cao et al. 2022; Wang et al. 2022] have overcome the need for 2D super-resolution. MoRF [Wang et al. 2022] learns a conditional neural reflectance field of faces from a small training dataset comprising of 12 views of 15 real subjects making neutral expressions, further augmented with synthetic renderings. [Cao et al. 2022] trains an avatar prior with a Mixture of Volumetric Primitives (MVP) [Lombardi et al. 2021] from data captured in a controlled environment. They leverage their model to fine-tune to a specific target subject’s identity based on a short sequence of a casually captured RGB-D video. Another line of work leverages 2D generative priors [Liu et al. 2023; Massague et al. 2024; Melas-Kyriazi et al. 2023; Tang et al. 2023; Wu et al. 2024; Zeng et al. 2023]. These methods typically employ score distillation sampling [Poole et al. 2022; Wang et al. 2024] for injecting signals from unseen views given text prompts or sparse input images.

Multiple recent works have leveraged 3D Morphable Face Models (3DMM) [Blanz and Vetter 1999] as strong priors. This has led to the development of avatars that can be driven by 3DMM expression coefficients [Buehler et al. 2021; Duan et al. 2023; Li et al. 2024; Niemeyer et al. 2022; Xu et al. 2023; Zhao et al. 2023; Zheng et al. 2022; 2023]. Such avatars are typically trained on a sequence of monocular video frames showing various head poses and facial expressions. Other works have explored the reconstruction of faces and textures from a single input image [Gao et al. 2020; Papantoniou et al. 2023; Vinod et al. 2024]. Another body of work has explored novel view synthesis from sparse inputs by training their model as an auto-encoder, coupled with image-based rendering. These methods [Chen et al. 2021; Mihajlovic et al. 2022; Wang et al. 2021; Yu et al. 2021] typically train a convolutional encoder that maps input images to 2D feature maps in images-space to condition the scene’s volume. This approach can typically be extended with additional priors such as keypoints [Mihajlovic et al. 2022], depth [Guangcong et al. 2023; Prinzler et al. 2023; Xu et al. 2022] or pixel-matches across input views [Truong et al. 2023].

Preface [Buehler et al. 2023] is a method for high-quality face avatar creation in which they train a low-resolution, generative prior on a dataset of facial identities captured in-studio. They could not reconstruct strong expressions due to the domain limitation of their prior model trained only on neutral faces. Our method tackles this challenging problem and we demonstrate compelling examples of novel view synthesis of expressive faces from sparse inputs.

3 Method

We present a method that takes as input as few as three images of a person’s face – of arbitrary identity, expression, and lighting condition – and reconstructs a personalized 3D face model that can render high-quality, photorealistic novel views of that person, including fine details like freckles, wrinkles, eyelashes, and teeth.

To overcome reconstruction ambiguities, our method uses a pre-trained volumetric face model as prior, trained on a large dataset of synthetic faces with a variety of identities, expressions, and viewpoints rendered in a single environment. At inference time, our method first fits the coefficients of our prior model to a small set of real input images. It then further fine-tunes the model weights and effectively performs domain adaptation during the few-shot reconstruction process, see Fig. 3. While the prior model is trained only once on a large collection of synthetic face images, the inference-time optimization is performed on a per-subject basis from as few as three (e.g., smartphone) images captured in-the-wild (see Fig. 9).

This section starts with background information (Sec. 3.1), details the training of synthetic prior (Sec. 3.2) and then finetuning from three input views (Sec. 3.3).

3.1 Background: NeRF and Preface

NeRF. Our prior face model is built upon Neural Radiance Fields (NeRF) [Mildenhall et al. 2020]. NeRFs represent 3D objects as density and (emissive) radiance fields parameterized by neural networks. Given a camera ray r, a NeRF samples 3D points x along the ray that are fed together with the view direction d into an MLP. The output is the corresponding density σ and color value c at x. A NeRF is rendered into any view via volumetric rendering. The color \({\bf c({\bf p})}\) of a pixel p is determined by compositing the density and color along the camera ray r within an interval between a near and a far camera plane [t_n, t_f]:

\begin{align} \mathbf {c}(\mathbf {p}) = {\bf F}_\theta ({\bf r}) = \int _{t_n}^{t_f}T(t)\sigma ({\bf r}(t)){\bf c}({\bf r}(t),{\bf d}) dt \text{,} \end{align}

(1)

\begin{align} \text{ where } T(t) = \text{exp} \left(- \int _{t_n}^{t} \sigma ({\bf r}(s)) ds \right). \end{align}

(2)

The variable θ denotes the model parameters that are fit to the input data.

Preface. Our method also extends Preface [Buehler et al. 2023], a method for novel view synthesis of neutral faces given sparse inputs. Besides the position x and view direction d, Preface also takes a learned latent code w as input. The latent code represents the identity and is optimized while training the model as an auto-decoder [Bojanowski et al. 2018]. During inference, Preface first projects the sparse input images into its latent space, by optimizing one identity code w. Then, it fine-tunes all model parameters under regularization constraints on the predicted normals and the view weights. While Preface excels at high-resolution novel view synthesis of neutral faces, it struggles in the presence of strong, idiosyncratic expressions (Fig. 7). In the following, we address this limitation while building an improved prior from synthetic images alone.

3.2 Pretraining an Expressive Prior Model

Fig. 4:

We train a prior model to capture the distribution of human heads with arbitrary facial expressions. As Preface [Buehler et al. 2023], our prior model is implemented as a conditional Neural Radiance Field (NeRF) [Mildenhall et al. 2020; Rebain et al. 2022] with a Mip-NeRF 360 backbone [Barron et al. 2022], Fig. 4. Yet, we observe that the simple Preface auto-decoder [Bojanowski et al. 2018; Rebain et al. 2022] cannot achieve high-quality fitting results on expressive faces (see the ablation in Sec. 4.3). We hypothesize that the distribution of expressive faces is much more difficult to model and disentangle than the distribution of neutral faces. To address this limitation, we decompose the latent space of our prior model into three latent spaces: a predefined identity space \(\mathcal {B}\), a predefined expression space Ψ, and a learned latent space \(\mathcal {W}\). The identity and expression spaces come from a linear 3D Morphable Model (3DMM) in the style of [Wood et al. 2021]. The latent codes in these two spaces are known a priori and represent the face shape for the arbitrary identity and expression in each synthetic training image. These codes are also frozen and do not change during training. The latent space \(\mathcal {W}\) represents characteristics that are not modeled by the 3DMM like hair, beard, clothing, glasses, appearance, lighting, etc., and is learned while training the auto-decoder as in Preface [Buehler et al. 2023]. Considering this model, we adapt Eq. 1 to obtain:

\begin{align} {\bf F}_{\theta _\text{p}}({\bf r}, \boldsymbol {\beta }, \boldsymbol {\psi }, \mathbf {w}) = \int _{t_n}^{t_f}T(t)\sigma ({\bf r}(t),\boldsymbol {\beta },\boldsymbol {\psi }, \mathbf {w}){\bf c}({\bf r}(t),{\bf d},\boldsymbol {\beta },\boldsymbol {\psi }, \mathbf {w}) dt \text{,} \end{align}

(3)

\begin{align} \text{ where } T(t) = \text{exp} \left(- \int _{t_n}^{t} \sigma ({\bf r}(s),\boldsymbol {\beta },\boldsymbol {\psi }, \mathbf {w}) ds \right), \end{align}

(4)

where \(\boldsymbol {\beta } \in \mathcal {B} \subset \mathbb {R}^{48}\) and \(\boldsymbol {\psi } \in \Psi \subset \mathbb {R}^{157}\) are the 3DMM identity and expression parameters, and \(\mathbf {w} \in \mathcal {W} \subset \mathbb {R}^{64}\) is a learned parameter encoding additional characteristics.

We train this prior on synthetic data alone – it never sees a real face. While it would be feasible to train a prior model on real data (see our ablation in Tbl. 2), we chose synthetic over real for multiple reasons. Real datasets exhibit limited diversity. Most face datasets feature monocular frontal views only, with few expressions other than smiles. Some multi-view, multi-expression datasets exist [Kirschstein et al. 2023; Wuu et al. 2022; Zhu et al. 2023], but consist of relatively few individuals due to the complexity and expense of running a capture studio. Further, subjects must adhere to wardrobe restrictions: glasses are forbidden and hair must be tucked away. A prior trained on such data will not generalize well to expressive faces captured in the wild. Besides, the logistics of capturing large-scale real data is extremely expensive, time and energy-consuming, and cumbersome. Instead, synthetics guarantee us a wide range of identity, expression, and appearance diversity, at orders of magnitude lower cost and effort. In addition, synthetics provide perfect ground truth annotations: each render is accompanied by 3DMM latent codes β, ψ.

Fig. 5:

We synthesize facial training data as in Wood et al. [2021]. We first generate the 3D face geometry by sampling the identity and expression spaces of the 3DMM. We then make these faces look realistic by applying physically based skin materials, attaching strand-based hairstyles, and dressing them up with clothes and glasses from our digital wardrobe. The scene is rendered with environment lighting using Cycles, a physically-based ray tracer (www.cycles-renderer.org). Examples are shown in Fig. 5 and on the supplementary HTML page. To help disentangle identity from expression, we sample 13 different random expressions for each random identity. Each expression is then rendered from 30 random viewpoints around the head. All faces are rendered under the same lighting condition, which was chosen to minimize shadows on the face.

Each training iteration randomly samples rays \(\bf r\) from a subset of all identities and expressions. A ray is rendered into a pixel color as given by Eq. 3. We optimize the network parameters \(\bf \theta _\text{p}\) and N per-identity latent codes w_1.N while keeping the 3DMM expression and identity codes β and ψ frozen:

\begin{align} \theta _\text{p}, {\bf w}_\text{1..N} = \mathop{arg\,min}_{\bf \theta , {\bf w}_\text{1..N}} \mathcal {L}_\text{prior}\,, \qquad \mathcal {L}_\text{prior} = \mathcal {L}_\text{recon} + \lambda _\text{prop} \mathcal {L}_\text{prop}. \end{align}

(5)

Here \(\mathcal {L}_\text{recon}\) is the mean-absolute error between the predicted and ground-truth colors, and \(\mathcal {L}_\text{prop}\) is the weight distribution matching loss from Mip-NeRF [Barron et al. 2021]. Please see Sec. 2 in the supp. PDF for the spelled-out loss terms.

We find that, when training from scratch, the model quickly collapses and outputs zero densities everywhere. We solve this by first training on images with background for 50,000 steps and continuing without background.

3.3 Inference from Sparse Views

We use our low-resolution synthetic prior model to enable high-resolution novel view synthesis of real expressive faces from few input images. We first describe how we obtain the conditional inputs and camera parameters in Sec. 3.3.1 and the subsequent fine-tuning of our model in Sec. 3.3.2.

3.3.1 3DMM Fitting and Camera Estimation.

Fig. 6:

Figure 3 gives an overview of the 3DMM fitting. During inference, the first step is to recover camera and 3DMM parameters from un-calibrated input images. We follow the approach of previous work [Wood et al. 2022] and fit to dense 2D landmarks (see Fig. 6). We first predict 599 2D probabilistic landmarks. Each landmark corresponds to a vertex in our 3DMM and is predicted as a 2D isotropic Gaussian with expected 2D location μ and scalar uncertainty σ. Next, we minimize an energy E(Φ;L), where L denotes the landmarks and Φ all the optimized 3DMM parameters including identity, expressions, joint rotations, and global translation, and intrinsic and extrinsic camera parameters, if unknown.

Minimizing E encourages the 3DMM to explain the observed landmarks with a probabilistic 2D landmark energy, and discourages unlikely faces using regularizers on 3DMM parameters and mesh self-intersection (additional detail is given in [Wood et al. 2022] and in Sec. 2 of the supp. mat.).

The benefit of the 3DMM fitting is two-fold. First, we get a good estimate of the world position of the camera and the head, so that later during inversion and finetuning of the model, camera parameters can be frozen. Second, thanks to the alignment of the 3DMM latent space and our prior model’s latent space, we directly feed the 3DMM parameters into the model, which serves as a good initialization during the subsequent inversion stage.

The outputs of this step are the camera parameters (shared intrinsics K and per-camera extrinsics [R_i|t_i]), a shared identity code β, and per-image expression codes ψ_i. For casual in-the-wild captures, it can be challenging to hold the same expression while the data is being captured. Therefore, we allow expression code to vary slightly between images to make the inversion robust to small, involuntary micro-changes in expression. In the studio setting, however, the cameras are synchronized and hence it is sufficient to optimize for a single, shared expression code.

3.3.2 Fine-tuning on Sparse Views.

This section describes how to fine-tune the low-resolution synthetic prior model to sparse, high-resolution real input images. Fine-tuning requires a short warm-up phase where only the latent code for the target \(\mathbf {w}_\text{target}\) is optimized. After that, fine-tuning optimizes all model parameters under additional constraints on the geometry and the appearance weights. We randomly sample rays from all available inputs, typically three images, and mask them to the foreground by multiplying them by an estimated foreground mask [Pandey et al. 2021].

Warm-up by Latent Code Inversion. While the 3DMM fitting provides the identity and expression codes \(\boldsymbol {\beta }, \boldsymbol {\psi }_\text{i}\), our model also requires the conditional input w, which models out-of-model characteristics like hair, clothing, and appearance. We follow [Buehler et al. 2023] and search the learned latent space \(\mathcal {W}\) of the prior model for a latent code that roughly matches the geometry and appearance of the input images. We downscale the three input images to the resolution of the prior model, sample random patches, and optimize

\begin{equation} \mathbf {w}_\text{target} = \mathop{arg\,min}_{\mathbf {w}} \mathcal {L}_\text{recon} + \lambda _\text{LPIPS} \mathcal {L}_\text{LPIPS}. \end{equation}

(6)

The photo-metric reconstruction term \(\mathcal {L}_\text{recon}\) is the mean absolute error between the rendered and the ground-truth patch and \(\mathcal {\text{LPIPS}}\) is a perceptual loss in the feature space of a pre-trained image classifier [Simonyan and Zisserman 2015; Zhang et al. 2018]. Note that the LPIPS loss is only employed during inversion, not during model fitting. The camera, 3DMM identity, and expression parameters are frozen during inversion.

Model Fitting. The output of the warm-up is a rough approximation of the input images in a low-resolution, synthetic space. In model fitting, we cross the domain gap to enable detailed novel view synthesis at high resolution for realistic faces. The model fitting needs to cross a substantial domain gap so that the output can contain details that have never been seen during prior model training.

Model fitting optimizes all model parameters on sparse, usually two or three, input images. In a randomly initialized NeRF [Mildenhall et al. 2020], this optimization would overfit and would fail to produce correct novel views [Buehler et al. 2023; Truong et al. 2023; Yang et al. 2023]. Thanks to our pretrained prior model, we can employ both implicit and explicit regularization, which yields high-quality results even in such sparse settings.

Implicit regularization comes from the fact that our prior model is trained on an aligned dataset of human faces. Initializing the weights of a NeRF with the correct latent code and weights of a prior model avoids total collapse. However, the optimization can still produce strong artifacts like duplicate ears and view-dependent color distortions. We follow [Buehler et al. 2023] and add explicit regularization on the consistency of predicted vs. analytical normals and an L2 regularization on the weight of the view branch to avoid view-dependent flickering. In addition, we add a distortion loss term [Barron et al. 2022] \(\mathcal {L}_\text{dist}\) for a more compact geometry:

\begin{align} \mathbf {\theta }_{\text{t}}, \mathbf {w}_{\text{target}} = \mathop{arg\,min}_{\mathbf {\theta }, \mathbf {w}} \mathcal {L}_\text{fit} &= \mathcal {L}_\text{recon} + \lambda _\text{prop} \mathcal {L}_\text{prop}\\&+ \lambda _\text{normal} \mathcal {L}_\text{normal} + \lambda _{d} \mathcal {L}_{d} + \lambda _\text{dist} \mathcal {L}_\text{dist} \nonumber \end{align}

(7)

where \(\mathcal {L}_\text{normal}\) and \(\mathcal {L}_{d}\) are the regularizers on predicted normals and view weights from Preface. We list and explain all loss terms in more detail in Sec. 2 of the supp. PDF.

Inference In-the-wild. For in-the-wild (ITW) images, we capture three images sequentially with a hand-held camera. The captured face might inhibit small movements during the capture, called micromotions. To mitigate these micromotions, we fine-tune with individual expression code for every input image. The 3DMM fitting yields a per-image expression code \(\boldsymbol {\hat{\psi }}_{i}\). During inference, we interpolate these expression codes based on their distance to the target camera. The weight is computed as the inverse squared distance between the target and all training cameras. The interpolated expression code \(\boldsymbol {\tilde{\psi }}_t\) for a target camera is computed as

\begin{align} \boldsymbol {\tilde{\psi }}_t &= \sum _i \underbrace{\frac{Z}{\epsilon + || \boldsymbol {\tilde{l}}_t - \boldsymbol {\hat{l}}_i||^2_2}}_{\text{weight}} \underbrace{\boldsymbol {\hat{\psi }}_i}_{\text{training expression}}, \end{align}

(8)

where \(\boldsymbol {\hat{\psi }}_i\) are the expression codes of the training frames, \(\boldsymbol {\tilde{l}_t}\) is the position of the target camera, \(\boldsymbol {\hat{l}}_i\) are the positions of the training cameras, ϵ is a small constant, and Z is a normalization factor to ensure that the weights sum up to 1.

4 Experiments

This section presents an extensive evaluation of our proposed method with both quantitative and qualitative comparisons to related work and additional ablations. For comparisons, we run publicly available code for SparseNeRF [Guangcong et al. 2023], Sparf [Truong et al. 2023], and Diner [Prinzler et al. 2023]. For FreeNeRF [Yang et al. 2023] and Preface [Buehler et al. 2023], we use our own implementation. We compare renders at the resolution 1334 × 2048 pixels, except for Diner, where we render at a lower resolution (160 × 256 pixels) due to its architecture and memory constraints. Please refer to the supplementary video and HTML page for more results.

4.1 Quantitative Evaluation

Table 1:

Method	PSNR ↑	SSIM ↑	LPIPS ↓
Sparse NeRF [Guangcong et al. 2023]	16.29	0.6470	0.4024
Diner [Prinzler et al. 2023]	17.08	0.6608	0.3123
SPARF [Truong et al. 2023]	18.38	0.6401	0.4032
FreeNeRF [Yang et al. 2023]	21.42	0.7093	0.3612
Preface [Buehler et al. 2023]	25.02	0.7539	0.3129
Ours	26.49	0.7721	0.2970

Table 1: Quantitative evaluation on the Multiface dataset [Wuu et al. 2022].

Table 2:

	Pre-training Variant	PSNR ↑	SSIM ↑	LPIPS ↓
a.i)	No Pre-training	10.21	0.3448	0.4256
a.ii)	On 1 subject	24.00	0.7512	0.3358
a.iii)	On 15 subjects	25.47	0.7668	0.3248
a.iv)	On 1,500 subjects	26.54	0.7750	0.3144
b.i)	For 105K steps	26.64	0.7752	0.3208
b.ii)	For 500K steps	26.54	0.7750	0.3144
b.iii)	For 1 Mio. steps	26.49	0.7721	0.2970
c.i)	On gray-scale renderings	26.44	0.7740	0.3242
c.ii)	On low-quality renderings	26.53	0.7727	0.3401
c.iii)	In diverse environments	26.58	0.7727	0.3415
c.iv)	Without makeup	26.72	0.7755	0.3210
c.v)	Without accessories	26.00	0.7731	0.3484
c.vi)	Without hair	26.14	0.7727	0.3303
c.vii)	With all accessories	26.54	0.7750	0.3144
d.i)	On real images	26.14	0.7708	0.3237
d.ii)	On real and synthetic images	26.41	0.7726	0.3227
d.iii)	On synthetic images	26.54	0.7750	0.3144
e)	Full	26.49	0.7721	0.2970

Table 2: We ablate the performance for different variants of the prior model. We pre-train a) on a smaller training set with fewer subjects, b) for a shorter number of steps, c) with different configurations of our synthetic rendering pipeline, and d) include real data. The full prior model is trained on 1,500 subjects with 13 expressions each for 1 Mio. steps on synthetic renderings with all available accessories in a single environment map. Please see the supp. mat. for details of the rendering pipeline.

Performance on casual in-the-wild captures is inherently difficult to evaluate quantitatively due to the lack of proper validation views. Often, the captured subject can hardly remain completely still. Therefore, we quantitatively evaluate on the Multiface [Wuu et al. 2022] studio dataset on nine scenes by randomly selecting three subjects with three expressions per subject. For each test subject, we use one frontal and two side views as input for training, as shown in Fig. 7. We remove the background by multiplying a foreground alpha matte estimated by [Pandey et al. 2021]. We hold out from 26 to 29 validation images per subject, where the camera viewing direction is located in the frontal hemisphere (see details in the supplementary material). As evaluation metrics, we measure photo-metric distance and image similarity via PSNR, SSIM, and LPIPS [Zhang et al. 2018], as summarized in Tbl. 1. Note that photometric reconstruction metrics like PSNR and SSIM and perceptual metrics like LPIPS are at odds with each other [Blau and Michaeli 2018]. We handle this tradeoff by optimizing perceptual quality in the warm-up (Eq. 6) and reconstruction quality in the fine-tuning (Eq. 7).

As shown in Tbl. 1, our new method provides better modeling fidelity across all three metrics over the set of validation views. In particular, compared to the state-of-the-art Preface [Buehler et al. 2023], we achieve the following improvement on the computed metrics: 6% PSNR, 2.4% SSIM, and 5% LPIPS. Visually, Fig. 7 shows that the novel views generated by our fine-tuned model more closely resemble the facial shape and appearance, including eye and mouth details, of the example (ground-truth) validation view.

4.2 Qualitative Results in the Wild

To demonstrate the robustness of our method in the wild, we captured subjects with a handheld DSLR camera (Canon EOS 4000D) and mobile phones (Pixel 7 and Pixel 8 Pro). We capture between one and three images per subject in various outdoor and indoor environments, including very challenging lighting conditions. These diverse in-the-wild results are shown in Fig. 1, Fig. 8, Fig. 9, and in the supplementary video and HTML page. The inlays in Fig. 1 show the high level of modeled detail on the lips and on the individual hair strands and eyelashes. Fig. 8 compares with the best performing related work [Buehler et al. 2023], which struggles with the mouth region. Of particular note is also the “tongue-out” expression in the third row of Fig. 9. Note that our synthetic training data contains tongues but the tongues never stick out of the mouth.

We also consider the more challenging single input image scenario, see Fig. 10. Our method achieves high-fidelity synthesis of the frontal face even for stylized and painted faces (right) but quickly degrades for side views. This behavior is expected, as our model is only trained on low-resolution synthetic images and has never seen high-resolution views from the side of a face. Please see the supp. HTML page for more examples.

Additional qualitative comparisons to previous work are shown in Fig. 7. Our method outperforms the other baseline methods in two main ways. First, our approach captures fine identity-specific details like teeth outlines (middle row) and stubble (bottom row), while previous methods degenerate into blur or noise. Second, the overall face shape is more accurate, as evidenced by the eyeball in the top row and the cheek silhouette in the middle row. Our synthetic face prior encourages the reconstruction to remain faithful to its learned understanding of faces, e.g., that corneas should bulge outwards, not be flat. These results and the metrics in Tbl. 1 demonstrate that our new model and fine-tuning method outperforms previous work both qualitatively and quantitatively.

4.3 Ablation Study

We conduct extensive ablation studies of the prior model. Table 2 lists metrics after fine-tuning to three inputs at 2K resolution. Please see the supp. PDF for more ablations and the HTML page for visuals.

Dataset Size. We compare prior models without any pre-training (a.i) with models pre-trained on a single subject (a.ii), on 15 subjects (a.iii), and on 1,500 subjects (a.iv). Note that each subject is rendered with 13 expressions and 30 views per expression (as described in Sec. 3.2). Hence, the model trained on 1 subject (a.ii) sees 1 · 13 · 30 = 390 images in total. Without pre-training (a.i), the reconstruction completely fails. Pre-training on a single subject (a.ii) leads to a noisy face geometry (see the surface normals in the supp. HTML page). The performance improves when more subjects are added (a.iii and a.iv). We conclude that pre-training is necessary and more data improves the reconstruction quality.

Number of Pre-training Steps. We ablate the performance when pre-training for a fewer number of steps. We observe that pre-training for one day only (105K steps, b.i) already achieves very high photo-metric reconstruction quality (PSNR and SSIM). This shorter training duration offers a more accessible alternative while still delivering high-quality outcomes. While short pre-training already yields good results, longer training is required for the best perceptual quality (LPIPS) (b.ii and b.iii).

Synthetic Data Quality. We investigate the effects of synthetic data quality in terms of appearance and texture (c.i - c.iv) and geometric diversity (c.v - c.vii). We find that the appearance and texture of the prior model have very little impact on the fine-tuning result but a higher geometry diversity achieves the best results. Prior models trained on gray-scale and low-quality renderings (c.i and c.ii) perform similarly as training in diverse environments (c.iii) [Gardner et al. 2017; Hold-Geoffroy et al. 2019]. Similarly, excluding details like makeup (c.iv) does not deteriorate the performance. However, the diversity in terms of geometry is important for the prior model. When removing hair (c.vi) and other accessories like beards, glasses, and clothing (c.v) from the prior model, the reconstruction still yields a valid face but it may contain some artifacts in non-surface regions like hair. This is notable as a drop in PSNR from 26.54 with all accessories (c.vii) to 26.00 without any accessories (c.v) (please see the supp. mat. for a list of the number of accessories). In summary, we find that the prior acts as a geometric regularization. Including all accessories yields a geometrically diverse prior model with the best results. Rendering even more assets is likely to improve the results for accessories like glasses, earrings, and clothing even further.

Including Real Data. We study the synthetic vs. real domain gap by pre-training the prior model on real multi-view images (d.i), a mixed dataset with real and synthetic images (d.ii), and synthetic images alone (d.iii). Priors trained on a real and a mixed dataset perform well but synthetic data alone performs best on all metrics. This behavior might seem non-intuitive. While it is very difficult to precisely determine why synthetic data outperforms real data, we are not the first to find that synthetic data can outperform real data [Sun et al. 2021; Trevithick et al. 2023; Wood et al. 2021; 2022; Yeh et al. 2022]. In our experience, real data capture and processing is imperfect compared with synthetic data. Even under controlled conditions, there may be issues like motion blur and imperfect foreground matting.

4.4 Limitations

While our method can reconstruct 3D faces from even just one view, we notice that the quality degrades at side views in this extremely challenging case (see Fig. 10) due to the lack of observation. Accessories like glasses frames and earrings may not be reconstructed perfectly and for extreme expressions, the mouth interior might not be fully consistent across all viewpoints (see the supp. HTML page). Furthermore, our method assumes that faces are non-occluded in the input images. This can cause the camera estimation during the 3DMM fitting (Sec. 3.3.1) to fail. These limitations could be potentially addressed by leveraging large generative models to hallucinate unobserved regions, which is an interesting future direction to explore. Another direction of future work could explore more efficient backbones to increase pre-training and fine-tuning efficiency, and potentially enable facial animation.

5 Conclusion

We present a method for high-fidelity expressive face modeling from as few as three input views captured in the wild. This challenging goal is achieved by leveraging a volumetric face prior and fine-tuning the prior model to sparse observations. Our key insight is that a prior model trained on synthetic data alone can generalize to diverse real-world identities and expressions, bypassing the expensive process of capturing large-scale real-world 3D facial appearances. We experimentally demonstrate that our method can robustly reconstruct expressive faces from sparse or even single-view images with unprecedented fine-scale idiosyncratic details, and achieve superior quality compared to previous state-of-the-art methods for few-shot reconstruction and novel view synthesis.

Supplemental Material

PDF File

Results video and supplementary PDF document.

Download
525.02 KB

MP4 File

Results video and supplementary PDF document.

Download
63.19 MB

References

[1]

Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. 2021. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5855–5864.

Abstract

1 Introduction

2 Related Work

3 Method

3.1 Background: NeRF and Preface

3.2 Pretraining an Expressive Prior Model

3.3 Inference from Sparse Views

3.3.1 3DMM Fitting and Camera Estimation.

3.3.2 Fine-tuning on Sparse Views.

4 Experiments

4.1 Quantitative Evaluation

4.2 Qualitative Results in the Wild

4.3 Ablation Study

4.4 Limitations

5 Conclusion

Supplemental Material

References

Index Terms

Recommendations

LitNeRF: Intrinsic Radiance Decomposition for High-Quality View Synthesis and Relighting of Faces

EMOVA: Emotion-driven neural volumetric avatar

Inovis: Instant Novel-View Synthesis

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations