We present a method that takes as input as few as three images of a person’s face – of arbitrary identity, expression, and lighting condition – and reconstructs a personalized 3D face model that can render high-quality, photorealistic novel views of that person, including fine details like freckles, wrinkles, eyelashes, and teeth.
To overcome reconstruction ambiguities, our method uses a pre-trained volumetric face model as prior, trained on a large dataset of synthetic faces with a variety of identities, expressions, and viewpoints rendered in a single environment. At inference time, our method first fits the coefficients of our prior model to a small set of real input images. It then further fine-tunes the model weights and effectively performs domain adaptation during the few-shot reconstruction process, see Fig.
3. While the prior model is trained only once on a large collection of synthetic face images, the inference-time optimization is performed on a per-subject basis from as few as three (
e.g., smartphone) images captured in-the-wild (see Fig.
9).
3.1 Background: NeRF and Preface
NeRF. Our prior face model is built upon Neural Radiance Fields (NeRF) [Mildenhall et al.
2020]. NeRFs represent 3D objects as density and (emissive) radiance fields parameterized by neural networks. Given a camera ray
r, a NeRF samples 3D points
x along the ray that are fed together with the view direction
d into an MLP. The output is the corresponding density
σ and color value
c at
x. A NeRF is rendered into any view via volumetric rendering. The color
\({\bf c({\bf p})}\) of a pixel
p is determined by compositing the density and color along the camera ray
r within an interval between a near and a far camera plane [
tn,
tf]:
The variable
θ denotes the model parameters that are fit to the input data.
Preface. Our method also extends Preface [Buehler et al.
2023], a method for novel view synthesis of neutral faces given sparse inputs. Besides the position
x and view direction
d, Preface also takes a learned latent code
w as input. The latent code represents the identity and is optimized while training the model as an auto-decoder [Bojanowski et al.
2018]. During inference, Preface first projects the sparse input images into its latent space, by optimizing one identity code
w. Then, it fine-tunes all model parameters under regularization constraints on the predicted normals and the view weights. While Preface excels at high-resolution novel view synthesis of neutral faces, it struggles in the presence of strong, idiosyncratic expressions (Fig.
7). In the following, we address this limitation while building an improved prior from synthetic images alone.
3.2 Pretraining an Expressive Prior Model
We train a prior model to capture the distribution of human heads with arbitrary facial expressions. As Preface [Buehler et al.
2023], our prior model is implemented as a conditional Neural Radiance Field (NeRF) [Mildenhall et al.
2020; Rebain et al.
2022] with a Mip-NeRF 360 backbone [Barron et al.
2022], Fig.
4. Yet, we observe that the simple Preface auto-decoder [Bojanowski et al.
2018; Rebain et al.
2022] cannot achieve high-quality fitting results on expressive faces (see the ablation in Sec.
4.3). We hypothesize that the distribution of expressive faces is much more difficult to model and disentangle than the distribution of neutral faces. To address this limitation, we decompose the latent space of our prior model into three latent spaces: a predefined identity space
\(\mathcal {B}\), a predefined expression space
Ψ, and a learned latent space
\(\mathcal {W}\). The identity and expression spaces come from a linear 3D Morphable Model (3DMM) in the style of [Wood et al.
2021]. The latent codes in these two spaces are known
a priori and represent the face shape for the arbitrary identity and expression in each synthetic training image. These codes are also frozen and do not change during training. The latent space
\(\mathcal {W}\) represents characteristics that are not modeled by the 3DMM like hair, beard, clothing, glasses, appearance, lighting, etc., and is learned while training the auto-decoder as in Preface [Buehler et al.
2023]. Considering this model, we adapt Eq.
1 to obtain:
where
\(\boldsymbol {\beta } \in \mathcal {B} \subset \mathbb {R}^{48}\) and
\(\boldsymbol {\psi } \in \Psi \subset \mathbb {R}^{157}\) are the 3DMM identity and expression parameters, and
\(\mathbf {w} \in \mathcal {W} \subset \mathbb {R}^{64}\) is a learned parameter encoding additional characteristics.
We train this prior on synthetic data alone – it never sees a real face. While it would be feasible to train a prior model on real data (see our ablation in Tbl.
2), we chose synthetic over real for multiple reasons. Real datasets exhibit limited diversity. Most face datasets feature monocular frontal views only, with few expressions other than smiles. Some multi-view, multi-expression datasets exist [Kirschstein et al.
2023; Wuu et al.
2022; Zhu et al.
2023], but consist of relatively few individuals due to the complexity and expense of running a capture studio. Further, subjects must adhere to wardrobe restrictions: glasses are forbidden and hair must be tucked away. A prior trained on such data will not generalize well to expressive faces captured in the wild. Besides, the logistics of capturing large-scale real data is extremely expensive, time and energy-consuming, and cumbersome. Instead, synthetics guarantee us a wide range of identity, expression, and appearance diversity, at orders of magnitude lower cost and effort. In addition, synthetics provide perfect ground truth annotations: each render is accompanied by 3DMM latent codes
β,
ψ.
We synthesize facial training data as in Wood et al. [
2021]. We first generate the 3D face geometry by sampling the identity and expression spaces of the 3DMM. We then make these faces look realistic by applying physically based skin materials, attaching strand-based hairstyles, and dressing them up with clothes and glasses from our digital wardrobe. The scene is rendered with environment lighting using Cycles, a physically-based ray tracer (
www.cycles-renderer.org). Examples are shown in Fig.
5 and on the supplementary HTML page. To help disentangle identity from expression, we sample 13 different random expressions for each random identity. Each expression is then rendered from 30 random viewpoints around the head. All faces are rendered under the same lighting condition, which was chosen to minimize shadows on the face.
Each training iteration randomly samples rays
\(\bf r\) from a subset of all identities and expressions. A ray is rendered into a pixel color as given by Eq.
3. We optimize the network parameters
\(\bf \theta _\text{p}\) and N per-identity latent codes
w1.N while keeping the 3DMM expression and identity codes
β and
ψ frozen:
Here
\(\mathcal {L}_\text{recon}\) is the mean-absolute error between the predicted and ground-truth colors, and
\(\mathcal {L}_\text{prop}\) is the weight distribution matching loss from Mip-NeRF [Barron et al.
2021]. Please see Sec. 2 in the supp. PDF for the spelled-out loss terms.
We find that, when training from scratch, the model quickly collapses and outputs zero densities everywhere. We solve this by first training on images with background for 50,000 steps and continuing without background.
3.3 Inference from Sparse Views
We use our low-resolution synthetic prior model to enable high-resolution novel view synthesis of real expressive faces from few input images. We first describe how we obtain the conditional inputs and camera parameters in Sec.
3.3.1 and the subsequent fine-tuning of our model in Sec.
3.3.2.
3.3.1 3DMM Fitting and Camera Estimation.
Figure
3 gives an overview of the 3DMM fitting. During inference, the first step is to recover camera and 3DMM parameters from un-calibrated input images. We follow the approach of previous work [Wood et al.
2022] and fit to dense 2D landmarks (see Fig.
6). We first predict 599 2D probabilistic landmarks. Each landmark corresponds to a vertex in our 3DMM and is predicted as a 2D isotropic Gaussian with expected 2D location
μ and scalar uncertainty
σ. Next, we minimize an energy
E(
Φ;
L), where
L denotes the landmarks and
Φ all the optimized 3DMM parameters including identity, expressions, joint rotations, and global translation, and intrinsic and extrinsic camera parameters, if unknown.
Minimizing
E encourages the 3DMM to explain the observed landmarks with a probabilistic 2D landmark energy, and discourages unlikely faces using regularizers on 3DMM parameters and mesh self-intersection (additional detail is given in [Wood et al.
2022] and in Sec. 2 of the supp. mat.).
The benefit of the 3DMM fitting is two-fold. First, we get a good estimate of the world position of the camera and the head, so that later during inversion and finetuning of the model, camera parameters can be frozen. Second, thanks to the alignment of the 3DMM latent space and our prior model’s latent space, we directly feed the 3DMM parameters into the model, which serves as a good initialization during the subsequent inversion stage.
The outputs of this step are the camera parameters (shared intrinsics K and per-camera extrinsics [Ri|ti]), a shared identity code β, and per-image expression codes ψi. For casual in-the-wild captures, it can be challenging to hold the same expression while the data is being captured. Therefore, we allow expression code to vary slightly between images to make the inversion robust to small, involuntary micro-changes in expression. In the studio setting, however, the cameras are synchronized and hence it is sufficient to optimize for a single, shared expression code.
3.3.2 Fine-tuning on Sparse Views.
This section describes how to fine-tune the low-resolution synthetic prior model to sparse, high-resolution real input images. Fine-tuning requires a short warm-up phase where only the latent code for the target
\(\mathbf {w}_\text{target}\) is optimized. After that, fine-tuning optimizes all model parameters under additional constraints on the geometry and the appearance weights. We randomly sample rays from all available inputs, typically three images, and mask them to the foreground by multiplying them by an estimated foreground mask [Pandey et al.
2021].
Warm-up by Latent Code Inversion. While the 3DMM fitting provides the identity and expression codes
\(\boldsymbol {\beta }, \boldsymbol {\psi }_\text{i}\), our model also requires the conditional input
w, which models out-of-model characteristics like hair, clothing, and appearance. We follow [Buehler et al.
2023] and search the learned latent space
\(\mathcal {W}\) of the prior model for a latent code that roughly matches the geometry and appearance of the input images. We downscale the three input images to the resolution of the prior model, sample random patches, and optimize
The photo-metric reconstruction term
\(\mathcal {L}_\text{recon}\) is the mean absolute error between the rendered and the ground-truth patch and
\(\mathcal {\text{LPIPS}}\) is a perceptual loss in the feature space of a pre-trained image classifier [Simonyan and Zisserman
2015; Zhang et al.
2018]. Note that the LPIPS loss is only employed during inversion, not during model fitting. The camera, 3DMM identity, and expression parameters are frozen during inversion.
Model Fitting. The output of the warm-up is a rough approximation of the input images in a low-resolution, synthetic space. In model fitting, we cross the domain gap to enable detailed novel view synthesis at high resolution for realistic faces. The model fitting needs to cross a substantial domain gap so that the output can contain details that have never been seen during prior model training.
Model fitting optimizes all model parameters on sparse, usually two or three, input images. In a randomly initialized NeRF [Mildenhall et al.
2020], this optimization would overfit and would fail to produce correct novel views [Buehler et al.
2023; Truong et al.
2023; Yang et al.
2023]. Thanks to our pretrained prior model, we can employ both
implicit and
explicit regularization, which yields high-quality results even in such sparse settings.
Implicit regularization comes from the fact that our prior model is trained on an aligned dataset of human faces. Initializing the weights of a NeRF with the correct latent code and weights of a prior model avoids total collapse. However, the optimization can still produce strong artifacts like duplicate ears and view-dependent color distortions. We follow [Buehler et al.
2023] and add explicit regularization on the consistency of predicted vs. analytical normals and an L2 regularization on the weight of the view branch to avoid view-dependent flickering. In addition, we add a distortion loss term [Barron et al.
2022]
\(\mathcal {L}_\text{dist}\) for a more compact geometry:
where
\(\mathcal {L}_\text{normal}\) and
\(\mathcal {L}_{d}\) are the regularizers on predicted normals and view weights from Preface. We list and explain all loss terms in more detail in Sec. 2 of the supp. PDF.
Inference In-the-wild. For in-the-wild (ITW) images, we capture three images sequentially with a hand-held camera. The captured face might inhibit small movements during the capture, called micromotions. To mitigate these micromotions, we fine-tune with individual expression code for every input image. The 3DMM fitting yields a per-image expression code
\(\boldsymbol {\hat{\psi }}_{i}\). During inference, we interpolate these expression codes based on their distance to the target camera. The weight is computed as the inverse squared distance between the target and all training cameras. The interpolated expression code
\(\boldsymbol {\tilde{\psi }}_t\) for a target camera is computed as
where
\(\boldsymbol {\hat{\psi }}_i\) are the expression codes of the training frames,
\(\boldsymbol {\tilde{l}_t}\) is the position of the target camera,
\(\boldsymbol {\hat{l}}_i\) are the positions of the training cameras, ϵ is a small constant, and
Z is a normalization factor to ensure that the weights sum up to 1.