MixLight: Borrowing the Best of both Spherical Harmonics and Gaussian Models

Xinlong Ji, Fangneng Zhan, Shijian Lu, Shi-Sheng Huang, Hua Huang Xinlong Ji is with the School of Computing Science & Technology, Beijing Institute of Technology, Beijing 100081, China. E-mail: [email protected]. Fangneng Zhan is with the Max Planck Institute for Informatics, Saarbrücken 66123, Germany. E-mail: [email protected]. Shijian Lu is with the School of Computer Science and Engineering, Nanyang Technological University, 639798, Singapore. E-mail: [email protected]. Shi-Sheng Huang and Hua Huang are with the School of Artificial Intelligence, Beijing Normal University, Beijing 100875, China. E-mail: {huangss, huahuang}@bnu.edu.cn. Hua Huang is the corresponding author.

Abstract

Accurately estimating scene lighting is critical for applications such as mixed reality. Existing works estimate illumination by generating illumination maps or regressing illumination parameters. However, the method of generating illumination maps has poor generalization performance and parametric models such as Spherical Harmonic (SH) and Spherical Gaussian (SG) fall short in capturing high-frequency or low-frequency components. This paper presents MixLight, a joint model that utilizes the complementary characteristics of SH and SG to achieve a more complete illumination representation, which uses SH and SG to capture low-frequency ambient and high-frequency light sources respectively. In addition, a special spherical light source sparsemax (SLSparsemax) module that refers to the position and brightness relationship between spherical light sources is designed to improve their sparsity, which is significant but omitted by prior works. Extensive experiments demonstrate that MixLight surpasses state-of-the-art (SOTA) methods on multiple metrics. In addition, experiments on Web Dataset also show that MixLight as a parametric method has better generalization performance than non-parametric methods.

Index Terms:

Illumination estimation, mixed reality, spherical harmonics, spherical gaussian, deep learning.

1 Introduction

Recovering high-dynamic-range (HDR) illumination from a single limited field of view (FOV) image is a challenging yet attractive task. This task holds significant relevance across various domains within computer graphics, encompassing mixed reality, HDR relighting, and virtual characters insertion. Addressing this challenging problem entails confronting three primary challenges. First, inferring panoramic illumination from a image of limited FOV poses a prominent difficulty. Second, recovering HDR illumination from low dynamic range (LDR) observation falls into an ill-posed problems. Third, the observed images are the result of intricate interplay between geometry, material, and illumination, leading to substantial ambiguity for lighting estimation.

Refer to caption — (a) Ambient Light estimation

Prior works [1, 2, 3, 4, 5, 6, 7] mitigate this ill-posedness by incorporating additional information; however, these approaches often lack user-friendliness, such as requiring obtaining the depth map [5, 6] or multi-view pictures [2, 7]. Recent works have leaned towards leveraging neural networks to predict illumination directly from a single image, either by generating illumination maps [8, 9, 10, 11, 12, 13] or by regressing illumination parameters [14, 10, 11, 15, 16, 17, 18, 12, 13]. Illumination maps, which are panoramic images containing illumination information in all directions, offer the potential to fully represent illumination. However, their high-dimensional nature often results in stability issues and poor generalization performance for lighting estimation. On the other hand, the regression-based methods parameterize the illumination with basis functions such as SH [17, 19, 18] and SG [14, 10, 11, 15, 12, 13], followed by a network to regress the illumination parameters of lower dimensions. These methods have better generalization performance but struggle to capture either high-frequency or low-frequency characteristics, leading to incomplete representation.

Specifically, Fig. 1 shows that SH is better suited for capturing low-frequency ambient light (with a less error in Fig. 1a), while SG is more suitable for capturing high-frequency light sources (with a less error in Fig. 1b). Therefore, the joint representation of illumination by utilizing the complementary advantages of SH and SG in the frequency-domain can lead to a more accurate representation of illumination information.

Inspired by this, a SH, SG joint model of illumination representation called MixLight is proposed. An HDR illumination map can be divided into two parts: ambient light and light source using a simple brightness threshold segmentation method [10]. Among them, the ambient light is mainly the low-frequency illumination information. The light source is a steep peak in brightness, which corresponds to the high-frequency part of the illumination information. MixLight uses a neural network to regress both SH and SG parameters simultaneously, where SH represents the low-frequency ambient light part and SG represents the high-frequency light source part.

In practice, real-world light sources often exhibit a sparse characteristic, typically confined to a relatively small yet variable number. Nevertheless, prior SG methods either overlook this sparsity [10, 11] by generating extraneous tiny light sources, or trap in oversimplified settings by featuring a fixed number of light sources [14, 12, 13]. Inspired by the sparsemax theory [20, 21], this paper designs the Spherical Light Source Sparsemax (SLSparsemax) mechanism to impose sparsity constraints on light sources at the neural network level. SLSparsemax considers the positions and brightness relationships among light sources on the sphere, calculates the credibility of light source predictions, and adaptively determines the number of sparse light sources.

The contributions of this work are summarized as follows:

1.

A SH, SG joint model named MixLight is proposed to represent illumination more completely and accurately with a few parameters.
2.

The spherical light source sparsemax (SLSparsemax) mechanism is designed to impose sparsity constraints on the estimation of light sources.
3.

Experiments on several datasets show that MixLight outperforms several SOTA methods in prediction accuracy and has good generalization ability.

2 Related Works

Illumination estimation is a classic and challenging task in the fields of computer vision and computer graphics. The panoramic HDR illumination estimated from a single image can be directly used as illumination conditions for many subsequent applications such as the virtual characters insertion or relighting [22, 23, 24, 25, 26, 27, 28, 29, 30, 31]. Early research efforts attempt to introduce more information to alleviate the ill-posedness, such as using landmarks in the scene to obtain partial geometric (or material) information [1, 2, 3, 4, 32, 33], or directly obtaining scene geometry from depth maps [5, 6] and multi-view pictures [2, 7]. However, this introduction of additional information is not user-friendly.

In recent years, most illumination estimation works are based on neural networks, which generate illumination maps [8, 10, 9, 11, 12] or regress parameters [14, 10, 11, 15, 16, 17, 18, 19, 12, 13], no longer relying on additional information. To generate high-dimensional illumination maps, a large amount of real HDR training data is required to ensure the generalization performance and the realism of the prediction. However, constructing such large-scale realistic datasets that currently lacks consumes considerable time and effort. Therefore, Gardner et al. [8] train a CNN mainly on a large-scale synthetic dataset and then fine-tuned light source intensity on a small-scale real dataset. More recent works [10, 9, 13, 12] rely on the powerful generative adversarial network (GAN) to generate illumination maps with realistic details. Zhan et al. [10] use the regressed parameters of SG to guide GAN [34] generating, similar to another two hybrid models [13, 12], while Wang et al. [9] directly generate illumination maps with scene style encoded in styleGAN [35]. However, GANs are trained only on the limited-scale real HDR dataset, resulting in overfitting to the training set and poor generalization performance.

Parameters regressing methods use parameterized representations such as SH [19, 18, 17] and SG [14, 10, 11, 15]. As SH supports low overhead, fast and differentiable illumination maps reconstruction and object rendering, SH based methods are introduced in resource-limited mobile devices [19, 18, 17], as well as reconstruction loss [18] and rendering loss [19] that improves the quality of predictions. However, limited to representation defects in the frequency domain (shown in Fig. 1), SH methods represent high-frequency illumination inaccurately. Oppositely, SG methods are poor in low-frequency representation, but also suffer from inaccuracy in high-frequency representation as their thoughtless about the sparsity of light sources. Gardner et al. [14] oversimplify the number of light sources to a fixed number (e.g., 3), which cannot accurately represent the varying number and shape of light sources in the real world. Zhan et al. [10] achieve variable SG light sources using spherical distribution but fail to remove numerous small distribution values in the predicted results, which is also not sparse, not consistent with reality, and misleads the subsequent generation.

As a method that generates high-dimensional illumination maps requires a large-scale realistic dataset currently unavailable to ensure generalization performance, using low-dimensional parameterized representations is a good alternative. Note that parametric representations can better connect with many existing applications [36, 37, 38, 39] that choose parametric illumination for a more efficient rendering. However, previous works typically use either SH or SG independently, which fail to accurately represent both high-frequency and low-frequency illumination simultaneously. This paper introduces MixLight, which combines the strengths of SH and SG, to respectively represent low-frequency ambient light and high-frequency light sources, achieving a lower-dimensional yet more complete representation. Against the non-sparsity that weakens the accuracy of high-frequency light source prediction, SLSparsemax is designed, which leverages the positional and intensity information of spherical light sources to help the network output adaptively sparse light sources.

3 Proposed Method

The proposed MixLight combines the complementary characteristics of SH and SG, using them to represent the ambient light and the light sources respectively, as illustrated in Fig. 3. SLSparsemax is designed in regressing network to improve the accuracy of high-frequency illumination prediction. The following sections introduce the methods used for the MixLight parameters decomposition (Section 3.1), the design of SLSparsemax (Section 3.2), and details about loss functions (Section 3.3).

3.1 Parameters Decomposition

As visualized in the right half of Fig. 3, the illumination map $I$ is separated into the ambient light $I_{A}$ and the light source regions $I_{L}$ firstly, which are further parameterized by Spherical Harmonics and Spherical Gaussian, respectively.

Separation. In the real world, the energy of light sources is mainly located in the high-frequency domain while ambient light is comparatively low-frequency. This distinction arises from the fact that the light source exhibits sharp changes in brightness, whereas ambient light changes more gradually. Separating light sources from ambient light effectively separates high and low-frequency light, facilitating subsequent parameters decomposition processes. In addition to the frequency differences, there is a huge brightness difference in that, light sources are always much brighter than other objects that comprise ambient light in the scene. HDR format illumination maps can capture such a huge brightness difference, where the above two light types can be effectively separated by using a 5% brightness threshold, as proposed in [10]. This method involves identifying the top 5% brightest pixels as light sources and categorizing the others as ambient light.

SH Parameters Decomposition. In the field of lighting estimation, SH is an important method for representing illumination [19, 17, 18]. Similar to the Fourier transform, SH can use a set of orthogonal spherical basis and their corresponding coefficients to approximate illumination [40, 41, 42], especially to represent the ambient light $I_{A}$ in this paper, as shown in Equation 1:

I_{A}(c,\rho)\approx\sum_{k=0}^{K}\sum_{m=-k}^{k}A_{m,c}^{k}B_{m}^{k}(\rho)

(1)

Here, $c$ represents the channel of RGB format $I_{A}$ , where $c$ can be 1, 2, and 3, corresponding to red, green, and blue channels, respectively. $\rho=(\phi,\theta)$ represents the spherical coordinate angles defined on the spherical image domain $\Omega$ , where $\phi$ is the azimuth angle and $\theta$ is the elevation angle. $B_{m}^{k}$ denotes the spherical harmonic basis function with order $k$ (ranging from 0 to $K$ ) and degree $m$ (ranging from - $k$ to $k$ ), and $A_{m,c}^{k}$ represents the corresponding coefficients for the channel $c$ . $K$ is the upper bound of $k$ .

As the order $k$ increases, SH can capture information from low frequencies to high frequencies, resulting in a more accurate representation [40, 41, 42]. However, SH should only be used to represent low-frequency information considering its representation defects, and the scale of the $A_{m,c}^{k}$ also rapidly expands with increasing order. Thus, blindly raising the upper boundary of $k$ just leads to SH losing its advantage of parameter sparsity, without much real gain. It is common to only use low-order SH to approximate illumination, setting $K$ as a small value.

As the inverse process of Equation 1, the SH coefficients $A_{m,c}^{k}$ can be obtained by projecting the ambient light $I_{A}$ onto the SH basis:

A_{m,c}^{k}=\frac{4\pi}{wh}\sum_{\rho}^{\Omega}I_{A}(c,\rho)B_{m}^{k}(\rho)

(2)

Here, $w$ , $h$ , and their product $wh$ represent the width, the height, and the area of the ambient light $I_{A}$ .

SG Parameters Decomposition. Similar to the approach in [10], the light source part $I_{L}$ is represented using SG on $N$ evenly distributed anchor points. The intensity of the light source is denoted as $E$ , its color is represented by $R$ , and the energy distribution of light sources is modeled using the light distribution $P$ . The method of seeking the SG parameters is also basically similar to [10]. To explain this process more clearly, details are provided here.

To calculate the light source intensity $E$ , sum up the pixel values in each channel, yielding:

T_{c}=\sum_{\rho}^{\Omega}I_{L}(c,\rho),\quad c=1,2,3

(3)

$T_{c}$ represents the sum of the pixel values in the $c$ channel.

The total intensity of the light source is the L2 norm of the vector formed by $T_{1}$ , $T_{2}$ and $T_{3}$ :

E=\left|\left|\left(T_{1},T_{2},T_{3}\right)\right|\right|_{2}^{2}

(4)

The overall color of the light source is represented by the color ratios $R$ :

R=\frac{(T_{1},T_{2},T_{3})}{E}

(5)

Next, the pixel values within the light source region are assigned to the nearest anchor point, and the energy distribution on $N$ anchor points is computed as $P$ .

As a reverse process of parameter decomposition, light sources can be reconstituted in the form of spherical Gaussian maps, as employed in Zhan et al. [10]. However, it is important to note that the Gaussian Map reconstructed by Zhan is not entirely accurate. The reconstructed Gaussian map mainly approximates the distribution of light, serving primarily as a rough guideline for subsequent generators, without providing an accurate intensity measure.

Different from Zhan et al. [10] but similar to the idea of [15, 13, 12], the Gaussian map generated by MixLight directly serves as final result of the light source part estimation, which must be numerically accurate in intensity. Therefore, a normalization term $q$ is multiplied in the Gaussian map reconstruction function to ensure that the integral of a single Gaussian kernel is kept as 1 (before multiplied by the amplitude value), which aims to keep the pixel value sum of pixels that assigned to an anchor point in $I_{L}$ equal to the integral of the corresponding SG pixel values in Gaussian map. Finally, the Gaussian map reconstruction formula is written as:

I_{L}\approx\sum_{i=1}^{N}v_{i}e^{\frac{d_{i}u-1}{s}}q

(6)

$N$ is the number of anchor points, $v_{i}$ denotes the RGB values of the $i$ -th anchor points, which is the product of the light distribution, intensity, and color ratios on that anchor point (namely $v_{i}=P_{i}ER$ ), $d_{i}$ is the direction of an anchor point (predefined by the Vogel method [43]), $u$ is the unit vector giving a direction on the sphere, $s$ is the angular size of the Gaussian kernel.

Normalization term $q$ is the reciprocal of the integral value of the spherical Gaussian and is dependent on the angular size $s$ and the sphere radius $r$ :

q=\frac{1}{{2\pi sr^{2}\left(1-e^{-2s}\right)}}

(7)

Discussion. Why use SH to represent low-frequency ambient light instead of using other functions on the sphere, such as SG? After all, by adjusting the angular size of the SG, it can be made smoother as well to represent low-frequency information. In the comparison experiment (Fig. 1) illustrated in the Introduction section, the angular size of SG is indeed carefully adjusted to produce two versions of SG. One sharp version is physically closer to the light source, which is compared with SH in the light source estimation task; another smooth version of SG that is physically closer to ambient light is used for comparison with SH in ambient light estimation tasks. More details about this adjustment can be found in the supplementary file. In this experiment, the parameters of SG were carefully set for the sake of fairness of comparison. But this also reflects a problem. When SG is used to represent lighting, many aspects of it (such as the value of the angular size, the number of SG kernels, whether it is a position-fixed model to predict the amplitude like [10, 11] or a number-fixed model to predict the position and amplitude simultaneously like [12, 14]) all requires someone with specialized knowledge to determine, and the process involves too many variable parameters and is often very cumbersome and uncertain. In contrast, SH only needs to determine which order to use, which is simple but efficient. In addition, previous works [40, 41, 42] have fully studied the usage of SH to represent low-frequency illumination information, where the 2-order SH has been proven to be very effective. Therefore, even users who lack professional knowledge can obtain a better lighting representation effect (refer to Fig. 1a) by simply setting the order to 2. In general, except for excellent accuracy in representing low-frequency light, SH has far fewer parameters to adjust and is easy to set, which is more user-friendly and efficient.

3.2 SLSparsemax

Light sources in the real world are sparse. In this paper, the sparsity of light sources refers to: in a scene, there are usually only a few light sources, and the number of light sources is variable. Previous works on SG do not effectively constrain the sparsity of the predicted results, which leads to inaccurate predictions.

Sparsemax, as described in [20], is an activation function utilized as the final layer in the neural network to improve the sparsity of the output. It has been applied in multi-class classification problems [21]. When used in illumination estimation, sparsemax tends to retain several of the brightest light sources while filtering out others, which helps create a sparsity relative to individual SG light sources. Further details on sparsemax can be found in the supplementary file and [20].

As a general method, the original sparsemax fails to utilize additional information in illumination estimation issues. In the real world, light sources usually have different shapes (e.g., patches, bars, points) and varying areas, implying useful prior information for lighting estimation. However, the original SG light source is point-shaped. Therefore, when simulating light sources using SG, it is encouraged to connect local point SG light sources on the spherical surface to simulate the patch-like and bar-like light sources in the real world. To this end, sparsity should be maintained between several SG clusters, rather than just between SG individuals. In summary, SLSparsemax encourages “local clustering, global sparsity” for SG light sources.

How can the SG point light on the sphere form a “locally clustered, globally sparse” distribution? The classical intrinsic decomposition task [44, 45, 46, 47, 48] seems to have similar requirements. The intrinsic decomposition task decomposes a picture into a reflectance layer and a shading layer, where the reflectance layer represents the color of the object itself, and the shading layer represents the effect of light on the object. The reflectance layer is usually assumed to consist of a limited number of color blocks, and the pixels within a block have the same color, but the blocks differ from one another. This is also a prior of local clustering and global sparsity. To satisfy this prior, the chroma or brightness similarity of a pixel with adjacent pixels is often measured, which is then incorporated into iterative optimization to help acquire the reflectance value. Inspired by this, the brightness similarity of one spherical light source with its neighborhood is used in SLSparsemax, where light sources more similar to neighbors have a better chance of being preserved. In this setting, if two light sources have the same brightness in the initial prediction result, the one with a more similar brightness to the neighborhood will be preferentially retained to encourage the network to eventually output a “locally clustered, globally sparse” light source prediction. Even if the wrong light source is retained, it will also be punished by the supervision of subsequent loss function, thus improving the prediction accuracy of the network.

Input :

P

Normalize

P^{norm}=P-\max\left\{P\right\}

Calculate

P^{cred}

s.t.

P_{i}^{simi}=\exp\left(-|P^{norm}_{i}-\frac{1}{|\Gamma(i)|}\sum\limits_{j\in% \Gamma(i)}P^{norm}_{j}|\right)

P_{i}^{cred}=\frac{P_{i}^{norm}}{P_{i}^{simi}}

Sort

P

P^{cred}(1)\geq\dots\geq P^{cred}(N)

Find

\kappa(P):=

\max\left\{\kappa\in[N]\mid 1+\kappa\min\limits_{1\leq i\leq\kappa}\{P(i)\}>% \sum\limits_{j\leq\kappa}P(j)\right\}

Define

\tau(P)=\frac{\left(\sum_{j\leq\kappa(P)}P(j)\right)-1}{\kappa(P)}

Output :

P^{sparse}

s.t.

P^{sparse}_{i}=\left[P_{i}-\tau(P)\right]_{+}

Algorithm 1 SLSparsemax

As depicted in Algorithm 1, SLSparsemax takes the initially predicted non-sparse light distribution $P$ as input. Each element in $P$ represents the brightness probability of a corresponding light source located on the spherical surface. To effectively filter out small light sources and locally cluster light sources, the credibility $P_{i}^{cred}$ of each SG light source is computed based on its normalized brightness probability $P_{i}^{norm}$ and its similarity $P_{i}^{simi}$ (in terms of brightness) with neighboring light sources in its neighborhood $\Gamma(i)$ . Subsequently, the input $P$ is sorted in descending order of the corresponding credibility $P_{i}^{simi}$ . Through a convex optimization process, a threshold $\tau(P)$ is determined to retain the top $\kappa(P)$ most reliable light sources, thereby producing the sparse output $P^{sparse}$ . In contrast to the original sparsemax [20] which focuses solely on the brightness of light sources, SLSparsemax additionally takes into account the brightness similarity between the light source and its neighborhood, where higher similarity results in enhanced gains in credibility, and ultimately a higher chance of being retained.

3.3 Loss Functions

For the SG section, MixLight adopts some effective SG-related loss functions that have been proven useful in [10]. Same with [10], the original L2 loss is used to constrain the distribution of lighting sources $P$ , the intensity $E$ and the color ratio $R$ , as well as their SML loss for $P$ . These loss functions can be found directly in Zhan et al. [10]. In addition to these, a masked L1 loss is devised for the SG parameter $P$ , in conjunction with SLSparsemax, to ensure accurate estimation of the light sources on the lower hemisphere.

The original sparsemax is a general method and not directly applicable to the lighting estimation problem. Unlike multi-class classification problems where sparsemax has been introduced and improved sparsity, the light distribution $P$ in lighting estimation problems is biased, with the majority of light sources distributed in the upper hemisphere of the spherical space. In this case, directly introducing the original sparsemax in the neural network will cause the network to inadequately learn the light distribution in the lower hemisphere of the spherical space, resulting in the network predicting numerous small light sources in the lower hemisphere where there are originally very few light sources. SLSparsemax also failed when facing this special light distribution-biased problem. To solve this problem, a masked L1 loss function specifically designed for light distributions $P$ is introduced to penalize the predicted tiny light sources in some places where there is no light source in the GT, which mainly occurs in the lower hemisphere. In other words, the masked L1 loss function $L_{masked-L1}$ will penalize tiny light sources that occur (have non-zero value) in predicted $\widehat{P}$ but do not occur in ground truth $P$ . This loss function compensates for the inadequacy of the original sparsemax method in handling the bias in data.

L_{masked-L1}=\lVert M^{P}\odot(\widehat{P}-P)\rVert_{1}

(8)

$M^{P}$ is a mask with the same shape as $P$ . Where $P$ is 0, the corresponding position in $M$ is 1; where $P$ is non-zero, the corresponding position in $M$ is 0. $\odot$ means Mask operation.

Regarding the SH parameters, MixLight mainly considers constraints on the SH coefficients themselves, constraints on the SH reconstruction results (used in [18]), and constraints on the SH rendering results (used in [19]).

The loss function for SH coefficients is defined as:

L_{\text{SH-co}}=\sum_{c=1}^{3}\sum_{k=0}^{K}\left(\frac{1}{2k+1}\sum_{m=-k}^{% k}\left(\widehat{A}_{m,c}^{k}-A_{m,c}^{k}\right)^{2}\right)

(9)

However, achieving a low error in the spherical harmonic (SH) coefficients does not necessarily ensure accurate light prediction. Even minor variations in the SH coefficients can result in significant alterations to both the reconstructed lighting map and the rendered outcomes. Therefore, imposing direct constraints on both the reconstructed lighting map and the rendering results can prove beneficial [18, 19].

Thanks to the differentiable nature of SH reconstruction [18] for light map and rendering [19], the SH reconstruction loss function and the rendering loss function can be directly constructed and optimized by backpropagation to update the network parameters. The SH reconstruction loss has been confirmed to contribute to the stability of the training process [18], which is defined in Equation 10:

\small\frac{1}{3wh}\sum_{c,\rho}\cos\theta\left(\sum_{m,k}\left(\widehat{A}_{m% ,c}^{k}B_{m}^{k}\left(\rho\right)-A_{m,c}^{k}B_{m}^{k}\left(\rho\right)\right)% \right)^{2}

(10)

Similar to [49], importance weighting $\cos{\theta}$ is also applied to the reconstructed lighting map in Equation 10. In spherical space, higher dimensions occupy smaller proportions of the sphere. Therefore, when calculating pixel errors on the lighting map, weights related to the elevation angle should be used to weaken the influence near the poles and increase the influence near the equator.

As to SH rendering Loss, Cheng et al. [19] randomly selected a few objects from several prepared objects for rendering during training. However, rendering losses defined on specific objects are not widely representative. In fact, any illumination values in any rendered objects come from the “irradiance environment map”[50], which preserves the irradiance for all orientations, while the rendering result of a specific object only includes a subset of the irradiance. The former is a superset (or the complete set) of the latter. Therefore, constraining the loss on the irradiance environment map indirectly constrains the loss of all potential objects to be rendered.

Given predicted SH coefficients $\widehat{A}$ and ground truth $A$ , the SH rendering loss $L_{\text{SH-rd}}$ is given by Equation 11, with a function shRender $(\cdot)$ helping to calculate irradiance environment map from coefficients (learn more in [50, 19]).

\small\frac{1}{3wh}\sum_{c,\rho}\cos\theta\left(\text{shRender}(\widehat{A})-% \text{shRender}(A)\right)^{2}

(11)

4 Experiments

4.1 Dataset

To train and evaluate MixLight, the publicly available Laval Indoor HDR Dataset [8] is utilized. This dataset comprises 2100 HDR illumination maps. Following the data preprocessing approach of [8, 10], each HDR panorama is cropped and warped eight times, resulting in a total of 19,556 training pairs. Similar to the method in [10], 200 training pairs are randomly selected as the test set, while the remaining 19,356 pairs are used as the training set.

Most prior works only assess their methods on the Laval Indoor HDR Dataset. However, this paper introduces an additional Web Dataset for testing. The images in the Web Dataset are collected from the internet and have no intersection with the Laval Indoor HDR Dataset, ensuring that previous SOTA methods have not reached them. This approach aims to provide a fairer evaluation. Meanwhile, comparing a method’s prediction errors on the Web Dataset and the Laval Indoor HDR Dataset allows for the assessment of its generalization performance. Following the same cropping and warping process, the Web Dataset comprises a total of 200 samples for testing. There are two reasons for performing additional tests on the Web Dataset:

•

For one thing, generative methods based on GANs have been criticized for their poor generalization performance [16]. When the training set and test set are split from one limited in-scale dataset (e.g., Laval Indoor HDR Dataset) with a specific scene style, these generative methods are able to memorize such delicate scene style due to their large-scale network parameter, and thus generate delicate illumination maps when tested, performing better than others. However, they may suffer from the overfitting to such a specific dataset style, and perform worse once the test samples are randomly collected from wild scenes with significant style differences. To validate this claim, a dataset having a random style is required.
•

For another, as some SOTA methods are not open-sourced and all compared methods are trained on the Laval Indoor HDR Dataset, it is difficult to ensure their training sets do not overlap with the test set used in this paper. While, constructing an additional Web Dataset ensures that it does not overlap with the training sets used by all methods, thus providing a fairer evaluation result.

4.2 Training Settings

Following [14, 8, 10], the size of the illumination map is 128x256, and the size of the limited FOV image that serves as the input is 192x256. In line with the best practice outlined in [10], a 5% threshold is used to separate ambient light from light sources. Moreover, the number of anchor points $N$ is set to 128, and the angular size $s$ is set to 0.0025, which constructs a sharp SG to represent the sharp light sources. For the setting of $N$ , note that in the experiment conducted by Zhan et al.[10], the prediction performance drops slightly when 64 instead of 128 anchor points are used, and increasing anchor points to 196 doesn’t improve the performance obviously. They conjecture that the larger number of parameters with 196 anchor points affects the regression accuracy negatively, choosing $N=128$ as the best setting.

Additionally, research by [50] shows that using 2nd order SH is sufficient for representing low-frequency ambient light (with an error below 1% when rendering Lambertian objects). Therefore, we use 2nd order SH to represent the ambient light, setting $K$ to 2. We found that reducing the SH parameter number will lead to the degeneration of representation capabilities, while increasing their number will make the regression task more challenging, leading to unstable training.

As to SLSparsemax, 6 closest points are considered to be in the neighborhood of an SG light source. We have also tried multiplying or dividing the components in the brackets of credibility by 10 and finally found that the current configuration is the best. The proposed MixLight is implemented using the PyTorch framework. Adam optimizer with a learning rate decay mechanism (initial learning rate of 0.0001) is used during training. It has been trained 130 epochs with a batch size of 32, on one NVIDIA GeForce RTX 3090 GPU with 24GB of memory.

4.3 Evaluation Method and Metric

Like several previous works [10, 9, 11, 14], MixLight undergoes both quantitative and qualitative evaluation.

For quantitative evaluation, all the compared methods estimate the illumination of samples from both the test set of the Laval Indoor HDR Dataset and the Web Dataset. The predicted illumination is then used to render three spheres, as shown in Fig. 4, using the Blender rendering engine [51]. The rendered results are then assessed against the GT using commonly used evaluation metrics, including RMSE and si-RMSE [45], which focus on the estimated light intensity and light directions (or shadings), respectively.

In qualitative evaluation, the predicted results are first visualized as illumination maps. Furtherly, similar 3D scenes, as used by [10, 9], each containing a virtual object and a background image, are rendered. The accuracy of illumination estimation can then be reflected based on the visual consistency of the rendered object with the scene.

Note that the actual predicted result is in HDR form, but only the LDR form picture can be displayed in the paper. However, LDR results do not show the true luminance estimates for each method because the tone mapping process adjusts the dynamic range of all images to the best viewing luminance. Furthermore, showing only the LDR light map would also make the observer psychologically biased toward the generative approach, which is unfair because the generative approach uses a higher-dimensional representation. For this reason, three spheres rendered during quantitative evaluation are shown along with the LDR form illumination maps, which reflect the true luminance and reduce potential unfairness.

TABLE I: Comparison of MixLight with several SOTA lighting estimation methods on two datasets. The evaluation metrics include the widely used RMSE, si-RMSE. D, S, M denote a diffuse, a matte silver and a mirror material of the rendered objects, respectively.

	StyleLight			EMLight			Gardner19			MixLight
Metrics	D	S	M	D	S	M	D	S	M	D	S	M
	on Laval Indoor HDR Dataset
RMSE $\downarrow$	0.181	0.218	0.207	0.202	0.274	0.275	0.170	0.213	0.207	0.095	0.185	0.199
si-RMSE $\downarrow$	0.055	0.147	0.155	0.065	0.147	0.160	0.048	0.143	0.151	0.039	0.133	0.144
	on Web Dataset
RMSE $\downarrow$	0.758	0.682	0.609	0.511	0.443	0.435	0.522	0.492	0.455	0.505	0.411	0.375
si-RMSE $\downarrow$	0.155	0.326	0.354	0.184	0.353	0.397	0.168	0.324	0.331	0.134	0.285	0.301

4.4 Quantitative Evaluation

In this section, MixLight is compared with three SOTA methods: Gardner19 [14], which solely relies on SG, StyleLight [9], known for direct illumination map generation, and EMLight [10], which utilizes SG to guide illumination map generation. The quantitative evaluation results on both the Laval Indoor HDR Dataset and the Web Dataset are presented in Table I.

MixLight demonstrates superior performance compared to all SOTA methods across all evaluation metrics and ball materials. This is primarily attributed to the complementary advantages of SH and SG in the frequency domain, allowing for more accurate predictions. Gardner19 simplifies the modeling of ambient light, leading to inaccuracies in predicting low-frequency illumination and subsequently affecting the RMSE metric. Additionally, the fixed number of SG light sources to 3 in Gardner19 oversimplifies the representation of high-frequency light sources, resulting in errors in the si-RMSE metric. StyleLight exhibits notable degradation in the RMSE metric when transitioning from the Laval Indoor HDR Dataset to the Web Dataset, suggesting poor generalization performance due to overfitting to the limited-scale training set. Similarly, EMLight also suffers from overfitting errors, but to a lesser extent, attributed to the guidance of the low-dimensional SG model. However, EMLight does not consider the sparsity of the light source, leading to inaccurate predictions of high-frequency illumination.

Unlike the SG component in Gardner19 and EMLight, the sparse SG in MixLight effectively models the sparse real-life light sources due to the utilization of SLSparsemax, resulting in more precise high-frequency illumination predictions. In contrast to pure generation methods such as StyleLight, MixLight, as a low-dimensional parameterized model, exhibits less performance degradation from the Laval Indoor Dataset to the Web Dataset, highlighting its superior generalization capabilities.

4.5 Qualitative Evaluation

This section visualizes the predicted results and the rendering results. Visualizing the predicted results allows direct observation of the light source’s position and the ambient light, while visualized rendering results depict details (e.g., hard shadows) that are challenging to capture through quantitative evaluation but significantly impact realism. The SOTA methods compared are consistent with those used in the quantitative evaluation. Finally, some failure cases in rendering results will be shown and analyzed.

Comparison of predicted results. Fig. 5 shows the predicted results of various methods, along with the GT illumination maps. The first two rows are test samples from Laval Indoor HDR Dataset, with the last two rows from the Web Dataset. In all four samples, the light source positions and ambient color variance predicted by MixLight are all close to the GT. In contrast, Gardner19 oversimplifies light sources and ambient light representation by fixing 3 light sources and one ambient light color, leading to a poor fit for the strip lights in the first sample and an inability to represent the free-varying ambient light as demonstrated by MixLight. StyleLight and EMLight tend to overfit the indoor style of the Laval Indoor HDR Dataset, performing well on the first two samples but predict inaccurately when the samples are collected from the internet with a significantly different style, as observed in the third and fourth rows. For instance, in the third sample, the pink walls, and in the fourth sample, the abandoned house are not present in the Laval Indoor HDR Dataset. In these two cases, StyleLight and EMLight still generate ceiling, yellow walls, and exquisite doors and windows consistent with the Laval Indoor HDR Dataset style, which is notably different from the GT. Additionally, EMLight does not constrain the sparsity of SG, resulting in the presence of numerous unnecessary light sources in the generated illumination maps.

It is important to note that the comparison of visualized illumination maps serves as a supplementary tool for understanding the limitations of certain methods as identified in the quantitative evaluation section (e.g., Gardner19 oversimplifies the light source representation). The predicted results of low-dimensional parametric methods seem to be smoother. However, the perceived smoothness or sharpness of the LDR illumination map does not inherently reflect the accuracy of the prediction. This is due to two primary factors: (1) The angular information of light is convolved and highly blurred during rendering. Thus the perceived smoothness or sharpness might not be important because it will inevitably be blurred after rendering. (2) Typically, the subjective quality in LDR space does not necessarily align with the results in HDR. In the HDR space, the ambient light predicted by all methods appears smooth (compared with the light source).

Comparison of rendered results. Fig. 6 shows the rendered images alongside the corresponding predicted results used for rendering. It can be observed that MixLight predicts accurately in terms of light source position and variation of ambient light, thus producing realistic rendering results where the color, brightness, and shadows of the virtual objects are close to the GT. Conversely, Gardner19 oversimplifies the modeling of ambient light and light sources. The use of a single color fails to accurately represent the ambient light, resulting in inaccurate color and brightness of virtual objects after rendering, especially in the first sample. Additionally, the limited number of three light sources struggles to represent the variable light sources in the real world, leading to overly harsh shadows in the rendering results, particularly noticeable in the last sample. StyleLight tends to overfit the Laval Indoor HDR Dataset, and its brightness predictions are generally lower than the ground truth, resulting in excessively low brightness in the rendering results. In contrast, EMLight employs the predicted parameters to guide the generation of illumination maps, thereby mitigating the overfitting issue to some extent. This approach demonstrates improved performance in the second, third, and fourth samples. However, the generative model trained on a small-scale dataset still exhibits instability, leading to significant distortion of the illumination maps in the first sample and resulting in orange-colored objects in the rendering.

Failure cases in rendering results. Illumination estimation is still a difficult problem due to its strong ill-posed nature. So even in other papers, it is normal for the proposed method to have limited effectiveness on some samples. The proposed MixLight exceeds all SOTAs in its entirety but also fails on individual samples, as illustrated in Fig. 7. There are two main types of failure: inaccurate brightness or location prediction of light sources. For one thing, the brightness of the light source has greater uncertainty than other parameters, whose loss is more difficult to converge during training. For the chair and pillow in the first two examples in Fig. 7, the brightness is predicted too weak to render clear shadows even if the light source position is very accurate. For the bag in the third example, the brightness is predicted too high, leading to harder Shadows. For another, the position of light Sources could be predicted better if more clues (e.g., shading and shadows) exist in the input. Without clues, even humans can not guess accurate positions. The third example has a huge desktop with neither shading nor clear shadows, where the shadow indicates light from the top but is insufficient to help further locate its azimuth.

TABLE II: Ablation study of choosing different representation models for ambient light and light sources. There are four combinations in total, namely using SH and SG to represent ambient light and light sources (MixLight), using SG for both ambient light and light sources (SGG), using SH for both ambient light and light sources (SHH), and using SG for ambient light and SH for light sources (SGH).

Metrics	MixLight			SGG			SHH			SGH
Metrics	D	S	M	D	S	M	D	S	M	D	S	M
	on Laval Indoor HDR Dataset
RMSE $\downarrow$	0.095	0.185	0.199	0.121	0.223	0.230	0.097	0.210	0.232	0.105	0.212	0.233
si-RMSE $\downarrow$	0.039	0.133	0.144	0.067	0.179	0.185	0.053	0.158	0.165	0.057	0.159	0.165
	on Web Dataset
RMSE $\downarrow$	0.505	0.411	0.375	0.585	0.539	0.507	0.525	0.463	0.425	0.506	0.448	0.415
si-RMSE $\downarrow$	0.133	0.285	0.301	0.226	0.445	0.448	0.213	0.403	0.396	0.200	0.391	0.387

TABLE III: Ablation study of the proposed SLSparsemax used in MixLight. The MixLight variants with SLSparsemax (SLS), original sparsemax (S), and No Sparsemax (NS) undergo quantitative testing on two datasets. Best scores are presented in bold, while second-best scores are displayed in italics.

Metrics	No Sparsemax (NS)			Sparsemax (S)			SLSparsemax (SLS)
Metrics	D	S	M	D	S	M	D	S	M
	on Laval Indoor HDR Dataset
RMSE $\downarrow$	0.111	0.198	0.218	0.102	0.199	0.209	0.095	0.185	0.199
si-RMSE $\downarrow$	0.052	0.141	0.151	0.043	0.141	0.150	0.039	0.133	0.144
	on Web Dataset
RMSE $\downarrow$	0.527	0.436	0.396	0.510	0.432	0.384	0.505	0.411	0.375
si-RMSE $\downarrow$	0.159	0.307	0.326	0.144	0.305	0.314	0.134	0.285	0.301

4.6 Ablation Study

MixLight uses SH to represent ambient light and SG to represent light sources. Table II shows the ablation study of choosing different representation models for ambient light and light sources, using the same datasets, metrics, and comparison methods as quantitative evaluation. The other three (model selection) combinations are SG for both ambient light and light sources (SGG), SH for both ambient light and light sources (SHH), and SG for ambient light but SH for light sources (SGH). Note that these three combinations have the same or similar parameter scales with MixLight, and more details about the parameter settings are in the supplementary file. As shown in Table II, MixLight surpasses the other three combinations in all test sets, all materials of the spheres, and all metrics, mainly benefiting from the accurate prediction ability of SH and SG to low-frequency and high-frequency information respectively.

Another experiment is conducted to quantitatively compare the MixLight variants with SLSparsemax (SLS), original sparsemax (S), and No Sparsemax (NS), as depicted in Table III. Best scores are presented in bold, while second-best scores are displayed in italics. Among the 12 metric results in the two test sets, SLS outperforms the original S with most best scores while S gets more second-best scores than NS. This decreasing error from NS, S to SLS can be observed more intuitively in visualized rendering results and respective error maps in Fig. 8, where S and SLS achieve a much sparser light source prediction than NS, better at mimicking the sparse light sources in GT and consequently producing pronounced shadows. SLSparsemax further generates locally clustered yet globally sparse light sources that closely resemble reality, thus rendering more realistic shadows and highlights and getting better scores in Table III.

5 Limitations and Future Works

This paper explores the feasibility of achieving a more accurate illumination representation model by combining inherent representation models. By observing the law of light sources in indoor scene datasets, we propose the sparsity assumption of light sources, and further design SLSparsemax modules based on this assumption. However, it should be noted that there are also times when the sparsity assumption does not apply. Illumination estimation for outdoor scenes is also not the focus of this paper for the time being. In addition, predicting spatially-varying illumination and predicting illumination with more details are all directions that can be further explored in the future.

Assumption of sparsity. In this paper, an assumption of sparsity is made: in a scene, there are usually only a few light sources, and the number of light sources is variable. This assumption is made as the target dataset (Laval Indoor) presents such sparsity characteristics. It should be noted that sometimes this assumption does not hold (e.g., multiple small light sources in the laboratory). Despite this limitation, making this assumption is a step forward from previous assumptions explicitly or implicitly used by previous SG methods. For example, Gardner19 [14] uses only three SG kernels, and [13] uses only one SG kernel to represent light sources. In the future, SLSparsemax could be introduced to more SG works to model sparse and variable numbers of SG light sources. A more general light source assumption can also be proposed based on the current one.

Outdoor scenes. Outdoor scenes have not been considered in this paper. However, both SH and SG can be promoted to outdoor scenes like [19, 12]. Therefore, as a SH, SG combined method, MixLight can also be promoted to predict outdoor illumination. Additionally, outdoor light sources (typically the sun) are fewer in number but have higher energy, different from indoor light sources. This difference previously blocks the formation of a unified indoor-outdoor light source model. However, this bottleneck is naturally suitable for the proposed SLSparsemax to tackle, which produces number-variant light sources adaptively. It can be further investigated in future work.

Spatially-varying illumination. MixLight is currently not a spatially-varying illumination representation. Considering that the depth information of scenes can be obtained from the Laval indoor HDR Dataset, this problem could be addressed by generating spatially-varying training pairs when pre-processing data, similar with [17], or by regressing the depth of light sources and promoting them from 2D sphere surface to 3D like [14].

Generalization performance. Recovering photo-realistic scene details from a single limited FOV image with a lack of mirror objects remains a challenge, as previous GAN-based approaches have shown poor generalization performance. Inputting multiple images can provide more scene details, which may be a possible solution to address this challenge. Building a large-scale real HDR dataset also helps improve the generalization performance.

6 Conclusions

This paper introduces MixLight, a joint illumination representation model that leverages the complementary advantages of SH and SG in the frequency domain. SLSparsemax is designed to help MixLight achieve as sparse light source predictions as that in real life. The results of quantitative and qualitative evaluations demonstrate that MixLight accurately predicts illumination from single indoor scene limited FOV images. Further experiments on a Web Dataset show that parametric methods have better generalization performance compared to generation methods.

Appendix A SH-SG comparison experiment

In the Introduction section of the main manuscript, an experiment (Fig. 1) is used to illustrate the different advantages of SH and SG in representing ambient light and light sources. The training and testing sets used are both from the Laval Indoor HDR Dataset [8]. The partition method of the dataset is consistent with the experimental settings in the main manuscript. The test set consisted of 200 randomly selected samples. For each sample, ambient light (or light sources) is predicted and used to render three balls. 200 samples render 600 results. For each rendering result, the error is calculated by the formula $\sqrt{\alpha\cdot\beta}$ , where $\alpha$ represents the Root Mean Square Error (RMSE) and $\beta$ represents si-RMSE [45]. Finally, 600 error values are presented in the form of a violin chart (e.g., Fig. 1a shown in the main manuscript), which shows the distribution of the error values on the test set, when using one of the SH and SG models to predict ambient light or light source.

To fairly compare the performance of SH or SG on the ambient light prediction task or the light source prediction task, different versions of SG are designed to predict light sources and ambient light.

In the ambient light estimation task, a 9-kernel smooth SG is designed, with its angular size $s$ adjusted to 0.2423, which ensures that the 9 SG functions cover the sphere as comprehensively as possible while maintaining a decay of roughly 0.5 on the boundaries. This smooth version of SG, along with its $s$ adjusting method, mirrored the approach in [15]. The only difference is that the number of SG kernels is reduced to 9 in this experiment to keep it focused on capturing low-frequency ambient light. This also helps to maintain SG’s parameter number at 27 (in RGB three channels), consistent with a 2nd order SH served as a competitor in ambient light prediction task.

In the light source estimation task, a sharp version of SG is utilized, which is essentially the spherical Gaussian distribution proposed in EMLight. The angular size $s$ is set to 0.0025 to keep the SG kernel function as sharp as the light sources. Then compare the sharp 128-kernel SG with a 6th-order SH, which comprised 132 and 147 parameters, respectively. While the parameters of SG and SH for the light sources prediction task were not entirely consistent, the 6th-order SH was the closest in terms of parameter scale with 128-kernel sharp SG.

Appendix B Original Sparsemax

Sparsemax [20] operates under the assumption that although the network’s output distribution values are fundamentally reliable after numerous training iterations, they are still not sparse enough. To address this, it sorts all elements within the input distribution value in descending order and retains only the leading large values, effectively filtering out small values that are close to zero. Additionally, to minimize alterations to the somewhat trusted input, it only applies an overall offset to the distribution value to preserve the distinction between elements, while striving to retain as many elements as possible to minimize changes in quantity.

Appendix C Ablation Study details

Comparison fairness is ensured when comparing MixLight with 3 other combinations in the ablation study.

In the first experiment of the ablation study, we replace the 2nd order SH of MixLight with 9-kernel SG (smooth version SG) to construct SGG. We replace the 128-kernel SG of MixLight with 6th order SH to construct SHH. We replace both above to construct SGH. So MixLight, SGG, SHH, and SGH have the same or close (27+132), (27+132), (27+147), and (27+147) parameters respectively.

Appendix D More results

More rendered results are given in Fig. 9 to supplement Fig. 6 of the main manuscript.

In Fig. 5 of the main manuscript, the predicted results are visualized as illumination maps to help analyze the pros and cons of various methods given in quantitative evaluation, where rendering results is the final goal that needs to be evaluated quantitatively. To aid the assessment, here are more direct predictions visualized in the form of light maps. Again, the results are split into two different test sets to compare how different methods behave on test sets from familiar (Fig. 10) to unfamiliar styles (Fig. 11).

References

[1] P. Debevec, “Rendering synthetic objects into real scenes: bridging traditional and image-based graphics with global illumination and high dynamic range photography,” in Proceedings of the 25th annual conference on Computer graphics and interactive techniques, 1998, pp. 189–198.
[2] G. Li, C. Wu, C. Stoll, Y. Liu, K. Varanasi, Q. Dai, and C. Theobalt, “Capturing relightable human performances under general uncontrolled illumination,” Computer Graphics Forum, vol. 32, no. 2pt3, pp. 275–284, 2013.
[3] S. B. Knorr and D. Kurz, “Real-time illumination estimation from faces for coherent rendering,” in 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, 2014, pp. 113–122.
[4] S. Georgoulis, K. Rematas, T. Ritschel, M. Fritz, T. Tuytelaars, and L. Van Gool, “What is around the camera?” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5170–5178.
[5] J. T. Barron and J. Malik, “Intrinsic scene properties from a single rgb-d image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 17–24.
[6] L. Gruber, T. Richter-Trummer, and D. Schmalstieg, “Real-time photometric registration from arbitrary geometry,” in 2012 IEEE international symposium on mixed and augmented reality (ISMAR). IEEE, 2012, pp. 119–128.
[7] L. Bin, X. Kun, and R. R. Martin, “Static scene illumination estimation from videos with applications [j],” Journal of Computer Science and Technology, vol. 32, no. 3, pp. 430–442, 2017.
[8] M.-A. Gardner, K. Sunkavalli, E. Yumer, X. Shen, E. Gambaretto, C. Gagné, and J.-F. Lalonde, “Learning to predict indoor illumination from a single image,” ACM Transactions on Graphics, vol. 36, no. 6, pp. 1–14, 2017.
[9] G. Wang, Y. Yang, C. C. Loy, and Z. Liu, “Stylelight: Hdr panorama generation for lighting estimation and editing,” in European Conference on Computer Vision, 2022, pp. 477–492.
[10] F. Zhan, C. Zhang, Y. Yu, Y. Chang, S. Lu, F. Ma, and X. Xie, “Emlight: Lighting estimation via spherical distribution approximation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 3287–3295.
[11] F. Zhan, Y. Yu, C. Zhang, R. Wu, W. Hu, S. Lu, F. Ma, X. Xie, and L. Shao, “Gmlight: Lighting estimation via geometric distribution approximation,” IEEE Transactions on Image Processing, vol. 31, pp. 2268–2278, 2022.
[12] M. R. K. Dastjerdi, J. Eisenmann, Y. Hold-Geoffroy, and J.-F. Lalonde, “Everlight: Indoor-outdoor editable hdr lighting estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7420–7429.
[13] H. Weber, M. Garon, and J.-F. Lalonde, “Editable indoor lighting estimation,” in European Conference on Computer Vision. Springer, 2022, pp. 677–692.
[14] M.-A. Gardner, Y. Hold-Geoffroy, K. Sunkavalli, C. Gagné, and J.-F. Lalonde, “Deep parametric indoor lighting estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7175–7183.
[15] M. Li, J. Guo, X. Cui, R. Pan, Y. Guo, C. Wang, P. Yu, and F. Pan, “Deep spherical gaussian illumination estimation for indoor scene,” in Proceedings of the ACM Multimedia Asia, 2019, pp. 1–6.
[16] F. Zhan, C. Zhang, W. Hu, S. Lu, F. Ma, X. Xie, and L. Shao, “Sparse needlets for lighting estimation with spherical transport loss,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 830–12 839.
[17] M. Garon, K. Sunkavalli, S. Hadap, N. Carr, and J.-F. Lalonde, “Fast spatially-varying indoor lighting estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6908–6917.
[18] D. Xu, Z. Li, and Y. Zhang, “Real-time illumination estimation for mixed reality on mobile devices,” in 2020 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW). IEEE, 2020, pp. 702–703.
[19] D. Cheng, J. Shi, Y. Chen, X. Deng, and X. Zhang, “Learning scene illumination by pairwise photos from rear and front mobile cameras,” in Computer Graphics Forum, vol. 37, no. 7. Wiley Online Library, 2018, pp. 213–221.
[20] A. Martins and R. Astudillo, “From softmax to sparsemax: A sparse model of attention and multi-label classification,” in International conference on machine learning. PMLR, 2016, pp. 1614–1623.
[21] A. Laha, S. A. Chemmengath, P. Agrawal, M. Khapra, K. Sankaranarayanan, and H. G. Ramaswamy, “On controllable sparse alternatives to softmax,” Advances in neural information processing systems, vol. 31, 2018.
[22] Y. Mei, H. Zhang, X. Zhang, J. Zhang, Z. Shu, Y. Wang, Z. Wei, S. Yan, H. Jung, and V. M. Patel, “Lightpainter: interactive portrait relighting with freehand scribble,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 195–205.
[23] T. Nestmeyer, J.-F. Lalonde, I. Matthews, and A. Lehrmann, “Learning physics-guided face relighting under directional light,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5124–5133.
[24] R. Pandey, S. Orts-Escolano, C. Legendre, C. Haene, S. Bouaziz, C. Rhemann, P. E. Debevec, and S. R. Fanello, “Total relighting: learning to relight portraits for background replacement.” ACM Trans. Graph., vol. 40, no. 4, pp. 43–1, 2021.
[25] T. Sun, J. T. Barron, Y.-T. Tsai, Z. Xu, X. Yu, G. Fyffe, C. Rhemann, J. Busch, P. Debevec, and R. Ramamoorthi, “Single image portrait relighting,” ACM Transactions on Graphics (TOG), vol. 38, no. 4, pp. 1–12, 2019.
[26] Z. Wang, X. Yu, M. Lu, Q. Wang, C. Qian, and F. Xu, “Single image portrait relighting via explicit multiple reflectance channel modeling,” ACM Transactions on Graphics (TOG), vol. 39, no. 6, pp. 1–13, 2020.
[27] Y.-Y. Yeh, K. Nagano, S. Khamis, J. Kautz, M.-Y. Liu, and T.-C. Wang, “Learning to relight portrait images via a virtual light stage and synthetic-to-real adaptation,” ACM Transactions on Graphics (TOG), vol. 41, no. 6, pp. 1–21, 2022.
[28] L. Zhang, Q. Zhang, M. Wu, J. Yu, and L. Xu, “Neural video portrait relighting in real-time via consistency modeling,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 802–812.
[29] X. Zhang, S. Fanello, Y.-T. Tsai, T. Sun, T. Xue, R. Pandey, S. Orts-Escolano, P. Davidson, C. Rhemann, P. Debevec et al., “Neural light transport for relighting and view synthesis,” ACM Transactions on Graphics (TOG), vol. 40, no. 1, pp. 1–17, 2021.
[30] A. Meka, C. Haene, R. Pandey, M. Zollhöfer, S. Fanello, G. Fyffe, A. Kowdle, X. Yu, J. Busch, J. Dourgarian et al., “Deep reflectance fields: high-quality facial reflectance field inference from color gradient illumination,” ACM Transactions on Graphics (TOG), vol. 38, no. 4, pp. 1–12, 2019.
[31] H. Kim, M. Jang, W. Yoon, J. Lee, D. Na, and S. Woo, “Switchlight: Co-design of physics-driven architecture and pre-training framework for human portrait relighting,” arXiv preprint arXiv:2402.18848, 2024.
[32] H. Weber, D. Prévost, and J.-F. Lalonde, “Learning to estimate indoor lighting from 3d objects,” in 2018 International Conference on 3D Vision (3DV). IEEE, 2018, pp. 199–207.
[33] H.-X. Yu, S. Agarwala, C. Herrmann, R. Szeliski, N. Snavely, J. Wu, and D. Sun, “Accidental light probes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 521–12 530.
[34] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic image synthesis with spatially-adaptive normalization,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2337–2346.
[35] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8110–8119.
[36] H. Zhu, H. Yang, L. Guo, Y. Zhang, Y. Wang, M. Huang, M. Wu, Q. Shen, R. Yang, and X. Cao, “Facescape: 3d facial dataset and benchmark for single-view 3d face reconstruction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[37] Y. Feng, H. Feng, M. J. Black, and T. Bolkart, “Learning an animatable detailed 3d face model from in-the-wild images,” ACM Transactions on Graphics (ToG), vol. 40, no. 4, pp. 1–13, 2021.
[38] Z. Chen, A. Chen, G. Zhang, C. Wang, Y. Ji, K. N. Kutulakos, and J. Yu, “A neural rendering framework for free-viewpoint relighting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5599–5610.
[39] H. Zhou, S. Hadap, K. Sunkavalli, and D. W. Jacobs, “Deep single-image portrait relighting,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 7194–7202.
[40] D. Frolova, D. Simakov, and R. Basri, “Accuracy of spherical harmonic approximations for images of lambertian objects under far and near lighting,” in Computer Vision-ECCV 2004: 8th European Conference on Computer Vision, Prague, Czech Republic, May 11-14, 2004. Proceedings, Part I 8. Springer, 2004, pp. 574–587.
[41] R. Ramamoorthi, “Analytic pca construction for theoretical analysis of lighting variability in images of a lambertian object,” IEEE transactions on pattern analysis and machine intelligence, vol. 24, no. 10, pp. 1322–1333, 2002.
[42] ——, “Modeling illumination variation with spherical harmonics,” Face Processing: Advanced Modeling Methods, pp. 385–424, 2006.
[43] H. Vogel, “A better way to construct the sunflower head,” Mathematical biosciences, vol. 44, no. 3-4, pp. 179–189, 1979.
[44] Q. Zhao, P. Tan, Q. Dai, L. Shen, E. Wu, and S. Lin, “A closed-form solution to retinex with nonlocal texture constraints,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 7, pp. 1437–1444, 2012.
[45] R. Grosse, M. K. Johnson, E. H. Adelson, and W. T. Freeman, “Ground truth dataset and baseline evaluations for intrinsic image algorithms,” in 2009 IEEE 12th International Conference on Computer Vision, 2009, pp. 2335–2342.
[46] S. Sengupta, A. Kanazawa, C. D. Castillo, and D. W. Jacobs, “Sfsnet: Learning shape, reflectance and illuminance of facesin the wild’,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6296–6305.
[47] A. Meka, M. Shafiei, M. Zollhöfer, C. Richardt, and C. Theobalt, “Real-time global illumination decomposition of videos,” ACM Transactions on Graphics (ToG), vol. 40, no. 3, pp. 1–16, 2021.
[48] R. Wang, Q. Zhang, C.-W. Fu, X. Shen, W.-S. Zheng, and J. Jia, “Underexposed photo enhancement using deep illumination estimation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6849–6857.
[49] V. Gkitsas, N. Zioulis, F. Alvarez, D. Zarpalas, and P. Daras, “Deep lighting environment map estimation from spherical panoramas,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 640–641.
[50] R. Ramamoorthi and P. Hanrahan, “An efficient representation for irradiance environment maps,” in Proceedings of the 28th annual conference on Computer graphics and interactive techniques, 2001, pp. 497–500.
[51] R. Hess, Blender foundations: The essential guide to learning blender 2.5. Taylor & Francis, 2013.