MixLight: Borrowing the Best of both Spherical Harmonics and Gaussian Models

Xinlong Ji, Fangneng Zhan, Shijian Lu, Shi-Sheng Huang, Hua Huang Xinlong Ji is with the School of Computing Science & Technology, Beijing Institute of Technology, Beijing 100081, China. E-mail: [email protected]. Fangneng Zhan is with the Max Planck Institute for Informatics, Saarbrücken 66123, Germany. E-mail: [email protected]. Shijian Lu is with the School of Computer Science and Engineering, Nanyang Technological University, 639798, Singapore. E-mail: [email protected]. Shi-Sheng Huang and Hua Huang are with the School of Artificial Intelligence, Beijing Normal University, Beijing 100875, China. E-mail: {huangss, huahuang}@bnu.edu.cn. Hua Huang is the corresponding author.
Abstract

Accurately estimating scene lighting is critical for applications such as mixed reality. Existing works estimate illumination by generating illumination maps or regressing illumination parameters. However, the method of generating illumination maps has poor generalization performance and parametric models such as Spherical Harmonic (SH) and Spherical Gaussian (SG) fall short in capturing high-frequency or low-frequency components. This paper presents MixLight, a joint model that utilizes the complementary characteristics of SH and SG to achieve a more complete illumination representation, which uses SH and SG to capture low-frequency ambient and high-frequency light sources respectively. In addition, a special spherical light source sparsemax (SLSparsemax) module that refers to the position and brightness relationship between spherical light sources is designed to improve their sparsity, which is significant but omitted by prior works. Extensive experiments demonstrate that MixLight surpasses state-of-the-art (SOTA) methods on multiple metrics. In addition, experiments on Web Dataset also show that MixLight as a parametric method has better generalization performance than non-parametric methods.

Index Terms:
Illumination estimation, mixed reality, spherical harmonics, spherical gaussian, deep learning.

1 Introduction

Recovering high-dynamic-range (HDR) illumination from a single limited field of view (FOV) image is a challenging yet attractive task. This task holds significant relevance across various domains within computer graphics, encompassing mixed reality, HDR relighting, and virtual characters insertion. Addressing this challenging problem entails confronting three primary challenges. First, inferring panoramic illumination from a image of limited FOV poses a prominent difficulty. Second, recovering HDR illumination from low dynamic range (LDR) observation falls into an ill-posed problems. Third, the observed images are the result of intricate interplay between geometry, material, and illumination, leading to substantial ambiguity for lighting estimation.

Refer to caption
(a) Ambient Light estimation
Refer to caption
(b) Light Sources estimation
Figure 1: Two violin figures illustrate the distinct advantages of SH and SG in illumination estimation tasks. The objective is to examine the potential disparities between SH and SG functions in representing various light components. In the first scenario, SH and SG with the same parameter sizes are used to represent ambient light. Two networks are trained to predict the parameters of SH and SG respectively. The predicted ambient light is then employed to render spheres (depicted in Fig. 4) on the test set. Prediction accuracy is evaluated by calculating the error between the predicted rendering result and the real rendering image. The resulting errors from all test set samples are visualized in the violin figure Fig. 1a, clearly displaying the distribution and average of the errors. In the second scenario, SH and SG with similar parameters are used to represent the light source. The network is trained, and the rendering error is then plotted in Fig. 1b. For further details about the experiment’s design, refer to the supplementary file.

Prior works [1, 2, 3, 4, 5, 6, 7] mitigate this ill-posedness by incorporating additional information; however, these approaches often lack user-friendliness, such as requiring obtaining the depth map [5, 6] or multi-view pictures [2, 7]. Recent works have leaned towards leveraging neural networks to predict illumination directly from a single image, either by generating illumination maps [8, 9, 10, 11, 12, 13] or by regressing illumination parameters [14, 10, 11, 15, 16, 17, 18, 12, 13]. Illumination maps, which are panoramic images containing illumination information in all directions, offer the potential to fully represent illumination. However, their high-dimensional nature often results in stability issues and poor generalization performance for lighting estimation. On the other hand, the regression-based methods parameterize the illumination with basis functions such as SH [17, 19, 18] and SG [14, 10, 11, 15, 12, 13], followed by a network to regress the illumination parameters of lower dimensions. These methods have better generalization performance but struggle to capture either high-frequency or low-frequency characteristics, leading to incomplete representation.

Refer to caption
Figure 2: The proposed MixLight estimates illumination and re-illuminates multiple virtual objects (in the second row). MixLight estimates low-dimensional lighting parameters that can be visualized as illumination maps (at the top right of each example) from limited field-of-view pictures (at the top left of each example).

Specifically, Fig. 1 shows that SH is better suited for capturing low-frequency ambient light (with a less error in Fig. 1a), while SG is more suitable for capturing high-frequency light sources (with a less error in Fig. 1b). Therefore, the joint representation of illumination by utilizing the complementary advantages of SH and SG in the frequency-domain can lead to a more accurate representation of illumination information.

Inspired by this, a SH, SG joint model of illumination representation called MixLight is proposed. An HDR illumination map can be divided into two parts: ambient light and light source using a simple brightness threshold segmentation method [10]. Among them, the ambient light is mainly the low-frequency illumination information. The light source is a steep peak in brightness, which corresponds to the high-frequency part of the illumination information. MixLight uses a neural network to regress both SH and SG parameters simultaneously, where SH represents the low-frequency ambient light part and SG represents the high-frequency light source part.

In practice, real-world light sources often exhibit a sparse characteristic, typically confined to a relatively small yet variable number. Nevertheless, prior SG methods either overlook this sparsity [10, 11] by generating extraneous tiny light sources, or trap in oversimplified settings by featuring a fixed number of light sources [14, 12, 13]. Inspired by the sparsemax theory [20, 21], this paper designs the Spherical Light Source Sparsemax (SLSparsemax) mechanism to impose sparsity constraints on light sources at the neural network level. SLSparsemax considers the positions and brightness relationships among light sources on the sphere, calculates the credibility of light source predictions, and adaptively determines the number of sparse light sources.

The contributions of this work are summarized as follows:

  1. 1.

    A SH, SG joint model named MixLight is proposed to represent illumination more completely and accurately with a few parameters.

  2. 2.

    The spherical light source sparsemax (SLSparsemax) mechanism is designed to impose sparsity constraints on the estimation of light sources.

  3. 3.

    Experiments on several datasets show that MixLight outperforms several SOTA methods in prediction accuracy and has good generalization ability.

Refer to caption
Figure 3: MixLight parameters decomposition and estimation. In the right half of the figure, an illumination map is separated as the light sources and the ambient light component, then decomposed to true values of MixLight parameters (including SG and SH parameters). The left half describes the parameter regression process. Specifically, the MixLight model uses an SLSparsemax activation function layer in the network to enforce the sparsity of light source parameters. Consistent with [10, 11, 17, 14], MixLight utilizes DenseNet121 as the backbone network. The input of the network is cropped from the illumination map (corresponding to the area in the red box).

2 Related Works

Illumination estimation is a classic and challenging task in the fields of computer vision and computer graphics. The panoramic HDR illumination estimated from a single image can be directly used as illumination conditions for many subsequent applications such as the virtual characters insertion or relighting [22, 23, 24, 25, 26, 27, 28, 29, 30, 31]. Early research efforts attempt to introduce more information to alleviate the ill-posedness, such as using landmarks in the scene to obtain partial geometric (or material) information [1, 2, 3, 4, 32, 33], or directly obtaining scene geometry from depth maps [5, 6] and multi-view pictures [2, 7]. However, this introduction of additional information is not user-friendly.

In recent years, most illumination estimation works are based on neural networks, which generate illumination maps [8, 10, 9, 11, 12] or regress parameters [14, 10, 11, 15, 16, 17, 18, 19, 12, 13], no longer relying on additional information. To generate high-dimensional illumination maps, a large amount of real HDR training data is required to ensure the generalization performance and the realism of the prediction. However, constructing such large-scale realistic datasets that currently lacks consumes considerable time and effort. Therefore, Gardner et al. [8] train a CNN mainly on a large-scale synthetic dataset and then fine-tuned light source intensity on a small-scale real dataset. More recent works [10, 9, 13, 12] rely on the powerful generative adversarial network (GAN) to generate illumination maps with realistic details. Zhan et al. [10] use the regressed parameters of SG to guide GAN [34] generating, similar to another two hybrid models [13, 12], while Wang et al. [9] directly generate illumination maps with scene style encoded in styleGAN [35]. However, GANs are trained only on the limited-scale real HDR dataset, resulting in overfitting to the training set and poor generalization performance.

Parameters regressing methods use parameterized representations such as SH [19, 18, 17] and SG [14, 10, 11, 15]. As SH supports low overhead, fast and differentiable illumination maps reconstruction and object rendering, SH based methods are introduced in resource-limited mobile devices [19, 18, 17], as well as reconstruction loss [18] and rendering loss [19] that improves the quality of predictions. However, limited to representation defects in the frequency domain (shown in Fig. 1), SH methods represent high-frequency illumination inaccurately. Oppositely, SG methods are poor in low-frequency representation, but also suffer from inaccuracy in high-frequency representation as their thoughtless about the sparsity of light sources. Gardner et al. [14] oversimplify the number of light sources to a fixed number (e.g., 3), which cannot accurately represent the varying number and shape of light sources in the real world. Zhan et al. [10] achieve variable SG light sources using spherical distribution but fail to remove numerous small distribution values in the predicted results, which is also not sparse, not consistent with reality, and misleads the subsequent generation.

As a method that generates high-dimensional illumination maps requires a large-scale realistic dataset currently unavailable to ensure generalization performance, using low-dimensional parameterized representations is a good alternative. Note that parametric representations can better connect with many existing applications [36, 37, 38, 39] that choose parametric illumination for a more efficient rendering. However, previous works typically use either SH or SG independently, which fail to accurately represent both high-frequency and low-frequency illumination simultaneously. This paper introduces MixLight, which combines the strengths of SH and SG, to respectively represent low-frequency ambient light and high-frequency light sources, achieving a lower-dimensional yet more complete representation. Against the non-sparsity that weakens the accuracy of high-frequency light source prediction, SLSparsemax is designed, which leverages the positional and intensity information of spherical light sources to help the network output adaptively sparse light sources.

3 Proposed Method

The proposed MixLight combines the complementary characteristics of SH and SG, using them to represent the ambient light and the light sources respectively, as illustrated in Fig. 3. SLSparsemax is designed in regressing network to improve the accuracy of high-frequency illumination prediction. The following sections introduce the methods used for the MixLight parameters decomposition (Section 3.1), the design of SLSparsemax (Section 3.2), and details about loss functions (Section 3.3).

3.1 Parameters Decomposition

As visualized in the right half of Fig. 3, the illumination map I𝐼Iitalic_I is separated into the ambient light IAsubscript𝐼𝐴I_{A}italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and the light source regions ILsubscript𝐼𝐿I_{L}italic_I start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT firstly, which are further parameterized by Spherical Harmonics and Spherical Gaussian, respectively.

Separation. In the real world, the energy of light sources is mainly located in the high-frequency domain while ambient light is comparatively low-frequency. This distinction arises from the fact that the light source exhibits sharp changes in brightness, whereas ambient light changes more gradually. Separating light sources from ambient light effectively separates high and low-frequency light, facilitating subsequent parameters decomposition processes. In addition to the frequency differences, there is a huge brightness difference in that, light sources are always much brighter than other objects that comprise ambient light in the scene. HDR format illumination maps can capture such a huge brightness difference, where the above two light types can be effectively separated by using a 5% brightness threshold, as proposed in [10]. This method involves identifying the top 5% brightest pixels as light sources and categorizing the others as ambient light.

SH Parameters Decomposition. In the field of lighting estimation, SH is an important method for representing illumination [19, 17, 18]. Similar to the Fourier transform, SH can use a set of orthogonal spherical basis and their corresponding coefficients to approximate illumination [40, 41, 42], especially to represent the ambient light IAsubscript𝐼𝐴I_{A}italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT in this paper, as shown in Equation 1:

IA(c,ρ)k=0Km=kkAm,ckBmk(ρ)subscript𝐼𝐴𝑐𝜌superscriptsubscript𝑘0𝐾superscriptsubscript𝑚𝑘𝑘superscriptsubscript𝐴𝑚𝑐𝑘superscriptsubscript𝐵𝑚𝑘𝜌I_{A}(c,\rho)\approx\sum_{k=0}^{K}\sum_{m=-k}^{k}A_{m,c}^{k}B_{m}^{k}(\rho)italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_c , italic_ρ ) ≈ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = - italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_m , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_ρ ) (1)

Here, c𝑐citalic_c represents the channel of RGB format IAsubscript𝐼𝐴I_{A}italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, where c𝑐citalic_c can be 1, 2, and 3, corresponding to red, green, and blue channels, respectively. ρ=(ϕ,θ)𝜌italic-ϕ𝜃\rho=(\phi,\theta)italic_ρ = ( italic_ϕ , italic_θ ) represents the spherical coordinate angles defined on the spherical image domain ΩΩ\Omegaroman_Ω, where ϕitalic-ϕ\phiitalic_ϕ is the azimuth angle and θ𝜃\thetaitalic_θ is the elevation angle. Bmksuperscriptsubscript𝐵𝑚𝑘B_{m}^{k}italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT denotes the spherical harmonic basis function with order k𝑘kitalic_k (ranging from 0 to K𝐾Kitalic_K) and degree m𝑚mitalic_m (ranging from -k𝑘kitalic_k to k𝑘kitalic_k), and Am,cksuperscriptsubscript𝐴𝑚𝑐𝑘A_{m,c}^{k}italic_A start_POSTSUBSCRIPT italic_m , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT represents the corresponding coefficients for the channel c𝑐citalic_c. K𝐾Kitalic_K is the upper bound of k𝑘kitalic_k.

As the order k𝑘kitalic_k increases, SH can capture information from low frequencies to high frequencies, resulting in a more accurate representation [40, 41, 42]. However, SH should only be used to represent low-frequency information considering its representation defects, and the scale of the Am,cksuperscriptsubscript𝐴𝑚𝑐𝑘A_{m,c}^{k}italic_A start_POSTSUBSCRIPT italic_m , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT also rapidly expands with increasing order. Thus, blindly raising the upper boundary of k𝑘kitalic_k just leads to SH losing its advantage of parameter sparsity, without much real gain. It is common to only use low-order SH to approximate illumination, setting K𝐾Kitalic_K as a small value.

As the inverse process of Equation 1, the SH coefficients Am,cksuperscriptsubscript𝐴𝑚𝑐𝑘A_{m,c}^{k}italic_A start_POSTSUBSCRIPT italic_m , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT can be obtained by projecting the ambient light IAsubscript𝐼𝐴I_{A}italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT onto the SH basis:

Am,ck=4πwhρΩIA(c,ρ)Bmk(ρ)superscriptsubscript𝐴𝑚𝑐𝑘4𝜋𝑤superscriptsubscript𝜌Ωsubscript𝐼𝐴𝑐𝜌superscriptsubscript𝐵𝑚𝑘𝜌A_{m,c}^{k}=\frac{4\pi}{wh}\sum_{\rho}^{\Omega}I_{A}(c,\rho)B_{m}^{k}(\rho)italic_A start_POSTSUBSCRIPT italic_m , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG 4 italic_π end_ARG start_ARG italic_w italic_h end_ARG ∑ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ω end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_c , italic_ρ ) italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_ρ ) (2)

Here, w𝑤witalic_w, hhitalic_h, and their product wh𝑤whitalic_w italic_h represent the width, the height, and the area of the ambient light IAsubscript𝐼𝐴I_{A}italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT.

SG Parameters Decomposition. Similar to the approach in [10], the light source part ILsubscript𝐼𝐿I_{L}italic_I start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is represented using SG on N𝑁Nitalic_N evenly distributed anchor points. The intensity of the light source is denoted as E𝐸Eitalic_E, its color is represented by R𝑅Ritalic_R, and the energy distribution of light sources is modeled using the light distribution P𝑃Pitalic_P. The method of seeking the SG parameters is also basically similar to [10]. To explain this process more clearly, details are provided here.

To calculate the light source intensity E𝐸Eitalic_E, sum up the pixel values in each channel, yielding:

Tc=ρΩIL(c,ρ),c=1,2,3formulae-sequencesubscript𝑇𝑐superscriptsubscript𝜌Ωsubscript𝐼𝐿𝑐𝜌𝑐123T_{c}=\sum_{\rho}^{\Omega}I_{L}(c,\rho),\quad c=1,2,3italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Ω end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_c , italic_ρ ) , italic_c = 1 , 2 , 3 (3)

Tcsubscript𝑇𝑐T_{c}italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT represents the sum of the pixel values in the c𝑐citalic_c channel.

The total intensity of the light source is the L2 norm of the vector formed by T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and T3subscript𝑇3T_{3}italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT:

E=(T1,T2,T3)22𝐸superscriptsubscriptnormsubscript𝑇1subscript𝑇2subscript𝑇322E=\left|\left|\left(T_{1},T_{2},T_{3}\right)\right|\right|_{2}^{2}italic_E = | | ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (4)

The overall color of the light source is represented by the color ratios R𝑅Ritalic_R:

R=(T1,T2,T3)E𝑅subscript𝑇1subscript𝑇2subscript𝑇3𝐸R=\frac{(T_{1},T_{2},T_{3})}{E}italic_R = divide start_ARG ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_E end_ARG (5)

Next, the pixel values within the light source region are assigned to the nearest anchor point, and the energy distribution on N𝑁Nitalic_N anchor points is computed as P𝑃Pitalic_P.

As a reverse process of parameter decomposition, light sources can be reconstituted in the form of spherical Gaussian maps, as employed in Zhan et al. [10]. However, it is important to note that the Gaussian Map reconstructed by Zhan is not entirely accurate. The reconstructed Gaussian map mainly approximates the distribution of light, serving primarily as a rough guideline for subsequent generators, without providing an accurate intensity measure.

Different from Zhan et al. [10] but similar to the idea of [15, 13, 12], the Gaussian map generated by MixLight directly serves as final result of the light source part estimation, which must be numerically accurate in intensity. Therefore, a normalization term q𝑞qitalic_q is multiplied in the Gaussian map reconstruction function to ensure that the integral of a single Gaussian kernel is kept as 1 (before multiplied by the amplitude value), which aims to keep the pixel value sum of pixels that assigned to an anchor point in ILsubscript𝐼𝐿I_{L}italic_I start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT equal to the integral of the corresponding SG pixel values in Gaussian map. Finally, the Gaussian map reconstruction formula is written as:

ILi=1Nviediu1sqsubscript𝐼𝐿superscriptsubscript𝑖1𝑁subscript𝑣𝑖superscript𝑒subscript𝑑𝑖𝑢1𝑠𝑞I_{L}\approx\sum_{i=1}^{N}v_{i}e^{\frac{d_{i}u-1}{s}}qitalic_I start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ≈ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT divide start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u - 1 end_ARG start_ARG italic_s end_ARG end_POSTSUPERSCRIPT italic_q (6)

N𝑁Nitalic_N is the number of anchor points, visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the RGB values of the i𝑖iitalic_i-th anchor points, which is the product of the light distribution, intensity, and color ratios on that anchor point (namely vi=PiERsubscript𝑣𝑖subscript𝑃𝑖𝐸𝑅v_{i}=P_{i}ERitalic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_E italic_R), disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the direction of an anchor point (predefined by the Vogel method [43]), u𝑢uitalic_u is the unit vector giving a direction on the sphere, s𝑠sitalic_s is the angular size of the Gaussian kernel.

Normalization term q𝑞qitalic_q is the reciprocal of the integral value of the spherical Gaussian and is dependent on the angular size s𝑠sitalic_s and the sphere radius r𝑟ritalic_r:

q=12πsr2(1e2s)𝑞12𝜋𝑠superscript𝑟21superscript𝑒2𝑠q=\frac{1}{{2\pi sr^{2}\left(1-e^{-2s}\right)}}italic_q = divide start_ARG 1 end_ARG start_ARG 2 italic_π italic_s italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_e start_POSTSUPERSCRIPT - 2 italic_s end_POSTSUPERSCRIPT ) end_ARG (7)

Discussion. Why use SH to represent low-frequency ambient light instead of using other functions on the sphere, such as SG? After all, by adjusting the angular size of the SG, it can be made smoother as well to represent low-frequency information. In the comparison experiment (Fig. 1) illustrated in the Introduction section, the angular size of SG is indeed carefully adjusted to produce two versions of SG. One sharp version is physically closer to the light source, which is compared with SH in the light source estimation task; another smooth version of SG that is physically closer to ambient light is used for comparison with SH in ambient light estimation tasks. More details about this adjustment can be found in the supplementary file. In this experiment, the parameters of SG were carefully set for the sake of fairness of comparison. But this also reflects a problem. When SG is used to represent lighting, many aspects of it (such as the value of the angular size, the number of SG kernels, whether it is a position-fixed model to predict the amplitude like [10, 11] or a number-fixed model to predict the position and amplitude simultaneously like [12, 14]) all requires someone with specialized knowledge to determine, and the process involves too many variable parameters and is often very cumbersome and uncertain. In contrast, SH only needs to determine which order to use, which is simple but efficient. In addition, previous works  [40, 41, 42] have fully studied the usage of SH to represent low-frequency illumination information, where the 2-order SH has been proven to be very effective. Therefore, even users who lack professional knowledge can obtain a better lighting representation effect (refer to Fig. 1a) by simply setting the order to 2. In general, except for excellent accuracy in representing low-frequency light, SH has far fewer parameters to adjust and is easy to set, which is more user-friendly and efficient.

3.2 SLSparsemax

Light sources in the real world are sparse. In this paper, the sparsity of light sources refers to: in a scene, there are usually only a few light sources, and the number of light sources is variable. Previous works on SG do not effectively constrain the sparsity of the predicted results, which leads to inaccurate predictions.

Sparsemax, as described in [20], is an activation function utilized as the final layer in the neural network to improve the sparsity of the output. It has been applied in multi-class classification problems [21]. When used in illumination estimation, sparsemax tends to retain several of the brightest light sources while filtering out others, which helps create a sparsity relative to individual SG light sources. Further details on sparsemax can be found in the supplementary file and [20].

As a general method, the original sparsemax fails to utilize additional information in illumination estimation issues. In the real world, light sources usually have different shapes (e.g., patches, bars, points) and varying areas, implying useful prior information for lighting estimation. However, the original SG light source is point-shaped. Therefore, when simulating light sources using SG, it is encouraged to connect local point SG light sources on the spherical surface to simulate the patch-like and bar-like light sources in the real world. To this end, sparsity should be maintained between several SG clusters, rather than just between SG individuals. In summary, SLSparsemax encourages “local clustering, global sparsity” for SG light sources.

How can the SG point light on the sphere form a “locally clustered, globally sparse” distribution? The classical intrinsic decomposition task [44, 45, 46, 47, 48] seems to have similar requirements. The intrinsic decomposition task decomposes a picture into a reflectance layer and a shading layer, where the reflectance layer represents the color of the object itself, and the shading layer represents the effect of light on the object. The reflectance layer is usually assumed to consist of a limited number of color blocks, and the pixels within a block have the same color, but the blocks differ from one another. This is also a prior of local clustering and global sparsity. To satisfy this prior, the chroma or brightness similarity of a pixel with adjacent pixels is often measured, which is then incorporated into iterative optimization to help acquire the reflectance value. Inspired by this, the brightness similarity of one spherical light source with its neighborhood is used in SLSparsemax, where light sources more similar to neighbors have a better chance of being preserved. In this setting, if two light sources have the same brightness in the initial prediction result, the one with a more similar brightness to the neighborhood will be preferentially retained to encourage the network to eventually output a “locally clustered, globally sparse” light source prediction. Even if the wrong light source is retained, it will also be punished by the supervision of subsequent loss function, thus improving the prediction accuracy of the network.

Input : P𝑃Pitalic_P
Normalize Pnorm=Pmax{P}superscript𝑃𝑛𝑜𝑟𝑚𝑃𝑃P^{norm}=P-\max\left\{P\right\}italic_P start_POSTSUPERSCRIPT italic_n italic_o italic_r italic_m end_POSTSUPERSCRIPT = italic_P - roman_max { italic_P }
Calculate Pcredsuperscript𝑃𝑐𝑟𝑒𝑑P^{cred}italic_P start_POSTSUPERSCRIPT italic_c italic_r italic_e italic_d end_POSTSUPERSCRIPT s.t.
Pisimi=exp(|Pinorm1|Γ(i)|jΓ(i)Pjnorm|)superscriptsubscript𝑃𝑖𝑠𝑖𝑚𝑖subscriptsuperscript𝑃𝑛𝑜𝑟𝑚𝑖1Γ𝑖subscript𝑗Γ𝑖subscriptsuperscript𝑃𝑛𝑜𝑟𝑚𝑗P_{i}^{simi}=\exp\left(-|P^{norm}_{i}-\frac{1}{|\Gamma(i)|}\sum\limits_{j\in% \Gamma(i)}P^{norm}_{j}|\right)italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_i italic_m italic_i end_POSTSUPERSCRIPT = roman_exp ( - | italic_P start_POSTSUPERSCRIPT italic_n italic_o italic_r italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG | roman_Γ ( italic_i ) | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ roman_Γ ( italic_i ) end_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_n italic_o italic_r italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | )
Picred=PinormPisimisuperscriptsubscript𝑃𝑖𝑐𝑟𝑒𝑑superscriptsubscript𝑃𝑖𝑛𝑜𝑟𝑚superscriptsubscript𝑃𝑖𝑠𝑖𝑚𝑖P_{i}^{cred}=\frac{P_{i}^{norm}}{P_{i}^{simi}}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_r italic_e italic_d end_POSTSUPERSCRIPT = divide start_ARG italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_r italic_m end_POSTSUPERSCRIPT end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_i italic_m italic_i end_POSTSUPERSCRIPT end_ARG
Sort P𝑃Pitalic_P as Pcred(1)Pcred(N)superscript𝑃𝑐𝑟𝑒𝑑1superscript𝑃𝑐𝑟𝑒𝑑𝑁P^{cred}(1)\geq\dots\geq P^{cred}(N)italic_P start_POSTSUPERSCRIPT italic_c italic_r italic_e italic_d end_POSTSUPERSCRIPT ( 1 ) ≥ ⋯ ≥ italic_P start_POSTSUPERSCRIPT italic_c italic_r italic_e italic_d end_POSTSUPERSCRIPT ( italic_N )
Find κ(P):=assign𝜅𝑃absent\kappa(P):=italic_κ ( italic_P ) :=
max{κ[N]1+κmin1iκ{P(i)}>jκP(j)}𝜅delimited-[]𝑁ket1𝜅subscript1𝑖𝜅𝑃𝑖subscript𝑗𝜅𝑃𝑗\max\left\{\kappa\in[N]\mid 1+\kappa\min\limits_{1\leq i\leq\kappa}\{P(i)\}>% \sum\limits_{j\leq\kappa}P(j)\right\}roman_max { italic_κ ∈ [ italic_N ] ∣ 1 + italic_κ roman_min start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_κ end_POSTSUBSCRIPT { italic_P ( italic_i ) } > ∑ start_POSTSUBSCRIPT italic_j ≤ italic_κ end_POSTSUBSCRIPT italic_P ( italic_j ) }
Define τ(P)=(jκ(P)P(j))1κ(P)𝜏𝑃subscript𝑗𝜅𝑃𝑃𝑗1𝜅𝑃\tau(P)=\frac{\left(\sum_{j\leq\kappa(P)}P(j)\right)-1}{\kappa(P)}italic_τ ( italic_P ) = divide start_ARG ( ∑ start_POSTSUBSCRIPT italic_j ≤ italic_κ ( italic_P ) end_POSTSUBSCRIPT italic_P ( italic_j ) ) - 1 end_ARG start_ARG italic_κ ( italic_P ) end_ARG
Output : Psparsesuperscript𝑃𝑠𝑝𝑎𝑟𝑠𝑒P^{sparse}italic_P start_POSTSUPERSCRIPT italic_s italic_p italic_a italic_r italic_s italic_e end_POSTSUPERSCRIPT s.t. Pisparse=[Piτ(P)]+subscriptsuperscript𝑃𝑠𝑝𝑎𝑟𝑠𝑒𝑖subscriptdelimited-[]subscript𝑃𝑖𝜏𝑃P^{sparse}_{i}=\left[P_{i}-\tau(P)\right]_{+}italic_P start_POSTSUPERSCRIPT italic_s italic_p italic_a italic_r italic_s italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_τ ( italic_P ) ] start_POSTSUBSCRIPT + end_POSTSUBSCRIPT
Algorithm 1 SLSparsemax

As depicted in Algorithm 1, SLSparsemax takes the initially predicted non-sparse light distribution P𝑃Pitalic_P as input. Each element in P𝑃Pitalic_P represents the brightness probability of a corresponding light source located on the spherical surface. To effectively filter out small light sources and locally cluster light sources, the credibility Picredsuperscriptsubscript𝑃𝑖𝑐𝑟𝑒𝑑P_{i}^{cred}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_r italic_e italic_d end_POSTSUPERSCRIPT of each SG light source is computed based on its normalized brightness probability Pinormsuperscriptsubscript𝑃𝑖𝑛𝑜𝑟𝑚P_{i}^{norm}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_o italic_r italic_m end_POSTSUPERSCRIPT and its similarity Pisimisuperscriptsubscript𝑃𝑖𝑠𝑖𝑚𝑖P_{i}^{simi}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_i italic_m italic_i end_POSTSUPERSCRIPT (in terms of brightness) with neighboring light sources in its neighborhood Γ(i)Γ𝑖\Gamma(i)roman_Γ ( italic_i ). Subsequently, the input P𝑃Pitalic_P is sorted in descending order of the corresponding credibility Pisimisuperscriptsubscript𝑃𝑖𝑠𝑖𝑚𝑖P_{i}^{simi}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_i italic_m italic_i end_POSTSUPERSCRIPT. Through a convex optimization process, a threshold τ(P)𝜏𝑃\tau(P)italic_τ ( italic_P ) is determined to retain the top κ(P)𝜅𝑃\kappa(P)italic_κ ( italic_P ) most reliable light sources, thereby producing the sparse output Psparsesuperscript𝑃𝑠𝑝𝑎𝑟𝑠𝑒P^{sparse}italic_P start_POSTSUPERSCRIPT italic_s italic_p italic_a italic_r italic_s italic_e end_POSTSUPERSCRIPT. In contrast to the original sparsemax [20] which focuses solely on the brightness of light sources, SLSparsemax additionally takes into account the brightness similarity between the light source and its neighborhood, where higher similarity results in enhanced gains in credibility, and ultimately a higher chance of being retained.

3.3 Loss Functions

For the SG section, MixLight adopts some effective SG-related loss functions that have been proven useful in [10]. Same with [10], the original L2 loss is used to constrain the distribution of lighting sources P𝑃Pitalic_P, the intensity E𝐸Eitalic_E and the color ratio R𝑅Ritalic_R, as well as their SML loss for P𝑃Pitalic_P. These loss functions can be found directly in Zhan et al. [10]. In addition to these, a masked L1 loss is devised for the SG parameter P𝑃Pitalic_P, in conjunction with SLSparsemax, to ensure accurate estimation of the light sources on the lower hemisphere.

The original sparsemax is a general method and not directly applicable to the lighting estimation problem. Unlike multi-class classification problems where sparsemax has been introduced and improved sparsity, the light distribution P𝑃Pitalic_P in lighting estimation problems is biased, with the majority of light sources distributed in the upper hemisphere of the spherical space. In this case, directly introducing the original sparsemax in the neural network will cause the network to inadequately learn the light distribution in the lower hemisphere of the spherical space, resulting in the network predicting numerous small light sources in the lower hemisphere where there are originally very few light sources. SLSparsemax also failed when facing this special light distribution-biased problem. To solve this problem, a masked L1 loss function specifically designed for light distributions P𝑃Pitalic_P is introduced to penalize the predicted tiny light sources in some places where there is no light source in the GT, which mainly occurs in the lower hemisphere. In other words, the masked L1 loss function LmaskedL1subscript𝐿𝑚𝑎𝑠𝑘𝑒𝑑𝐿1L_{masked-L1}italic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d - italic_L 1 end_POSTSUBSCRIPT will penalize tiny light sources that occur (have non-zero value) in predicted P^^𝑃\widehat{P}over^ start_ARG italic_P end_ARG but do not occur in ground truth P𝑃Pitalic_P. This loss function compensates for the inadequacy of the original sparsemax method in handling the bias in data.

LmaskedL1=MP(P^P)1subscript𝐿𝑚𝑎𝑠𝑘𝑒𝑑𝐿1subscriptdelimited-∥∥direct-productsuperscript𝑀𝑃^𝑃𝑃1L_{masked-L1}=\lVert M^{P}\odot(\widehat{P}-P)\rVert_{1}italic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k italic_e italic_d - italic_L 1 end_POSTSUBSCRIPT = ∥ italic_M start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ⊙ ( over^ start_ARG italic_P end_ARG - italic_P ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (8)

MPsuperscript𝑀𝑃M^{P}italic_M start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT is a mask with the same shape as P𝑃Pitalic_P. Where P𝑃Pitalic_P is 0, the corresponding position in M𝑀Mitalic_M is 1; where P𝑃Pitalic_P is non-zero, the corresponding position in M𝑀Mitalic_M is 0. direct-product\odot means Mask operation.

Regarding the SH parameters, MixLight mainly considers constraints on the SH coefficients themselves, constraints on the SH reconstruction results (used in [18]), and constraints on the SH rendering results (used in [19]).

The loss function for SH coefficients is defined as:

LSH-co=c=13k=0K(12k+1m=kk(A^m,ckAm,ck)2)subscript𝐿SH-cosuperscriptsubscript𝑐13superscriptsubscript𝑘0𝐾12𝑘1superscriptsubscript𝑚𝑘𝑘superscriptsuperscriptsubscript^𝐴𝑚𝑐𝑘superscriptsubscript𝐴𝑚𝑐𝑘2L_{\text{SH-co}}=\sum_{c=1}^{3}\sum_{k=0}^{K}\left(\frac{1}{2k+1}\sum_{m=-k}^{% k}\left(\widehat{A}_{m,c}^{k}-A_{m,c}^{k}\right)^{2}\right)italic_L start_POSTSUBSCRIPT SH-co end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG 2 italic_k + 1 end_ARG ∑ start_POSTSUBSCRIPT italic_m = - italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_m , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_m , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (9)

However, achieving a low error in the spherical harmonic (SH) coefficients does not necessarily ensure accurate light prediction. Even minor variations in the SH coefficients can result in significant alterations to both the reconstructed lighting map and the rendered outcomes. Therefore, imposing direct constraints on both the reconstructed lighting map and the rendering results can prove beneficial [18, 19].

Thanks to the differentiable nature of SH reconstruction [18] for light map and rendering [19], the SH reconstruction loss function and the rendering loss function can be directly constructed and optimized by backpropagation to update the network parameters. The SH reconstruction loss has been confirmed to contribute to the stability of the training process [18], which is defined in Equation 10:

13whc,ρcosθ(m,k(A^m,ckBmk(ρ)Am,ckBmk(ρ)))213𝑤subscript𝑐𝜌𝜃superscriptsubscript𝑚𝑘superscriptsubscript^𝐴𝑚𝑐𝑘superscriptsubscript𝐵𝑚𝑘𝜌superscriptsubscript𝐴𝑚𝑐𝑘superscriptsubscript𝐵𝑚𝑘𝜌2\small\frac{1}{3wh}\sum_{c,\rho}\cos\theta\left(\sum_{m,k}\left(\widehat{A}_{m% ,c}^{k}B_{m}^{k}\left(\rho\right)-A_{m,c}^{k}B_{m}^{k}\left(\rho\right)\right)% \right)^{2}divide start_ARG 1 end_ARG start_ARG 3 italic_w italic_h end_ARG ∑ start_POSTSUBSCRIPT italic_c , italic_ρ end_POSTSUBSCRIPT roman_cos italic_θ ( ∑ start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_m , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_ρ ) - italic_A start_POSTSUBSCRIPT italic_m , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_ρ ) ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (10)

Similar to [49], importance weighting cosθ𝜃\cos{\theta}roman_cos italic_θ is also applied to the reconstructed lighting map in Equation 10. In spherical space, higher dimensions occupy smaller proportions of the sphere. Therefore, when calculating pixel errors on the lighting map, weights related to the elevation angle should be used to weaken the influence near the poles and increase the influence near the equator.

As to SH rendering Loss, Cheng et al. [19] randomly selected a few objects from several prepared objects for rendering during training. However, rendering losses defined on specific objects are not widely representative. In fact, any illumination values in any rendered objects come from the “irradiance environment map”[50], which preserves the irradiance for all orientations, while the rendering result of a specific object only includes a subset of the irradiance. The former is a superset (or the complete set) of the latter. Therefore, constraining the loss on the irradiance environment map indirectly constrains the loss of all potential objects to be rendered.

Given predicted SH coefficients A^^𝐴\widehat{A}over^ start_ARG italic_A end_ARG and ground truth A𝐴Aitalic_A, the SH rendering loss LSH-rdsubscript𝐿SH-rdL_{\text{SH-rd}}italic_L start_POSTSUBSCRIPT SH-rd end_POSTSUBSCRIPT is given by Equation 11, with a function shRender()(\cdot)( ⋅ ) helping to calculate irradiance environment map from coefficients (learn more in [50, 19]).

13whc,ρcosθ(shRender(A^)shRender(A))213𝑤subscript𝑐𝜌𝜃superscriptshRender^𝐴shRender𝐴2\small\frac{1}{3wh}\sum_{c,\rho}\cos\theta\left(\text{shRender}(\widehat{A})-% \text{shRender}(A)\right)^{2}divide start_ARG 1 end_ARG start_ARG 3 italic_w italic_h end_ARG ∑ start_POSTSUBSCRIPT italic_c , italic_ρ end_POSTSUBSCRIPT roman_cos italic_θ ( shRender ( over^ start_ARG italic_A end_ARG ) - shRender ( italic_A ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (11)

4 Experiments

4.1 Dataset

To train and evaluate MixLight, the publicly available Laval Indoor HDR Dataset [8] is utilized. This dataset comprises 2100 HDR illumination maps. Following the data preprocessing approach of [8, 10], each HDR panorama is cropped and warped eight times, resulting in a total of 19,556 training pairs. Similar to the method in [10], 200 training pairs are randomly selected as the test set, while the remaining 19,356 pairs are used as the training set.

Most prior works only assess their methods on the Laval Indoor HDR Dataset. However, this paper introduces an additional Web Dataset for testing. The images in the Web Dataset are collected from the internet and have no intersection with the Laval Indoor HDR Dataset, ensuring that previous SOTA methods have not reached them. This approach aims to provide a fairer evaluation. Meanwhile, comparing a method’s prediction errors on the Web Dataset and the Laval Indoor HDR Dataset allows for the assessment of its generalization performance. Following the same cropping and warping process, the Web Dataset comprises a total of 200 samples for testing. There are two reasons for performing additional tests on the Web Dataset:

  • For one thing, generative methods based on GANs have been criticized for their poor generalization performance [16]. When the training set and test set are split from one limited in-scale dataset (e.g., Laval Indoor HDR Dataset) with a specific scene style, these generative methods are able to memorize such delicate scene style due to their large-scale network parameter, and thus generate delicate illumination maps when tested, performing better than others. However, they may suffer from the overfitting to such a specific dataset style, and perform worse once the test samples are randomly collected from wild scenes with significant style differences. To validate this claim, a dataset having a random style is required.

  • For another, as some SOTA methods are not open-sourced and all compared methods are trained on the Laval Indoor HDR Dataset, it is difficult to ensure their training sets do not overlap with the test set used in this paper. While, constructing an additional Web Dataset ensures that it does not overlap with the training sets used by all methods, thus providing a fairer evaluation result.

4.2 Training Settings

Following [14, 8, 10], the size of the illumination map is 128x256, and the size of the limited FOV image that serves as the input is 192x256. In line with the best practice outlined in [10], a 5% threshold is used to separate ambient light from light sources. Moreover, the number of anchor points N𝑁Nitalic_N is set to 128, and the angular size s𝑠sitalic_s is set to 0.0025, which constructs a sharp SG to represent the sharp light sources. For the setting of N𝑁Nitalic_N, note that in the experiment conducted by Zhan et al.[10], the prediction performance drops slightly when 64 instead of 128 anchor points are used, and increasing anchor points to 196 doesn’t improve the performance obviously. They conjecture that the larger number of parameters with 196 anchor points affects the regression accuracy negatively, choosing N=128𝑁128N=128italic_N = 128 as the best setting.

Additionally, research by [50] shows that using 2nd order SH is sufficient for representing low-frequency ambient light (with an error below 1% when rendering Lambertian objects). Therefore, we use 2nd order SH to represent the ambient light, setting K𝐾Kitalic_K to 2. We found that reducing the SH parameter number will lead to the degeneration of representation capabilities, while increasing their number will make the regression task more challenging, leading to unstable training.

As to SLSparsemax, 6 closest points are considered to be in the neighborhood of an SG light source. We have also tried multiplying or dividing the components in the brackets of credibility by 10 and finally found that the current configuration is the best. The proposed MixLight is implemented using the PyTorch framework. Adam optimizer with a learning rate decay mechanism (initial learning rate of 0.0001) is used during training. It has been trained 130 epochs with a batch size of 32, on one NVIDIA GeForce RTX 3090 GPU with 24GB of memory.

Refer to caption
(a) Gray Diffuse
Refer to caption
(b) Matte Silver
Refer to caption
(c) Mirror Silver
Figure 4: The scenes used in evaluations consist of three spheres with different materials including diffuse gray, matte silver and mirror silver.

4.3 Evaluation Method and Metric

Like several previous works [10, 9, 11, 14], MixLight undergoes both quantitative and qualitative evaluation.

For quantitative evaluation, all the compared methods estimate the illumination of samples from both the test set of the Laval Indoor HDR Dataset and the Web Dataset. The predicted illumination is then used to render three spheres, as shown in Fig. 4, using the Blender rendering engine [51]. The rendered results are then assessed against the GT using commonly used evaluation metrics, including RMSE and si-RMSE [45], which focus on the estimated light intensity and light directions (or shadings), respectively.

In qualitative evaluation, the predicted results are first visualized as illumination maps. Furtherly, similar 3D scenes, as used by [10, 9], each containing a virtual object and a background image, are rendered. The accuracy of illumination estimation can then be reflected based on the visual consistency of the rendered object with the scene.

Note that the actual predicted result is in HDR form, but only the LDR form picture can be displayed in the paper. However, LDR results do not show the true luminance estimates for each method because the tone mapping process adjusts the dynamic range of all images to the best viewing luminance. Furthermore, showing only the LDR light map would also make the observer psychologically biased toward the generative approach, which is unfair because the generative approach uses a higher-dimensional representation. For this reason, three spheres rendered during quantitative evaluation are shown along with the LDR form illumination maps, which reflect the true luminance and reduce potential unfairness.

TABLE I: Comparison of MixLight with several SOTA lighting estimation methods on two datasets. The evaluation metrics include the widely used RMSE, si-RMSE. D, S, M denote a diffuse, a matte silver and a mirror material of the rendered objects, respectively.
StyleLight EMLight Gardner19 MixLight
Metrics D S M D S M D S M D S M
on Laval Indoor HDR Dataset
RMSE\downarrow 0.181 0.218 0.207 0.202 0.274 0.275 0.170 0.213 0.207 0.095 0.185 0.199
si-RMSE\downarrow 0.055 0.147 0.155 0.065 0.147 0.160 0.048 0.143 0.151 0.039 0.133 0.144
on Web Dataset
RMSE\downarrow 0.758 0.682 0.609 0.511 0.443 0.435 0.522 0.492 0.455 0.505 0.411 0.375
si-RMSE\downarrow 0.155 0.326 0.354 0.184 0.353 0.397 0.168 0.324 0.331 0.134 0.285 0.301
Refer to caption
Figure 5: Visual comparison of the predicted results. In the first column are limited FOV images that serve as the input for all illumination estimation methods, followed by the predicted results visualized as illumination maps. Note that the first two rows are test samples from the Laval Indoor HDR Dataset, while the last two rows from the Web Dataset.

4.4 Quantitative Evaluation

In this section, MixLight is compared with three SOTA methods: Gardner19 [14], which solely relies on SG, StyleLight [9], known for direct illumination map generation, and EMLight [10], which utilizes SG to guide illumination map generation. The quantitative evaluation results on both the Laval Indoor HDR Dataset and the Web Dataset are presented in Table I.

MixLight demonstrates superior performance compared to all SOTA methods across all evaluation metrics and ball materials. This is primarily attributed to the complementary advantages of SH and SG in the frequency domain, allowing for more accurate predictions. Gardner19 simplifies the modeling of ambient light, leading to inaccuracies in predicting low-frequency illumination and subsequently affecting the RMSE metric. Additionally, the fixed number of SG light sources to 3 in Gardner19 oversimplifies the representation of high-frequency light sources, resulting in errors in the si-RMSE metric. StyleLight exhibits notable degradation in the RMSE metric when transitioning from the Laval Indoor HDR Dataset to the Web Dataset, suggesting poor generalization performance due to overfitting to the limited-scale training set. Similarly, EMLight also suffers from overfitting errors, but to a lesser extent, attributed to the guidance of the low-dimensional SG model. However, EMLight does not consider the sparsity of the light source, leading to inaccurate predictions of high-frequency illumination.

Unlike the SG component in Gardner19 and EMLight, the sparse SG in MixLight effectively models the sparse real-life light sources due to the utilization of SLSparsemax, resulting in more precise high-frequency illumination predictions. In contrast to pure generation methods such as StyleLight, MixLight, as a low-dimensional parameterized model, exhibits less performance degradation from the Laval Indoor Dataset to the Web Dataset, highlighting its superior generalization capabilities.

Refer to caption
Figure 6: Visual comparisons of rendered results. In the first column are background images of 3D scenes, which act as the input for all illumination estimation methods. The predicted illumination maps (in the top-left red box of each rendered image) are then employed to render virtual objects.
Refer to caption
Figure 7: Failure cases in rendering results.

4.5 Qualitative Evaluation

This section visualizes the predicted results and the rendering results. Visualizing the predicted results allows direct observation of the light source’s position and the ambient light, while visualized rendering results depict details (e.g., hard shadows) that are challenging to capture through quantitative evaluation but significantly impact realism. The SOTA methods compared are consistent with those used in the quantitative evaluation. Finally, some failure cases in rendering results will be shown and analyzed.

Comparison of predicted results. Fig. 5 shows the predicted results of various methods, along with the GT illumination maps. The first two rows are test samples from Laval Indoor HDR Dataset, with the last two rows from the Web Dataset. In all four samples, the light source positions and ambient color variance predicted by MixLight are all close to the GT. In contrast, Gardner19 oversimplifies light sources and ambient light representation by fixing 3 light sources and one ambient light color, leading to a poor fit for the strip lights in the first sample and an inability to represent the free-varying ambient light as demonstrated by MixLight. StyleLight and EMLight tend to overfit the indoor style of the Laval Indoor HDR Dataset, performing well on the first two samples but predict inaccurately when the samples are collected from the internet with a significantly different style, as observed in the third and fourth rows. For instance, in the third sample, the pink walls, and in the fourth sample, the abandoned house are not present in the Laval Indoor HDR Dataset. In these two cases, StyleLight and EMLight still generate ceiling, yellow walls, and exquisite doors and windows consistent with the Laval Indoor HDR Dataset style, which is notably different from the GT. Additionally, EMLight does not constrain the sparsity of SG, resulting in the presence of numerous unnecessary light sources in the generated illumination maps.

It is important to note that the comparison of visualized illumination maps serves as a supplementary tool for understanding the limitations of certain methods as identified in the quantitative evaluation section (e.g., Gardner19 oversimplifies the light source representation). The predicted results of low-dimensional parametric methods seem to be smoother. However, the perceived smoothness or sharpness of the LDR illumination map does not inherently reflect the accuracy of the prediction. This is due to two primary factors: (1) The angular information of light is convolved and highly blurred during rendering. Thus the perceived smoothness or sharpness might not be important because it will inevitably be blurred after rendering. (2) Typically, the subjective quality in LDR space does not necessarily align with the results in HDR. In the HDR space, the ambient light predicted by all methods appears smooth (compared with the light source).

Comparison of rendered results. Fig. 6 shows the rendered images alongside the corresponding predicted results used for rendering. It can be observed that MixLight predicts accurately in terms of light source position and variation of ambient light, thus producing realistic rendering results where the color, brightness, and shadows of the virtual objects are close to the GT. Conversely, Gardner19 oversimplifies the modeling of ambient light and light sources. The use of a single color fails to accurately represent the ambient light, resulting in inaccurate color and brightness of virtual objects after rendering, especially in the first sample. Additionally, the limited number of three light sources struggles to represent the variable light sources in the real world, leading to overly harsh shadows in the rendering results, particularly noticeable in the last sample. StyleLight tends to overfit the Laval Indoor HDR Dataset, and its brightness predictions are generally lower than the ground truth, resulting in excessively low brightness in the rendering results. In contrast, EMLight employs the predicted parameters to guide the generation of illumination maps, thereby mitigating the overfitting issue to some extent. This approach demonstrates improved performance in the second, third, and fourth samples. However, the generative model trained on a small-scale dataset still exhibits instability, leading to significant distortion of the illumination maps in the first sample and resulting in orange-colored objects in the rendering.

Failure cases in rendering results. Illumination estimation is still a difficult problem due to its strong ill-posed nature. So even in other papers, it is normal for the proposed method to have limited effectiveness on some samples. The proposed MixLight exceeds all SOTAs in its entirety but also fails on individual samples, as illustrated in Fig. 7. There are two main types of failure: inaccurate brightness or location prediction of light sources. For one thing, the brightness of the light source has greater uncertainty than other parameters, whose loss is more difficult to converge during training. For the chair and pillow in the first two examples in Fig. 7, the brightness is predicted too weak to render clear shadows even if the light source position is very accurate. For the bag in the third example, the brightness is predicted too high, leading to harder Shadows. For another, the position of light Sources could be predicted better if more clues (e.g., shading and shadows) exist in the input. Without clues, even humans can not guess accurate positions. The third example has a huge desktop with neither shading nor clear shadows, where the shadow indicates light from the top but is insufficient to help further locate its azimuth.

TABLE II: Ablation study of choosing different representation models for ambient light and light sources. There are four combinations in total, namely using SH and SG to represent ambient light and light sources (MixLight), using SG for both ambient light and light sources (SGG), using SH for both ambient light and light sources (SHH), and using SG for ambient light and SH for light sources (SGH).
Metrics MixLight SGG SHH SGH
D S M D S M D S M D S M
on Laval Indoor HDR Dataset
RMSE\downarrow 0.095 0.185 0.199 0.121 0.223 0.230 0.097 0.210 0.232 0.105 0.212 0.233
si-RMSE\downarrow 0.039 0.133 0.144 0.067 0.179 0.185 0.053 0.158 0.165 0.057 0.159 0.165
on Web Dataset
RMSE\downarrow 0.505 0.411 0.375 0.585 0.539 0.507 0.525 0.463 0.425 0.506 0.448 0.415
si-RMSE\downarrow 0.133 0.285 0.301 0.226 0.445 0.448 0.213 0.403 0.396 0.200 0.391 0.387
TABLE III: Ablation study of the proposed SLSparsemax used in MixLight. The MixLight variants with SLSparsemax (SLS), original sparsemax (S), and No Sparsemax (NS) undergo quantitative testing on two datasets. Best scores are presented in bold, while second-best scores are displayed in italics.
Metrics No Sparsemax (NS) Sparsemax (S) SLSparsemax (SLS)
D S M D S M D S M
on Laval Indoor HDR Dataset
RMSE\downarrow 0.111 0.198 0.218 0.102 0.199 0.209 0.095 0.185 0.199
si-RMSE\downarrow 0.052 0.141 0.151 0.043 0.141 0.150 0.039 0.133 0.144
on Web Dataset
RMSE\downarrow 0.527 0.436 0.396 0.510 0.432 0.384 0.505 0.411 0.375
si-RMSE\downarrow 0.159 0.307 0.326 0.144 0.305 0.314 0.134 0.285 0.301
Refer to caption
(a) GT
Refer to caption
(b) NS
Refer to caption
(c) S
Refer to caption
(d) SLS
Figure 8: Visual comparison of the rendered results from MixLight with No Sparsemax (NS), original Sparsemax method (S), or SLSparsemax (SLS). In the top-left red box of each example is the input limited FOV image, while in the top-right yellow box is the predicted illumination map.

4.6 Ablation Study

MixLight uses SH to represent ambient light and SG to represent light sources. Table II shows the ablation study of choosing different representation models for ambient light and light sources, using the same datasets, metrics, and comparison methods as quantitative evaluation. The other three (model selection) combinations are SG for both ambient light and light sources (SGG), SH for both ambient light and light sources (SHH), and SG for ambient light but SH for light sources (SGH). Note that these three combinations have the same or similar parameter scales with MixLight, and more details about the parameter settings are in the supplementary file. As shown in Table II, MixLight surpasses the other three combinations in all test sets, all materials of the spheres, and all metrics, mainly benefiting from the accurate prediction ability of SH and SG to low-frequency and high-frequency information respectively.

Another experiment is conducted to quantitatively compare the MixLight variants with SLSparsemax (SLS), original sparsemax (S), and No Sparsemax (NS), as depicted in Table III. Best scores are presented in bold, while second-best scores are displayed in italics. Among the 12 metric results in the two test sets, SLS outperforms the original S with most best scores while S gets more second-best scores than NS. This decreasing error from NS, S to SLS can be observed more intuitively in visualized rendering results and respective error maps in Fig. 8, where S and SLS achieve a much sparser light source prediction than NS, better at mimicking the sparse light sources in GT and consequently producing pronounced shadows. SLSparsemax further generates locally clustered yet globally sparse light sources that closely resemble reality, thus rendering more realistic shadows and highlights and getting better scores in Table III.

5 Limitations and Future Works

This paper explores the feasibility of achieving a more accurate illumination representation model by combining inherent representation models. By observing the law of light sources in indoor scene datasets, we propose the sparsity assumption of light sources, and further design SLSparsemax modules based on this assumption. However, it should be noted that there are also times when the sparsity assumption does not apply. Illumination estimation for outdoor scenes is also not the focus of this paper for the time being. In addition, predicting spatially-varying illumination and predicting illumination with more details are all directions that can be further explored in the future.

Assumption of sparsity. In this paper, an assumption of sparsity is made: in a scene, there are usually only a few light sources, and the number of light sources is variable. This assumption is made as the target dataset (Laval Indoor) presents such sparsity characteristics. It should be noted that sometimes this assumption does not hold (e.g., multiple small light sources in the laboratory). Despite this limitation, making this assumption is a step forward from previous assumptions explicitly or implicitly used by previous SG methods. For example, Gardner19 [14] uses only three SG kernels, and [13] uses only one SG kernel to represent light sources. In the future, SLSparsemax could be introduced to more SG works to model sparse and variable numbers of SG light sources. A more general light source assumption can also be proposed based on the current one.

Outdoor scenes. Outdoor scenes have not been considered in this paper. However, both SH and SG can be promoted to outdoor scenes like [19, 12]. Therefore, as a SH, SG combined method, MixLight can also be promoted to predict outdoor illumination. Additionally, outdoor light sources (typically the sun) are fewer in number but have higher energy, different from indoor light sources. This difference previously blocks the formation of a unified indoor-outdoor light source model. However, this bottleneck is naturally suitable for the proposed SLSparsemax to tackle, which produces number-variant light sources adaptively. It can be further investigated in future work.

Spatially-varying illumination. MixLight is currently not a spatially-varying illumination representation. Considering that the depth information of scenes can be obtained from the Laval indoor HDR Dataset, this problem could be addressed by generating spatially-varying training pairs when pre-processing data, similar with [17], or by regressing the depth of light sources and promoting them from 2D sphere surface to 3D like [14].

Generalization performance. Recovering photo-realistic scene details from a single limited FOV image with a lack of mirror objects remains a challenge, as previous GAN-based approaches have shown poor generalization performance. Inputting multiple images can provide more scene details, which may be a possible solution to address this challenge. Building a large-scale real HDR dataset also helps improve the generalization performance.

6 Conclusions

This paper introduces MixLight, a joint illumination representation model that leverages the complementary advantages of SH and SG in the frequency domain. SLSparsemax is designed to help MixLight achieve as sparse light source predictions as that in real life. The results of quantitative and qualitative evaluations demonstrate that MixLight accurately predicts illumination from single indoor scene limited FOV images. Further experiments on a Web Dataset show that parametric methods have better generalization performance compared to generation methods.

Appendix A SH-SG comparison experiment

In the Introduction section of the main manuscript, an experiment (Fig. 1) is used to illustrate the different advantages of SH and SG in representing ambient light and light sources. The training and testing sets used are both from the Laval Indoor HDR Dataset [8]. The partition method of the dataset is consistent with the experimental settings in the main manuscript. The test set consisted of 200 randomly selected samples. For each sample, ambient light (or light sources) is predicted and used to render three balls. 200 samples render 600 results. For each rendering result, the error is calculated by the formula αβ𝛼𝛽\sqrt{\alpha\cdot\beta}square-root start_ARG italic_α ⋅ italic_β end_ARG, where α𝛼\alphaitalic_α represents the Root Mean Square Error (RMSE) and β𝛽\betaitalic_β represents si-RMSE [45]. Finally, 600 error values are presented in the form of a violin chart (e.g., Fig. 1a shown in the main manuscript), which shows the distribution of the error values on the test set, when using one of the SH and SG models to predict ambient light or light source.

To fairly compare the performance of SH or SG on the ambient light prediction task or the light source prediction task, different versions of SG are designed to predict light sources and ambient light.

In the ambient light estimation task, a 9-kernel smooth SG is designed, with its angular size s𝑠sitalic_s adjusted to 0.2423, which ensures that the 9 SG functions cover the sphere as comprehensively as possible while maintaining a decay of roughly 0.5 on the boundaries. This smooth version of SG, along with its s𝑠sitalic_s adjusting method, mirrored the approach in [15]. The only difference is that the number of SG kernels is reduced to 9 in this experiment to keep it focused on capturing low-frequency ambient light. This also helps to maintain SG’s parameter number at 27 (in RGB three channels), consistent with a 2nd order SH served as a competitor in ambient light prediction task.

In the light source estimation task, a sharp version of SG is utilized, which is essentially the spherical Gaussian distribution proposed in EMLight. The angular size s𝑠sitalic_s is set to 0.0025 to keep the SG kernel function as sharp as the light sources. Then compare the sharp 128-kernel SG with a 6th-order SH, which comprised 132 and 147 parameters, respectively. While the parameters of SG and SH for the light sources prediction task were not entirely consistent, the 6th-order SH was the closest in terms of parameter scale with 128-kernel sharp SG.

Appendix B Original Sparsemax

Sparsemax [20] operates under the assumption that although the network’s output distribution values are fundamentally reliable after numerous training iterations, they are still not sparse enough. To address this, it sorts all elements within the input distribution value in descending order and retains only the leading large values, effectively filtering out small values that are close to zero. Additionally, to minimize alterations to the somewhat trusted input, it only applies an overall offset to the distribution value to preserve the distinction between elements, while striving to retain as many elements as possible to minimize changes in quantity.

Appendix C Ablation Study details

Comparison fairness is ensured when comparing MixLight with 3 other combinations in the ablation study.

In the first experiment of the ablation study, we replace the 2nd order SH of MixLight with 9-kernel SG (smooth version SG) to construct SGG. We replace the 128-kernel SG of MixLight with 6th order SH to construct SHH. We replace both above to construct SGH. So MixLight, SGG, SHH, and SGH have the same or close (27+132), (27+132), (27+147), and (27+147) parameters respectively.

Appendix D More results

More rendered results are given in Fig. 9 to supplement Fig. 6 of the main manuscript.

In Fig. 5 of the main manuscript, the predicted results are visualized as illumination maps to help analyze the pros and cons of various methods given in quantitative evaluation, where rendering results is the final goal that needs to be evaluated quantitatively. To aid the assessment, here are more direct predictions visualized in the form of light maps. Again, the results are split into two different test sets to compare how different methods behave on test sets from familiar (Fig. 10) to unfamiliar styles (Fig. 11).

Refer to caption
Figure 9: More Visual comparisons of rendered results. In the first column are background images of 3D scenes, which act as the input for all illumination estimation methods. The predicted illumination maps (in the top-right red box of each rendered image) are then employed to render virtual objects.
Refer to caption
Figure 10: More visual comparison of the predicted results on Laval Indoor HDR Dataset [8]. These samples are all randomly sampled from the test set of the Laval Indoor HDR Dataset, which is the most frequently used dataset of the illumination estimation academy. The MixLight proposed in this paper and three other compared SOTAs are all trained based on the training set from Laval Indoor HDR Dataset. Therefore, all the above illumination prediction neural networks are familiar with the style of this dataset and could get better scores when tested on the test set of the Laval Indoor HDR Dataset (see Table 1 in the main manuscript).
Refer to caption
Figure 11: More visual comparison of the predicted results on Web Dataset. These samples are all randomly sampled from the test set of the Web Dataset. The Web Dataset is collected from the internet, has no intersection with the Laval Indoor HDR Dataset, and thus has diverse styles that have never been seen by all compared methods. When tested on this unfamiliar “in the wild” test set, the performance of all methods degenerates to varying degrees, as can be seen in Table 1 of the main manuscript. The degradation is most severe in the generative approach, as they always reproduce the style of the old Laval Indoor HDR Dataset on this new test set. Specifically, more failed predictions have emerged for EMLight, such as the fifth and seventh samples.

References

  • [1] P. Debevec, “Rendering synthetic objects into real scenes: bridging traditional and image-based graphics with global illumination and high dynamic range photography,” in Proceedings of the 25th annual conference on Computer graphics and interactive techniques, 1998, pp. 189–198.
  • [2] G. Li, C. Wu, C. Stoll, Y. Liu, K. Varanasi, Q. Dai, and C. Theobalt, “Capturing relightable human performances under general uncontrolled illumination,” Computer Graphics Forum, vol. 32, no. 2pt3, pp. 275–284, 2013.
  • [3] S. B. Knorr and D. Kurz, “Real-time illumination estimation from faces for coherent rendering,” in 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).   IEEE, 2014, pp. 113–122.
  • [4] S. Georgoulis, K. Rematas, T. Ritschel, M. Fritz, T. Tuytelaars, and L. Van Gool, “What is around the camera?” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5170–5178.
  • [5] J. T. Barron and J. Malik, “Intrinsic scene properties from a single rgb-d image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 17–24.
  • [6] L. Gruber, T. Richter-Trummer, and D. Schmalstieg, “Real-time photometric registration from arbitrary geometry,” in 2012 IEEE international symposium on mixed and augmented reality (ISMAR).   IEEE, 2012, pp. 119–128.
  • [7] L. Bin, X. Kun, and R. R. Martin, “Static scene illumination estimation from videos with applications [j],” Journal of Computer Science and Technology, vol. 32, no. 3, pp. 430–442, 2017.
  • [8] M.-A. Gardner, K. Sunkavalli, E. Yumer, X. Shen, E. Gambaretto, C. Gagné, and J.-F. Lalonde, “Learning to predict indoor illumination from a single image,” ACM Transactions on Graphics, vol. 36, no. 6, pp. 1–14, 2017.
  • [9] G. Wang, Y. Yang, C. C. Loy, and Z. Liu, “Stylelight: Hdr panorama generation for lighting estimation and editing,” in European Conference on Computer Vision, 2022, pp. 477–492.
  • [10] F. Zhan, C. Zhang, Y. Yu, Y. Chang, S. Lu, F. Ma, and X. Xie, “Emlight: Lighting estimation via spherical distribution approximation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 3287–3295.
  • [11] F. Zhan, Y. Yu, C. Zhang, R. Wu, W. Hu, S. Lu, F. Ma, X. Xie, and L. Shao, “Gmlight: Lighting estimation via geometric distribution approximation,” IEEE Transactions on Image Processing, vol. 31, pp. 2268–2278, 2022.
  • [12] M. R. K. Dastjerdi, J. Eisenmann, Y. Hold-Geoffroy, and J.-F. Lalonde, “Everlight: Indoor-outdoor editable hdr lighting estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7420–7429.
  • [13] H. Weber, M. Garon, and J.-F. Lalonde, “Editable indoor lighting estimation,” in European Conference on Computer Vision.   Springer, 2022, pp. 677–692.
  • [14] M.-A. Gardner, Y. Hold-Geoffroy, K. Sunkavalli, C. Gagné, and J.-F. Lalonde, “Deep parametric indoor lighting estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7175–7183.
  • [15] M. Li, J. Guo, X. Cui, R. Pan, Y. Guo, C. Wang, P. Yu, and F. Pan, “Deep spherical gaussian illumination estimation for indoor scene,” in Proceedings of the ACM Multimedia Asia, 2019, pp. 1–6.
  • [16] F. Zhan, C. Zhang, W. Hu, S. Lu, F. Ma, X. Xie, and L. Shao, “Sparse needlets for lighting estimation with spherical transport loss,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 830–12 839.
  • [17] M. Garon, K. Sunkavalli, S. Hadap, N. Carr, and J.-F. Lalonde, “Fast spatially-varying indoor lighting estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6908–6917.
  • [18] D. Xu, Z. Li, and Y. Zhang, “Real-time illumination estimation for mixed reality on mobile devices,” in 2020 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW).   IEEE, 2020, pp. 702–703.
  • [19] D. Cheng, J. Shi, Y. Chen, X. Deng, and X. Zhang, “Learning scene illumination by pairwise photos from rear and front mobile cameras,” in Computer Graphics Forum, vol. 37, no. 7.   Wiley Online Library, 2018, pp. 213–221.
  • [20] A. Martins and R. Astudillo, “From softmax to sparsemax: A sparse model of attention and multi-label classification,” in International conference on machine learning.   PMLR, 2016, pp. 1614–1623.
  • [21] A. Laha, S. A. Chemmengath, P. Agrawal, M. Khapra, K. Sankaranarayanan, and H. G. Ramaswamy, “On controllable sparse alternatives to softmax,” Advances in neural information processing systems, vol. 31, 2018.
  • [22] Y. Mei, H. Zhang, X. Zhang, J. Zhang, Z. Shu, Y. Wang, Z. Wei, S. Yan, H. Jung, and V. M. Patel, “Lightpainter: interactive portrait relighting with freehand scribble,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 195–205.
  • [23] T. Nestmeyer, J.-F. Lalonde, I. Matthews, and A. Lehrmann, “Learning physics-guided face relighting under directional light,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5124–5133.
  • [24] R. Pandey, S. Orts-Escolano, C. Legendre, C. Haene, S. Bouaziz, C. Rhemann, P. E. Debevec, and S. R. Fanello, “Total relighting: learning to relight portraits for background replacement.” ACM Trans. Graph., vol. 40, no. 4, pp. 43–1, 2021.
  • [25] T. Sun, J. T. Barron, Y.-T. Tsai, Z. Xu, X. Yu, G. Fyffe, C. Rhemann, J. Busch, P. Debevec, and R. Ramamoorthi, “Single image portrait relighting,” ACM Transactions on Graphics (TOG), vol. 38, no. 4, pp. 1–12, 2019.
  • [26] Z. Wang, X. Yu, M. Lu, Q. Wang, C. Qian, and F. Xu, “Single image portrait relighting via explicit multiple reflectance channel modeling,” ACM Transactions on Graphics (TOG), vol. 39, no. 6, pp. 1–13, 2020.
  • [27] Y.-Y. Yeh, K. Nagano, S. Khamis, J. Kautz, M.-Y. Liu, and T.-C. Wang, “Learning to relight portrait images via a virtual light stage and synthetic-to-real adaptation,” ACM Transactions on Graphics (TOG), vol. 41, no. 6, pp. 1–21, 2022.
  • [28] L. Zhang, Q. Zhang, M. Wu, J. Yu, and L. Xu, “Neural video portrait relighting in real-time via consistency modeling,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 802–812.
  • [29] X. Zhang, S. Fanello, Y.-T. Tsai, T. Sun, T. Xue, R. Pandey, S. Orts-Escolano, P. Davidson, C. Rhemann, P. Debevec et al., “Neural light transport for relighting and view synthesis,” ACM Transactions on Graphics (TOG), vol. 40, no. 1, pp. 1–17, 2021.
  • [30] A. Meka, C. Haene, R. Pandey, M. Zollhöfer, S. Fanello, G. Fyffe, A. Kowdle, X. Yu, J. Busch, J. Dourgarian et al., “Deep reflectance fields: high-quality facial reflectance field inference from color gradient illumination,” ACM Transactions on Graphics (TOG), vol. 38, no. 4, pp. 1–12, 2019.
  • [31] H. Kim, M. Jang, W. Yoon, J. Lee, D. Na, and S. Woo, “Switchlight: Co-design of physics-driven architecture and pre-training framework for human portrait relighting,” arXiv preprint arXiv:2402.18848, 2024.
  • [32] H. Weber, D. Prévost, and J.-F. Lalonde, “Learning to estimate indoor lighting from 3d objects,” in 2018 International Conference on 3D Vision (3DV).   IEEE, 2018, pp. 199–207.
  • [33] H.-X. Yu, S. Agarwala, C. Herrmann, R. Szeliski, N. Snavely, J. Wu, and D. Sun, “Accidental light probes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 521–12 530.
  • [34] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic image synthesis with spatially-adaptive normalization,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2337–2346.
  • [35] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8110–8119.
  • [36] H. Zhu, H. Yang, L. Guo, Y. Zhang, Y. Wang, M. Huang, M. Wu, Q. Shen, R. Yang, and X. Cao, “Facescape: 3d facial dataset and benchmark for single-view 3d face reconstruction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [37] Y. Feng, H. Feng, M. J. Black, and T. Bolkart, “Learning an animatable detailed 3d face model from in-the-wild images,” ACM Transactions on Graphics (ToG), vol. 40, no. 4, pp. 1–13, 2021.
  • [38] Z. Chen, A. Chen, G. Zhang, C. Wang, Y. Ji, K. N. Kutulakos, and J. Yu, “A neural rendering framework for free-viewpoint relighting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5599–5610.
  • [39] H. Zhou, S. Hadap, K. Sunkavalli, and D. W. Jacobs, “Deep single-image portrait relighting,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 7194–7202.
  • [40] D. Frolova, D. Simakov, and R. Basri, “Accuracy of spherical harmonic approximations for images of lambertian objects under far and near lighting,” in Computer Vision-ECCV 2004: 8th European Conference on Computer Vision, Prague, Czech Republic, May 11-14, 2004. Proceedings, Part I 8.   Springer, 2004, pp. 574–587.
  • [41] R. Ramamoorthi, “Analytic pca construction for theoretical analysis of lighting variability in images of a lambertian object,” IEEE transactions on pattern analysis and machine intelligence, vol. 24, no. 10, pp. 1322–1333, 2002.
  • [42] ——, “Modeling illumination variation with spherical harmonics,” Face Processing: Advanced Modeling Methods, pp. 385–424, 2006.
  • [43] H. Vogel, “A better way to construct the sunflower head,” Mathematical biosciences, vol. 44, no. 3-4, pp. 179–189, 1979.
  • [44] Q. Zhao, P. Tan, Q. Dai, L. Shen, E. Wu, and S. Lin, “A closed-form solution to retinex with nonlocal texture constraints,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 7, pp. 1437–1444, 2012.
  • [45] R. Grosse, M. K. Johnson, E. H. Adelson, and W. T. Freeman, “Ground truth dataset and baseline evaluations for intrinsic image algorithms,” in 2009 IEEE 12th International Conference on Computer Vision, 2009, pp. 2335–2342.
  • [46] S. Sengupta, A. Kanazawa, C. D. Castillo, and D. W. Jacobs, “Sfsnet: Learning shape, reflectance and illuminance of facesin the wild’,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6296–6305.
  • [47] A. Meka, M. Shafiei, M. Zollhöfer, C. Richardt, and C. Theobalt, “Real-time global illumination decomposition of videos,” ACM Transactions on Graphics (ToG), vol. 40, no. 3, pp. 1–16, 2021.
  • [48] R. Wang, Q. Zhang, C.-W. Fu, X. Shen, W.-S. Zheng, and J. Jia, “Underexposed photo enhancement using deep illumination estimation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6849–6857.
  • [49] V. Gkitsas, N. Zioulis, F. Alvarez, D. Zarpalas, and P. Daras, “Deep lighting environment map estimation from spherical panoramas,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 640–641.
  • [50] R. Ramamoorthi and P. Hanrahan, “An efficient representation for irradiance environment maps,” in Proceedings of the 28th annual conference on Computer graphics and interactive techniques, 2001, pp. 497–500.
  • [51] R. Hess, Blender foundations: The essential guide to learning blender 2.5.   Taylor & Francis, 2013.