A Closed-form Solution for Weight Optimization in Fully-connected Feed-forward Neural Networks

Slavisa Tomic, João Pedro Matos-Carvalho, and Marko Beko This work is funded by Fundação para a Ciência e Tecnologia under the project UIDB/50008/2020, UIDB/04111/2020, CEECINST/00147/2018/CP1498/CT0015 and 2021.04180.CEECIND, as well as Instituto Lusófono de Investigação e Desenvolvimento (ILIND) under Project COFAC/ILIND/COPELABS/1/2022.Slavisa Tomic is with Center of Technology and Systems (UNINOVA-CTS) and Associated Lab of Intelligent Systems (LASI), 2829-516 Caparica, Portugal and COPELABS, Lusófona University, Campo Grande 376, 1749-024 Lisbon, Portugal (email: [email protected]).João Pedro Matos-Carvalho is with Center of Technology and Systems (UNINOVA-CTS) and Associated Lab of Intelligent Systems (LASI), 2829-516 Caparica, Portugal and COPELABS, Lusófona University, Campo Grande 376, 1749-024 Lisbon, Portugal (email: [email protected]).Marko Beko is with Center of Technology and Systems (UNINOVA-CTS) and Associated Lab of Intelligent Systems (LASI), 2829-516 Caparica, Portugal, Instituto de Telecomunicações, Instituto Superior Técnico, University of Lisbon, Lisboa, Portugal, and COPELABS, Lusófona University, Campo Grande 376, 1749-024 Lisbon, Portugal (email: [email protected]).
Abstract

This work addresses weight optimization problem for fully-connected feed-forward neural networks. Unlike existing approaches that are based on back-propagation (BP) and chain rule gradient-based optimization (which implies iterative execution, potentially burdensome and time-consuming in some cases), the proposed approach offers the solution for weight optimization in closed-form by means of least squares (LS) methodology. In the case where the input-to-output mapping is injective, the new approach optimizes the weights in a back-propagating fashion in a single iteration by jointly optimizing a set of weights in each layer for each neuron. In the case where the input-to-output mapping is not injective (e.g., in classification problems), the proposed solution is easily adapted to obtain its final solution in a few iterations. An important advantage over the existing solutions is that these computations (for all neurons in a layer) are independent from each other; thus, they can be carried out in parallel to optimize all weights in a given layer simultaneously. Furthermore, its running time is deterministic in the sense that one can obtain the exact number of computations necessary to optimize the weights in all network layers (per iteration, in the case of non-injective mapping). Our simulation and experimental results show that the proposed scheme, BPLS, works well and is competitive with existing ones in terms of accuracy, but significantly surpasses them in terms of running time. To summarize, the new method is straightforward to implement, is competitive and computationally more efficient than the existing ones, and is well-tailored for parallel implementation.

Index Terms:
Back-propagation (BP), least squares (LS), neural networks, weight optimization.

I Introduction

The advent of powerful computer hardware and workstations in the first decade of the twenty-first century enabled extraordinary ascent of machine learning. Since then, the area of machine learning (as a part of artificial intelligence) has been constantly evolving and attracting research interest, especially since it has been successfully applied to various academic or industrial problems [1]. Extremely powerful processing capabilities of such machines are allowing them to even transcend performance of human beings in some cases [2]. Machine learning tools are particularly interesting for a wide range of practical problems in which it is difficult to establish accurate mathematical models [1], since their solution often relies on solving some particular realizations of the problem with reasonably good accuracy [3]. However, such problems could be solved more efficiently via machine learning techniques, where a set of sufficiently large samples (training data) are provided, based on which a machine could be able to learn and make decisions concerning a set of new samples (testing data).

Methods for solving convex programming problems with low convergence rates date back from the last decades of the 20th century, like [4, 5], for instance. More recently, several algorithms that successfully minimize regret (measured as the value of difference between a made decision and the optimal decision) in the online learning setting have been proposed [6]-[13]. All of the referred methods are related through the use of gradient descent, which is a relatively efficient optimization method given that the objective function is differentiable with respect to its parameters. In this case, the computation of first-order partial derivatives with respect to the parameters is of the same order of computational complexity as simply evaluating the function. The work in [4] proposed a method that solves the problem in Hilbert space by constructing a minimizing sequence of points that are not relaxational. The latter property, together with a simple tuning of a parameter in the subdivision process to not start with 1111, allowed the authors to reduce the amount of computation at each step to a minimum and obtain an estimate of the convergence rate of their method. In [5], the author showed that introducing a momentum (MOM) term in gradient descent learning algorithms improves the speed of convergence and that it can increase the range of learning rate over which the system converges, by bringing certain eigen components of the system closer to critical damping. The work in [6] showed how Long Short-term Memory recurrent neural networks could be used to generate complex sequences with long-range structure, by predicting a single data point at a time. The authors in [6] used a form of stochastic gradient descent (SGD) where the gradients are divided by a running average of their recent magnitude and generated their parameter updates by using a momentum on the re-scaled gradient. In [7], a family of sub-gradient methods that dynamically incorporate knowledge about certain properties of the data observed in previous iterations and perform more informative gradient-based learning was proposed. The authors generalized the dubious online learning paradigm to a sub-problem consisting of adapting an algorithm to fit a particular dataset by altering the proximal function and by automatically adjusting the learning rates for online learning. Over the recent years, works in [8]-[11] showed that SGD method is an efficient optimization approach in various machine learning problems and that contributed to recent advances in deep learning. The authors in [12] showed that applying SGD with a momentum with a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter can train both deep and recurrent neural networks to levels of performance that were previously achievable only with Hessian-Free optimization. For this purpose, the authors employed a Nesterov’s accelerated gradient (NAG). In [13], an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments was presented. The method in [13] calculates adaptive learning rates individually for different parameters from estimates of first and second moments of the gradients by using initialization bias correction terms to avoid large step sizes that can lead to divergence.

Even though the described solutions were derived for a much broader class of optimization problems, their application in weight optimization in neural networks is straightforward (and widely used nowadays) by exploiting them as optimizers that dynamically adjust the learning rate in a parameter estimation. In sharp contrast to the existing methods, this work proposes a novel approach for weight optimization whose solution is given in closed-form, by resorting to back-propagation (BP) and least squares (LS) methodology. The proposed solution optimizes the weights in a back-propagating (going from the output towards the input layer) manner by optimizing a set of weights in each layer for each neuron. The proposed scheme achieves weight optimization in a single iteration (per layer) and in just a few iterations for problems with injective and non-injective input-to-output mapping, respectively. What enhances even further the efficiency of the new approach is the fact that the computations in each layer are independent from each other, enabling them to be carried out in parallel. Moreover, it is worth mentioning that the running time of the proposed scheme is deterministic in the sense that one knows the exact number of computations necessary to optimize the weights in all network layers (per iteration, in the case of non-injective mapping) and is about 20202020 times faster than the state-of-the-art solutions in the considered network architectures.

II The Proposed Approach

Consider a fully-connected feed-forward neural network111For the sake of simplicity of the following derivations, a network with no activation functions is considered as an illustrative example. as illustrated in Fig. 1. According to the network structure, the j𝑗jitalic_j-th output, 𝒚jsubscript𝒚𝑗\boldsymbol{y}_{j}bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, is obviously calculated as

𝒚j=j=1Jt=1Twtj(L+1)𝒏t(L),subscript𝒚𝑗superscriptsubscript𝑗1𝐽superscriptsubscript𝑡1𝑇superscriptsubscript𝑤𝑡𝑗𝐿1superscriptsubscript𝒏𝑡𝐿\boldsymbol{y}_{j}=\displaystyle\sum_{j=1}^{J}\displaystyle\sum_{t=1}^{T}w_{tj% }^{(L+1)}\boldsymbol{n}_{t}^{(L)},bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L + 1 ) end_POSTSUPERSCRIPT bold_italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT , (1)

for j=1,,J𝑗1𝐽j=1,...,Jitalic_j = 1 , … , italic_J, where the i𝑖iitalic_i-th neuron in the l𝑙litalic_l-th layer (for l2𝑙2l\geq 2italic_l ≥ 2) is given by

𝒏i(l)=i=1Nlk=1Nl1wki(l)𝒏k(l1),superscriptsubscript𝒏𝑖𝑙superscriptsubscript𝑖1subscript𝑁𝑙superscriptsubscript𝑘1subscript𝑁𝑙1superscriptsubscript𝑤𝑘𝑖𝑙superscriptsubscript𝒏𝑘𝑙1\boldsymbol{n}_{i}^{(l)}=\displaystyle\sum_{i=1}^{N_{l}}\displaystyle\sum_{k=1% }^{N_{l-1}}w_{ki}^{(l)}\boldsymbol{n}_{k}^{(l-1)},bold_italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bold_italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , (2)

with Nlsubscript𝑁𝑙N_{l}italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and Nl1subscript𝑁𝑙1N_{l-1}italic_N start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT denoting the total number of neurons in the l𝑙litalic_l-th layer and the l1𝑙1l-1italic_l - 1-th layer respectively (note that N0subscript𝑁0N_{0}italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denotes the number of features of the input data, while NL+1subscript𝑁𝐿1N_{L+1}italic_N start_POSTSUBSCRIPT italic_L + 1 end_POSTSUBSCRIPT denotes the number of outputs), while the neurons in the first layer (l=1𝑙1l=1italic_l = 1) are computed according to

𝒏p(1)=p=1Pm=1Mwmp(1)𝒙m.superscriptsubscript𝒏𝑝1superscriptsubscript𝑝1𝑃superscriptsubscript𝑚1𝑀superscriptsubscript𝑤𝑚𝑝1subscript𝒙𝑚\boldsymbol{n}_{p}^{(1)}=\displaystyle\sum_{p=1}^{P}\displaystyle\sum_{m=1}^{M% }w_{mp}^{(1)}\boldsymbol{x}_{m}.bold_italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_m italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT . (3)
Refer to caption
Figure 1: An example of a generic fully-connected neural network.

To the best of our knowledge, until the present day, existing works focus their attention on training the neural networks by using back-propagation method in the following manner. The input data is multiplied with randomly generated weights in order to compute the output by exploiting the known network architecture. The so-produced output is then compared with the ground truth (i.e., the desired output) in order to update the weights by employing partial derivatives based on the chain rule as

wi(t+1)=wi(t)+ηΔwi(t),superscriptsubscript𝑤𝑖𝑡1superscriptsubscript𝑤𝑖𝑡𝜂Δsuperscriptsubscript𝑤𝑖𝑡w_{i}^{(t+1)}=w_{i}^{(t)}+\eta\Delta w_{i}^{(t)},italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + italic_η roman_Δ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ,

where wi(t)superscriptsubscript𝑤𝑖𝑡w_{i}^{(t)}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is the weight value that one is trying to optimize in the t𝑡titalic_t-th iteration (epoch), η𝜂\etaitalic_η is the learning rate (usually, either a fixed value or determined dynamically in each epoch by using an optimizer such as MOM [5], RMSProp [6], NAG [12] or Adam [13]), and Δwi(t)Δsuperscriptsubscript𝑤𝑖𝑡\Delta w_{i}^{(t)}roman_Δ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is the chain derivative of the output with respect to the i𝑖iitalic_i-th weight. Obviously, this method is iterative and non-deterministic in the sense that one cannot know how many iterations will the algorithm require to converge beforehand. Moreover, this approach is sub-optimal, meaning that it does not give any guarantees that the obtained solution is actually the optimal one, global-wise.

This work takes a different approach to accomplish weight optimization in a fully-connected feed-forward neural network. Given that the problem can be modeled as an injective function between the input and the output, the proposed solution requires a single iteration to optimize the weights of each neuron without resorting to derivatives and its solution is given in a closed-form (which allows for deterministic calculation of its running time). For problems with non-injective input-output relationship (such as classification problems), this work proposes a straightforward adaptation of the original idea that optimizes the weights in just a few epochs, as described in the following text.

The proposed weight optimization process consists of optimizing the weights of each neuron individually in every layer in the back-propagating direction. First, similar to the existing approaches, randomly generated weights are assigned to the network in order to obtain the outputs, 𝒚jsubscript𝒚𝑗\boldsymbol{y}_{j}bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, according to (1), by exploiting the known network architecture. Once 𝒚jsubscript𝒚𝑗\boldsymbol{y}_{j}bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPTs are available, the network is peeled layer by layer (in the opposite direction, from the output towards the input layers) and the corresponding weights are optimized based on an LS approach. More precisely, the weights of the i𝑖iitalic_i-th neuron (i=1,,Nl𝑖1subscript𝑁𝑙i=1,...,N_{l}italic_i = 1 , … , italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT) in the l𝑙litalic_l-th layer, 𝒘i(l)=[wki(l)]Tsuperscriptsubscript𝒘𝑖𝑙superscriptdelimited-[]superscriptsubscript𝑤𝑘𝑖𝑙𝑇\boldsymbol{w}_{i}^{(l)}=[w_{ki}^{(l)}]^{T}bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = [ italic_w start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, k=1,,Nl1𝑘1subscript𝑁𝑙1k=1,...,N_{l-1}italic_k = 1 , … , italic_N start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT, are optimized by solving the following LS problem

𝒘^i(l)=arg min𝒘i(l)𝑨l𝒘i(l)𝒃l2.superscriptsubscript^𝒘𝑖𝑙superscriptsubscript𝒘𝑖𝑙arg minsuperscriptnormsubscript𝑨𝑙superscriptsubscript𝒘𝑖𝑙subscript𝒃𝑙2\hat{\boldsymbol{w}}_{i}^{(l)}=\underset{\boldsymbol{w}_{i}^{(l)}}{\text{arg\,% min}}\,\|\boldsymbol{A}_{l}\boldsymbol{w}_{i}^{(l)}-\boldsymbol{b}_{l}\|^{2}.over^ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = start_UNDERACCENT bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG arg min end_ARG ∥ bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT - bold_italic_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (4)

The matrix 𝑨lsubscript𝑨𝑙\boldsymbol{A}_{l}bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and the vector 𝒃lsubscript𝒃𝑙\boldsymbol{b}_{l}bold_italic_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT in (4) are defined according to

𝑨l={[𝒙m]T,ifl=1,[𝒏i(l1)]T,if   1<lL+1,subscript𝑨𝑙casessuperscriptmatrixsubscript𝒙𝑚𝑇if𝑙1otherwiseotherwisesuperscriptmatrixsuperscriptsubscript𝒏𝑖𝑙1𝑇if1𝑙𝐿1\boldsymbol{A}_{l}=\begin{cases}\begin{bmatrix}\vdots\\ \boldsymbol{x}_{m}\\ \vdots\end{bmatrix}^{T},&\text{if}\,\,\,l=1,\\ \\ \begin{bmatrix}\vdots\\ \boldsymbol{n}_{i}^{(l-1)}\\ \vdots\end{bmatrix}^{T},&\text{if}\,\,\,1<l\leq L+1\\ \end{cases},bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = { start_ROW start_CELL [ start_ARG start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL bold_italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_l = 1 , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL [ start_ARG start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL bold_italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , end_CELL start_CELL if 1 < italic_l ≤ italic_L + 1 end_CELL end_ROW ,
𝒃l={[𝒏^i(l)]T,if   1lL,[𝒚~l]T,ifl=L+1,subscript𝒃𝑙casessuperscriptdelimited-[]superscriptsubscript^𝒏𝑖𝑙𝑇if1𝑙𝐿otherwiseotherwisesuperscriptdelimited-[]subscript~𝒚𝑙𝑇if𝑙𝐿1\boldsymbol{b}_{l}=\begin{cases}[\hat{\boldsymbol{n}}_{i}^{(l)}]^{T},&\text{if% }\,\,\,1\leq l\leq L,\\ \\ [\widetilde{\boldsymbol{y}}_{l}]^{T},&\text{if}\,\,\,l=L+1\end{cases},bold_italic_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = { start_ROW start_CELL [ over^ start_ARG bold_italic_n end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , end_CELL start_CELL if 1 ≤ italic_l ≤ italic_L , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL [ over~ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_l = italic_L + 1 end_CELL end_ROW ,

i.e., 𝑨ld×Nl1subscript𝑨𝑙superscript𝑑subscript𝑁𝑙1\boldsymbol{A}_{l}\in\mathbb{R}^{d\times N_{l-1}}bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_N start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝒃ld×1subscript𝒃𝑙superscript𝑑1\boldsymbol{b}_{l}\in\mathbb{R}^{d\times 1}bold_italic_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 1 end_POSTSUPERSCRIPT, with d𝑑ditalic_d representing the number of input data (𝒙m1×dsubscript𝒙𝑚superscript1𝑑\boldsymbol{x}_{m}\in\mathbb{R}^{1\times d}bold_italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT), 𝒚~jsubscript~𝒚𝑗\widetilde{\boldsymbol{y}}_{j}over~ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT denoting the desired output of the network (for a given input) and

𝒏^i(l)={𝒚~jt=1tiTw^tj(l+1)𝒏t(l)w^ij(l+1),ifl=L,𝒏^k(l+1)u=1uiNlw^uk(l+1)𝒏u(l)w^ik(l+1),otherwise,superscriptsubscript^𝒏𝑖𝑙casessubscript~𝒚𝑗superscriptsubscript𝑡1𝑡𝑖𝑇superscriptsubscript^𝑤𝑡𝑗𝑙1superscriptsubscript𝒏𝑡𝑙superscriptsubscript^𝑤𝑖𝑗𝑙1if𝑙𝐿otherwiseotherwisesuperscriptsubscript^𝒏𝑘𝑙1superscriptsubscript𝑢1𝑢𝑖subscript𝑁𝑙superscriptsubscript^𝑤𝑢𝑘𝑙1superscriptsubscript𝒏𝑢𝑙superscriptsubscript^𝑤𝑖𝑘𝑙1otherwise\hat{\boldsymbol{n}}_{i}^{(l)}=\begin{cases}\frac{\widetilde{\boldsymbol{y}}_{% j}-\displaystyle\sum_{\begin{subarray}{c}t=1\\ t\neq i\end{subarray}}^{T}\hat{w}_{tj}^{(l+1)}\boldsymbol{n}_{t}^{(l)}}{\hat{w% }_{ij}^{(l+1)}},&\text{if}\,\,\,l=L,\\ \\ \frac{\hat{\boldsymbol{n}}_{k}^{(l+1)}-\displaystyle\sum_{\begin{subarray}{c}u% =1\\ u\neq i\end{subarray}}^{N_{l}}\hat{w}_{uk}^{(l+1)}\boldsymbol{n}_{u}^{(l)}}{% \hat{w}_{ik}^{(l+1)}},&\text{otherwise}\end{cases},over^ start_ARG bold_italic_n end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = { start_ROW start_CELL divide start_ARG over~ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_t = 1 end_CELL end_ROW start_ROW start_CELL italic_t ≠ italic_i end_CELL end_ROW end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_t italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT bold_italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT end_ARG , end_CELL start_CELL if italic_l = italic_L , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL divide start_ARG over^ start_ARG bold_italic_n end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_u = 1 end_CELL end_ROW start_ROW start_CELL italic_u ≠ italic_i end_CELL end_ROW end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_u italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT bold_italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT end_ARG , end_CELL start_CELL otherwise end_CELL end_ROW , (5)

as illustrated in Figs. 2(a)2(b) and 2(c).

Refer to caption
(a) Output (the last) layer
Refer to caption
(b) Hidden (middle) layer
Refer to caption
(c) Input (the first) layer
Figure 2: An example of the proposed network trimming for weight optimization of each neuron in a layer.

Therefore, the solution of (4) is given in closed-form, i.e., the proposed weight optimization process boils down to just

𝒘^i(l)=(𝑨lT𝑨l)1(𝑨lT𝒃l).superscriptsubscript^𝒘𝑖𝑙superscriptsuperscriptsubscript𝑨𝑙𝑇subscript𝑨𝑙1superscriptsubscript𝑨𝑙𝑇subscript𝒃𝑙\hat{\boldsymbol{w}}_{i}^{(l)}=\left(\boldsymbol{A}_{l}^{T}\boldsymbol{A}_{l}% \right)^{-1}\left(\boldsymbol{A}_{l}^{T}\boldsymbol{b}_{l}\right).over^ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = ( bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) . (6)

After this process has been performed for all layers of the network during the training phase, the network is left with the optimal weights (in the LS sense) for a given dataset.

To summarize, the proposed approach forces the network to produce the desired output by adjusting the weight values going into each neuron according to the proposed back-propagating LS (BPLS) methodology. The interesting thing is that the optimized weights in the last layer guarantee that the network will produce the desired output, but it also allows for updating the values of neurons across all layers. This is simply accomplished by exploiting the optimized weights and the desired output/updated value of the neurons in the preceding layer, together with the known network architecture in a back-propagating manner. In other words, each layer of the network is observed as a set of parallel systems with known inputs (the values computed by resorting to random weights) and known outputs (the desired ones). Hence, the optimization of weights is achieved in a simple and efficient manner in each layer and each neuron. The desired outputs in each layer are then used to push the information throughout the network by using the optimized weights to perform updates of the values of each neuron, after which the process is repeated for the succeeding layer.

It is worth mentioning that the proposed approach works also for networks with injective activation functions (such as sigmoid, hyperbolic tangent, exponential linear unit, softmax, softplus, softminus, etc.). In this case, the computation of the desired values in each neuron/output is adjusted to account for the use of the activation function, i.e., the updates of the neurons are obtained by calculating the inverse of the activation function whose argument is given on the right-hand side in (5). Finally, unlike the existing solutions, when the problem has an injective input-to-output mapping, the running time of the new scheme is deterministic in the sense that one can compute its exact running time, given that its total number of computations is exactly Nl×(L+1)subscript𝑁𝑙𝐿1N_{l}\times\left(L+1\right)italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × ( italic_L + 1 ).

A pseudo-code of the proposed algorithm, called BPLS, is given in Algorithm 1. At line 1111 of the algorithm, each of the weights in the network is randomly initialized according to a uniform random variable on the interval [wlow,wupp]subscript𝑤lowsubscript𝑤upp[w_{\text{low}},w_{\text{upp}}][ italic_w start_POSTSUBSCRIPT low end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT upp end_POSTSUBSCRIPT ]. Line 2222 is required so that all intermediate results (neuron values, nil,i=1,,Nl,l=1,,Lformulae-sequencesubscript𝑛𝑖𝑙𝑖1subscript𝑁𝑙𝑙1𝐿n_{il},i=1,...,N_{l},l=1,...,Litalic_n start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT , italic_i = 1 , … , italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_l = 1 , … , italic_L) for the random network are acquired. It is worth mentioning that, unlike the existing methods, lines 5105105-105 - 10 can be executed in parallel, i.e., one can obtain all weights for a given layer simultaneously by suitably rewriting the involved parameters in a matrix form.

Algorithm 1  Pseudo-code of the Proposed BPLS Algorithm
0:  M::𝑀absentM:italic_M : Number of inputs
0:  𝒙m::subscript𝒙𝑚absent\boldsymbol{x}_{m}:bold_italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT : Input data m=1,,M𝑚1𝑀m=1,...,Mitalic_m = 1 , … , italic_M
0:  L::𝐿absentL:italic_L : Number of layers in the network
0:  Nl::subscript𝑁𝑙absentN_{l}:italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT : Number of neurons in the l𝑙litalic_l-th layer, l=1,,L𝑙1𝐿l=1,...,Litalic_l = 1 , … , italic_L
0:  J::𝐽absentJ:italic_J : Number of outputs
0:  𝒚~j::subscript~𝒚𝑗absent\widetilde{\boldsymbol{y}}_{j}:over~ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : Desired outcome at the j𝑗jitalic_j-th output
1:  Initialization: Set wki(l)𝒰[wlow,wupp],i=1,,Nl,k=1,,Nl1,l=1,,L+1formulae-sequencesimilar-tosuperscriptsubscript𝑤𝑘𝑖𝑙𝒰subscript𝑤lowsubscript𝑤uppformulae-sequence𝑖1subscript𝑁𝑙formulae-sequence𝑘1subscript𝑁𝑙1𝑙1𝐿1w_{ki}^{(l)}\sim\mathcal{U}[w_{\text{low}},w_{\text{upp}}],\,\,i=1,...,N_{l},% \,k=1,...,N_{l-1},\,l=1,...,L+1italic_w start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∼ caligraphic_U [ italic_w start_POSTSUBSCRIPT low end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT upp end_POSTSUBSCRIPT ] , italic_i = 1 , … , italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_k = 1 , … , italic_N start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT , italic_l = 1 , … , italic_L + 1
2:  𝒚jsubscript𝒚𝑗absent\boldsymbol{y}_{j}\leftarrowbold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← (1) //Get output of random network
3:  𝒚jsubscript𝒚𝑗absent\boldsymbol{y}_{j}\leftarrowbold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← (1) //Get output of random network
4:  for l=L+1:1: 1:𝑙𝐿11:1l=L+1\,:\,-1\,:\,1italic_l = italic_L + 1 : - 1 : 1 do
5:     for i=1: 1:Nl1:𝑖11:subscript𝑁𝑙1i=1\,:\,1\,:\,N_{l-1}italic_i = 1 : 1 : italic_N start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT do
6:        𝒘^i(l)superscriptsubscript^𝒘𝑖𝑙absent\hat{\boldsymbol{w}}_{i}^{(l)}\leftarrowover^ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ← (6) //Update weights in l𝑙litalic_l-th layer
7:        if l1𝑙1l\neq 1italic_l ≠ 1 then
8:           𝒏^i(l1)superscriptsubscript^𝒏𝑖𝑙1absent\hat{\boldsymbol{n}}_{i}^{(l-1)}\leftarrowover^ start_ARG bold_italic_n end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ← (5) //Update neurons in l1𝑙1l-1italic_l - 1-th layer
9:        end if
10:     end for
11:  end for
12:  Return: 𝒚^jsubscript^𝒚𝑗\hat{\boldsymbol{y}}_{j}over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT //Use 𝒘^i(l)superscriptsubscript^𝒘𝑖𝑙\hat{\boldsymbol{w}}_{i}^{(l)}over^ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT &\&& netw. arch. for output at j𝑗jitalic_j

The proposed BPLS solution can also be applied for problems that can be seen as non-injective systems, such as classification problems. In this case, the proposed algorithm is summarized in the pseudo-code in Algorithm 2. Until line 14141414, the main adjustment in our approach is that another LS problem is solved by using updated coefficient matrix of the system. This is required since it might not be possible to obtain accurate (enough) weight estimates from a high quantity of erroneous data. Once an initial estimate of the weights is acquired, the algorithm iteratively updates the weight values by exclusively using data for which an incorrect label has been given, rather than the entire dataset. This allows for faster execution of the algorithm compared to the existing ones and the procedure rewards previous weight estimates that reduce the number of missed classifications, line 20202020.

Algorithm 2  Pseudo-code for BPLS Adapted for Non-injective Systems
0:  M::𝑀absentM:italic_M : Number of inputs
0:  𝒙m::subscript𝒙𝑚absent\boldsymbol{x}_{m}:bold_italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT : Input data m=1,,M𝑚1𝑀m=1,...,Mitalic_m = 1 , … , italic_M
0:  L::𝐿absentL:italic_L : Number of layers in the network
0:  Nl::subscript𝑁𝑙absentN_{l}:italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT : Number of neurons in the l𝑙litalic_l-th layer, l=1,,L𝑙1𝐿l=1,...,Litalic_l = 1 , … , italic_L
0:  J::𝐽absentJ:italic_J : Number of outputs
0:  𝒚~j::subscript~𝒚𝑗absent\widetilde{\boldsymbol{y}}_{j}:over~ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : Desired outcome at the j𝑗jitalic_j-th output
0:  τmax::subscript𝜏maxabsent\tau_{\text{max}}:italic_τ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT : Maximum number of iterations
1:  Initialization: Set wki(l)𝒰[wlow,wupp],i=1,,Nl,k=1,,Nl1,l=1,,L+1formulae-sequencesimilar-tosuperscriptsubscript𝑤𝑘𝑖𝑙𝒰subscript𝑤lowsubscript𝑤uppformulae-sequence𝑖1subscript𝑁𝑙formulae-sequence𝑘1subscript𝑁𝑙1𝑙1𝐿1w_{ki}^{(l)}\sim\mathcal{U}[w_{\text{low}},w_{\text{upp}}],\,\,i=1,...,N_{l},% \,k=1,...,N_{l-1},\,l=1,...,L+1italic_w start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∼ caligraphic_U [ italic_w start_POSTSUBSCRIPT low end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT upp end_POSTSUBSCRIPT ] , italic_i = 1 , … , italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_k = 1 , … , italic_N start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT , italic_l = 1 , … , italic_L + 1
2:  𝒚jsubscript𝒚𝑗absent\boldsymbol{y}_{j}\leftarrowbold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← (1) //Get output of random network
3:  for l=L+1:1: 1:𝑙𝐿11:1l=L+1\,:\,-1\,:\,1italic_l = italic_L + 1 : - 1 : 1 do
4:     for i=1: 1:Nl1:𝑖11:subscript𝑁𝑙1i=1\,:\,1\,:\,N_{l-1}italic_i = 1 : 1 : italic_N start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT do
5:        if l1𝑙1l\neq 1italic_l ≠ 1 then
6:           𝒘~i(l)superscriptsubscript~𝒘𝑖𝑙absent\widetilde{\boldsymbol{w}}_{i}^{(l)}\leftarrowover~ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ← (6) //Initial update in l𝑙litalic_l-th layer
7:           𝒏^i(l1)superscriptsubscript^𝒏𝑖𝑙1absent\hat{\boldsymbol{n}}_{i}^{(l-1)}\leftarrowover^ start_ARG bold_italic_n end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ← (5) //Update neurons in l1𝑙1l-1italic_l - 1-th layer
8:           𝑨^l[(𝒏^i)(l1)T]T\hat{\boldsymbol{A}}_{l}\leftarrow[\ldots\,(\hat{\boldsymbol{n}}_{i}{{}^{(l-1)% }})^{T}\,\ldots]^{T}over^ start_ARG bold_italic_A end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ← [ … ( over^ start_ARG bold_italic_n end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT ( italic_l - 1 ) end_FLOATSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT … ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT //Upd. coeff. matrix
9:           𝒘^i(l)(𝑨^l𝑨^lT)1(𝑨^l𝒃lT)superscriptsubscript^𝒘𝑖𝑙superscriptsubscript^𝑨𝑙superscriptsubscript^𝑨𝑙𝑇1subscript^𝑨𝑙superscriptsubscript𝒃𝑙𝑇\hat{\boldsymbol{w}}_{i}^{(l)}\leftarrow\left(\hat{\boldsymbol{A}}_{l}\hat{% \boldsymbol{A}}_{l}^{T}\right)^{-1}\left(\hat{\boldsymbol{A}}_{l}\boldsymbol{b% }_{l}^{T}\right)over^ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ← ( over^ start_ARG bold_italic_A end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT over^ start_ARG bold_italic_A end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_A end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_italic_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) //Estimate weights
10:        else
11:           𝒘^i(l)superscriptsubscript^𝒘𝑖𝑙absent\hat{\boldsymbol{w}}_{i}^{(l)}\leftarrowover^ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ← (6) //Estimate weights in 1111-st layer
12:        end if
13:     end for
14:  end for
15:  𝒚^jsubscript^𝒚𝑗\hat{\boldsymbol{y}}_{j}over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT //Estimate output at j𝑗jitalic_j
16:  μ(0)(𝒚^j==𝒚~j)\mu^{(0)}\leftarrow(\hat{\boldsymbol{y}}_{j}==\widetilde{\boldsymbol{y}}_{j})italic_μ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ← ( over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = = over~ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) //Estimated = desired labels?
17:  Initialize: τ1𝜏1\tau\leftarrow 1italic_τ ← 1 and μ(τ)106superscript𝜇𝜏superscript106\mu^{(\tau)}\leftarrow 10^{6}italic_μ start_POSTSUPERSCRIPT ( italic_τ ) end_POSTSUPERSCRIPT ← 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT
18:  while μ(τ1)>μ(τ)superscript𝜇𝜏1superscript𝜇𝜏\mu^{(\tau-1)}>\mu^{(\tau)}italic_μ start_POSTSUPERSCRIPT ( italic_τ - 1 ) end_POSTSUPERSCRIPT > italic_μ start_POSTSUPERSCRIPT ( italic_τ ) end_POSTSUPERSCRIPT and ττmax𝜏subscript𝜏max\tau\leq\tau_{\text{max}}italic_τ ≤ italic_τ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT do
19:     𝒘^i,miss(l)superscriptsubscript^𝒘𝑖miss𝑙absent\hat{\boldsymbol{w}}_{i,\text{miss}}^{(l)}\leftarrowover^ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_i , miss end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ← repeat lines 3143143-143 - 14 using inputs that produced a miss
20:     𝒘^i(l)(1#misses#data)𝒘^i(l)+#misses#data𝒘^i,miss(l)superscriptsubscript^𝒘𝑖𝑙1#misses#datasuperscriptsubscript^𝒘𝑖𝑙#misses#datasuperscriptsubscript^𝒘𝑖miss𝑙\hat{\boldsymbol{w}}_{i}^{(l)}\leftarrow\left(1-\frac{\#\text{misses}}{\#\text% {data}}\right)\hat{\boldsymbol{w}}_{i}^{(l)}+\frac{\#\text{misses}}{\#\text{% data}}\hat{\boldsymbol{w}}_{i,\text{miss}}^{(l)}over^ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ← ( 1 - divide start_ARG # misses end_ARG start_ARG # data end_ARG ) over^ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT + divide start_ARG # misses end_ARG start_ARG # data end_ARG over^ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_i , miss end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT
21:     𝒚^jsubscript^𝒚𝑗\hat{\boldsymbol{y}}_{j}over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT //Estimate output at j𝑗jitalic_j
22:     μ(τ)(𝒚^j==𝒚~j)\mu^{(\tau)}\leftarrow(\hat{\boldsymbol{y}}_{j}==\widetilde{\boldsymbol{y}}_{j})italic_μ start_POSTSUPERSCRIPT ( italic_τ ) end_POSTSUPERSCRIPT ← ( over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = = over~ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) //Check labels
23:     ττ+1𝜏𝜏1\tau\leftarrow\tau+1italic_τ ← italic_τ + 1
24:  end while
25:  Return: 𝒚^jsubscript^𝒚𝑗\hat{\boldsymbol{y}}_{j}over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT //Use 𝒘^i(l)superscriptsubscript^𝒘𝑖𝑙\hat{\boldsymbol{w}}_{i}^{(l)}over^ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT &\&& netw. arch. for output at j𝑗jitalic_j

III Discussion

In this section, an analysis of the proposed approach in terms of its computational complexity and feasibility of its solution is provided. The section is organized accordingly.

III-A Computational Complexity Analysis

To analyse the computational complexity of the proposed approach, the Big O notation is adopted. Since the proposed approach allows for optimization of all weights in a given layer l𝑙litalic_l simultaneously, (6) can be also be rewritten as

𝒘^(l)=(𝑨lT𝑨l)1(𝑨lT𝑩l),superscript^𝒘𝑙superscriptsuperscriptsubscript𝑨𝑙𝑇subscript𝑨𝑙1superscriptsubscript𝑨𝑙𝑇subscript𝑩𝑙\hat{\boldsymbol{w}}^{(l)}=\left(\boldsymbol{A}_{l}^{T}\boldsymbol{A}_{l}% \right)^{-1}\left(\boldsymbol{A}_{l}^{T}\boldsymbol{B}_{l}\right),over^ start_ARG bold_italic_w end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = ( bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , (7)

where 𝒘^(l)=[𝒘^i(l)]Nl1×Nlsuperscript^𝒘𝑙delimited-[]superscriptsubscript^𝒘𝑖𝑙superscriptsubscript𝑁𝑙1subscript𝑁𝑙\hat{\boldsymbol{w}}^{(l)}=[\hat{\boldsymbol{w}}_{i}^{(l)}]\in\mathbb{R}^{N_{l% -1}\times N_{l}}over^ start_ARG bold_italic_w end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = [ over^ start_ARG bold_italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝑨ld×Nl1subscript𝑨𝑙superscript𝑑subscript𝑁𝑙1\boldsymbol{A}_{l}\in\mathbb{R}^{d\times N_{l-1}}bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_N start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝑩l=[𝒃l]d×Nlsubscript𝑩𝑙delimited-[]subscript𝒃𝑙superscript𝑑subscript𝑁𝑙\boldsymbol{B}_{l}=[\boldsymbol{b}_{l}]\in\mathbb{R}^{d\times N_{l}}bold_italic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = [ bold_italic_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, with d𝑑ditalic_d representing the number of input data used. Essentially, the proposed solution (7) involves two matrix operations in the weight optimization process in each layer: multiplication and inversion. The former operation requires a complexity of 𝒪(αβγ)𝒪𝛼𝛽𝛾\mathcal{O}(\alpha\beta\gamma)caligraphic_O ( italic_α italic_β italic_γ ) for multiplication of an α×β𝛼𝛽\alpha\times\betaitalic_α × italic_β size matrix with a β×γ𝛽𝛾\beta\times\gammaitalic_β × italic_γ size matrix, while the latter operation requires a cubed complexity in the size of the matrix. Hence, by neglecting minor terms and focusing only on the dominant ones, the worst-case computational complexity of the proposed algorithm is 𝒪(h3+h2d)𝒪superscript3superscript2𝑑\mathcal{O}(h^{3}+h^{2}d)caligraphic_O ( italic_h start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ), where h=max{N0,,NL}subscript𝑁0subscript𝑁𝐿h=\max\{N_{0},...,N_{L}\}italic_h = roman_max { italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_N start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT }.

III-B Feasibility Analysis

According to (7), let

f(𝒘(l))=𝑨l𝒘(l)𝑩l2==(𝒘(l))T𝑨lT𝑨l𝒘(l)2(𝒘(l))T𝑨lT𝑩l+𝑩lT𝑩l𝑓superscript𝒘𝑙superscriptnormsubscript𝑨𝑙superscript𝒘𝑙subscript𝑩𝑙2absentabsentsuperscriptsuperscript𝒘𝑙𝑇superscriptsubscript𝑨𝑙𝑇subscript𝑨𝑙superscript𝒘𝑙2superscriptsuperscript𝒘𝑙𝑇superscriptsubscript𝑨𝑙𝑇subscript𝑩𝑙superscriptsubscript𝑩𝑙𝑇subscript𝑩𝑙\begin{array}[]{l}f(\boldsymbol{w}^{(l)})=\|\boldsymbol{A}_{l}\boldsymbol{w}^{% (l)}-\boldsymbol{B}_{l}\|^{2}=\\ =(\boldsymbol{w}^{(l)})^{T}\boldsymbol{A}_{l}^{T}\boldsymbol{A}_{l}\boldsymbol% {w}^{(l)}-2(\boldsymbol{w}^{(l)})^{T}\boldsymbol{A}_{l}^{T}\boldsymbol{B}_{l}+% \boldsymbol{B}_{l}^{T}\boldsymbol{B}_{l}\end{array}start_ARRAY start_ROW start_CELL italic_f ( bold_italic_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) = ∥ bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_italic_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT - bold_italic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = end_CELL end_ROW start_ROW start_CELL = ( bold_italic_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_italic_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT - 2 ( bold_italic_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + bold_italic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY

denote the cost function to be minimized and \mathcal{F}caligraphic_F denote the set of feasible solutions. Since 𝑨lT𝑨lsuperscriptsubscript𝑨𝑙𝑇subscript𝑨𝑙\boldsymbol{A}_{l}^{T}\boldsymbol{A}_{l}bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is positive semi-definite by definition, the cost function is convex. Moreover, one can even guarantee strong convexity of the objective function by simply adding a (small) regularization term, i.e.,

f~(𝒘(l))=𝑨l𝒘(l)𝑩l2+ϵ𝒘(l)2=(𝒘(l))T(𝑨lT𝑨l+ϵ𝑰Nl1)𝒘(l)2(𝒘(l))T𝑨lT𝑩l+𝑩lT𝑩l,~𝑓superscript𝒘𝑙superscriptnormsubscript𝑨𝑙superscript𝒘𝑙subscript𝑩𝑙2italic-ϵsuperscriptnormsuperscript𝒘𝑙2absentsuperscriptsuperscript𝒘𝑙𝑇superscriptsubscript𝑨𝑙𝑇subscript𝑨𝑙italic-ϵsubscript𝑰subscript𝑁𝑙1superscript𝒘𝑙2superscriptsuperscript𝒘𝑙𝑇superscriptsubscript𝑨𝑙𝑇subscript𝑩𝑙superscriptsubscript𝑩𝑙𝑇subscript𝑩𝑙\begin{array}[]{l}\tilde{f}(\boldsymbol{w}^{(l)})=\|\boldsymbol{A}_{l}% \boldsymbol{w}^{(l)}-\boldsymbol{B}_{l}\|^{2}+\epsilon\|\boldsymbol{w}^{(l)}\|% ^{2}=\\ (\boldsymbol{w}^{(l)})^{T}\left(\boldsymbol{A}_{l}^{T}\boldsymbol{A}_{l}+% \epsilon\boldsymbol{I}_{N_{l-1}}\right)\boldsymbol{w}^{(l)}-2(\boldsymbol{w}^{% (l)})^{T}\boldsymbol{A}_{l}^{T}\boldsymbol{B}_{l}+\boldsymbol{B}_{l}^{T}% \boldsymbol{B}_{l},\end{array}start_ARRAY start_ROW start_CELL over~ start_ARG italic_f end_ARG ( bold_italic_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) = ∥ bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_italic_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT - bold_italic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ϵ ∥ bold_italic_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = end_CELL end_ROW start_ROW start_CELL ( bold_italic_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_ϵ bold_italic_I start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) bold_italic_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT - 2 ( bold_italic_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + bold_italic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , end_CELL end_ROW end_ARRAY

with ϵ=106italic-ϵsuperscript106\epsilon=10^{-6}italic_ϵ = 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, for instance.

From the practical/engineering perspective, the two objective functions are virtually the same, but f~~𝑓\tilde{f}over~ start_ARG italic_f end_ARG is strongly convex, meaning that it has a unique solution, since the matrix 𝑨lT𝑨l+ϵ𝑰Nl1superscriptsubscript𝑨𝑙𝑇subscript𝑨𝑙italic-ϵsubscript𝑰subscript𝑁𝑙1\boldsymbol{A}_{l}^{T}\boldsymbol{A}_{l}+\epsilon\boldsymbol{I}_{N_{l-1}}bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_ϵ bold_italic_I start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is non-singular. Hence, the necessary and sufficient condition for 𝒘(l)superscript𝒘𝑙\boldsymbol{w}^{(l)}bold_italic_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT to be a minimizer of f~(𝒘(l))~𝑓superscript𝒘𝑙\tilde{f}(\boldsymbol{w}^{(l)})over~ start_ARG italic_f end_ARG ( bold_italic_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) is

f~(𝒘(l))=0(𝑨lT𝑨l+ϵ𝑰Nl1)𝒘(l)𝑨lT𝑩l=0.~𝑓superscript𝒘𝑙0superscriptsubscript𝑨𝑙𝑇subscript𝑨𝑙italic-ϵsubscript𝑰subscript𝑁𝑙1superscript𝒘𝑙superscriptsubscript𝑨𝑙𝑇subscript𝑩𝑙0\nabla\tilde{f}(\boldsymbol{w}^{(l)})=0\Leftrightarrow\left(\boldsymbol{A}_{l}% ^{T}\boldsymbol{A}_{l}+\epsilon\boldsymbol{I}_{N_{l-1}}\right)\boldsymbol{w}^{% (l)}-\boldsymbol{A}_{l}^{T}\boldsymbol{B}_{l}=0.∇ over~ start_ARG italic_f end_ARG ( bold_italic_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) = 0 ⇔ ( bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_ϵ bold_italic_I start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) bold_italic_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT - bold_italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 0 .

Thus, it follows that \mathcal{F}\neq\varnothingcaligraphic_F ≠ ∅, i.e., a (unique) solution for the proposed approach is feasible in practice.

IV Numerical Results

This section validates the performance of the proposed approach, by comparing it to the following benchmark algorithms: AdaGrad in [7], SGD in [8], NAG in [12], and Adam in [13]. The section first provides a set of numerical results for a couple of toy examples for which mathematical models between the input and the output are well-defined. Likewise, an experiment with real world datasets is performed by resorting to the MNIST and the Fashion-MNIST datasets for multi-class digit and fashion image recognition problems, respectively. The organization of this section is therefore done accordingly. It is worth mentioning that the initial weights for all neurons in all layers were drawn randomly from a uniform distribution on the interval [1,1]11[-1,1][ - 1 , 1 ] in all considered scenarios. Moreover, for toy examples, all considered algorithms were executed in MATLAB by using CPU, whereas for experimental validation, all considered algorithms were executed in Python employing CUDA and PyTorch library with both CPU and GPU. Lastly, the studied architectures were set empirically, but the generalization of all considered algorithms is straightforward to any type of architecture.

IV-A Toy Examples

The performance of the proposed approach is validated here through a couple of functions that describe the known mathematical relationship between the input data and the desired output. Both linear and non-linear relationships are considered for a network comprising an input dataset and a bias fixed at value 1111 (resulting in M=2𝑀2M=2italic_M = 2), with one hidden layer (i.e., L=2𝐿2L=2italic_L = 2) composed of three neurons, N1=3subscript𝑁13N_{1}=3italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 3, and two distinct outputs, J=2𝐽2J=2italic_J = 2, each one describing an independent function of the input data. In the latter case, sigmoid activation functions, defined as a(x)=11+exp{x}𝑎𝑥11𝑥a(x)=\frac{1}{1+\exp\left\{-x\right\}}italic_a ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 1 + roman_exp { - italic_x } end_ARG for some x𝑥x\in\mathbb{R}italic_x ∈ blackboard_R, are employed, as illustrated in Fig. 3. For the training phase, the input dataset composed of a series of odd numbers ranging from 1111 to 9999, while the test is performed by a series of even numbers ranging from 2222 to 10101010. Given the parsimonious quantity of data used, the maximum number of Epochs for the existing methods222All gradient descent-based algorithms were implemented by considering good default settings presented in [13], including the learning rate of η=103𝜂superscript103\eta=10^{-3}italic_η = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, except for AdaGrad. In our experiments, we found that AdaGrad performs much better when considering η=101𝜂superscript101\eta=10^{-1}italic_η = 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT; hence, η=101𝜂superscript101\eta=10^{-1}italic_η = 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is adopted for AdaGrad in the toy examples. was set to 1000. In all simulations presented in this section, the training dataset is corrupted by noise, nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, modeled as a zero-mean Gaussian random variable with variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, i.e., ni𝒩(0,σ2)similar-tosubscript𝑛𝑖𝒩0superscript𝜎2n_{i}\sim\mathcal{N}(0,\sigma^{2})italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). The principal metric used for assessing accuracy of an approach in this setting is the root mean squared error (RMSE), defined as RMSE=1Mcm=1Mc𝒚~m𝒚^m2,RMSE1subscript𝑀𝑐superscriptsubscript𝑚1subscript𝑀𝑐superscriptnormsubscript~𝒚𝑚subscript^𝒚𝑚2\text{RMSE}=\sqrt{\frac{1}{M_{c}}\sum_{m=1}^{M_{c}}\|\widetilde{\boldsymbol{y}% }_{m}-\hat{\boldsymbol{y}}_{m}\|^{2}},RMSE = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ over~ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , where 𝒚^msubscript^𝒚𝑚\hat{\boldsymbol{y}}_{m}over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the estimate of the desired output, 𝒚~msubscript~𝒚𝑚\widetilde{\boldsymbol{y}}_{m}over~ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, in the m𝑚mitalic_m-th Monte Carlo, Mcsubscript𝑀𝑐M_{c}italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, run.

Refer to caption
Figure 3: The considered network with activation function in the toy example.

Fig. 4 illustrates RMSE versus σ𝜎\sigmaitalic_σ performance of the considered algorithms for a random linear input/output relationship, defined as 𝒚~1=13𝒙1+2subscript~𝒚113subscript𝒙12\widetilde{\boldsymbol{y}}_{1}=-\frac{1}{3}\boldsymbol{x}_{1}+2over~ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG 3 end_ARG bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 2 and 𝒚~2=2𝒙11subscript~𝒚22subscript𝒙11\widetilde{\boldsymbol{y}}_{2}=2\boldsymbol{x}_{1}-1over~ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1, for both training 4(a) and test 4(b) phases. As expected, all algorithms exhibit deterioration in their performance when the noise power grows. The proposed BPLS algorithm in Algorithm 1 matches the performance of the state-of-the-art approach in the training phase and even slightly surpasses it in the test phase. This is an important accomplishment given that the proposed approach is significantly faster that the existing ones, and requires a single iteration (epoch) to obtain its final solution.

Refer to caption
(a) Training phase
Refer to caption
(b) Test phase
Figure 4: RMSE versus σ𝜎\sigmaitalic_σ illustration, for a well-defined linear mathematical model and Mc=1000subscript𝑀𝑐1000M_{c}=1000italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 1000.

Fig. 5 illustrates RMSE versus σ𝜎\sigmaitalic_σ performance of the considered algorithms for a random non-linear input/output relationship, defined as 𝒚~1=11+exp{log10(𝒙132)}subscript~𝒚111subscript10superscriptsubscript𝒙132\widetilde{\boldsymbol{y}}_{1}=\frac{1}{1+\exp\left\{\log_{10}\left(% \boldsymbol{x}_{1}^{-\frac{3}{2}}\right)\right\}}over~ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + roman_exp { roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) } end_ARG and 𝒚~2=11+exp{𝒙114}subscript~𝒚211superscriptsubscript𝒙114\widetilde{\boldsymbol{y}}_{2}=\frac{1}{1+\exp\left\{\boldsymbol{x}_{1}^{-% \frac{1}{4}}\right\}}over~ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + roman_exp { bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT } end_ARG, for both training 5(a) and test 5(b) phases. Similarly as in the previous case, the proposed approach virtually matches the best performance among the existing ones and outperforms them in the test phase somewhat more clearly. These results can be seen as an indicator that the existing methods slightly over-fitted the model in comparison with the proposed one.

Refer to caption
(a) Training phase
Refer to caption
(b) Test phase
Figure 5: RMSE versus σ𝜎\sigmaitalic_σ illustration, for a well-defined non-linear mathematical model and Mc=1000subscript𝑀𝑐1000M_{c}=1000italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 1000.

IV-B Real-world Experiments

To experimentally assess the proposed method, two large, well-known and publicly-available image datasets (MNIST [14, 8], a dataset of handwritten Arabic digits, and Fashion-MNIST [15], a dataset of fashion images) are used for training and testing the considered approaches. Both datasets contain a training set of 6×1046superscript1046\times 10^{4}6 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT examples and a test set of 1×1041superscript1041\times 10^{4}1 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT examples of 28×28282828\times 2828 × 28 gray-scale images, associated with a label from 10 classes. Similar as before, the existing methods were implemented by considering the default settings presented in [13] with maximum 40404040 Epochs and the learning rate η=103𝜂superscript103\eta=10^{-3}italic_η = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. Furthermore, for the existing algorithms, batch=1batch1\text{batch}=1batch = 1 (in order to optimize the weight updates) and Mc=100subscript𝑀𝑐100M_{c}=100italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 100 was considered. In order to make the comparison completely fair, random weights were generated in each Mcsubscript𝑀𝑐M_{c}italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT run and all methods were tested against exactly the same initial weights. For this setting, classification accuracy (CA), defined as the proportion of correctly labelled images to the total number of images, is chosen as the main performance metric. For both datasets, a fully-connected network with one hidden layer (i.e., L=2𝐿2L=2italic_L = 2) composed of N1=60subscript𝑁160N_{1}=60italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 60 and two hidden layers (i.e., L=3𝐿3L=3italic_L = 3) composed of N1=100subscript𝑁1100N_{1}=100italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 100 and N2=70subscript𝑁270N_{2}=70italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 70 neurons, and J=10𝐽10J=10italic_J = 10 distinct outputs is considered, with softmax activation functions, defined as a(𝒙)i=exp{xi}j=1Kexp{xj}𝑎subscript𝒙𝑖subscript𝑥𝑖superscriptsubscript𝑗1𝐾subscript𝑥𝑗a(\boldsymbol{x})_{i}=\frac{\exp\left\{x_{i}\right\}}{\sum_{j=1}^{K}\,\exp% \left\{x_{j}\right\}}italic_a ( bold_italic_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG roman_exp { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } end_ARG for 𝒙K𝒙superscript𝐾\boldsymbol{x}\in\mathbb{R}^{K}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, as illustrated in Fig. 6.

Refer to caption
(a) Network with one hidden layer
Refer to caption
(b) Network with two hidden layers
Figure 6: The considered network with softmax activation function in the experimental examples.

Fig. 7 illustrates CA (%percent\%%) versus epochs performance of the considered algorithms for the MNIST training dataset in a fully-connected network with a single hidden layer, for both training 7(a) and test 7(b) phases. The figure shows that the proposed BPLS algorithm in Algorithm 2 requires only 11111111 iterations in total to obtain its final solution, while the existing ones were allowed to make at most 40404040 epochs (following the example set in [13]). One can observe that, apart from AdaGrad, the existing algorithms saturate within approximately 10101010 epochs gaining very little-to-nothing in the remaining ones. The only method that does not converge within the considered maximum number of epochs is SGD, which indicates that it requires tuning of its learning rate or more epochs. From Fig. 7, it can be seen that the proposed BPLS algorithms practically matches the performance of the state-of-the-art solutions, in both training and test phases.

Refer to caption
(a) Training phase
Refer to caption
(b) Test phase
Figure 7: CA (%) versus epochs illustration, with one hidden layer (N1=60subscript𝑁160N_{1}=60italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 60) using MNIST dataset.

Fig. 8 illustrates CA (%percent\%%) versus epochs performance of the considered algorithms for the MNIST training dataset in a fully-connected network with a pair of hidden layers, for both training 8(a) and test 8(b) phases. Naturally, in this setting, the proposed solution requires a somewhat higher number of iterations in total (14141414) than in the previous setting, although it gets saturated after roughly 5555, similar as most of the considered ones. Nevertheless, the performance of all considered methods is basically unchanged with the addition of an extra hidden layer, and the proposed BPLS is competitive with the best existing methods in both training and test phases.

Refer to caption
(a) Training phase
Refer to caption
(b) Test phase
Figure 8: CA (%) versus epochs illustration, with two hidden layers (N1=100subscript𝑁1100N_{1}=100italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 100 and N2=70subscript𝑁270N_{2}=70italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 70) using MNIST dataset.

Fig. 9 illustrates CA (%percent\%%) versus epochs performance of the considered algorithms for the Fashion-MNIST training dataset in a fully-connected network with a single hidden layer, for both training 9(a) and test 9(b) phases. Similar as for the MNIST dataset, Fig. 9 shows that the proposed BPLS algorithm matches the performance of the existing methods and requires only 6666 iterations to obtain its final solution. Similarly, most of considered existing methods require the same number epochs to saturate and letting them to continue working leads to insignificant benefits.

Refer to caption
(a) Training phase
Refer to caption
(b) Test phase
Figure 9: CA (%) versus epochs illustration, with one hidden layer (N1=60subscript𝑁160N_{1}=60italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 60) using Fashion-MNIST dataset.

Fig. 10 illustrates CA (%percent\%%) versus epochs performance of the considered algorithms for the Fashion-MNIST training dataset in a fully-connected network with a pair of hidden layers, for both training 10(a) and test 10(b) phases. Once again, the figure exhibits that introducing an additional hidden layer practically does not change the performance of the considered algorithms, and that the proposed BPLS is competitive with the best existing ones.

Refer to caption
(a) Training phase
Refer to caption
(b) Test phase
Figure 10: CA (%) versus epochs illustration, with two hidden layers (N1=100subscript𝑁1100N_{1}=100italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 100 and N2=70subscript𝑁270N_{2}=70italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 70) using Fashion-MNIST dataset.

Table I summarizes the best achieved CA (%) results (on average) for the two datasets and two considered network architectures. The results presented in the table corroborate previous claim that the proposed solution practically matches the performance of the state-of-the-art methods in both training and test phases.

TABLE I: The Best Achieved Training and Test Results for Different Image Datasets
Dataset Algorithm BPLS Adam NAG SGD AdaGrad
MNIST, L=2𝐿2L=2italic_L = 2 (training & test) 89.77 89.47 94.40 92.91 94.40 92.91 90.94 90.69 57.72 58.10
MNIST, L=3𝐿3L=3italic_L = 3 (training & test) 88.72 88.38 93.96 92.77 93.96 92.77 50.50 50.91 91.35 91.23
Fashion-MNIST, L=2𝐿2L=2italic_L = 2 (training & test) 83.25 81.23 87.37 83.94 86.93 83.94 62.30 61.89 80.71 79.75
Fashion-MNIST, L=3𝐿3L=3italic_L = 3 (training & test) 82.03 80.21 85.67 83.31 85.43 83.09 54.95 54.78 82.90 81.65

The average running times (in seconds) are exhibited in Table II. It is worth mentioning that this analysis was performed on the machine with the following characteristics: CPU: Intel Core i7-8700 CPU 3.203.203.203.20GHz ×\times× 12121212 with 64646464GB RAM and GPU: NVIDIA GeForce RTX 2070207020702070 SUPER/PCIe/SSE2222 where the Fashion-MNIST dataset was employed in a fully-connected network with two hidden layers, where batch=1batch1\text{batch}=1batch = 1 was considered for the existing algorithms. It is worth mentioning here that, by employing the existing PyTorch library, the state-of-the-art methods are fully optimized and adapted to CPU implementation. Even so, the table clearly shows the superiority of the proposed BPLS algorithm over the existing ones, obtaining its results about 20202020 times faster.

TABLE II: Average Running Time (s) of the Considered Algorithms
Algorithm BPLS Adam NAG SGD AdaGrad
Time (CPU) 0.26 4.93 4.94 4.83 4.86
Time (GPU) 0.17 4.85 4.82 4.69 4.73

To conclude the results analysis, Fig. 11 illustrates the CA (%) versus average GPU running time (s) comparison for the Fashion-MNIST test dataset in a network with two hidden layers, which summarizes the main findings of this section. Ideally, one would desire the results to lie in the upper-left corner of the figure. It is exactly where the results of the proposed BPLS algorithm can be found, while the majority of results of the existing algorithms are situated in the right superior corner, illustrating the significant difference in running times.

Refer to caption
Figure 11: CA (%) versus average GPU running time (s) per epoch comparison for the Fashion-MNIST test dataset in a fully-connected network with two hidden layers.

V Conclusions

This work proposed a novel approach to optimize weights for a fully-connected feed-forward neural network. Unlike the existing solutions that base their computation on iterative back-propagation and first-order gradient-based optimization schemes, the proposed solution optimizes the weights of each neuron in a single iteration and its solution is given in closed-form. It is founded on the least squares criterion, where a system of linear equations is derived by using back propagation according to the network architecture (going from right to left), where the solutions in the preceding layer are used to obtain the inputs in the succeeding one. The new algorithm is suitable for parallel implementation and can thus optimize (in the LS sense) the weights of all neurons in a single layer simultaneously. The presented numerical experiments show that the proposed solution matches the performance of the state-of-the-art methods in terms of accuracy and surpasses them significantly in terms of running time (around 20202020 times faster in the considered network architecture).

References

  • [1] Mohamed A. Abdou, “Literature Review: Efficient Deep Neural Networks Techniques for Medical Image Analysis” Neural Computing and Applications, vol. 34, pp. 5791–5812, February 2022.
  • [2] S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, Upper Saddle River, NJ, USA, 2009.
  • [3] V. Nath , S. E. Levinson. Autonomous Robotics and Deep Learning. Springer Cham, Switzerland, 2014.
  • [4] Y. E. Nesterov, “A Method of Solving a Convex Programming Problem with Convergence Rate 𝒪(1k2)𝒪1superscript𝑘2\mathcal{O}\left(\frac{1}{k^{2}}\right)caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ),” Doklady Akademii Nauk SSSR, vol. 269, no. 3, pp. 543–547, July 1982.
  • [5] N. Qian, “On the Momentum Term in Gradient Descent Learning Algorithms,” Neural networks, vol. 12, no. 1, pp. 145–151, January 1999.
  • [6] A. Graves, “Generating Sequences with Recurrent Neural Networks,” arXiv preprint arXiv:1308.0850, 2013.
  • [7] J. Duchi, E. Hazan, and Y. Singer, “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization,” The Journal of Machine Learning Research, vol. 12, pp. 2121–2159, July, 2011.
  • [8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based Learning Applied to Document Recognition,” Proceedings of the IEEE, vol. 86, no. 11 pp. 2278–2324, November 1998.
  • [9] G. Hinton, L. Deng, D. Yu, et. al. “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, November 2012.
  • [10] A. Krizhevsky, I. Sutskever, G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Communications of the ACM, vol. 60, no. 6 pp. 84–90, May 2017.
  • [11] L. Deng, J. Li, J. T. Huang, et al. “Recent Advances in Deep Learning for Speech Research at Microsoft,” The 38th International Conference on Acoustics, Speech, and Signal Processing, Vancouver, BC, Canada, pp. 8604–8608, May 26–31, 2013.
  • [12] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the Importance of Initialization and Momentum in Deep Learning,” The 30th International Conference on Machine Learning, Atlanta, GA, USA, pp. 1139–1147, June 16–21, 2013.
  • [13] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” 3rd International Conference for Learning Representations, San Diego, CA, USA, pp. 1–15, May 7–9, 2015.
  • [14] Y. LeCun, C. Cortes, and C. J. C. Burges, “The MNIST Database of Handwritten Digits,”, 1999. Retrieved from http://yann.lecun.com/exdb/mnist/
  • [15] H. Xiao, K. Rasul, and R. Vollgraf, ”Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms,” arXiv preprint arXiv:1708.07747, August, 2017.