Elastic net regularization

Last updated

In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods. Nevertheless, elastic net regularization is typically more accurate than both methods with regard to reconstruction. [1]

Contents

Specification

The elastic net method overcomes the limitations of the LASSO (least absolute shrinkage and selection operator) method which uses a penalty function based on

Use of this penalty function has several limitations. [2] For example, in the "large p, small n" case (high-dimensional data with few examples), the LASSO selects at most n variables before it saturates. Also if there is a group of highly correlated variables, then the LASSO tends to select one variable from a group and ignore the others. To overcome these limitations, the elastic net adds a quadratic part () to the penalty, which when used alone is ridge regression (known also as Tikhonov regularization). The estimates from the elastic net method are defined by

The quadratic penalty term makes the loss function strongly convex, and it therefore has a unique minimum. The elastic net method includes the LASSO and ridge regression: in other words, each of them is a special case where or . Meanwhile, the naive version of elastic net method finds an estimator in a two-stage procedure : first for each fixed it finds the ridge regression coefficients, and then does a LASSO type shrinkage. This kind of estimation incurs a double amount of shrinkage, which leads to increased bias and poor predictions. To improve the prediction performance, sometimes the coefficients of the naive version of elastic net is rescaled by multiplying the estimated coefficients by . [2]

Examples of where the elastic net method has been applied are:

Reduction to support vector machine

It was proven in 2014 that the elastic net can be reduced to the linear support vector machine. [7] A similar reduction was previously proven for the LASSO in 2014. [8] The authors showed that for every instance of the elastic net, an artificial binary classification problem can be constructed such that the hyper-plane solution of a linear support vector machine (SVM) is identical to the solution (after re-scaling). The reduction immediately enables the use of highly optimized SVM solvers for elastic net problems. It also enables the use of GPU acceleration, which is often already used for large-scale SVM solvers. [9] The reduction is a simple transformation of the original data and regularization constants

into new artificial data instances and a regularization constant that specify a binary classification problem and the SVM regularization constant

Here, consists of binary labels . When it is typically faster to solve the linear SVM in the primal, whereas otherwise the dual formulation is faster. Some authors have referred to the transformation as Support Vector Elastic Net (SVEN), and provided the following MATLAB pseudo-code:

functionβ=SVEN(X, y, t, λ2);[n,p]=size(X);X2=[bsxfun(@minus,X,y./t);bsxfun(@plus,X,y./t)];Y2=[ones(p,1);-ones(p,1)];if2p>nthenw=SVMPrimal(X2,Y2,C=1/(2*λ2));α=C*max(1-Y2.*(X2*w),0);elseα=SVMDual(X2,Y2,C=1/(2*λ2));endifβ=t*(α(1:p)-α(p+1:2p))/sum(α);

Software

Related Research Articles

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories, SVMs are one of the most studied models, being based on statistical learning frameworks of VC theory proposed by Vapnik and Chervonenkis (1974).

<span class="mw-page-title-main">Least squares</span> Approximation method in statistics

In regression analysis, least squares is a parameter estimation method based on minimizing the sum of the squares of the residuals made in the results of each individual equation.

In statistics, the Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. The errors do not need to be normal, nor do they need to be independent and identically distributed. The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator, ridge regression, or simply any degenerate estimator.

In mathematics and computing, the Levenberg–Marquardt algorithm, also known as the damped least-squares (DLS) method, is used to solve non-linear least squares problems. These minimization problems arise especially in least squares curve fitting. The LMA interpolates between the Gauss–Newton algorithm (GNA) and the method of gradient descent. The LMA is more robust than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For well-behaved functions and reasonable starting parameters, the LMA tends to be slower than the GNA. LMA can also be viewed as Gauss–Newton using a trust region approach.

Ridge regression is a method of estimating the coefficients of multiple-regression models in scenarios where the independent variables are highly correlated. It has been used in many fields including econometrics, chemistry, and engineering. Also known as Tikhonov regularization, named for Andrey Tikhonov, it is a method of regularization of ill-posed problems. It is particularly useful to mitigate the problem of multicollinearity in linear regression, which commonly occurs in models with large numbers of parameters. In general, the method provides improved efficiency in parameter estimation problems in exchange for a tolerable amount of bias.

<span class="mw-page-title-main">Regularization (mathematics)</span> Technique to make a model more generalizable and transferable

In mathematics, statistics, finance, and computer science, particularly in machine learning and inverse problems, regularization is a process that converts the answer of a problem to a simpler one. It is often used in solving ill-posed problems or to prevent overfitting.

In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.

In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.

Proportional hazards models are a class of survival models in statistics. Survival models relate the time that passes, before some event occurs, to one or more covariates that may be associated with that quantity of time. In a proportional hazards model, the unique effect of a unit increase in a covariate is multiplicative with respect to the hazard rate. The hazard rate at time is the probability per short time dt that an event will occur between and given that up to time no event has occurred yet. For example, taking a drug may halve one's hazard rate for a stroke occurring, or, changing the material from which a manufactured component is constructed, may double its hazard rate for failure. Other types of survival models such as accelerated failure time models do not exhibit proportional hazards. The accelerated failure time model describes a situation where the biological or mechanical life history of an event is accelerated.

Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients and ultimately allowing the out-of-sample prediction of the regressandconditional on observed values of the regressors. The simplest and most widely used version of this model is the normal linear model, in which given is distributed Gaussian. In this model, and under a particular choice of prior probabilities for the parameters—so-called conjugate priors—the posterior can be found analytically. With more arbitrarily chosen priors, the posteriors generally have to be approximated.

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). PCR is a form of reduced rank regression. More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.

In statistical theory, the field of high-dimensional statistics studies data whose dimension is larger than typically considered in classical multivariate analysis. The area arose owing to the emergence of many modern data sets in which the dimension of the data vectors may be comparable to, or even larger than, the sample size, so that justification for the use of traditional techniques, often based on asymptotic arguments with the dimension held fixed as the sample size increased, was lacking.

In statistics and machine learning, lasso is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model. The lasso method assumes that the coefficients of the linear model are sparse, meaning that few of them are non-zero. It was originally introduced in geophysics, and later by Robert Tibshirani, who coined the term.

Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis. In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle. LS-SVMs are a class of kernel-based learning methods.

Proximal gradientmethods for learning is an area of research in optimization and statistical learning theory which studies algorithms for a general class of convex regularization problems where the regularization penalty may not be differentiable. One such example is regularization of the form

Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm. Reasons to use multiple kernel learning include a) the ability to select for an optimal kernel and parameters from a larger set of kernels, reducing bias due to kernel selection while allowing for more automated machine learning methods, and b) combining data from different sources that have different notions of similarity and thus require different kernels. Instead of creating a new kernel, multiple kernel algorithms can be used to combine kernels already established for each individual data source.

De-sparsified lasso contributes to construct confidence intervals and statistical tests for single or low-dimensional components of a large parameter vector in high-dimensional model.

In statistics, linear regression is a model that estimates the linear relationship between a scalar response and one or more explanatory variables. A model with exactly one explanatory variable is a simple linear regression; a model with two or more explanatory variables is a multiple linear regression. This term is distinct from multivariate linear regression, which predicts multiple correlated dependent variables rather than a single dependent variable.

Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.

Structured sparsity regularization is a class of methods, and an area of research in statistical learning theory, that extend and generalize sparsity regularization learning methods. Both sparsity and structured sparsity regularization methods seek to exploit the assumption that the output variable to be learned can be described by a reduced number of variables in the input space . Sparsity regularization methods focus on selecting the input variables that best describe the output. Structured sparsity regularization methods generalize and extend sparsity regularization methods, by allowing for optimal selection over structures like groups or networks of input variables in .

References

  1. Huang, Yunfei.; et al. (2019). "Traction force microscopy with optimized regularization and automated Bayesian parameter selection for comparing cells". Scientific Reports. 9 (1): 537. arXiv: 1810.05848 . Bibcode:2019NatSR...9..539H. doi: 10.1038/s41598-018-36896-x . PMC   6345967 . PMID   30679578.
  2. 1 2 Zou, Hui; Hastie, Trevor (2005). "Regularization and Variable Selection via the Elastic Net". Journal of the Royal Statistical Society, Series B. 67 (2): 301–320. CiteSeerX   10.1.1.124.4696 . doi:10.1111/j.1467-9868.2005.00503.x. S2CID   122419596.
  3. Wang, Li; Zhu, Ji; Zou, Hui (2006). "The doubly regularized support vector machine" (PDF). Statistica Sinica. 16: 589–615.
  4. Liu, Meizhu; Vemuri, Baba (2012). "A robust and efficient doubly regularized metric learning approach". Proceedings of the 12th European Conference on Computer Vision. Lecture Notes in Computer Science. Vol. Part IV. pp. 646–659. doi:10.1007/978-3-642-33765-9_46. ISBN   978-3-642-33764-2. PMC   3761969 . PMID   24013160.
  5. Shen, Weiwei; Wang, Jun; Ma, Shiqian (2014). "Doubly Regularized Portfolio with Risk Minimization". Proceedings of the AAAI Conference on Artificial Intelligence. 28: 1286–1292. doi: 10.1609/aaai.v28i1.8906 . S2CID   11017740.
  6. Milanez-Almeida, Pedro; Martins, Andrew J.; Germain, Ronald N.; Tsang, John S. (2020-02-10). "Cancer prognosis with shallow tumor RNA sequencing". Nature Medicine. 26 (2): 188–192. doi:10.1038/s41591-019-0729-3. ISSN   1546-170X. PMID   32042193. S2CID   211074147.
  7. Zhou, Quan; Chen, Wenlin; Song, Shiji; Gardner, Jacob; Weinberger, Kilian; Chen, Yixin. A Reduction of the Elastic Net to Support Vector Machines with an Application to GPU Computing. Association for the Advancement of Artificial Intelligence.
  8. Jaggi, Martin (2014). Suykens, Johan; Signoretto, Marco; Argyriou, Andreas (eds.). An Equivalence between the Lasso and Support Vector Machines. Chapman and Hall/CRC. arXiv: 1303.1152 .
  9. "GTSVM". uchicago.edu.
  10. Friedman, Jerome; Trevor Hastie; Rob Tibshirani (2010). "Regularization Paths for Generalized Linear Models via Coordinate Descent". Journal of Statistical Software. 33 (1): 1–22. doi: 10.18637/jss.v033.i01 . PMC   2929880 . PMID   20808728.
  11. "CRAN - Package glmnet". r-project.org. 22 August 2023.
  12. Waldron, L.; Pintilie, M.; Tsao, M. -S.; Shepherd, F. A.; Huttenhower, C.; Jurisica, I. (2011). "Optimized application of penalized regression methods to diverse genomic data". Bioinformatics. 27 (24): 3399–3406. doi:10.1093/bioinformatics/btr591. PMC   3232376 . PMID   22156367.
  13. "CRAN - Package pensim". r-project.org. 9 December 2022.
  14. "mlcircus / SVEN — Bitbucket". bitbucket.org.
  15. Sjöstrand, Karl; Clemmensen, Line; Einarsson, Gudmundur; Larsen, Rasmus; Ersbøll, Bjarne (2 February 2016). "SpaSM: A Matlab Toolbox for Sparse Statistical Modeling" (PDF). Journal of Statistical Software.
  16. "pyspark.ml package — PySpark 1.6.1 documentation". spark.apache.org. Retrieved 2019-04-17.
  17. "Proc Glmselect" . Retrieved 2019-05-09.
  18. "A Survey of Methods in Variable Selection and Penalized Regression" (PDF).

Further reading