Table of Contents
In computer science, regularization is a concept about the addition of information with the aim of solving a problem that is ill-proposed. It is also an approach that helps address over-fitting. In Machine Learning, regularization refers to part or all modifications done on a machine-learning algorithm to minimize its generalization error. However, it does not include the training error.
A key concept in machine learning is the aspect of dealing with overfitting. A model that depicts aspects of over-fitting produces output that is unpredictable and also has low accuracy. This is usually due to picking up data points whose representation is not the target data’s actual properties. This amounts to noise and is a common thing when training your dataset.
In this case, the function fails to correct a pattern in the given dataset because it has trained to make a correct prediction of the given target values inclusive of the noise-induced at given data points. As a result, the function is not likely to give any errors when working with the training set.
However, when used to make predictions using test datasets, there are likely huge errors. Simply put, the model does not perform well on the novel test data set as it did on the training dataset.
Thus, there is a need to correct this, and that is where regularization comes in. It involves adding an extra penalty concept or term in the given error function. In this case, the lambda is the regularization parameter.
A penalty is applied on higher terms in response to an increase in the complexity, which subsequently reduces the significance accorded to the higher terms bringing the model to a state of less complexity.
This extra term introduced for specific aspects of tuning the function controls the function from excessive fluctuation to keep the coefficients from taking outlier values. The latter aspect is critical in minimizing the error value of the coefficients.
These methods go by the name shrinkage. When dealing with neural networks, this concept is called weight decay. Besides regularization, various other methods are essential in addressing the issue of over-fitting.
One in mind is having the training dataset size increased. However, do not confuse increasing the number of observations to adding the features for the same dataset. By increasing the number of features or columns, the complexity can grow exponentially to levels that are difficult to manage, causing an overall performance below par. On the contrary, increasing the number of rows amounts to a rich dataset leading to more realistic results.
Before tackling the two main regularization techniques, remember that you can handle over-fitting with other approaches such as stepwise regularization, pruning, reducing the expected number of features, or cross-validation. However, these methods are great when dealing with a small collection of features where feature selection is simple.
In case of handling extensive features, these regularization techniques below become very handy.
Below we examine popular regularization techniques and what differentiates them. First, note that having many features in your dataset leads to creating a complex model. Thus, the need to create a less complex model calls for the use of the regularization techniques, which are essential in addressing over-fitting.
These techniques are categorized into two, namely L1 and L2 regularization. Ridge regression is a model that uses the L2 model, whereas a model that uses L1 is called Lasso regression.
The differentiating factor between the two techniques is the penalty term in play. Also, Lasso is responsible for shrinking the coefficient of less valuable features to zero. By so doing, it removes certain features. In fact, this is ideal for feature selection where a considerable amount of features exist.
Linear regression representation is as shown in the diagram below. In this image, β denotes the coefficient for different predictors or variables (X), whereas Y denotes the learned relation.
Y ≈ β0 + β1X1 + β2X2 + …+ βpXp
There is a loss function during the fitting process called RSS. This stands for the residual sum of squares. The selection of the coefficients is strategic to reduce the loss function. However, the coefficients’ adjustment is greatly dependent on the target training data.
In the case of noise in the dataset, the resultant coefficient estimates will fail in giving a trust-able generalization of the test or the actual data in the future.
This loophole necessitates regularization to shrink the estimates learned to a value that is as close to zero as possible.
Lasso is an abbreviation for Least Absolute Shrinkage and Selection Operator. It works by adding an absolute value of the correspondent coefficient’s magnitude as the desired penalty term to the function’s loss.
Note that in case the lambda gets to zero, then we will get back the OLS. Otherwise, in the case of huge values, the coefficient will become zero; thus, it will under-fit.
Lasso regression only penalizes high coefficients, which becomes the principal differentiating factor between Lasso regression and ridge regression. The Li norm is common for using modulus other than squares of β as the basis of its penalty.
In essence, Lasso represents an equation whose summation of modulus (|βj|) of its coefficients is either equal to or less than s. On the other hand, ridge regression solves equations whose summation of coefficients’ squares is equal to s or less than s.
In the equation, s is a constant value existing for every single value of shrinkage factor λ. Generally, these equations can be grouped as the constraint functions.
Given the two parameters in the equation above, the lasso coefficients have the minimal RSS or loss function for all the given points lying within the diamond.
In ridge regression, the equation shows that the coefficients have a minimal loss function or RSS for all the given points lying within the circle represented by the following equation.
β1² + β2² ≤ s
In the diagram below, the section on the right of the + sign represents L2 regularization. The penalty term is an addition of the squared magnitude of the respective coefficient to the loss function.
In the case of a huge lambda, there will be too much addition of weight resulting in under-fitting. However, if the lambda is equivalent to zero, the OLS is returned. Besides selecting the lambda being critical, ridge regression is an effective method when dealing with over-fitting.
When handling ridge regression, the addition of the shrinkage quantity leads to modification of RSS.
Therefore, estimating the coefficients involves minimizing the function. To decide the extent to which we can penalize how flexible our model is, the tuning parameter (λ) is convenient.
An increase in a models’ flexibility results from a corresponding increase in the respective coefficients. Minimizing the latter function entails reducing the coefficients as much as possible. The smaller, the better. Ridge regression can effectively prevent coefficients from getting to huge values using this approach.
The Intercept β0, is a measure of the responses’ mean value at the point.
xi1 = xi2 = …= xip = 0.
Normally, the estimated association of every variable is shrunk with the corresponding response except in the intercept above.
The estimates from ridge regression are equivalent to that of least squares when λ = 0, and there is no effect from the penalty term.
However, when the value of λ grows towards infinity, the shrinkage penalty’s overall effect increases while the ridge regression coefficient estimates get close to zero. Thus, selecting an ideal value of λ is very important. Also, cross-validation is essential in this process. Also, note that the L2 norm refers to the coefficient estimates produced using this approach.
The resultant coefficients from the standard least-squares method are classified as scale equivariant. It means that if we multiply every input by a given input w, then the resultant coefficients will be scaled by a factor of 1/ w.
It standards out that the multiplication of coefficient and predictor, Xjβj, will still be the same irrespective of scaling the predictor. Unfortunately, this does not happen in ridge regression. Instead, there is a need for standardization of the predictors or to bring the predictors to a similar scale before you can carry out ridge regression.
The following formula is essential in this process.
Model interpretability is a known disadvantage when working with ridge regression. As much as it will significantly shrink the coefficients for the less important predictors, it will not make them reach the zeroth value. This indicates that all predictors will be included in the final model. On the contrary, in the lasso scenario, some of the coefficient estimates can be forced to zero using the L1 penalty. This happens more when the tuning parameter increases significantly. Thus, in this case, the lasso approach is also responsible for selecting variables and will often lead to sparse models.
The selection of the value of λ should be made with extreme care because this tuning parameter employed in the regularization techniques controls the impact on both variances and bias. When the tuning parameter increases, the coefficients’ resultant value is reduced; this implies a corresponding reduction in the variance.
Overall, the tuning parameter increase is significant because it only minimizes the variance, which is at the core of avoiding over-fitting. Also, there is no loss of vital data properties at this point.
However, when the tuning parameter increases beyond a certain threshold figure, important properties in the model begin to disappear. The resultant effect is under-fitting. This results from bias in the model.
Regularization is essential in minimizing the model’s variance while maintaining its bias. It is simply a rectification factor for models that do not have the characteristics to generalize well for varied datasets in testing other than the training user data. This abnormally has variance in your models and is a common phenomenon of the standard least squares.