This article is an attempt to summarize the book chapter “Regularization for Deep Learning” and expects readers to know basic terms.
To start, will answer “what”, “Why” and “How” questions in random order about each topic and trying to put important points only.
Why Do I have to read this chapter at all?
Scenario- Plotting training and validation set loss.
Training loss -> Decreasing over time
Validation loss -> consistent and then increasing over time
Possible Conclusion → Overfitting!!!!
P.S. It happens more often than you think and IS NOT FUN!!!
Aim -> A model with lower generalization error, i.e. can work as good on Unseen data (production :p ).
Solution-> Intelligent people said use Regularization!
This answers why and what part!
Now let’s focus on How we can do it!
Oh and we are going to discuss only below today-
1. Parameter Norm Penalties
In Neural Networks possible parameters → Weights!
We basically learn weights.
Why regularize weights?
Number of weights = Number of units = Size of the network
Size of the network = complexity of the model
Occam’s razor says: There is always a simpler explanation. My headache today could be a brain tumor, but it’s more likely to be all that time I spent watching “Lord of the Rings”. In short, simpler the better!
But, We all are about “Deep” networks! i.e. More Complexity :p
Problems with weights?
Bigger the weights, bigger the impact!
If w=5 then w²=25 & if w=.5 then w²=.25!
Smaller change lead to bigger difference!
Classical High Variance situation!
So, let’s try to stop it from becoming bigger.
Oh and btw, we are leaving “Bias” as unregularized because a) it can cause underfitting b) it doesn’t introduce that much variance.
and also, you can put penalty on each layer rather than on all the layers combined!
7.1.1 L2 Parameter Regularization
If you see these names, they all means the same thing.
“weight decay”, “ridge regression”, “Tikhonov regularization”
Calculating the Sum of the squared weights = L2
Idea -> Penalize L2, make this value shrink!
And since we anyways try to minimize the loss in objective function why not add it there and minimize both together!! Easy!
Now, we can also control how much we want to penalize it because may be a unit does require a slightly bigger weight. So this Alpha helps me do that.
Alpha = 0 → No penalty, Alpha = 1 → Full penalty!
7.1.2 L1 Regularization
Basically Everything is same just that instead of sum of squares of weights its,
Calculating the Sum of the absolute values of weights = L1.
and its also called “LASSO” (least absolute shrinkage and selection operator.
Differences
When to use which depends on the type of the problem/goal.
L1+L2 in combination can be used as well.
7.2 Norm Penalties as Constrained Optimization
Same as above in principle but above methods “encourages” weights to be smaller but doesn’t “Enforce” it. Putting a constraint does.
Example, From the Deep Learning assignment 2 , one of the script had exploding gradient problem and clipping the gradient solved the problem.
Same way, if norm of the weight > given threshold then simply discard the additional value.
How to use it?
Force it to be unit norm or limit the max size of the norm or limit the max and min size of the norm.
Advantage Over Penalties-
- With penalizing approach it can get stuck in local minima.
- Constraint approach doesn’t encourage weights to be stay as small as it originally was.
- Can take advantage of higher learning rates while keeping model stable.
References-
@book{Goodfellow-et-al-2016,
title={Deep Learning},
author={Ian Goodfellow and Yoshua Bengio and Aaron Courville},
publisher={MIT Press},
note={\url{http://www.deeplearningbook.org}},
year={2016}
}