SE Can't Code

A Tokyo based Software Engineer. Not System Engineer :(

Solving the problem of overfitting.

As Cousera's Machine Learning course, I learned regularization to avoid overfitting. There are Underfitting and Overfittng in machine learning. Underfitting refers to a model that can neither model the training data not generalize to new data. This is often not discussed as it is easy to detect given a good performance metric. By contrast, Overfitting refers to a model that models the training data too well. In generally, training data sets includes noise and this noise cause overfitting from effecting model.

f:id:fixxman:20160929072728p:plain

from Underfitting vs. Overfitting — scikit-learn 0.18 documentation

If we have many features, the learned hypothesis may fit the training set very well (

{ \displaystyle
J(\theta)=\frac{1}{2m}{\sum_{i=1}^{m}{(h_\theta(x^i)-y^i)^2\approx0}}
} ), but fail to generalize to new examples (predict values on new example). We have to generalize how well a hypothesis applies even to new examples in machine learning.


Addressing Overfitting

There are two options to avoid overfitting below:

  1. Reduce number of features
  2. Regularization

Reduce number of features is that you manually select which features to keep and algorithm model. But it means throwing away some of the information that you have data about problem. It is not good a way for solving overfitting. So in generally, Regularization is often used for this problem. Using regularization, you can keep all the features, but reduce magnitude and values of parameters theta. And it works well when you have a lot of features, each of which contributes a bit to predicting y.


Cost Function of Regularization for Overfitting

Regularization gives penalty to cost function to ease a complexity of model. Given some parameters theta, regularization can be simpler hypothesis and less prone to overfitting. We wanna make unimportant parameter that are less likely to be relevant close zero, but we don't know to pick in advance. So, Regularization's formula is below:

{ \displaystyle
J(\theta)=\frac{1}{2m}{\sum_{i=1}^{m}{(h_\theta(x^i)-y^i)^2 + \frac{\lambda}{2m} \sum_{i=1}^{n}\theta_j^2}}
}

This regularization means that give penalty to all parameters and make all parameters close zero. There is a trade off between cost function and regularization, cost function's goal is that it would like to fit the training data well but regularization's goal that it want to keep the parameters small. To control this trade off, we use { \displaystyle
\lambda
} and have to choose optimal value of it because it cause the under fitting if regularization's parameter is too big.


Regularized Linear Regression

In linear regression, we can add regularization to gradient descent with repeat below:

{ \displaystyle
\theta_0=\theta_0 - \alpha\frac{1}{m}{\sum_{i=0}^{m}{(h_\theta(x^i)-y^i)x_o^i}}
}

{ \displaystyle
\theta_j=\theta_j (1 - \alpha\frac{\lambda}{m}) - \alpha\frac{1}{m}\sum_{i=1}^{m}{(h_\theta(x^i)-y^i)x_j^i}
}

So { \displaystyle
1 - \alpha\frac{\lambda}{m}
} < 1, it suppose around 0.99 which means make parameters a little shrink.

And we can add regularization to normal equation below:

{ \displaystyle
x = 
\left(
\begin{array}{c}
(\tilde{x_0})^T\\
(\tilde{x_1})^T\\
\vdots \\
(\tilde{x_n})^T
\end{array}
\right)}

{ \displaystyle
y = 
\left(
\begin{array}{c}
(\tilde{y_0})\\
(\tilde{y_1})\\
\vdots \\
(\tilde{x_n})
\end{array}
\right)}

{ \displaystyle
\theta = \left( X^TX + \lambda
\left(
\begin{array}{cccc}
0&0&\cdots&0\\
0&1&\cdots&0\\
\vdots & \vdots&\ddots & \vdots\\
0 & 0& \cdots&1
\end{array}
\right)
\right)^{-1}X^T y
}


But it cause non-invertibility (singular) if  {\displaystyle
\lambda = 0
}, this means that suppose  {\displaystyle
m\leq n
}. If  {\displaystyle
\lambda = 0
}, regularization solve this non-invetibility. So you don't need worry about it.

Remove all ads