Subscribed unsubscribe Subscribe Subscribe

SE Can't Code

A Tokyo based Software Engineer. Not System Engineer :(

Gradient Descent and Normal Equation.

Learn again linear regression with multiple variables. I'm gonna note down how linear regression can be extended accommodate multiple input features in here. Especially, I'm gonna talk about Gradient Descent and Normal Equation. These are methods that can decide θ in cost function.

Gradient Descent

This is a part of gradient method algorithm to find minimum value of function using calculation differential. This is the best simple in gradient method, it drops a suddenly direction from point initialized with α parameter.

Hypothesis : { \displaystyle
  h\theta(X)={\theta}^\top X = \theta_1 X_1 + ... + \theta_n X_n
}

Cost Function : { \displaystyle
  J(\theta)=\frac{1}{2m}\sum_{i=1}^{m} (h\theta(X^i)-y^i)^2
}
{ \displaystyle
  x^{(i)}
} is the i-th sample (from a set of m samples) and { \displaystyle
  y^{(i)}
} is the i-th expected result.

Gradient Descent : repeat until convergence{
{ \displaystyle
  \theta_j:=\theta_j-α\frac{d}{d\theta_j}J(\theta)
}
}

If there are some feature in case of linear regression with multiple variables(ex. { \displaystyle
  J=\theta_1, \theta_2, ..., \theta_n
} ), you must simultaneously update every θ. And you seem that if this parameter α is too large, { \displaystyle
  J(\theta)
} may not decrease on every iteration and may not converge. Conversely, if α is too small, it take some steps and some time to convergence. But so formula of differential { \displaystyle
\frac{d}{d\theta_j}J(\theta)
} becomes smaller that it's dropping will become moderate to local minimum. You have to try and choose learning rate α through process of confirming graph of gradient descent until convergence. In generally, it's often used such a range α [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1], this is a example. You have to know characteristics of gradient descent's learning rate when you choose it.

Normal Equation

There is other method for calculating θ, it's called the Normal Equation, as an analytical solution to the linear regression problem with a least-squares cost function. In some cases, it is more effective than applying Gradient Descent. In one case, if you have small n features, it enable you to get θ more easily.

Hypothesis : { \displaystyle
  h\theta(X)={\theta}^\top X = \theta_1 X_1 + ... + \theta_n X_n
}

Cost Function : { \displaystyle
  J(\theta)=\frac{1}{2m}\sum_{i=1}^{m} (h\theta(X^i)-y^i)^2
}
{ \displaystyle
  x^{(i)}
} is the i-th sample (from a set of m samples) and { \displaystyle
  y^{(i)}
} is the i-th expected result.

Normal Equation : { \displaystyle
  \theta=(X^\top X)^{-1} X^\top y
}

You can compute minimalize θ using Normal Equation without feature scaling. In Normal Equation, vector and matrix enable you to know θ that is our unknown more easily. But you may think "What will i do if { \displaystyle
  X^\top X
} is non-invertible (singular/degenerate)?" In this case that uses Normal Equation for linear regression, you should check two points for this problem. First is Redundant features(linearly dependent). If you find this problem in this point, you should delete feature that cause this problem. For example, there are two features such as size in feet and size in meter for prediction home prize using linear regression. The feet can be represented by meter (1m = 3.28feet). So if you these feature when you use Normal Equation, you should delete which one. Second is too many features, especially m <= n case (m is size of data set, n is count of features). In this case, you should delete some features or use regularization.

Compare Gradient Descent and Normal Equation

You may have a question "Which should i choose Gradient Decent or Normal Equation when i'm gonna calculate θ?". Off course, There are some difference points that are good points and bad points between these methods. I summarized these difference below:

Gradient Descent

  • Need to choose α
  • Need many iterations
  • Work well even when m(data size) is large

Normal Equation

  • No need to choose α
  • Don't need to iterate
  • Need to compute { \displaystyle
(X^\top X)^-1
}
  • Slow if n(features) is very large because this order is { \displaystyle
O(n^3)
}

Recall that here θ is our unknown. When you must know θ in linear regression with multiple features, there are above methods that are famous formula to compute θ. The one is compute partial differential each of feature, the other is compute matrix and vector. These are a part of most famous method for knowing θ. Seeing data sets, you should choose method that applies your data sets and problems.

Remove all ads