SE Can't Code

A Tokyo based Software Engineer. Not System Engineer :(

Learn again linear regression with multiple variables. I'm gonna note down how linear regression can be extended accommodate multiple input features in here. Especially, I'm gonna talk about Gradient Descent and Normal Equation. These are methods that can decide θ in cost function.

This is a part of gradient method algorithm to find minimum value of function using calculation differential. This is the best simple in gradient method, it drops a suddenly direction from point initialized with α parameter.

Hypothesis : Cost Function :  is the i-th sample (from a set of m samples) and is the i-th expected result.

Gradient Descent : repeat until convergence{ }

If there are some feature in case of linear regression with multiple variables(ex. ), you must simultaneously update every θ. And you seem that if this parameter α is too large, may not decrease on every iteration and may not converge. Conversely, if α is too small, it take some steps and some time to convergence. But so formula of differential becomes smaller that it's dropping will become moderate to local minimum. You have to try and choose learning rate α through process of confirming graph of gradient descent until convergence. In generally, it's often used such a range α [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1], this is a example. You have to know characteristics of gradient descent's learning rate when you choose it.

Normal Equation

There is other method for calculating θ, it's called the Normal Equation, as an analytical solution to the linear regression problem with a least-squares cost function. In some cases, it is more effective than applying Gradient Descent. In one case, if you have small n features, it enable you to get θ more easily.

Hypothesis : Cost Function :  is the i-th sample (from a set of m samples) and is the i-th expected result.

Normal Equation : You can compute minimalize θ using Normal Equation without feature scaling. In Normal Equation, vector and matrix enable you to know θ that is our unknown more easily. But you may think "What will i do if is non-invertible (singular/degenerate)?" In this case that uses Normal Equation for linear regression, you should check two points for this problem. First is Redundant features(linearly dependent). If you find this problem in this point, you should delete feature that cause this problem. For example, there are two features such as size in feet and size in meter for prediction home prize using linear regression. The feet can be represented by meter (1m = 3.28feet). So if you these feature when you use Normal Equation, you should delete which one. Second is too many features, especially m <= n case (m is size of data set, n is count of features). In this case, you should delete some features or use regularization.

Compare Gradient Descent and Normal Equation

You may have a question "Which should i choose Gradient Decent or Normal Equation when i'm gonna calculate θ?". Off course, There are some difference points that are good points and bad points between these methods. I summarized these difference below:

• Need to compute • Slow if n(features) is very large because this order is 