SE Can't Code

A Tokyo based Software Engineer. Not System Engineer :(

Neural Network.

In coursera, I learned the Neural Network that is a data structure and machine learning algorithm that mimics brain. This is developed from simulation of network of brain neuron. You suppose example of computer vision, you are learning to recognize cars from 100×100 pixel images (not RBG), Let the features be pixel intensity values. If you train logistic regression including all the quadratic terms as features, about how many features will you have ? This answer is 50 million, this means that it's too much features for classification in nonlinear hypothesis. So it doesn't apply to logistic regression. But the Neural Network enable you to clear these complex hypothesis or complex nonlinear hypothesis.


Neuron Model

The neuron is computation units that takes many inputs and computes from inputs and sends output of computation to other neuron. You can draw below neuron model. In this image, there is not { \displaystyle
X_0
} that is the bias unit, but you can draw this extra note bias unit and sometimes not.

f:id:fixxman:20161010144412p:plain

This neuron model use the sigmoid activation function as logistic regression.

{ \displaystyle
h_\theta(x)=\frac{1}{1 + e^{-\theta^T x}}
}

{ \displaystyle
\theta = \begin{pmatrix}
\theta_0\\
\theta_1\\
\theta_2\\
\theta_3
\end{pmatrix}
,
\hspace{10px}
x = \begin{pmatrix}
x_0\\
x_1\\
x_2\\
x_3
\end{pmatrix}
}

{ \displaystyle
X_0
} is bias unit that is always 1. θ is weights parameters vector.


Neural Network and Forward Propagation

In Neural Network, this model is composed from three layer, "Input Layer", "Hidden Layer" and "Output Layer". { \displaystyle
a_i^{(j)}
} is "activation" of unit i in layer j. { \displaystyle
\Theta^{(j)}
} is matrix of weights controlling function mapping from layer j to layer j+1.

f:id:fixxman:20161010164908p:plain


We can compute this model below:

{ \displaystyle
a_1^{(2)}=g(\Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2 + \Theta_{13}^{(1)}x_3) \\
a_2^{(2)}=g(\Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_1 + \Theta_{22}^{(1)}x_2 + \Theta_{23}^{(1)}x_3) \\
a_3^{(2)}=g(\Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_1 + \Theta_{32}^{(1)}x_2 + \Theta_{33}^{(1)}x_3) \\
h_\Theta(x)=a_1^{(3)}=g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} + \Theta_{13}^{(2)}a_3^{(2)})
}

If network has { \displaystyle
S_j
} units in layer j, { \displaystyle
S_j
} units in layer j+1, then { \displaystyle
\Theta^{(j)}
} will be of dimension { \displaystyle
S_{j+1}*(S_j+1)
} .

And we can compute neural network model is vectorized implementation. Replace input of logistic function to { \displaystyle
Z_{no of unit}^{(layer)}
}, as above network image you suppose compute hidden layer. So below is x and z vector:

{ \displaystyle
x = \begin{pmatrix}
x_0\\
x_1\\
x_2\\
x_3
\end{pmatrix}
,
\hspace{10px}
z^{(2)} = \begin{pmatrix}
z_1^{(2)}\\
z_2^{(2)}\\
z_3^{(2)}\\
\end{pmatrix}
}

And compute { \displaystyle
h_\theta(x)
} below:

{ \displaystyle
z^{(2)} = \Theta^{(1)}x \\
\alpha^{(2)} = g(z^{(2)})
}

and add { \displaystyle
\alpha_0^{(2)} = 1
}

{ \displaystyle
z^{(3)} = \Theta^{(2)} \alpha^{(2)} \\
h_\theta(x) = \alpha^{(3)} = g(z^{(3)})
}


It called "Forward propagation". It start of with the activation of the input-units and then we sort of forward propagation that to the hidden layer and compute the activation of the hidden layer and then we sort of forward propagation that and compute the activation of the output layer, in summary this process of computing the activation from the input then the hidden then output layer.


Cost Function of Neural Network

First, it define that { \displaystyle
L
} is total of number of layers in network and { \displaystyle
S_l
} is number of units (not counting bias unit) in layer l. In Neural Network, it make cost function of logistic regression generalize more below:

Logistic Regression's cost function

{ \displaystyle
y \in \mathbb{R}, \quad h_\theta(x) \in \mathbb{R}
}
{ \displaystyle
J(\theta) = -\frac{1}{m}\Biggl[\sum_{i=1}^{m}y^{(i)}\log(h_\theta(x^{(i)})) + (1-y^{(i)})\log(h_\theta(x^{(i)}))\Biggr] + \frac{\lambda}{2m}\sum_{j=1}^{n}(\theta_{j})^2
}

Neural Network's cost function

{ \displaystyle
y_k \in \mathbb{R}^K,\quad h_\Theta(x) \in \mathbb{R}^K,\quad(h_\Theta(x))_k = k^{th} output\\
}
{ \displaystyle
J(\Theta) = -\frac{1}{m}\Biggl[\sum_{i=1}^{m}\sum_{k=1}^{K}y^{(i)}_k\log(h_\Theta(x^{(i)}))_k + (1-y^{(i)}_k)\log(h_\Theta(x^{(i)}))_k\Biggr] + \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l}\sum_{j=1}^{s_l+1}(\Theta^{(l)}_{ji})^2
}

First, it calculates sum of number K outputs in this formula. Second is regularization that calculates sum of each elements that squared. You don't need to add i=0 that is bias unit, so i starts from 1. But this is just one possible convention, and even if you were to sum over i equals 0 up to { \displaystyle
S_1
} that would work about the same and doesn't make a big difference.


Back Propagation Algorithm

As always, we have to make cost function minimum. So we can use "Back Propagation Algorithm" In Neural Network. To compute advanced optimization, it need below code:

{ \displaystyle
J(\Theta) , \quad 
\frac{\partial}{\partial \Theta^{(l)}_{ij}}J(\Theta)
}

This calculates partial differential J using { \displaystyle
\Theta
} of each of elements of each of layer that sees transition of cost J when { \displaystyle
\Theta
} is minor changed. The name of Back Propagation comes from the fact that we start by computing the delta term for the output layer and then we go back a layer and compute the delta terms for the hidden layer and then go back another step to compute any delta and so on. We're sort of back propagating the error from the output layer to other layer.
f:id:fixxman:20161010174717p:plain

Below is the procedure of back propagation, this run in loop for all training data. First, initialize { \displaystyle
\Theta
} of each of layer. For important point, you have to assign value which is selected randomly from below { \displaystyle
\epsilon
} decided already because it would be fail if it set 0 to all elements in initialization.

{ \displaystyle
-\epsilon \leq \Theta^{(l)}_{ij} \leq \epsilon
}

Second, Run the forward propagation (refer above "Neural Network and Forward Propagation").
Third, Run the back propagation. { \displaystyle
\delta_j^{(i)}
} is "error" of node j in layer l, which is in some sense going to capture our error in activation of that neural network. In back propagation, first step is calculating error on output layer.

{ \displaystyle
\delta^{(3)}_j = (a^{(3)}_j − y_j)
}

Next, calculating error on hidden layer below:

{ \displaystyle
\delta^{(2)} = (\Theta^{(2)})^T \delta^{(3)}.*g'(z^{(2)})
}

In this step, if { \displaystyle
g(z)
} is sigmoid function, { \displaystyle
g'(z)
} is expressed below:

{ \displaystyle
g'(z) = \frac{\partial}{\partial z}g(z) = g(z)(1-g(z))
}

This expression is equal to mathematically the derivative of the g function of the activation function.
Next, add error to matrix delta to accumulates error of each layer of each training sets below:

{ \displaystyle
\Delta^{(l)} = \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T
}

Finally, it divides error delta which is accumulated on each layer by number of training data that means partial differential of θ of j after run all training data.

{ \displaystyle
\frac{\partial}{\partial \Theta^{(l)}_{ij}}J(\Theta) = D^{(l)}_{ij} =\frac{1}{m}\Delta^{(l)}_{ij} \hspace{60pt} for\;j=0
}
{ \displaystyle
\frac{\partial}{\partial \Theta^{(l)}_{ij}}J(\Theta) = D^{(l)}_{ij} =\frac{1}{m}\Delta^{(l)}_{ij} + \frac{\lambda}{m}\Theta^{(l)}_{ij} \qquad for\;j\geq1
}

In above formula, it adds normalization to second formula (j≧1) to avoid overfitting.


Gradient Checking

There is possibility of penetrating bugs into this back propagation because this implement is too complex (especially partial differential of θ of j), so we have to confirm whether this is normal or not using gradient checking. Below formula can be used when { \displaystyle
\epsilon
} is small enough.

{ \displaystyle

\frac{\partial}{\partial \theta_1}J(x)\approx\frac{J(\theta_1+\epsilon,\theta_2,\theta_3,...,\theta_n) - J(\theta_1-\epsilon,\theta_2,\theta_3,...,\theta_n)}{2\epsilon} \\

\frac{\partial}{\partial \theta_2}J(x)\approx\frac{J(\theta_1,\theta_2+\epsilon,\theta_3,...,\theta_n) - J(\theta_1,\theta_2-\epsilon,\theta_3,...,\theta_n)}{2\epsilon} \\

\vdots \\

\frac{\partial}{\partial \theta_n}J(x)\approx\frac{J(\theta_1,\theta_2,\theta_3,...,\theta_n+\epsilon) - J(\theta_1,\theta_2,\theta_3,...,\theta_n-\epsilon)}{2\epsilon}
}

If this value closes enough partial differential value of J calculated back propagation, your can confirm that implement is normal. I'm gonna note down this implementation of back propagation:

  • Implement backprop to compute DVec (unrolled { \displaystyle
D^{(1)}, D^{(2)}, D^{(3)}
} ).
  • Implement numerical gradient check to compute gradApprox.
  • Make sure they give similar values.
  • Turn off gradient checking. Using backprop code for learning.

Important point is that be sure to disable your gradient checking code before training your classifier. If you run numerical gradient computation on every iteration of gradient descent (or in the inner loop of cost function) your code will be very slow.


Summary, Training a Neural Network

Let's wrap up! I'm gonna summarize above steps to train a neural network.

  1. Randomly initialize weights
  2. Implement forward propagation to get { \displaystyle
h_\Theta (x^{(i)})
} for any { \displaystyle
x^{(i)}
}
  3. Implement code to compute cost function { \displaystyle
J(\Theta)
}
  4. Implement backprop to compute partial derivatives { \displaystyle
\frac{\partial}{\partial \Theta^{(l)}_{jk}}J(\Theta)
}
  5. Use gradient checking to compare { \displaystyle
\frac{\partial}{\partial \Theta^{(l)}_{jk}}J(\Theta)
} computed using backpropagation vs.using numerical estimate of gradient of { \displaystyle
J(\Theta)
} . Then disable gradient checking code.
  6. Use gradient descent or advanced optimization method with backpropagation to try to minimize { \displaystyle
J(\Theta)
} as a function of parameters { \displaystyle
\Theta
}.

That's all.

Remove all ads