SE Can't Code

A Tokyo based Software Engineer. Not System Engineer :(

Logistic Regression.

I'm gonna note down here about Logistic Regression because I learned it again at Cousera's Machine Learning course. In machine learning, Logistic Regression is often used for classification. It measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function, which is the cumulative logistic distribution and it called binomial distribution. In case of classification (ex, Tumor size predicts malignant potential. y=1 is Yes, y=0 is No.), y = 0 or 1, if you use linear regression, { \displaystyle
} can be >1 or < 0. But this is bad because of expecting y = 0 or 1. So it is a good way to use Logistic Regression because it is actually a classification algorithm that we apply to settings where the label y is discrete value, when it's either 0 or 1.

Hypothesis Representation of Logistic Regression

Logistic Regression Model is below:



{ \displaystyle
  0\le h_\theta(x)\le1


{ \displaystyle
  h_\theta(x)=g(\theta^\top x)

Sigmoid function

{ \displaystyle

Assigning sigmoid function into model, we can get Logistic Regression Model below:


{ \displaystyle
  h_\theta(x)=\frac{1}{1+e^{-\theta^\top x}}

Using this model, you can draw below graph and see it asymptotes at one and zero.
Logistic regression - Wikipedia, the free encyclopedia

For interpretation of hypothesis output, this model estimates probability that y=1 on input x. For example of tumor size predicts malignant potential, x is given below:
{ \displaystyle
x_0 \\
1 \\

It may { \displaystyle
} (means probability y = 1), doctor tell patient that 70% change of tumor being malignant.

{ \displaystyle
} means "probability that y=1, given x, parameterized by θ." and this y is 0 or 1.

Decision Boundary of Logistic Regression

Logistic regression supposes:
prediction "y=1" if { \displaystyle
} 0.5

prediction "y=0" if { \displaystyle
} < 0.5

Seeing above graph, x bar means z and z { \displaystyle
} 0 be y bar { \displaystyle
} 0.5, z < 0 be y bar < 0.5. this means { \displaystyle
z = \theta^\top x
And we can draw linear with each θ values. For example, it given below:
if { \displaystyle
h_\theta (x)=g(\theta_0 + \theta_1 X_1 + \theta_2 X_2)
} and θ=[-3, 1, 1], you can predict "y=1" if { \displaystyle
-3 + X_1 + X_2\geq 0
It can be changed { \displaystyle
X_1 + X_2\geq 3
} and { \displaystyle
X_1 + X_2= 3
} which make it decide Decision Boundary and means 0.5.

Cost Function of Logistic Regression

How to choose parameters θ(how to fit data)? In Logistic regression, Cost function is below:
If y = 1, { \displaystyle
Cost(h_\theta (x)-y)=-log(h_\theta (x))

If y = 0, { \displaystyle
Cost(h_\theta (x)-y)=-log(1-h_\theta (x))

Overall cost function j of θ will be convex and local optima free contrast with Linear regression because it will be log function's graph. If it missed for prediction, you have to pay more cost.
You can capture intuition that if { \displaystyle
h_\theta (x))=0
} (predict { \displaystyle
} ), but y=1 which means that we will penalize learning algorithm by a very large cost.

It be more simplified and assign into gradient descent below:

{ \displaystyle
J(\theta)=-\frac{1}{m}(\sum_{k=1}^{n} y^{i}\log h_\theta (x^{i})+(1-y^{i})\log(1-h_\theta (x^{i})))
This is the Cross-entropy error function which is an idea in statistics for how to efficiently find parameter's data for different models. This cost function can be derived from statistics using the principle of maximum likelihood estimation. Finding min J(θ) as fitting parameters θ to make a prediction given new x. You have to repeat computing a below gradient descent with simultaneously updating all θ :

{ \displaystyle

You will notice that this algorithm looks identical to linear regression. But hypothesis is difference between Logistic regression and Linear regression below:

Logistic regression : { \displaystyle
  h_\theta (x)=\frac{1}{1+e^{-x}}

Linear regression : { \displaystyle
  h_\theta (x)=\theta^\top x

#coding: utf-8

import numpy as np
import matplotlib.pyplot as plt

def plotData(X, y):
    positive = [i for i in range(len(y)) if y[i] == 1]
    negative = [i for i in range(len(y)) if y[i] == 0]

    plt.scatter(X[positive, 0], X[positive, 1], c='red', marker='o', label="positive")
    plt.scatter(X[negative, 0], X[negative, 1], c='blue', marker='o', label="negative")

def sigmoid(z):
    return 1.0 / (1 + np.exp(-z))

def safe_log(x, minval=0.0000000001):
    return np.log(x.clip(min=minval))

def computeCost(X, y, theta):
    h = sigmoid(, theta))
    J = (1.0 / m) * np.sum(-y * safe_log(h) - (1 - y) * safe_log(1 - h))
    return J

def gradientDescent(X, y, theta, alpha, iterations):
    m = len(y)      # length of training data
    J_history = []  # cost of each update
    for iter in range(iterations):
        h = sigmoid(, theta))
        theta = theta - alpha * (1.0 / m) *, h - y)
        cost = computeCost(X, y, theta)
        print iter, cost
    return theta, J_history

def main():
    data = np.genfromtxt("ex2data1.txt", delimiter=",")
    X = data[:, (0, 1)]
    y = data[:, 2]
    m = len(y)

    plotData(X, y)

    X = X.reshape((m, 2))
    X = np.hstack((np.ones((m, 1)), X))

    # initialize parameters to 0
    theta = np.zeros(3)
    iterations = 300000
    alpha = 0.001

    # compute cost as initialize
    initialCost = computeCost(X, y, theta)
    print "initial cost:", initialCost

    # estimate parameters using gradient descent
    theta, J_history = gradientDescent(X, y, theta, alpha, iterations)
    print "theta:", theta
    print "final cost:", J_history[-1]


    xmin, xmax = min(X[:,1]), max(X[:,1])
    xs = np.linspace(xmin, xmax, 100)
    ys = [- (theta[0] / theta[2]) - (theta[1] / theta[2]) * x for x in xs]
    plt.plot(xs, ys, 'b-', label="decision boundary")
    plt.xlim((30, 100))
    plt.ylim((30, 100))

if __name__ == "__main__":

This result is below:

initial cost: 0.69314718056
theta: [-9.25573205  0.07960975  0.07329322]
final cost: 0.283686931959


Advanced Optimazation of Logistic Regression

For logistic regression, there are some optimization algorithms given θ below:

I'm gonna explain these algorithm's advantages and disadvantages except gradient descent. These not need to manually pick α (learning rate) because inter-loop called a line search algorithm that automatically tries out different values for the learning rate alpha and automatically picks a good learning rate alpha so that it can even pick a different learning rate for every iteration. And These often find convergence faster than gradient descent. Disadvantages is more complex, so you shouldn't implement these yourself.

Multiclass Classification One-vs-All of Logistic Regression

Using One-vs-All(other called one-vs-rest), it is easy for multiclass classification on logistic regression. For example, you wanna filter or tag email for three labels such as "Work(y=1)", "Friends(y=2)", "Hobby(y=3)". In these case, you can train a logistic regression classifier for each class i to predict the probability that y=i. On new input x, to make a prediction, you can pick the class i that maximizes.