- 5 mins

# Logistic Regression

Our group was a little smaller this week, but with all the events happening and the Alumni Council visiting campus this weekend, I’m going to assume we’re still going strong! I guess any size of group works for me since every size of groups has its benefits. Anyways, this week we talked about logistic regression, the foundation of classification problems.

Next week is the weekend before mid-semester week, so I’m guessing a lot of people will not show up. I have no exams or papers (this entire semester) so I will still be there working on some projects. Feel free to stop by to chat or start thinking about projects you can start working on. I won’t be teaching much from now on, since I think understanding linear and logistic regression thoroughly gets you through the basics of machine learning.

Lastly, Grinnell Pioneer Weekend 2017 just got announced for April 7-9. This is a business-hackathon-ish event where you come up with ideas for a startup. The judges choose what they think are the best ideas and award \$2000, \$1000, and \$500 to the winning three teams. You should definitely check it out!

That’s all I have for you. See you all next weekend or after spring break!

# Summary 2017-03-03

This week, we talked about extending the concepts and framework we learned from linear regression into the realm of classification. Logistic regression is the most standard/simple method of creating a model to classify data based on qualitative categories. I would argue that most classification methods are simply extensions of logistic regression. After learning about logistic regression, you will have the tools to analyze most of the simple data you would come across.

## What is logistic regression?

Logistic regression is in a family of regression models called the Generalized Linear Models, which are extensions of the basic linear regression. The models are generally a transformation of the linear regression to fit data in different ways.

In the case of logistic regression, we use a sigmoid function on the linear model.

## What is the sigmoid function?

The sigmoid function is the following:

If you aren’t a math-person, this probably does not mean much to you, so here’s a visual to better understand it:

Notice that the output (range) of the sigmoid function is from 0 to 1 exclusive (without 0 or 1). We can then use this function to represent the probability of a data observation being a 0 or a 1, depending on how 0 and 1 are defined.

## How do we apply the sigmoid function to linear regression?

That’s simple; just apply the sigmoid function to a linear model. We learned in previous weeks that the matrix multiplication of X and theta is the vector representing the prediction of a linear regression model. Applying the sigmoid function to X*theta, we get a new hypothesis function:

Based on our training data X and parameters theta, our hypothesis function generates the probability that each observation will correspond to 1.

Now that we have a general idea of how the logistic regression works, we can figure out the cost function and which optimization algorithm to use for it.

## Can we use the same cost function as in linear regression?

For linear regression, we used:

Let’s try it! We can use randomly generated data with one feature, its square, and its cube to visualize the cost function as a function of theta_1. First, we try this with the regular linear regression:

Just as we talked about for linear regression, we get a quadratic curve with a clear global minimum. Now let’s try using the same cost function but after applying the sigmoid function:

Notice that there is a local minimum that is not necessarily the global minimum when varying theta_1. Thus, we need to adapt the cost function so we can reach the global optimum better.

## What cost function should we use instead?

Without going into too much detail, the cost function we use instead is:

I would be happy to dicuss where this cost function comes from, but I’m realizing how long this summary is already. Ask me if you’re curious.

## How should we optimize our parameters theta?

Similar to linear regression, we can use batch gradient descent to optimize our parameters. Here’s what we used in linear regression:

repeat {
theta = theta - alpha*der
}

where alpha is the step parameter and der is the derivative of the cost function. The derivative of our cost function is:

There are other optimization algorithms, but you can explore those in your own time if needed. Here are a few: conjugate gradient, BFGS.

## Final Notes

Don’t forget to feature scale before optimizing. I wasn’t very detailed about the last few steps because I realized the summary was too long already. Let me know if you have any questions about those last few steps. I can explain in detail as needed.

# Up next.

Next week, we will talk about evaluation metrics to judge how good your model is. This will be particularly useful for classification problems since there are other factors we need to consider for classification problems.