Week 2 Summary - Regularization- 6 mins
Given how hard it is to keep members engaged in groups on campus, I’m veeery surprised so many of you showed up again. Maybe I should hold clubs at Grinnell to a higher standard. Anyways, thanks to everyone for showing up again this week. For those who couldn’t make it this week, hopefully you can make it to the next meetings. This week we talked about adding regularization terms to regression models. Many of us are also going to the hackathon at ISU next weekend (HackISU), so we discussed potential ideas for products. We can use some of those ideas in the future for projects when we meet as well.
Since HackISU is next weekend and I will be there, we won’t be meeting next week. Feel free to use the space (JRC202) to work on data science related projects or to study–the room is reserved from 2-3pm. I will send a reminder email for the week after that. See you all in two weeks!
This week, we talked about regularization. In all of our statistics courses at Grinnell, there is no mention about regularization, but it is an essential concept in data science. You’ll see it used in logistic regression models, artificial neural networks, and many other algorithms. We can add a regularization for linear regressions as well, but as we’ll find, it is not as useful for simple linear regressions.
What is regularization?
According to Wikipedia, regularization “refers to a process introducing additional information in order to solve an ill-posed problem or to prevent overfitting.” In other words, we use regularization terms to ensure that when we train a model, we do not overfit the model. If you have not heard of the overfitting problem, you will hear about it a lot in the context of data science. Given that we have some amount of data, it’s not too difficult to fit a model that catches every single point exactly (e.g. for a data set with 10 points, use a 10th degree polynomial!). In predictive modeling, however, our goal is to build a model that can predict data the model has not seen yet. Regularization is what we use to make our models a little more robust in the context of overfitting. We can implement regularization into any supervised learning model.
So how do we do this?
Last week, we talked about the structure of machine learning problems. The regularization term will be added to the steps concerning the cost function and the optimization of our model parameters. To be less abstract, let’s see how this applies to the linear regression we learned last week!
First, let’s start with how we apply regularization into the cost function. Recall from last week that the cost function for linear regression models is:
We penalize the cost function by adding a regularization that we can adjust using a parameter
lambda greater than or equal to 0. The new cost function is:
Notice that we added an extra summation term. This is the regularization. Notice that if we increase lambda to infinity, our overfit model will be trivial and if we decrease lambda to 0, the regularization is trivial leaving the overfit model. Based on the extremes, we can see that the regularization actually accounts for overfitting. Thus, we can think of the cost function as a balance between the linear regression cost function, which is an overfitting term, and the regularization term, which is an underfitting term.
NOTE: In some supervised learning algorithms, we use
C, which is the inverse of
C is then multiplied by the model instead of the regularization term.
Next, we apply the regularization term to the optimization of the model parameters. Recall from last week that we change our model parameters based on the following:
where the partial derivative of our cost function is
Since we have updated our cost function to include the regularization term, taking the derivative of the new cost function, we get
We can then substitute this partial derivative to update our parameters. We will talk about how to choose
lambda properly in the future. Now that we have added our regularization term, we can hopefully get a better model with out-of-sample predictability!
But wait–there’s more (RIP Billy Mays). Let’s consider a case where our parameters vector
theta has parameters with vastly different values. For example, let us consider using a linear regression to predict the values of houses. Some obvious features we might use to predict house price are house size, room numbers, and number of bathrooms. Then we might have the following:
Running a simple linear regression gives us the following values:
theta_0 = -68872.3 theta_1 = 160.6 theta_2 = 8539.7
Do you see the problem here? When we use add the regularization term, it will put a larger weight on
theta_0 and then
theta_1 and lastly
theta_2. However, there is nothing special about
theta_0 other than the fact that the scale used for its respective feature is larger. So how do we go about fixing this problem? We use feature scaling. This is another way of saying normalization. You can just create t statistics out of out feature in the data set and feed that into the algorithm to weigh each feature equally.
There are other ways of feature scaling, but I personally like this one.
NOTE: Feature scaling actually helps run the gradient descent algorithm for linear regression run faster as well. Think about why that might be.
If you include everything we have talked about from last week and this week, you have covered the general structure of most supervised learning algorithms and have learned how to build one of them, linear regression. You will actually find that regularization has little effect on strictly linear regressions, but as we explore more complicated models, regularization will work wonders.
Like I mentioned, we will not be meeting next week because of the hackathon. The following week (week 4), I will talk about logistic regression! See you all in two weeks.