Week 1 Summary - Linear Regression

- 7 mins

Data Science Introduction and Linear Regression

This week, we talked about the basics of data science, machine learning, and the implementation of linear regression from scratch. Admittedly, I did a poor job structuring how I talked about linear regression, so here’s a better version (summary) of what we talked about. Next week, we will talk about two concepts used in regression models called feature scaling and regularization. Here’s a nice link if you want to learn about it before next week: https://datanice.github.io/machine-learning-101-what-is-regularization-interactive.html

Let me know if you have any questions.

Summary 2017-02-09

What is data science?

Krit talked about how it’s pretty broad and vague, but it is the intersection between statistics and computer science simply put. I would argue that currently it also heavily incorporates a business aspect as well, but with the growing number of applications of data science, that is quickly changing.

What do data scientists do? What structure do they follow?

I feel relatively comfortable teaching about the machine learning components, so that’s what you’ll be hearing from me.

What are some basic methods data scientists use?

Data scientists use a large variety of tools in the different steps mentioned previously, but a lot of data science education happens in the data analysis (using machine learning) section. Among these, the most well known ones are classified as supervised learning and unsupervised learning tasks. A few supervised learning tasks are the linear regression, logistic regression, artificial neural network (deep learning), and support vector machines. A few unsupervised learning tasks are principal components analysis, clustering, and anomaly detection.

We will start by learning about linear regression and implementing it from scratch, but first, we talk about the structure of many machine learning algorithms and how they return the optimal parameters for a given algorithm.

What is the structure of machine learning problems?

All machine learning problems have the same structure (from what I’ve seen) in solving a problem:

How do we implement a linear regression from scratch?

Before going into the details, here are some labels to make our lives easier:
m = number of observations
n = number of independent variables
X = an m by (n+1) matrix of all the values for each observation and independent variable (aka features)
NOTE: X is m by (n+1) because we add as the first column a column of 1s. Ask me if you’re curious why.
Y = an m by 1 matrix of all the values for the response variable
superscript i = indicates the ith observation (y^i = response value of the ith observation)
subscript j = indicates the jth variable
x_j_ = jth column of X
x^i = ith row of X
theta = vector containing all the parameters of a model
theta_0 = intercept of the model
We will study how to implement the regression model using one feature. Now we can get into how to run a linear regression.

Initial trial

Suppose we ran a straight horizontal line through the following points (see below). We call the equation of this line h_theta(x). That looks pretty terrible, so we can guess that the cost will be high.
Initial Linear Regression

Cost Function J

So what does the cost look like? For a linear regression, the cost function is J(theta_0,theta_1). Here theta_0 corresponds to the intercept and theta_1 corresponds to the slope of the linear model. This cost function indicates the difference in distance between the actual points and the line:
Linear Regression Cost
We sum each residual square and divide by m, the number of points, to get an average of the residual squares.
NOTE: we also divide by two because of convention. There is nothing special about 2 other than it will make some computations make sense later.

Reducing the Cost Function J

Notice that the cost function J has a squared value, so if you plotted J as a function of any theta, we would get a parabola:
Cost Function Graph
Thus, we want to change our theta values so that J gets closer and closer to the minimum of the parabola. This idea of iterating different values of theta till we reach the minimum (convergence) is called an optimization method. The optimization we will use is called gradient descent, or steepest descent. We update the parameters, theta, using the following equation:
Theta Update
NOTE: each theta needs to be updated simultaneously. i.e. for each iteration, update the theta values after computing theta for each i=1,2,…,n.
You can get the partial derivative using basic calculus:
Partial Derivative

This equation makes sense because if you start on the left side of the parabola, then the derivative is negative, so subtracting a negative value, we get a higher value for theta. That means, we are closer to the parabola. You can then use a while loop or repeat loop till you change theta till the parabolic cost function is at its global minimum. The same logic works when you start on the right side of the parabola.

We give alpha as a parameter to this algorithm to determine how large the steps are per iteration. On one hand, if alpha is tiny, then you will move towards the center of the parabola in tiny steps, making the algorithm take a loooong time. On the other hand, if alpha is huge, then each iteration may move you from the left side of the parabola to the right over and over again. In which case you would be stuck in an infinite loop, never reaching the minimum. You have to figure out what alpha works best.

Once the algorithm converges, you now have the optimized parameters, theta, which will fit the best line through your points:
Complete Linear Model

Final Comments

One important comment is that when we implement these optimization methods with large data sets, for loops start to become pretty inefficient, so we use a concept called vectorization. This is just a fancy way of saying forget the for loops when using matrices and vectors. Instead, use matrix algebra which are highly optimized instead. Here are the implementations using vectorization:
Vectorized Cost
Vectorized Update
Lastly, I want to mention that gradient descent is not the only optimization method that works for linear regression. Another commonly used one is the normal equation, which is actually much simpler, but is slow for large data sets. You can find the derivation and the equation here: http://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression/

Up next…

Next week, we will learn about feature scaling and regularization to thoroughly implement linear regression.

Jun Taek Lee

Jun Taek Lee

Wishing to one day be as cool as Roger Peng

rss facebook twitter github youtube mail spotify instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora