Week 1 Summary - Linear Regression- 7 mins
Data Science Introduction and Linear Regression
This week, we talked about the basics of data science, machine learning, and the implementation of linear regression from scratch. Admittedly, I did a poor job structuring how I talked about linear regression, so here’s a better version (summary) of what we talked about. Next week, we will talk about two concepts used in regression models called feature scaling and regularization. Here’s a nice link if you want to learn about it before next week: https://datanice.github.io/machine-learning-101-what-is-regularization-interactive.html
Let me know if you have any questions.
What is data science?
Krit talked about how it’s pretty broad and vague, but it is the intersection between statistics and computer science simply put. I would argue that currently it also heavily incorporates a business aspect as well, but with the growing number of applications of data science, that is quickly changing.
What do data scientists do? What structure do they follow?
- Identify a problem (ask interesting questions
- Find data (webscraping, databases, collect your own)
- Clean data (regular expressions, merging data, etc.)
- Explore the data (datamining, finding patterns, etc.)
- Build models (machine learning, model selection, etc.)
- Evaluate models (evaluation metrics)
- Present models (data visualization, presentation)
I feel relatively comfortable teaching about the machine learning components, so that’s what you’ll be hearing from me.
What are some basic methods data scientists use?
Data scientists use a large variety of tools in the different steps mentioned previously, but a lot of data science education happens in the data analysis (using machine learning) section. Among these, the most well known ones are classified as supervised learning and unsupervised learning tasks. A few supervised learning tasks are the linear regression, logistic regression, artificial neural network (deep learning), and support vector machines. A few unsupervised learning tasks are principal components analysis, clustering, and anomaly detection.
We will start by learning about linear regression and implementing it from scratch, but first, we talk about the structure of many machine learning algorithms and how they return the optimal parameters for a given algorithm.
What is the structure of machine learning problems?
All machine learning problems have the same structure (from what I’ve seen) in solving a problem:
- Run a model with appropriate parameters
- Calculate the error/cost of a model
- Change the parameters to reduce the error/cost
- Repeat previous steps till you get the lowest error/cost possible
How do we implement a linear regression from scratch?
Before going into the details, here are some labels to make our lives easier:
m = number of observations
n = number of independent variables
X = an
(n+1) matrix of all the values for each observation and independent variable (aka features)
(n+1) because we add as the first column a column of
1s. Ask me if you’re curious why.
Y = an
1 matrix of all the values for the response variable
i = indicates the
ith observation (
y^i = response value of the ith observation)
j = indicates the
jth column of
ith row of X
theta = vector containing all the parameters of a model
theta_0 = intercept of the model
We will study how to implement the regression model using one feature. Now we can get into how to run a linear regression.
Suppose we ran a straight horizontal line through the following points (see below). We call the equation of this line
h_theta(x). That looks pretty terrible, so we can guess that the cost will be high.
So what does the cost look like? For a linear regression, the cost function is
theta_0 corresponds to the intercept and
theta_1 corresponds to the slope of the linear model. This cost function indicates the difference in distance between the actual points and the line:
We sum each residual square and divide by
m, the number of points, to get an average of the residual squares.
NOTE: we also divide by two because of convention. There is nothing special about
2 other than it will make some computations make sense later.
Reducing the Cost Function
Notice that the cost function
J has a squared value, so if you plotted
J as a function of any theta, we would get a parabola:
Thus, we want to change our theta values so that
J gets closer and closer to the minimum of the parabola. This idea of iterating different values of theta till we reach the minimum (convergence) is called an optimization method. The optimization we will use is called gradient descent, or steepest descent. We update the parameters, theta, using the following equation:
NOTE: each theta needs to be updated simultaneously. i.e. for each iteration, update the theta values after computing theta for each
You can get the partial derivative using basic calculus:
This equation makes sense because if you start on the left side of the parabola, then the derivative is negative, so subtracting a negative value, we get a higher value for theta. That means, we are closer to the parabola. You can then use a
while loop or
repeat loop till you change theta till the parabolic cost function is at its global minimum. The same logic works when you start on the right side of the parabola.
We give alpha as a parameter to this algorithm to determine how large the steps are per iteration. On one hand, if alpha is tiny, then you will move towards the center of the parabola in tiny steps, making the algorithm take a loooong time. On the other hand, if alpha is huge, then each iteration may move you from the left side of the parabola to the right over and over again. In which case you would be stuck in an infinite loop, never reaching the minimum. You have to figure out what alpha works best.
Once the algorithm converges, you now have the optimized parameters, theta, which will fit the best line through your points:
One important comment is that when we implement these optimization methods with large data sets,
for loops start to become pretty inefficient, so we use a concept called vectorization. This is just a fancy way of saying forget the
for loops when using matrices and vectors. Instead, use matrix algebra which are highly optimized instead. Here are the implementations using vectorization:
Lastly, I want to mention that gradient descent is not the only optimization method that works for linear regression. Another commonly used one is the normal equation, which is actually much simpler, but is slow for large data sets. You can find the derivation and the equation here: http://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression/
Next week, we will learn about feature scaling and regularization to thoroughly implement linear regression.