The Glowing Python: Linear regression with Numpy

Saturday, March 24, 2012

Linear regression with Numpy

Few post ago, we have seen how to use the function numpy.linalg.lstsq(...) to solve an over-determined system. This time, we'll use it to estimate the parameters of a regression line.
A linear regression line is of the form w₁x+w₂=y and it is the line that minimizes the sum of the squares of the distance from each data point to the line. So, given n pairs of data (x_i, y_i), the parameters that we are looking for are w₁ and w₂ which minimize the error

and we can compute the parameter vector w = (w₁ , w₂)^T as the least-squares solution of the following over-determined system

Let's use numpy to compute the regression line:

from numpy import arange,array,ones,linalg
from pylab import plot,show

xi = arange(0,9)
A = array([ xi, ones(9)])
# linearly generated sequence
y = [19, 20, 20.5, 21.5, 22, 23, 23, 25.5, 24]
w = linalg.lstsq(A.T,y)[0] # obtaining the parameters

# plotting the line
line = w[0]*xi+w[1] # regression line
plot(xi,line,'r-',xi,y,'o')
show()

We can see the result in the plot below.

You can find more about data fitting using numpy in the following posts:

Update, the same result could be achieve using the function scipy.stats.linregress (thanks ianalis!):

from numpy import arange,array,ones#,random,linalg
from pylab import plot,show
from scipy import stats

xi = arange(0,9)
A = array([ xi, ones(9)])
# linearly generated sequence
y = [19, 20, 20.5, 21.5, 22, 23, 23, 25.5, 24]

slope, intercept, r_value, p_value, std_err = stats.linregress(xi,y)

print 'r value', r_value
print  'p_value', p_value
print 'standard deviation', std_err

line = slope*xi+intercept
plot(xi,line,'r-',xi,y,'o')
show()

27 comments:

G. Steve ArnoldMarch 24, 2012 at 2:36 PM
Possible Bugs: x_lst is unused and w[] is undefined?
ReplyDelete
Replies
JustGlowingMarch 24, 2012 at 2:41 PM
Thanks Steve, I fixed it. I changed the code at the end to make it consisted with the notation.
ReplyDelete
Replies
ianalisMarch 27, 2012 at 12:09 AM
Another method is to use scipy.stats.linregress()
ReplyDelete
Replies
AnonymousOctober 15, 2012 at 2:06 AM
What is the r_value and the p_value in the second program? What do they represent?
ReplyDelete
Replies
JustGlowingOctober 15, 2012 at 8:09 AM
r_value is the correlation coefficient and p_value is the p-value for a hypothesis test whose null hypothesis is that the slope is zero.

For more information about correlation you can fin my last post:
http://glowingpython.blogspot.com/2012/10/visualizing-correlation-matrices.html

And you can find more about p-value here:
http://en.wikipedia.org/wiki/P-value
ReplyDelete
Replies
AnonymousOctober 22, 2012 at 3:01 PM
how can i get the sum squared error of the regression from this function ??
ReplyDelete
Replies
AnonymousOctober 24, 2012 at 3:50 PM
Is there a way to calculate the maximum and minimum gradient, given multiple pairs of (x,y) measurements at each point e.g. repeated trials? Thanks!!
ReplyDelete
Replies
Andrew MJanuary 30, 2013 at 4:36 PM
The following function is quite nice: scipy.stats.linregress

http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html

It provides the p-value and r-value without extra work.
ReplyDelete
Replies
Jared ForsythApril 20, 2013 at 2:21 AM
Awesome! just what I was looking for.
ReplyDelete
Replies
DavidFebruary 4, 2014 at 9:29 PM
This comment has been removed by the author.
ReplyDelete
Replies
DavidFebruary 4, 2014 at 9:38 PM
I stumbled upon this fine piece of work, and it seemed to work just fine.
I although came across a problem, once the slope (from the updated code) turned either negative or below zero which meant that the "line" list became empty. To solve this, I simply did the following instead which solved my issue:

line = A[0]+intercept
ReplyDelete
Replies
JustGlowingFebruary 5, 2014 at 1:09 PM
Hi David, at the moment I'm using the implementation provided by sklearn, maybe you could find it helpful: http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#example-linear-model-plot-ols-py
ReplyDelete
Replies
AnonymousFebruary 14, 2014 at 6:38 AM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
AdviserApril 10, 2014 at 4:14 AM
How about a 2D linear regression ? Can you please suggest whats the easiest way to perform the same analysis on a 2D dataset ?
ReplyDelete
Replies
AnonymousApril 11, 2014 at 5:03 AM
Is there an easy way to plot a regression line that would be based only part of the y data. For example plot the whole y but plot regression line only for:
[20.5, 21.5, 22, 23, 23, 25.5, 24]
ReplyDelete
Replies
AnonymousJuly 30, 2014 at 3:56 PM
std_err is not standard deviation, but the error of the estimated slope!
ReplyDelete
Replies
UnknownJanuary 25, 2017 at 11:58 AM
Is it necessary to add "ones(9)"? I usually have just the independent variable x, and the dependent one y,... I don't know how, why and when should I add ones column to my independent var (x). regards
ReplyDelete
Replies
UnknownJanuary 25, 2017 at 12:02 PM
Just one more,... how to predict the new data set with my new set of points? Y use Xtrain and Y train, the model W=inv(Xtrain.T*Xtrain)*Xtrain.T*Ytrain , with np.dot, of course... so when predicting y use Ypred = Xvalid*W, ... but it's not working to me :(
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

Saturday, March 24, 2012

Linear regression with Numpy

27 comments:

Quote