Friday, October 12, 2012

Visualizing correlation matrices

The correlation is one of the most common and most useful statistics. A correlation is a single number that describes the degree of relationship between two variables. The function corrcoef provided by numpy returns a matrix R of correlation coefficients calculated from an input matrix X whose rows are variables and whose columns are observations. Each element of the matrix R represents the correlation between two variables and it is computed as

where cov(X,Y) is the covariance between X and Y, while σX and σY are the standard deviations. If N is number of variables then R is a N-by-N matrix. Then, when we have a large number of variables we need a way to visualize R. The following snippet uses a pseudocolor plot to visualize R:
from numpy import corrcoef, sum, log, arange
from numpy.random import rand
from pylab import pcolor, show, colorbar, xticks, yticks

# generating some uncorrelated data
data = rand(10,100) # each row of represents a variable

# creating correlation between the variables
# variable 2 is correlated with all the other variables
data[2,:] = sum(data,0)
# variable 4 is correlated with variable 8
data[4,:] = log(data[8,:])*0.5

# plotting the correlation matrix
R = corrcoef(data)
pcolor(R)
colorbar()
yticks(arange(0.5,10.5),range(0,10))
xticks(arange(0.5,10.5),range(0,10))
show()
The result should be as follows:


As we expected, the correlation coefficients for the variable 2 are higher than the others and we observe a strong correlation between the variables 4 and 8.

15 comments:

  1. Don't use the jet colormap!

    http://www.jwave.vt.edu/~rkriz/Projects/create_color_table/color_07.pdf

    https://abandonmatlab.wordpress.com/2011/05/07/lets-talk-colormaps/

    http://cresspahl.blogspot.com/2012/03/expanded-control-of-octaves-colormap.html

    I think the hot colormap would be a better choice here

    ReplyDelete
  2. In some cases, Hinton diagrams can be far more useful. See http://www.scipy.org/Cookbook/Matplotlib/HintonDiagrams

    ReplyDelete
  3. hey,

    i get a strange error when running the script:


    /Users/xxx/src/matplotlib/lib/matplotlib/backends/backend_macosx.pyc in draw_quad_mesh(self, gc, master_transform, meshWidth, meshHeight, coordinates, offsets, offsetTrans, facecolors, antialiased, showedges)
    98 facecolors,
    99 antialiased,
    --> 100 showedges)
    101
    102 def new_gc(self):

    "only length-1 arrays can be converted to Python scalars"

    also, the colorbar is not visible
    what to do?

    ReplyDelete
  4. which version of matplotlib/python are you using?

    ReplyDelete
  5. hey,

    i'm using Python 2.7.3 and matplotlib '1.2.x' on os x.
    btw: if i leave out the colorbar command the error doesn't show up.

    ReplyDelete
  6. hello again.

    actually, i dont know why i had this unstable version installed.
    i used pip to install the stable 1.1.1 version and now it works like a charm.

    thanks for the fast reply and keep up the good work here :)

    ReplyDelete
  7. I like the correlation example and will try that later on some of my data. It is also cool that we uses the same theme on blogger. /Magnus

    ReplyDelete
    Replies
    1. Thanks Magnus. I like this theme because it's simple. If you're interested in matrix visualization don't forget to try Hinton diagrams also.

      Delete
  8. This comment has been removed by the author.

    ReplyDelete
  9. Love this blog. Here's the same matrix made in Plotly: http://on.fb.me/14oU6ej
    Different colormap and 20 instead of 10 rows.

    ReplyDelete
    Replies
    1. You should force 0 to be white dude, otherwise it's great.

      Delete
  10. I found it difficult to get result for 288 rows by 1000 columns, Any suggestion????

    ReplyDelete
  11. Thanks a lot for this! very helpful!
    Just one question why is the correlation coeff range going from -0.15 to 1 and not from -1 to 1 ?

    ReplyDelete
    Replies
    1. Hi, correlation is between -1 and 1. When it's 1 it means that the two variables linearly increase at the same time and it is maximum when we compare a variable with itself (see the values on the diagonal). When it's -1 the correlation is still maximum but negative, it means that when one variable increases, the other decreases. We don't reach -1 because this doesn't happen in the variables we considered.

      Delete

Note: Only a member of this blog may post a comment.