Wednesday, July 27, 2011

PCA and image compression with numpy

In the previous post we have seen the princomp function. This function performs principal components analysis (PCA) on the n-by-p data matrix and uses all the p principal component to computed the principal component scores. In this new post, we will see a modified version of the princomp where the representation of the original data in the in the principal component space is computed with less than p principal components:
from numpy import mean,cov,cumsum,dot,linalg,size,flipud

def princomp(A,numpc=0):
 # computing eigenvalues and eigenvectors of covariance matrix
 M = (A-mean(A.T,axis=1)).T # subtract the mean (along columns)
 [latent,coeff] = linalg.eig(cov(M))
 p = size(coeff,axis=1)
 idx = argsort(latent) # sorting the eigenvalues
 idx = idx[::-1]       # in ascending order
 # sorting eigenvectors according to the sorted eigenvalues
 coeff = coeff[:,idx]
 latent = latent[idx] # sorting eigenvalues
 if numpc < p and numpc >= 0:
  coeff = coeff[:,range(numpc)] # cutting some PCs if needed
 score = dot(coeff.T,M) # projection of the data in the new space
 return coeff,score,latent
The following code uses the new version of the princomp to compute the PCA of a matrix that represents an image in gray scale. The PCA is computed ten times with an increasing number of principal components. The script show the images reconstructed using less than 50 principal components (out of 200).
from pylab import imread,subplot,imshow,title,gray,figure,show,NullLocator
A = imread('shakira.jpg') # load an image
A = mean(A,2) # to get a 2-D array
full_pc = size(A,axis=1) # numbers of all the principal components
i = 1
dist = []
for numpc in range(0,full_pc+10,10): # 0 10 20 ... full_pc
 coeff, score, latent = princomp(A,numpc)
 Ar = dot(coeff,score).T+mean(A,axis=0) # image reconstruction
 # difference in Frobenius norm
 # showing the pics reconstructed with less than 50 PCs
 if numpc <= 50:
  ax = subplot(2,3,i,frame_on=False)
  ax.xaxis.set_major_locator(NullLocator()) # remove ticks
  i += 1 
  title('PCs # '+str(numpc))

title('numpc FULL')
The resulting images:

We can see that 40 principal components are enough to reconstruct the original image.

At the end of this experiment, we can plot the distance of the reconstructed images from the original image in Frobenius norm (red curve) and the cumulative sum of the eigenvalues (blue curve). Recall that the cumulative sum of the eigenvalues shows the level of variance accounted by each of the corresponding eigenvectors. On the x axis there is the number of eigenvalues/eigenvectors used.
from pylab import plot,axis
perc = cumsum(latent)/sum(latent)
dist = dist/max(dist)


  1. I know this is an old post ( and I appreciated it as a reference ), but I was amused at the following:
    In you code you
    from numpy import mean,cov,cumsum,dot,linalg,size,flipud
    from pylab import *

    The from pylab import * overwrites *everything* you imported from numpy in the global namespace. =)

    1. Hi, thanks for the comment. I usually extract my snippets from bigger programs and I have to rewrite the import before the publication.

  2. JustGlowing, you say in your program:

    if numpc < p or numpc >= 0: # then cut some PCs

    Should it be "and" instead of "or" there?

  3. Could you please explain about this line:

    Ar = dot(coeff,score).T+mean(A,axis=0) # image reconstruction

    I confused where you add the old subtracted mean of the original matrix to the dot product. The Original data is already projected onto a new vector space, are we allowed to add that mean subtracted from the original data back to the new scores? I also do not understand why should we multiply the scores again to the eigenvectors!? By this time We have the sorted scores and taking the first, second ets scores will give us the projected data onto the PC1 and PC2 ... so why didn't you just take the first 'n' scores in order to show the data in the reduced space?

    1. Hi Sinooshka,

      In the line you pointed we try to rebuild the original data matrix. The motivation that lead to that line of code are the following. To compute the coefficient matrix we use the following equation:

      P = V*A (here, P is coeff, V is latent and A is our data matrix)

      To get A back, one could use

      A = V^-1*P

      but, since V is orthonormal (which means that V^1=V^T), we can compute A as

      A = V^T*P (which corresponds to dot(coeff,score).T)

      Now, since the PCA requires that the data are centered, the mean is subtracted to all the data points in the function princomp. Then, to rebuild the data, we need to add again the mean.

  4. Thank you For the clarification JustGlowing !
    I also have another question :
    What if we want to apply PCA on the Original Image with all 3 band? In that case how we are going to define our latent variables and consequently the coefficients ?

    1. In this case you have three different matrices and you need to apply the PCA on each matrix separately.

  5. OK , now lets think that I have a cube (a hyperspectral image) with not 3 bands but lets say 200 bands... now I want to reduce the number of these band into lest say 5 bands ... What I need to do is to roll out each band (2D image) into a 1D vector and create a (n x 200) matrix from them ... now I want to do PCA or any dimension reduction method on this data. The question is Is that a correct way of using PCA on this data ?

    1. This is a different kind of problem. You should try to give a look at the literature in this field.

  6. Just in case for others having the same question:

    In order to apply conventional PCA to a hypercube, it is
    necessary to ‘unfold’ the hypercube into a two-dimensional
    matrix in which each row represents the spectrum of 1 pixel.

  7. why do you use

    M = (A-mean(A.T,axis=1)).T

    for example

    A= np.array([[1,2],[2,3],[7,6]])
    array([[1, 2],
    [2, 3],
    [7, 6]])

    M = (A-np.mean(A.T,axis=1)).T

    array([[-2.33333333, -1.33333333, 3.66666667],
    [-1.66666667, -0.66666667, 2.33333333]])

    so matrix seems to change dimensions

    1. yes, it changes dimensions but it's just the result of the transposition.

    2. but why we need to transpose matrix?

  8. Hi, thanks for this useful post.

    Is this line from the 2nd block of code correct?

    A = mean(A,2) # to get a 2-D array

    Shouldn't imread load the .jpg as a 2D array already? And why would you use numpy.mean() here to "get a 2-D array" ?

    Thanks for your time.

    1. The original image is in RBG, which means that when you load it you have a 3D matrix. Taking the mean along the second axis you have a 2D matrix and you can interpret it as a grayscale image.

    2. Ah, hah. I had wrongly assumed the image was originally grayscale.


  9. Hi JustGlowing,

    I have just tried running this program but it won't work. It says:

    File "C:\Users\Martin Dalgaard\Anaconda\lib\site-packages\numpy\core\", line 50, in _count_reduce_items
    items *= arr.shape[ax]

    IndexError: tuple index out of range

    Since the error seems to be in the numpy files I don't know how to solve it in my code. I hope you can help me. Thanks in advance. :-)

    - Martin Dalgaard

    1. Hi Martin, I've never seen this error before. You could try updating your numpy version.

  10. hi!, Ihave a doubt,i have tried runnig your program but i can´t see nothing and only see: Process finished with exit code 0....

  11. hi, how can i apply this code for more than 10 images. pls help me