The Glowing Python: Latent Semantic Analysis with Term-Document matrix

Friday, May 20, 2011

Latent Semantic Analysis with Term-Document matrix

This example is inspired by the second paragraph of the paper Matrices, vector spaces, and information retrieval. It shows a vector space representation of information used to represent documents in a collection and the query algorithm to find relevant documents. This example implement the model and the query matching algorithm using the linear algebra module provided by numpy. The program is tested on the sample data in Figure 2 of the paper.

import numpy
def buildTermDocumentMatrix(terms,docs):
 """ build a term-document matrix """
 tlen = len(terms)
 dlen = len(docs)
 A = numpy.zeros((tlen, dlen))

 for i,t in enumerate(terms):
  for j,d in enumerate(docs):
   A[i,j] = d.lower().count(t) # computing terms frequencies

 for i in range(dlen): # normalize columns
  A[:tlen,i] = A[:tlen,i]/numpy.linalg.norm(A[:tlen,i])

 return A

def query(A,q,docs):
 """ make the query and print the result """
 q = q/numpy.linalg.norm(q) # normalize query vector
 for i in range(len(docs)):
  # dot product
  print '-Doc  :',docs[i],'\n-Match:',numpy.dot(A[:6,i].T,q) 

# documents collection
docs =['How to Bake Bread Without Recipes',
'The Classic Art of Viennese Pastry',
'Numerical Recipes: The Art of Scientific Computing',
'Breads, Pastries, Pies and Cakes : Quantity Baking Recipes',
'Pastry: A Book of Best French Recipe']
# interesting terms
terms = ['bak','recipe','bread','cake','pastr','pie']

# will return a matrix 6 terms x 5 documents
A = buildTermDocumentMatrix(terms,docs) 
print 'Normalized Terms-Documents matrix'
print A

print '\n*** Query: "bak(e,ing)" + "bread"'
q1 = numpy.array([1,0,1,0,0,0])
query(A,q1,docs)

print '\n*** Query: "bak(e,ing)" only'
q2 = numpy.array([1,0,0,0,0,0])
query(A,q2,docs)

The results are the same as is the reference paper:

Normalized Terms-Documents matrix
[[ 0.57735027  0.          0.          0.40824829  0.        ]
 [ 0.57735027  0.          1.          0.40824829  0.70710678]
 [ 0.57735027  0.          0.          0.40824829  0.        ]
 [ 0.          0.          0.          0.40824829  0.        ]
 [ 0.          1.          0.          0.40824829  0.70710678]
 [ 0.          0.          0.          0.40824829  0.        ]]

*** Query: "bak(e,ing)" + "bread"
-Doc  : How to Bake Bread Without Recipes 
-Match: 0.816496580928
-Doc  : The Classic Art of Viennese Pastry 
-Match: 0.0
-Doc  : Numerical Recipes: The Art of Scientific Computing 
-Match: 0.0
-Doc  : Breads, Pastries, Pies and Cakes : Quantity Baking Recipes 
-Match: 0.57735026919
-Doc  : Pastry: A Book of Best French Recipe 
-Match: 0.0

*** Query: "bak(e,ing)" only
-Doc  : How to Bake Bread Without Recipes 
-Match: 0.57735026919
-Doc  : The Classic Art of Viennese Pastry 
-Match: 0.0
-Doc  : Numerical Recipes: The Art of Scientific Computing 
-Match: 0.0
-Doc  : Breads, Pastries, Pies and Cakes : Quantity Baking Recipes 
-Match: 0.408248290464
-Doc  : Pastry: A Book of Best French Recipe 
-Match: 0.0

Other resources about about the model implemented can be found here:

5 comments:

UnknownMay 20, 2011 at 7:49 PM
In the default difflib there is a SequenceMatcher object that can be used to do some comparison like this. I used it to index websites at the company I work for and look for changes greater than a certain ratio to indicate that the website had been broken but you could use it for smaller chunks of text to search for relevant documents.
ReplyDelete
Replies
JustGlowingMay 20, 2011 at 8:13 PM
Hi Gekitsuu, thank for the suggestion. It's a good idea for a new post.
ReplyDelete
Replies
AnonymousApril 17, 2014 at 10:57 PM
hey, why isn't the sum of the normalization didnt go to zero?
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

Friday, May 20, 2011

Latent Semantic Analysis with Term-Document matrix

5 comments:

Quote