Wednesday, September 24, 2014

Text summarization with NLTK

The target of the automatic text summarization is to reduce a textual document to a summary that retains the pivotal points of the original document. The research about text summarization is very active and during the last years many summarization algorithms have been proposed.
In this post we will see how to implement a simple text summarizer using the NLTK library (which we also used in a previous post) and how to apply it to some articles extracted from the BBC news feed. The algorithm that we are going to see tries to extract one or more sentences that cover the main topics of the original document using the idea that, if a sentences contains the most recurrent words in the text, it probably covers most of the topics of the text. Here's the Python class that implements the algorithm:
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
from string import punctuation
from heapq import nlargest

class FrequencySummarizer:
  def __init__(self, min_cut=0.1, max_cut=0.9):
    """
     Initilize the text summarizer.
     Words that have a frequency term lower than min_cut 
     or higer than max_cut will be ignored.
    """
    self._min_cut = min_cut
    self._max_cut = max_cut 
    self._stopwords = set(stopwords.words('english') + list(punctuation))

  def _compute_frequencies(self, word_sent):
    """ 
      Compute the frequency of each of word.
      Input: 
       word_sent, a list of sentences already tokenized.
      Output: 
       freq, a dictionary where freq[w] is the frequency of w.
    """
    freq = defaultdict(int)
    for s in word_sent:
      for word in s:
        if word not in self._stopwords:
          freq[word] += 1
    # frequencies normalization and fitering
    m = float(max(freq.values()))
    for w in freq.keys():
      freq[w] = freq[w]/m
      if freq[w] >= self._max_cut or freq[w] <= self._min_cut:
        del freq[w]
    return freq

  def summarize(self, text, n):
    """
      Return a list of n sentences 
      which represent the summary of text.
    """
    sents = sent_tokenize(text)
    assert n <= len(sents)
    word_sent = [word_tokenize(s.lower()) for s in sents]
    self._freq = self._compute_frequencies(word_sent)
    ranking = defaultdict(int)
    for i,sent in enumerate(word_sent):
      for w in sent:
        if w in self._freq:
          ranking[i] += self._freq[w]
    sents_idx = self._rank(ranking, n)    
    return [sents[j] for j in sents_idx]

  def _rank(self, ranking, n):
    """ return the first n sentences with highest ranking """
    return nlargest(n, ranking, key=ranking.get)
The FrequencySummarizer tokenizes the input into sentences then computes the term frequency map of the words. Then, the frequency map is filtered in order to ignore very low frequency and highly frequent words, this way it is able to discard the noisy words such as determiners, that are very frequent but don't contain much information, or words that occur only few times. And finally, the sentences are ranked according to the frequency of the words they contain and the top sentences are selected for the final summary.

To test the summarizer, let's create a function that extract the natural language from a html page using BeautifulSoup:
import urllib2
from bs4 import BeautifulSoup

def get_only_text(url):
 """ 
  return the title and the text of the article
  at the specified url
 """
 page = urllib2.urlopen(url).read().decode('utf8')
 soup = BeautifulSoup(page)
 text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
 return soup.title.text, text
We can finally apply our summarizer on a set of articles extracted from the BBC news feed:
feed_xml = urllib2.urlopen('http://feeds.bbci.co.uk/news/rss.xml').read()
feed = BeautifulSoup(feed_xml.decode('utf8'))
to_summarize = map(lambda p: p.text, feed.find_all('guid'))

fs = FrequencySummarizer()
for article_url in to_summarize[:5]:
  title, text = get_only_text(article_url)
  print '----------------------------------'
  print title
  for s in fs.summarize(text, 2):
   print '*',s
And here are the results:
----------------------------------
BBC News - Scottish independence: Campaigns seize on Scotland powers pledge
* Speaking ahead of a visit to apprentices at an engineering firm in 
Renfrew, Deputy First Minister Nicola Sturgeon said: Only a 'Yes' vote will 
ensure we have full powers over job creation - enabling us to create more 
and better jobs across the country.
* Asked if the move smacks of panic, Mr Alexander told BBC Breakfast: 
I don't think there's any embarrassment about placing policies on the 
front page of papers with just days two go.
----------------------------------
BBC News - US air strike supports Iraqi troops under attack
* Gabriel Gatehouse reports from the front line of Peshmerga-held territory 
in northern Iraq The air strike south-west of Baghdad was the first taken as 
part of our expanded efforts beyond protecting our own people and humanitarian 
missions to hit Isil targets as Iraqi forces go on offence, as outlined in the 
president's speech last Wednesday, US Central Command said.
* But Iran's Supreme Leader Ayatollah Ali Khamenei said on Monday that the US 
had requested Iran's co-operation via the US ambassador to Iraq.
----------------------------------
BBC News - Passport delay victims deserve refund, say MPs
* British adult passport costs Normal service - £72.50 Check  Send - 
Post Office staff check application correct and it is sent by Special Delivery 
- £81.25 Fast-Track - Applicant attends Passport Office in person and passport 
delivered within one week - £103 Premium - Passport available for collection 
on same day applicant attends Passport Office - £128 In mid-June it announced 
that - for people who could prove they were booked to travel within seven days 
and had submitted passport applications more than three weeks earlier - there 
would be a free upgrade to its fast-track service.
* The Passport Office has since cut the number of outstanding applications to 
around 90,000, but the report said: A number of people have ended up 
out-of-pocket due to HMPO's inability to meet its service standard.
----------------------------------
BBC News - UK inflation rate falls to 1.5%
* Howard Archer, chief UK and European economist at IHS Global Insight, 
said: August's muted consumer price inflation is welcome news for consumers' 
purchasing power as they currently continue to be hampered by very 
low earnings growth.
* Consumer Price Index (CPI) inflation fell to 1.5% from 1.6% in August, 
the Office for National Statistics said.
----------------------------------
BBC News - Thailand deaths: Police have 'number of suspects'
* The BBC's Jonathan Head, on Koh Tao, says police are focussing on the 
island's Burmese community BBC south-east Asia correspondent Jonathan Head 
said the police's focus on Burmese migrants would be quite controversial as 
Burmese people were often scapegoated for crimes in Thailand.
* By Jonathan Head, BBC south-east Asia correspondent The shocking death of 
the two young tourists has cast a pall over this scenic island resort Locals 
say they can remember nothing like it happening before.
Of course, the evaluation a text summarizer is not an easy task. But, from the results above we note that the summarizer often picked quoted text reported in the original article and that the sentences picked by the summarizer often represent decent insights if we consider the title of the article.

88 comments:

  1. Hello, this is an amazing program, can you tell me on which algorithm this is based on ? or is it your own ?
    why did you choose min_cut =0.1 , and max_cut=0.9 , i mean to say that, why can't 0.2 be the lower cut off to declare that the word is not required ?

    ReplyDelete
    Replies
    1. I'm pretty sure that you will very similar algorithms also in literature. min_cut and max_cut are parameters that should be tuned according to the data you have. 0.1 and 0.9 were just a guess.

      Delete
    2. Oh so min and max cut values u have provided is not ideal right ? I shall try and modify and see the results.

      do you know any way apart from using domain experts to manually evaluate , how good the summary is ? , like some automation technique ?

      Delete
    3. Yes, the values may be not the best choice.

      You really should look at the literature about text summarization evaluation, you could start from here:

      http://www.cai.sk/ojs/index.php/cai/article/viewFile/37/24

      Delete
  2. Hey! great program. I have a small doubt though, the sentences are displayed in rank order, can they be displayed in order of occurrence?

    ReplyDelete
    Replies
    1. I don't think it makes sense because I expect to have a very low number of duplicated sentences in a text.

      Delete
  3. Are you aware of any summarisation algorithm for online reviews? It would be different from news summaries as one needs to first identify the product features mentioned, then the sentiment associated with that feature.

    ReplyDelete
    Replies
    1. You're looking for a cross document summary. You could use this algorithm giving it all the reviews in input. Of course, it is very naive, but the high frequency terms should be given by the product features mentioned by the reviewers.

      Delete
  4. It doesn't work with many wikipedia dump files like when we download wikipedia pages to a file and give that as input to the summarizer.

    ReplyDelete
    Replies
    1. Yes, the parser proposed in the example is specific for the bbc website.

      Delete
    2. Why so? How can we make this code work for any link?

      Delete
    3. Yes you can. Just parse the html document and extract the body portion using BS4. Then feed that string to the FS object.

      Delete
  5. are the three blocks of code u provided suppose to be given in the same file?

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. Hi nishita, python uses indentation to specify the code blocks (loops, functions, ifs). If you don't indent the code it won't work.

      Delete
  7. I keep gtng syntax error specifying the argument passed to get_only _text. I tried changing d url der and also in bs4.py. Bt seems like m missing out something

    ReplyDelete
  8. Hey why do you say that it works only for BBC? If we feed it with a different url,say,a wikipedia page, wouldn't it extract details the same way it did for BBC?

    ReplyDelete
    Replies
    1. The function was tested on the bbc website at the time I wrote the snippet. It may also work on other websites.

      Delete
    2. Oh okay thanks a lot I'll definitely try playing around with the code. And also I'd love it if you direct me to some papers or books that helped you write the code. I'm just a beginner and I want to explore more.

      Delete
    3. Another user asked a similar question on this post. Look at the reply he got.

      Delete
  9. Research papers are produced now in such mass quantity, and often universities and researchers are doing the same types of studies getting similar results, and each of their papers is published in a separate or different journal. online text summarizer

    ReplyDelete
  10. I think this post is really worth a try. I like to thank you for sharing this information. One of this days I will use your tips and see if I come up with much better result if not close to what you have been doing right now.

    ReplyDelete
  11. Thanks - this code is great. Only one strange thing for me - if I try the code as it is, I get an error:

    in _compute_frequencies
    for w in freq.keys():
    RuntimeError: dictionary changed size during iteration

    I'm new to Python, but I understand the error - the code is trying to modify the thing that control the iteration during the loop. Why don't others have this problem? If I take out that 'del freq[w]' it works, but obviosuly without the 'cut' functionality. Any ideas? I'm using Python3.
    Thanks again for the code, though.

    ReplyDelete
    Replies
    1. Hi, you copy freq.keys() into a separate list and use it for the iteration.

      Delete
    2. Thanks - great. Works perfectly.

      Delete
    3. you can use 'for w in list(freq)'. It works in Python3.

      Delete
    4. Thank you, Om Prakash. It works for me.

      Delete
  12. Hi, have you tried with other languages? Thanks.

    ReplyDelete
    Replies
    1. Sorry to not specify more. I meant Spanish or Portuguese etc. Non English text.

      Delete
    2. No, I didn't. It would be interesting to do so.

      Delete
  13. when I try this code. error occurs- Frequency Summarizer instance has no attribute 'summarize' . can you resolve it.

    ReplyDelete
  14. hello, can you tell me on which algorithm this is based on ? is it Term Frequency algorithm ?

    ReplyDelete
    Replies
    1. Hi Firda, if you look through the comments there are some references.

      Term frequencies are a core part of the algorithm.

      Delete
    2. Thank you, so this core algorithm is term frequencies not TF-IDF algorithm ? sorry i just
      rather confused. i ever heard there are some algorithm like sentences clustering, lexical chain or TF-IDF algorithm. is the algorithm you made not both of them?

      Delete
    3. This agorithm just uses frequencies.

      Delete
    4. this is an amazing program, but is it really important to determine the parameters? You choose the min_cut =0.1 , and max_cut=0.9. if the parameter checking was not used in the program, is it wrong? or make a problem for the summarization?

      Delete
    5. Hello again Friday, these parameters simply filter words with too high or too low frequencies. If you think that they're not a problem for your input data, you could simply set them to 0. It really depends on the language you're dealing with and if you applied some other types of normalization before using this algorithm.

      Delete
    6. thank you, but why you choose 0.9 as max_cut parameters? is there a specific reason?

      Delete
    7. It was just a guess based on the outcomes that I observed. In theory, you should tune them using a proper validation methodology (like cross validation for example).

      Delete
    8. hai, sorry for ask again :D why you not choose 1.0 as your max_cut parameters but 0.9?

      Delete
    9. This comment has been removed by the author.

      Delete
  15. This comment has been removed by the author.

    ReplyDelete
  16. This comment has been removed by the author.

    ReplyDelete
  17. when not using the list() before the map(), the following error is what i get... pls help me out

    Traceback (most recent call last):
    File "C:\Users\MAA\Desktop\text.py", line 77, in
    for article_url in to_summarize[:5]:
    TypeError: 'map' object is not subscriptable

    ReplyDelete
    Replies
    1. Did you solve this error Tarun? I am having the same error

      Delete
  18. I am using your program to summarize a set of disaster responses.
    It is repeating same sentence multiple time and I am asking for 2 line summary,it is giving me more that two lines.What is the reason behind it?

    ReplyDelete
  19. Good work. The same algo is used by many online summary tools. I found the results similar. But there are better algorithms available. Full marks to you for implementing this algo and sharing with people.

    ReplyDelete
  20. This comment has been removed by the author.

    ReplyDelete
  21. how to link three blocks like all in the same file or different and then link them? And can we do it with .csv or .json files?

    ReplyDelete
  22. Nice post. i am getting the below error, any idea what it might be??
    TypeError: 'map' object is not subscriptable

    ReplyDelete
    Replies
    1. Hi Ganesh, you can switch to Python 2 or give a look at this: http://stackoverflow.com/questions/6800481/python-map-object-is-not-subscriptable

      Delete
  23. Hey I was trying to run the code but it is showing various types of errors even after installing all the required modules .

    ReplyDelete
  24. Traceback (most recent call last):
    File "check3.py", line 1, in
    feed_xml = urllib.urlopen('http://feeds.bbci.co.uk/news/rss.xml').read()
    NameError: name 'urllib' is not defined

    what to do plz tell me i have only two days to submit my project ?

    ReplyDelete
    Replies
    1. The code in this post uses urllib2, not urllib.

      Delete
    2. I had the same issue,I solved it by removind a leading blank space for that line

      Delete
  25. Can i get a more detailed explanation as to how this can be applied to other webpages that are not RSS feeds? What is 'guid'? When I try to find_all('p'), it seems to be reading just the first paragraph on that page.

    ReplyDelete
    Replies
    1. Hi Flavour, guid is a tag used in the xml code of the feed which contain the url of the article. When you run final_all('p') make sure you're iterating over the results of the function as it should return all the paragraphs in the html you're working with. It's to advise on what to do if you're data is not in rss as the way you need to retrieve it strongly depends on the format the input data is.

      Delete
  26. Still can not figure out how to solve
    TypeError: 'builtin_function_or_method' object is not subscriptable
    :( Tried out several things. I need to run it in python 3 anaconda.

    ReplyDelete
    Replies
    1. from nltk.tokenize import sent_tokenize,word_tokenize
      from nltk.corpus import stopwords
      from collections import defaultdict
      from string import punctuation
      from heapq import nlargest

      class FrequencySummarizer:
      def __init__(self, min_cut=0.1, max_cut=0.9):
      """
      Initilize the text summarizer.
      Words that have a frequency term lower than min_cut
      or higer than max_cut will be ignored.
      """
      self._min_cut = min_cut
      self._max_cut = max_cut
      self._stopwords = set(stopwords.words('english') + list(punctuation))

      def _compute_frequencies(self, word_sent):
      """
      Compute the frequency of each of word.
      Input:
      word_sent, a list of sentences already tokenized.
      Output:
      freq, a dictionary where freq[w] is the frequency of w.
      """
      freq = defaultdict(int)
      for s in word_sent:
      for word in s:
      if word not in self._stopwords:
      freq[word] += 1
      # frequencies normalization and fitering
      m = float(max(freq.values()))
      for w in list(freq):
      freq[w] = freq[w]/m
      if freq[w] >= self._max_cut or freq[w] <= self._min_cut:
      del freq[w]
      return freq

      def summarize(self, text, n):
      """
      Return a list of n sentences
      which represent the summary of text.
      """
      sents = sent_tokenize(text)
      assert n <= len(sents)
      word_sent = [word_tokenize(s.lower()) for s in sents]
      self._freq = self._compute_frequencies(word_sent)
      ranking = defaultdict(int)
      for i,sent in enumerate(word_sent):
      for w in sent:
      if w in self._freq:
      ranking[i] += self._freq[w]
      sents_idx = self._rank(ranking, n)
      return (sents[j] for j in sents_idx)

      def _rank(self, ranking, n):
      """ return the first n sentences with highest ranking """
      return nlargest(n, ranking, key=ranking.get)

      Delete
    2. It looks same as original what was the changes that you made to make it work in python3?

      Delete
  27. I adapted your code to Python 3 and the code grabs the titles only from the RSS feed. After grabbing the first 4 titles of articles, it threw this error:

    Traceback (most recent call last):
    Pound hits highest since Brexit vote on rate rise speech - BBC News
    File "C:/Users/OneDrive/Anaconda NLP 3/text_summarizer/text_summarizer.py", line 93, in
    for s in fs.summarize(text, 2):
    AttributeError: 'FrequencySummarizer' object has no attribute 'summarize'


    Can you please let me know what is wrong and how to be fixed?

    ReplyDelete
    Replies
    1. Hi i've been having error on this line "for article_url in to_summarize[:5]:" I have python3 and i tried a lot fixing this error but couldnt. Can you please help? It would mean everything!

      Delete
  28. Hello, the code was working perfectly but I update python to 3.6
    I got this problem after I fixed the list(freg)

    assert n <= len(sents)
    AssertionError

    ReplyDelete
  29. Thanks for this, is it possible to summarize emails and chats, where the text is extracted and summarized for the conversation or thread?

    ReplyDelete
  30. title, text = get_only_text(url)
    NameError: name 'get_only_text' is not defined

    I am getting this error in the last. Please help me to solve it out.

    ReplyDelete
    Replies
    1. Did you copy the function get_only_text ?

      Delete
    2. This error is resolved.I am getting new error.

      title= fs.get_only_text(aa)
      TypeError: get_only_text() takes exactly 1 argument (2 given)

      Delete
  31. This comment has been removed by the author.

    ReplyDelete
  32. Can you please tell me how to remove the error or urllib2? it is continuously giving this error and my python is not installing urllib2. Please

    ReplyDelete
  33. Hi i've been having error on this line "for article_url in to_summarize[:5]:" I have python3 and i tried a lot fixing this error but couldnt. Can you please help? It would mean everything!

    ReplyDelete
  34. 31 # frequencies normalization and fitering
    32 m = float(max(freq.values()))
    ---> 33 for w in freq.keys():
    34 freq[w] = freq[w]/m
    35 if freq[w] >= self._max_cut or freq[w] <= self._min_cut:

    RuntimeError: dictionary changed size during iteration

    ReplyDelete
    Replies
    1. It seem that many of you have the urllib2 error generally in python3: The simple solution to this i found as:
      import urllib.request
      #and
      req = urllib.request.Request(url)
      page = urllib.request.urlopen(req).read().decode('utf8')
      soup = BeautifulSoup(page,"html5lib")

      The another thing is to mention "html5lib" in BeautifulSoup to remove parser error.
      Others are seem to correct

      Delete
  35. what if two sentences have same ranking.. How will be the sentences ranked in summary

    ReplyDelete
  36. Hello,
    Is this your own algorithm for summarization? If not, which method did you used? Can I get a paper about this method? Please, help me out!

    ReplyDelete
    Replies
    1. Hi Cynthia,

      I wrote this using a paper I read quite a long ago. Unfortunately I can't find it anymore.

      Delete
  37. Hello, very useful and interesting algorithm. I'm new to python. May I ask how to modify the algorithm to parse and summarize text from a local txt file?

    Many thanks

    ReplyDelete
    Replies
    1. Hi Guglielmo, you can open a local file with the function open(). Once you open the file you need to read the file in memory and that depends on the file is structured.

      Delete
  38. # frequencies normalization and fitering
    32 m = float(max(freq.values()))
    ---> 33 for w in freq.keys():
    34 freq[w] = freq[w]/m
    35 if freq[w] >= self._max_cut or freq[w] <= self._min_cut:

    RuntimeError: dictionary changed size during iteration

    ReplyDelete
  39. Hello sir, im getting the following error. Could you please help?
    ModuleNotFoundError: No module named 'urllib2'

    i have tried to install the it on cmd but it throws the following error
    ERROR: Could not find a version that satisfies the requirement urllib2 (from versions: none)
    ERROR: No matching distribution found for urllib2

    ReplyDelete
    Replies
    1. this example was written using python 2 and urllib2 doesn't exist in python 3. You need to use urllib.request

      Delete

Note: Only a member of this blog may post a comment.