Sunday, February 24, 2013

Selection Sort animated

Lately I bumped in the Wikipedia page dedicated to the Selection Sort algorithm and I found an interesting animation that shows its behavior. So I decided to write this post which shows how to recreate that animation using Python.
The idea behind the selection sort is straightforward: at each iteration the unsorted element with the smallest (or largest) value is moved to its proper position in the array. Assume that we wish to sort an array in increasing order. We begin by selecting the lowest element and moving it to the lowest index position. We can do this by swapping the element at the lowest index and the lowest element. We then reduce the effective size of the unsorted items by one element and repeat the process on the smaller unsorted (sub)array. The process stops when the effective number of the unsorted items becomes 1.
Let's see a Python implementation of the selection sort which is able to visualize with a graph the status of the sorting at each iteration:
import pylab

def selectionsort_anim(a):
 x = range(len(a)) 
 for j in range(len(a)-1):
  iMin = j
  for i in range(j+1,len(a)):
   if a[i] < a[iMin]: # find the smallest value
    iMin = i
  if iMin != j: # place the value into its proper location
   a[iMin], a[j] = a[j], a[iMin]
  # plotting
  pylab.plot(x,a,'k.',markersize=6)
  pylab.savefig("selectionsort/img" + '%04d' % j + ".png")
  pylab.clf() # figure clear

# running the algorithm
a = range(300) # initialization of the array
shuffle(a)     # shuffle!
selectionsort_anim(a) # sorting
At each iteration the status of the algorithm is visualized plotting the indexes of the array versus its values. Every plot is saved as an image and we can easily join them as a video using ffmpeg:
$ cd selectionsort # the directory where the images are
$ ffmpeg -qscale 5 -r 20 -b 9600 -i img%04d.png movie.mp4
The result should be as follows

At each frame of the video we can see that the elements on the left form a subset of items already sorted and the rest of the items remain to be sorted. It's nice to see that the number unsorted elements decrease at each iterations while the subset of sorted items grows forming a straight line.

This is just the first of a series of posts about the visualization of sorting algorithms. Stay tuned for the animations of other sorting algorithms!

Monday, February 18, 2013

Visualizing the tangent

The tangent to a curve is the straight line that touches the curve at a given point. One reason that tangents are so important is that they give the slopes of straight lines. So, if your curve represents a time series you can tell the ratio of change of your values just looking at the tangent.
Suppose that a curve is given as the graph of a function, y = f(x). We have that the slope in the point (a, f(a)) is equal to its derivative in a


and the equation of the tangent line can be stated as follows


With this in mind we can write a snippet of code which visualize the tangent of a curve:
from numpy import sin,linspace,power
from pylab import plot,show

def f(x): # sample function
 return x*sin(power(x,2))

# evaluation of the function
x = linspace(-2,4,150)
y = f(x)

a = 1.4
h = 0.1
fprime = (f(a+h)-f(a))/h # derivative
tan = f(a)+fprime*(x-a)  # tangent

# plot of the function and the tangent
plot(x,y,'b',a,f(a),'om',x,tan,'--r')
show()
The result is as follows:


The f is plotted with a blue curve and the tangent is the dashed line. Looking at the graph it is actually easy to observe that the tangent gives us a way to visualize the slope of a curve in a point. Let's see how this can help us in a practical example. Consider the fresh potatoes consumer price index between the years 1949 and 2006:
from numpy import arange
# Fresh potatoes: Annual Consumer price index, 1949-2006
# obatined at https://explore.data.gov/Agriculture/U-S-Potato-Statistics/cgk7-6ccj
price_index = [21.0,17.6,19.3,28.9,21.1,20.5,22.1,26.4,22.3,24.4,
24.6,28.0,24.7,24.9,25.7,31.6,39.1,31.3,31.3,32.1,34.4,38.0,36.7,
39.6,58.8,71.8,57.7,62.6,63.8,66.3,63.6,81.0,109.5,92.7,91.3,116.0,
101.5,96.1,116.0,119.1,153.5,162.6,144.6,141.5,154.6,174.3,174.7,
180.6,174.1,185.2,193.1,196.3,202.3,238.6,228.1,231.1,247.7,273.1]
t = np.arange(1949,2007)
From the calculus we have that the derivative is positive when f is increasing, it is negative when f is decreasing and zero when f has a saddle point. So, if we look at the tangent of the curve of the consumer price index in a certain year we have that it has a positive slope when the price index is increasing, a negative slope when the price are decreasing and it is constant when the trend is going to change. Using an interpolation of the data we loaded above we can plot the tangent in each year we want:
from scipy import interpolate

def draw_tangent(x,y,a):
 # interpolate the data with a spline
 spl = interpolate.splrep(x,y)
 small_t = arange(a-5,a+5)
 fa = interpolate.splev(a,spl,der=0)     # f(a)
 fprime = interpolate.splev(a,spl,der=1) # f'(a)
 tan = fa+fprime*(small_t-a) # tangent
 plot(a,fa,'om',small_t,tan,'--r')

draw_tangent(t,price_index,1991)
draw_tangent(t,price_index,1998)

plot(t,price_index,alpha=0.5)
show()



The graph shows the data contained in the array price_index and shows the tangent of the curve for the years 1991 and 1998. Using the tangent, this graph gives an emphasis about the fact that the price index is decreasing during the years around 1991 and increasing around 1998.

Find out more about derivative approximation in the post Finite differences with Toeplitz matrix.

Wednesday, February 6, 2013

Betweenness Centrality

In Network Analysis the identification of important nodes is a common task. We have various centrality measures that we can use and in this post we will focus on the Betweenness Centrality. We will see how this measure is computed and how to use the library networkx in order to create a visualization of the network where the nodes with the highest betweenness are highlighted. The betweenness focuses on the number of visits through the shortests paths. If a walker moves from one node to another node via the shortests path, then the nodes with a large number of visits have a higher centrality. The betweenness centrality is defined as


where s(s,t) is total number of shortest paths from node s to node t and sv(s,t) is the number of those paths that pass through v.
Let's see how to compute the betweenness with networkx. As first step we have to load a sample network (yes, it's the same of this post):
# read the graph (gml format)
G = nx.read_gml('lesmiserables.gml',relabel=True)
Now we have a representation G of our network and we can use the function betweenness_centrality() to compute the centrality of each node. This function returns a list of tuples, one for each node, and each tuple contains the label of the node and the centrality value. We can use this information in order to trim the original network and keep only the most important nodes:
def most_important(G):
 """ returns a copy of G with
     the most important nodes
     according to the pagerank """ 
 ranking = nx.betweenness_centrality(G).items()
 print ranking
 r = [x[1] for x in ranking]
 m = sum(r)/len(r) # mean centrality
 t = m*3 # threshold, we keep only the nodes with 3 times the mean
 Gt = G.copy()
 for k, v in ranking:
  if v < t:
   Gt.remove_node(k)
 return Gt

Gt = most_important(G) # trimming
And we can use the original network and the trimmed one to visualize the network as follows:
from pylab import show
# create the layout
pos = nx.spring_layout(G)
# draw the nodes and the edges (all)
nx.draw_networkx_nodes(G,pos,node_color='b',alpha=0.2,node_size=8)
nx.draw_networkx_edges(G,pos,alpha=0.1)

# draw the most important nodes with a different style
nx.draw_networkx_nodes(Gt,pos,node_color='r',alpha=0.4,node_size=254)
# also the labels this time
nx.draw_networkx_labels(Gt,pos,font_size=12,font_color='b')
show()
The graph should be like this one:


This graph is pretty interesting, indeed it highlights the nodes which are very influential on the way the information spreads over the network. In the sample network we used each node represents a character and the connection between two characters represent the coappearance in the same chapter of the book 'Les miserable'. Looking at the graph we can easily say what are the most important characters according to the Betweenness Centrality. We can also observe some interesting situations like the ones of Valjean and Myriel. They are to connected to groups of characters who don't have a direct connection with the main ones.