Up above us all so high...
In the first part of this tutorial we shall imagine ourselves in
a satellite taking
photographs of the earth. In the process we shall learn some
image processing as well as some clustering techniques.
This shows part of a blue ocean and two land
masses with some green vegetation. We can recognize these by the
color. A human eye is extremely adept at detecting large regions of
similar colors in an image. A camera, however, has no such
merely scans a scene in a mechanical way, faithfully reporting every dot
in it, but has no idea if the dots constitute any pattern or not. The
above image, for example, is made of 3000 dots, arranged in a rectangle
that is 75 dots wide and 40 dots high. Each dot, by the way, is
called a pixel (a short form for "picture element"). The
camera uses three filters: red,
green and blue. The camera reports the red-ness, green-ness
and blue-ness of each pixel. The numbers are from 0 to 255. Thus, a perfectly
red dot will have red value 255, green value 0 and blue value 0. If you
increase the green component the dot will appear yellowish (since red plus
green light make yellow light).
The 3000 dots in our pictures, therefore, are just 9000 numbers to the
satellite. It is better to think of them as 3000 points in
3D. Our eye could quickly group (or cluster) these 3000 points
into two clusters: ocean and land. The ability to group a
large number of points in d-dimensions into a relatively
smaller number of classes is the aim of cluster analysis.
The file sat.dat stores the data set for this image. Before
going into the statistics let us learn how to turn these 3000
points into an image using R.
First read the data.
|A satellite image
dat = read.table("sat.dat")
Notice that this file has no header line, so we have the omitted
head=T option. We are told that the data
file has 3 columns and 3000 rows of numbers. Each row is for a
pixel, and the 3 columns are for red, green and blue,
respectively. Let us assign these names to the columns.
names(dat) = c("red","green","blue")
Next make a vector of colors, one color for each point.
mycol = rgb(red,green,blue,max=255)
rgb makes colors by combining red,
green and blue values. The
specification means 255 is to be considered the highest value for
Next we need to specify the size of the image. It is 75 pixels in
width and 40
pixels in height.
rows = 1:75
columns = 1:40
So the 3000 points are actually arranged as a 40 by 75
matrix. (The height, 40, is the number of rows.)
z = matrix(1:3000,nrow=75)
Now we are ready to make the image.
If you do not like to see the axes and labels, then you may
Based on the 3000 points
the statistical brain of the satellite has to decide that is seeing two
land masses peeping out from a vast blue ocean. And one important tool to
achieve this is called clustering.
But before we go into it let us make a 3D scatterplot out of the 3
library(scatterplot3d) #Warning! It won't work unless
#scatterplot3d is installed.
#I mention this to explain how
#you may do it if you can download
#this (pretty large) package.
#Also, this package is readily available for
#only Windows. To use it in Linux, you have
#to download the source code and compile!
#But we shall see a work around for this soon.
Then you can use:
One problem with
scatterplot3d is that there is no
simple way to specify the viewing angle. This is rectified in
the small script scat3d.r, which you can download.
You can also turn the plot around specifying the angles
phi as follows.
scat3d(red,green,blue,col=mycol,theta = 20, phi=0)
Just a jumble of points, right? We shall apply a technique called
k-means clustering to group the 3000 points into 2
clusters. Then later we shall learn how k-means clustering
Apply the command
cl = kmeans(dat,2)
and see what
kmeans has returned.
Each pixel is put into one of two clusters (called 1 and 2), and
cl$clus tells us the cluster of each pixel. The
centers of the two clusters are given by
Now we shall make an image of the clustering result. We are going
to have an image consisting of only two colors, one for each
cluster. It is natural to choose the center as the representative
color for a cluster.
c2 = rgb(cl$cen[,1],cl$cen[,2],cl$cen[,3],max=255)
we have to wrap the long vector
cl$clus in to a 75
by 40 matrix in order to make the image.
Do you think that the ocean is well-separated from the land?
What about adding some more details? For this we shall increase
k to 3, say.
cl = kmeans(dat,3)
c3 = rgb(cl$cen[,1],cl$cen[,2],cl$cen[,3],max=255)
As you increase the number of clusters more and more details are
added to the image. Notice that in our example the details get
added to the land masses instead of the ocean which is just a
flat uninteresting sheet of water.
How the k-means algorithm works
Consider the data set shown below. We can see that it has two
clusters of points.
In order to find these two clusters we shall employ the
k-means algorithm with k=2.
The algorithm proceeds by alternating two steps that we shall
call Monarchy and Democracy (see the figure below). In step 1, we apply Monarchy by
choosing two random points as the kings of the two
clusters. These two kings create their empires by enlisting the
data points nearest to them as their subjects. So we get two
kingdoms separated by the perpendicular bisector between the
In step 2 we have Democracy, where the king is abolished and the
data points in each kingdom choose their respective leaders as
the average of themselves.
This election over, Monarchy kicks in once again as the elected
leaders behave as kings, enlisting the nearest data points as
subjects. This redefines the boundary of the kingdoms.
Then Democracy starts again, after which comes Monarchy. This
process continues until the kingdoms do not change any
more. These two kingdoms finally give the two clusters. The kings
(or elected leaders) are the centers of the clusters.
|Want to find the two clusters
|Steps in the 2-means algorithm
Since the amount of details increases with a larger number of
clusters, we come to the question ``How to choose the best
number of clusters?" One way to resolve the issue is to first take a look
at the results for all possible numbers of clusters. Clearly, if
there are n cases in the data, then the number of clusters can go
from 1 to n. This is the idea behind hierarchical clustering. It is
like zooming down gradually on the details.
Consider the four shapes below.
Suppose that you are to group the similar shapes together making just
clusters. You'd most possibly put the two cubes in one cluster, while the
sphere and the cone (both having round surfaces) will make the second
cluster. So the two clusters are
(small cube, large cube) , (sphere, cone).
Next, suppose that you are to further split any one of the
clusters (making three clusters in all). One natural way is to
split the (sphere, cone) cluster into two separate clusters (sphere) and
(cone) resulting in three clusters:
(small cube, large cube) , (sphere), (cone).
If we are to further increase the number of clusters, we have to split the
(small cube), (large cube) , (sphere), cone).
And we have reached the maximum number of clusters. This step-by-step
clustering process may be expressed using the following diagram which is
sometimes called a clustering tree and sometimes called a
Note that this single tree contains information about all
possible numbers of clusters.
Here we shall illustrate an application of hierarchical clustering using
the data set shapley.dat.
shap = read.table("shapley.dat",head=T)
In this file we have some missing values for the variable
in column 3 that are coded as 0 (instead of
NA). Let us
convert these to
shap[shap[,3]==0,3] = NA
Now we shall perform hierarchical clustering, but we shall use
only a subset of the data. We shall take only those cases where
V is between 12000 and 16000. Also we shall consider
only the three variables
shap = shap[V>12000 & V<16000,c("RA","DE","V")]
Next we shall centre and scale each variable (i.e., we
shall subtract the mean and divide by the standard
deviation). This is conveniently done by the
shap = scale(shap)
In order to perform hierarchical clustering using the
hclust function we have to find the distance matrix of
the data. The (i,j)-th entry of this matrix gives the
Euclidean distance between the i-th and the j-th
point in the data set.
d = dist(shap)
mytree = hclust(d) #this may take some time
As we have mentioned earlier this hairy tree-like object (called
a dendrogram) represents clustering allowing all
possible numbers of clusters. In order to use it we have to
``cut'' it at a particular height as shown in the diagram below.
Suppose that we want 3 clusters for the Shapley data set. The
dendrogram shows that we need to cut near height 6.
|Cutting a dendrogram
classes = cutree(mytree,h=6)
table(classes) #How many points in each class
Since we are working with 3D data (we have only 3 columns) so we
can make a 3D scatterplot.
Let us take a look at a method called AGglomerative NESting (AGNES).
The R function that we shall employ is called
shall apply it on the Shapley data.
agn = agnes(shap) #this will take some time
R can make two possible plots of the output from
Here the argument
which=2 requests the second of
these. The returned object
agn is a clustering tree
somewhat like that returned by
hclust. But the
internal structures differ. So you cannot directly apply the
cutree function to it. We first need to convert it to
the structure compatible with
agnh = as.hclust(agn)
tmp = cutree(agnh,h=2.2)
Here we have performed the cut at a specified height. We could
also have specified the number of required clusters.
tmp2 = cutree(agnh,k=10)
You may recall the terms "single linkage","complete linkage" and
"average linkage" from the theory class. What
by default is "average linkage". Let us see how "single linkage"
performs for our data.
agn.sing = agnes(shap,method="single")
Write an R script to split the Shapley data into 5 clusters using
the "complete linkage" method of AGNES, and then make a 3D
scatterplot of the data set using the colors red, green, blue,
black and pink for the 5 clusters. [Solution]
k-nearest neighbors classification
This method is motivated by how a child is taught to read. The
teacher first shows her some letters and tells her the names of
the letters, so that she can associate the names with the shapes.
Then she is presented with some letters to see if she can
|The teacher's voice is shown in red
When she sees a new letter the child quickly matches the new
shape with the ones she has learned, and says the name of the
letter that comes closest to the new letter.
Let us take a note of the different parts of this process. First,
we have a training data set: the letter shapes and their
names (shown in the first picture). Then we have a test data set
which is very much like the training set, except that the
names are not given.
We shall employ the same method to recognize the Hyades stars! We
shall start with a training data set consisting of 120 stars of
which the first 60 will be Hyades and the others not Hyades. We
are supposed to learn to recognize a Hyades star based on these.
Then we shall be given a test data set consisting of 56 new
stars, some of which are Hyades (we are not told which). Our
job will be to identify the Hyades stars in the test data sets.
First download (right-click and save to your computer) the two data sets:train.dat and tst.dat.
Next, load them into R.
|The child is to identify these
trn = read.table("train.dat",head=T)
tst = read.table("tst.dat",head=T)
We are told that the first 60 stars in the training data set are
Hyades, while the remaining 60 are not. So accordingly
we shall make a vector
of size 120:
trueClass = c(rep("Hyad",60),rep("NonHyad",60))
Now it is time to apply the k-nearest neighbors
foundClass = knn(trn,tst,trueClass,k=3)
tst consists of 56 stars, so
foundClass is a vector of that length.
Here it so happens that there is not a single mistake!