Principal Component Analysis

A simple example

Consider 100 students with Physics and Statistics grades shown in the diagram below. The data set is in marks.dat.

If we want to compare among the students which grade should be a better discriminating factor? Physics or Statistics? Surely Physics, since the variation is larger there. This is a common situation in data analysis where the direction along which the data varies the most is of special importance.

Now suppose that the plot looks like the following. What is the best way to compare the students now?

Here the direction of maximum variation is like a slanted straight line. This means we should take linear combination of the two grades to get the best result. In this simple data set the direction of maximum variation is more or less clear. But for many data sets (especially high dimensional ones) such visual inspection is not adequate or even possible! So we need an objective method to find such a direction. Principal Component Analysis (PCA) is one way to do this.

dat = read.table("marks.dat",head=T)
dim(dat)
names(dat)
pc = princomp(~Stat+Phys,dat)
pc$loading

Notice the somewhat non-intuitive syntax of the princomp function. The first argument is a so-called formula object in R (we have encountered this beast in the regression tutorial). In princomp the first argument must start with a ~ followed by a list of the variables (separated by plus signs).

The output may not be readily obvious. The next diagram will help.

R has returned two principal components. (Two because we have two variables). These are a unit vector at right angles to each other. You may think of PCA as choosing a new coordinate system for the data, the principal components being the unit vectors along the axes. The first principal component gives the direction of the maximum spread of the data. The second gives the direction of maximum spread perpendicular to the first direction. These two directions are packed inside the matrix pc$loadings. Each column gives a direction. The direction of maximum spread (the first principal component) is in the first column, the next principal component in the second and so on.

There is yet more information. Type

pc

to learn the amount of spread of the data along the chosen directions. Here the spread along the first direction is 12.40. while that along the second is much smaller 1.98. These numbers are often not of main importance, it is their relative magnitude that matters.

To see all the stuff that is neatly tucked inside pc we can type

names(pc)

We shall not go into all these here. But one thing deserves mention: scores. These are the projections of the data points along the principal components. We have already mentioned that the principal components may be viewed as a new reference frame. Then the scores are the coordinates of the points w.r.t. this frame.

pc$scores

Higher dimensions

Most statisticians consider PCA a tool for reducing dimension of data. To see this consider the interactive 3D scatterplot below. It is possible to rotate this plot with the mouse. By rotating suitably we can see that the cloud of points is basically confined in a 2D plane. In other words, the data set is essentially 2D.

Drag the picture with the mouse

The same conclusion may be obtained by PCA. Here the first two components will be along the plane, while the third will be perpendicular to the plane. These are shown as the three lines.

Putting it to action

Now that we have seen how PCA can identify if the data cloud resides in a lower dimensional space, we are ready to apply our knowledge to astronomy. We shall work with the SDSS Quasar data set stored in the file SDSS_quasar.dat. First we prepare the data set for analysis.

quas = read.table("SDSS_quasar.dat",head=T)
dim(quas)
names(quas)
quas = na.omit(quas)
dim(quas)

Now we shall apply PCA.

pc = princomp(quas[,-1],scores=T)

The scores=T option will automatically compute the projections of the data along the principal component directions.

Before looking inside pc let us make a mental note of what information we should be looking for. We should look for the loadings (which are 22 mutually perpendicular directions in a 22-dimensional space). Each direction is represented by a unit vector with 22 components. So we have 22 times 22 = 484 numbers! Whew! We should also know the spread of the data cloud along each of the 22 directions. The spread along each direction is given by just a single number. So we have to just look at 22 numbers for this (lot less than 484). So we shall start by looking for these 22 numbers first. Type

pc

to see them. Well, some of these are much larger than the rest. To get an idea of the relative magnitudes, let us plot them.

plot(pc)

Incidentally, this plot is often called a screeplot, and R has a function with that name (it produces the same output as the plot command).

screeplot(pc)

By default, the spreads are shown as bars. Most textbooks, however, prefer to depict the same information as a line diagram:

screeplot(pc,type="lines")

The term ``scree'' refers to pebbles lying around the base of a cliff, and a screeplot drawn with lines makes the analogy clear. But one should not forget that the lines do not have any significance. The points are the only important things.

We can see from the screeplot that only the first 2 components account for the bulk. In other words, the 22-dimensional data cloud essentially resides in just a 2D plane! Which is that plane? The answer lies in the first two columns of pc$loadings.

pc$loading[,1:2]

These give two mutually perpendicular unit vectors defining the plane. To make sure that this is indeed the case you may check as

M = pc$loading[,1:2]
t(M) %*% M #should ideally produce the 2 by 2 identity matrix

You might like to project the entire data set onto this plane.

plot(pc$scores[,1],pc$scores[,2],pch=".")

This is how the data cloud looks like in that magic plane in 22-dimensional space. And with my limited astronomy knowledge I have no idea why these look like this!!! (An astronomy student had once told me that this pattern has to do something with the way the survey was conducted in outer space.)

Peeping behind the scree...

It may be of some interest to know how PCA works. We shall not go into all the nitty gritty details, but the basic idea is not too hard to grasp, if you know eigen values and eigen vectors.

What PCA does is, roughly speaking, computing the eigen values and eigen vectors of the covariance matrix of the data. A detailed exposition of why that is done is beyond the scope of this tutorial. But we may use R's eigen analysis tools to hack a rough imitation of princomp ourselves. We shall take the students' grades data as our running example.

First compute the covariance matrix.

dat = read.table("marks.dat",head=T)
covar = cov(dat)

Next compute the eigen values and vectors:

eig = eigen(covar)

Now compare:

val = eig$values
sqrt(val)
pc = princomp(~Stat+Phys,dat)
pc

The values will not match exactly as there are more bells and whistles inside princomp than I have cared to divulge here. But still the results are comparable.

A better way

The princomp function is numerically less stable than prcomp which is a better alternative. However, its output structure differs somewhat from that for princomp.

Exercise: Read the help on prcomp and redo the above computation on the SDSS quasar data using this function. [Solution]