dat = read.table("marks.dat",head=T) dim(dat) names(dat) pc = princomp(~Stat+Phys,dat) pc$loadingNotice the somewhat non-intuitive syntax of the
princomp
function. The first argument is a so-called
formula object in R (we have encountered this beast in the
regression tutorial). In princomp
the first argument
must start with a ~ followed by a list of the variables
(separated by plus signs).
The output may not be readily obvious. The next diagram will
help.
pc$loadings
. Each column gives a direction. The
direction of maximum spread (the first principal component) is in
the first column, the next principal component in the second and
so on.
There
is yet more information. Type
pcto learn the amount of spread of the data along the chosen directions. Here the spread along the first direction is 12.40. while that along the second is much smaller 1.98. These numbers are often not of main importance, it is their relative magnitude that matters. To see all the stuff that is neatly tucked inside
pc
we can type
names(pc)We shall not go into all these here. But one thing deserves mention:
scores
. These are the projections of the
data points along the principal components. We have already
mentioned that the
principal components may be viewed as a new reference frame. Then
the scores
are the coordinates of the points w.r.t. this frame.
pc$scores
Drag the picture with the mouse |
quas = read.table("SDSS_quasar.dat",head=T) dim(quas) names(quas) quas = na.omit(quas) dim(quas)Now we shall apply PCA.
pc = princomp(quas[,-1],scores=T)The
scores=T
option will automatically compute the
projections of the data along the principal component directions.
Before looking inside pc
let us make a mental note
of what information we should be looking for. We should look for
the loadings (which are 22 mutually perpendicular directions in a
22-dimensional space). Each direction is represented by a unit
vector with 22 components. So we have 22 times 22 = 484 numbers!
Whew! We should also know the spread of the data cloud along each
of the 22 directions. The spread along each direction is given by
just a single number. So we have to just look at 22 numbers for
this (lot less than
484). So we shall start by looking for these 22 numbers
first. Type
pcto see them. Well, some of these are much larger than the rest. To get an idea of the relative magnitudes, let us plot them.
plot(pc)Incidentally, this plot is often called a screeplot, and R has a function with that name (it produces the same output as the
plot
command).
screeplot(pc)By default, the spreads are shown as bars. Most textbooks, however, prefer to depict the same information as a line diagram:
screeplot(pc,type="lines")The term ``scree'' refers to pebbles lying around the base of a cliff, and a screeplot drawn with lines makes the analogy clear. But one should not forget that the lines do not have any significance. The points are the only important things. We can see from the screeplot that only the first 2 components account for the bulk. In other words, the 22-dimensional data cloud essentially resides in just a 2D plane! Which is that plane? The answer lies in the first two columns of
pc$loadings
.
pc$loading[,1:2]These give two mutually perpendicular unit vectors defining the plane. To make sure that this is indeed the case you may check as
M = pc$loading[,1:2] t(M) %*% M #should ideally produce the 2 by 2 identity matrixYou might like to project the entire data set onto this plane.
plot(pc$scores[,1],pc$scores[,2],pch=".")This is how the data cloud looks like in that magic plane in 22-dimensional space. And with my limited astronomy knowledge I have no idea why these look like this!!! (An astronomy student had once told me that this pattern has to do something with the way the survey was conducted in outer space.)
princomp
ourselves. We shall take
the students' grades data as our running example.
First compute the covariance matrix.
dat = read.table("marks.dat",head=T) covar = cov(dat)Next compute the eigen values and vectors:
eig = eigen(covar)Now compare:
val = eig$values sqrt(val) pc = princomp(~Stat+Phys,dat) pcThe values will not match exactly as there are more bells and whistles inside
princomp
than I have cared to
divulge here. But still the results are comparable.
princomp
function is numerically less stable than
prcomp
which is a better alternative. However, its
output structure differs somewhat from that for
princomp
.
Exercise:
Read the help on prcomp and redo the above computation
on the SDSS quasar data using this function. [Solution]
|