dat = read.table("marks.dat",head=T) dim(dat) names(dat) pc = princomp(~Stat+Phys,dat) pc$loadingNotice the somewhat non-intuitive syntax of the
princompfunction. The first argument is a so-called formula object in R (we have encountered this beast in the regression tutorial). In
princompthe first argument must start with a ~ followed by a list of the variables (separated by plus signs). The output may not be readily obvious. The next diagram will help.
pc$loadings. Each column gives a direction. The direction of maximum spread (the first principal component) is in the first column, the next principal component in the second and so on. There is yet more information. Type
pcto learn the amount of spread of the data along the chosen directions. Here the spread along the first direction is 12.40. while that along the second is much smaller 1.98. These numbers are often not of main importance, it is their relative magnitude that matters. To see all the stuff that is neatly tucked inside
pcwe can type
names(pc)We shall not go into all these here. But one thing deserves mention:
scores. These are the projections of the data points along the principal components. We have already mentioned that the principal components may be viewed as a new reference frame. Then the scores are the coordinates of the points w.r.t. this frame.
|Drag the picture with the mouse|
quas = read.table("SDSS_quasar.dat",head=T) dim(quas) names(quas) quas = na.omit(quas) dim(quas)Now we shall apply PCA.
pc = princomp(quas[,-1],scores=T)The
scores=Toption will automatically compute the projections of the data along the principal component directions. Before looking inside
pclet us make a mental note of what information we should be looking for. We should look for the loadings (which are 22 mutually perpendicular directions in a 22-dimensional space). Each direction is represented by a unit vector with 22 components. So we have 22 times 22 = 484 numbers! Whew! We should also know the spread of the data cloud along each of the 22 directions. The spread along each direction is given by just a single number. So we have to just look at 22 numbers for this (lot less than 484). So we shall start by looking for these 22 numbers first. Type
pcto see them. Well, some of these are much larger than the rest. To get an idea of the relative magnitudes, let us plot them.
plot(pc)Incidentally, this plot is often called a screeplot, and R has a function with that name (it produces the same output as the
screeplot(pc)By default, the spreads are shown as bars. Most textbooks, however, prefer to depict the same information as a line diagram:
screeplot(pc,type="lines")The term ``scree'' refers to pebbles lying around the base of a cliff, and a screeplot drawn with lines makes the analogy clear. But one should not forget that the lines do not have any significance. The points are the only important things. We can see from the screeplot that only the first 2 components account for the bulk. In other words, the 22-dimensional data cloud essentially resides in just a 2D plane! Which is that plane? The answer lies in the first two columns of
pc$loading[,1:2]These give two mutually perpendicular unit vectors defining the plane. To make sure that this is indeed the case you may check as
M = pc$loading[,1:2] t(M) %*% M #should ideally produce the 2 by 2 identity matrixYou might like to project the entire data set onto this plane.
plot(pc$scores[,1],pc$scores[,2],pch=".")This is how the data cloud looks like in that magic plane in 22-dimensional space. And with my limited astronomy knowledge I have no idea why these look like this!!! (An astronomy student had once told me that this pattern has to do something with the way the survey was conducted in outer space.)
princompourselves. We shall take the students' grades data as our running example. First compute the covariance matrix.
dat = read.table("marks.dat",head=T) covar = cov(dat)Next compute the eigen values and vectors:
eig = eigen(covar)Now compare:
val = eig$values sqrt(val) pc = princomp(~Stat+Phys,dat) pcThe values will not match exactly as there are more bells and whistles inside
princompthan I have cared to divulge here. But still the results are comparable.
princompfunction is numerically less stable than
prcompwhich is a better alternative. However, its output structure differs somewhat from that for
Read the help on