Regression Analysis
Introduction
In every walk of science we come across
variables related to one another. Mathematical formulations
of such relations occupy an
important place in scientific research. When only two variables
are involved we often plot a curve to show functional
relationships as in the left hand figure below which depicts the
relation between the length of a steel wire and the weight
suspended from it.
However, what the experimenter sees in the laboratory is not this
neat line. He sees a bunch of points as shown in the right hand
figure. One idealizes these points to get the straight line.
This process of extracting an ideal (and, desirably, simple)
relation between variables based on observed data is called
regression analysis.
Often the relationship between two variables has a causal
flavor. For example, in our example, we like to think of the
weight as causing the elongation. Thus, weight is the
variable that is under direct control of the experimenter. So we
call it the explanatory variable. The other variable
(elongation) is called the response. If we have a single
explanatory variable and a single response (as here), then we
have a bivariate regression problem. A multivariate
regression problem has more than one explanatory variable, but
(usually) a single response. We shall talk about bivariate
regression first.
Bivariate regression
A typical regression analysis proceeds in four steps:
- First we postulate a form of the relation, like a
straight line or a parabola or an exponential curve. The form
involves unknown numbers to be determined. For instance, a straight line
has equation y = a + bx, where a,b are unknown
numbers to
be determined. Typically the form of the relation is obtained by
theoretical considerations and/or looking at the scatterplot.
- Next we have to decide upon an objective criterion of
goodness of fit. For instance, in the plot below, we can see that
line A is a bad fit. But both B and C appear
equally good to the naked eye. An objective criterion for
goodness of fit is needed to choose one over the other.
- Once we have chosen the form as well as the criterion, it is
a matter of routine computation to
find a relation of our chosen form that fits the data best.
Statistics textbooks spend many pages elaborating on this
step. However, for us it is just a matter of invoking R.
- Last but not the least, we have to check whether the ``best''
relation is indeed what we expected. (Common sense above routine
computation!)
Let us load the Hipparcos data set and extract the Hyades
stars. We shall do this by just invoking the script that we had
created earlier.
source("hyad.r")
attach(hip[HyadFilter,])
We shall now define a luminosity variable logL
logL = (15 - Vmag - 5 * log10(Plx)) / 2.5
Let us make a scatterplot of this variable against
B.V
plot(B.V,logL)
Well, the plot seems to indicate a relation of the form
logL
= a + b B.V
So we have finished the first step.
Next we have to decide upon an
objective criterion for goodness of fit. We shall use the most
popular choice: least squares.
To understand the meaning of this consider the following graph
that shows a line drawn over the scatterplot.
 |
The concept of Least Squares |
From each of the data points we have drawn a vertical to the
line. The length of these verticals measure how much the actual
value of logL
differs from that predicted by the
line. In Least Squares method we try to choose a line that
minimizes the sum of squares of these errors.
In the third step you will invoke the
lm
function of R to find the best values of a and
b as follows.
bestfit = lm(logL ~ B.V)
Notice the somewhat weird notation logL ~ B.V
. This
means the formula
logL
= a + b B.V
In fact, if we write y ~ x1+x2+...xk
then R
interprets this as
y
= a0 + a1 x1
+
... + ak xk
,
where a0,...,ak are the
parameters to be determined.
plot(B.V,logL)
abline(bestfit)
bestfit
This finishes the third step. The final step is to check how good
the fit is. Remember that a fundamental aim of regression is to
be able to predict the value of the response for a given value of
the explanatory variable. So we must make sure that we are able
to make good predictions and that we cannot improve it any
further.
The very first thing to do is to plot the data and overlay the
fitted line on it, as we have already done. The points should be
close to the line without any pattern. The lack of pattern
is important, because if there is some pattern then we can
possibly improve the fit by taking this into account. In our
example there is a pattern: the points in the middle are
mostly below the line, while the points near the extremes are
above. This suggests a slight curvature. So we may be
better off with a quadratic fit.
The second thing to do is to plot the residuals. These are
the the actual value of the response
variables minus the fitted value. In terms of the least squares
plot shown earlier these are the lengths of the vertical lines
representing errors. For points above the line (red verticals)
the sign is negative, while for the blue verticals the sign is
positive.
The residuals are computed by
the lm
function
automatically.
res = bestfit$resid
plot(res)
abline(h=0)
Ideally the points should all be scattered equally around the
zero line. But here the points below the line look more densely
spaced.
Next we should plot the residuals against the explanatory
variable.
plot(B.V,res)
abline(h=0)
The clear U pattern is a most decisive demonstration that our
best fit can possibly be improved further if we fit a parabola
instead of a straight line.
Now that we are a bit wiser after our first attempt at regression
analysis of the Hyades data set we should go back to step 1 and
choose the parabolic form
logL
= a + b B.V
+ c
B.V
2.
This form has three unknown parameters to be determined:
a,b and c. Again we come to step 2. We shall still
choose the least squares method. The third step is almost as
before
bestfit2 = lm(logL ~ B.V + I(B.V^2))
The only unexpected piece in the above line is the
I
. This admittedly awkward symbol means the
B.V^2
should be treated as is. (What happens
without this I
is somewhat strange and its explanation
is beyond the present scope.)
bestfit2
summary(bestfit2)
Now let us look at the fitted line. However, unlike the straight
line case, here we do not have any ready-made function like
abline
. We shall need to draw the curve directly using
the fitted values. Here is the first attempt:
plot(B.V,logL)
lines(B.V,bestfit2$fit)
Oops! We did not want this mess! Actually, what the
lines
function does is this: it plots all supplied the
points and joins them in that order. Since the values of
B.V
are not sorted from small to large, the lines
get drawn haphazardly. To cure the problem we need to sort
B.V
and order the fitted values accordingly.
This is pretty easy in R:
ord = order(B.V)
plot(B.V,logL)
lines(B.V[ord],bestfit2$fit[ord])
This line does seem to be a better fit than the straight line
that we fitted earlier. Let us make the residual plot. But this
time we shall use a built-in feature of R.
plot(bestfit2,3)
In fact, R can automatically make 4 different plots to assess the
performance of the fit. The 3 in the command asks R to produce
the 3rd of these. The others are somewhat more advanced in
nature. There are three points to note here.
- the vertical axis shows the square root of something called
standardized residual. These are obtained by massaging the
residuals that we were working with. The extra massaging
makes the residuals more ``comparable''
(somewhat like making the denominators equal before comparing two
fractions!)
- the horizontal axis shows the fitted values.
- the wavy red line tries to draw our attention to the general
pattern of the points. Ideally it should be a horizontal line.
- There are three points that are rather too far from the zero
line. These are potential trouble-makers (outliers) and R has
labeled them with their case numbers.
To see the values for the point
labelled 54 you
may use
B.V[54]
logL[54]
Typically it is a good idea to take a careful look at these values to
make sure there is nothing wrong with them (typos etc). Also,
these stars may indeed be special. Some one analyzing the data about
the ozonosphere had stumbled across such outliers that turned
out to be the holes in the ozone layer!
Comparing models
Sometimes we have two competing fits for the same
data. For example, we had the straight line as well as the
quadratic line. How do we compare between them? ``Choose the one
that goes closer through the points" might look like a tempting
answer. But unfortunately there is a snag. While we want the fit
to pass close to the points, we also want to keep the equation of
the fit as simple as possible! (This is not merely out of an
ascetic love for simplicity, there are also deep statistical
implications that we cannot discuss here.) Typically, these two
criteria: simplicity and proximity to all the
points act in opposite directions. You gain one only by
sacrificing the other. There are criteria that seek to strike a
balance between the twain. We shall discuss two of them here:
Akaike's Information Criterion (AIC) and Bayesian Information
Criterion (BIC), both being computed using a function called
AIC
in R:
AIC(bestfit)
AIC(bestfit2)
The smaller the AIC the better is the fit. So here the quadratic
fit is indeed better. However, AIC is known to overfit:
it has a bias to favour more complicated models. The BIC
seeks to correct this bias:
n = length(B.V)
AIC(bestfit,k=log(n)) #This computes BIC
AIC(bestfit2,k=log(n))
Well, BIC also confirms the superiority of the
quadratic. Incidentally, both the AIC and the BIC are merely
dumb mathematical formulas.
 |
 |  |  |
 |
It is always important to use domain
knowledge and common
sense to check the fit.
|  |
 |  |  |
|
In particular, here one must remove those
three outliers and redo the computation just to see if there is a
better fit.
Robustness
Outliers are a constant source of headache for the statistician,
and astronomical data sets are well known for a high content of
outliers (after all, the celestial bodies are under no obligation
to follow petty statistical models!) In fact, detecting outliers
is sometimes the most fruitful outcome of an analysis. In order
to detect them we need statistical procedures that are dominated
by the general trend of the data (so that the outliers show up in
contrast!) A statistical process that is dominated only by the
majority of the points (and not by the outliers) is called
robust. The second step in a regression analysis (where we
choose the criterion) determines how robust our analysis will
be. The least squares criterion that we have used so far is not
at all robust.
We shall now compare this with a more robust
choice that is implemented in the function lqs
of the
MASS package in R.
 |
 |  |  |
 |
A package in R is like an add-on.
It is a set of functions and data sets (plus online helps on these)
that may reside in your machine but are not automatically loaded into
R. You have to load them with the function library like
library(MASS) #MASS is the name of the package
The package mechanism is the main way R grows as more and more
people all over the world write and contribute packages to R.
It is possible to make R download and install necessary packages
from the
internet, as we shall see later.
|  |
 |  |  |
|
To keep the exposition simple we shall again
fit a straight line.
library(MASS)
fit = lm(logL ~ B.V)
robfit = lqs(logL ~ B.V)
plot(B.V,logL)
abline(fit,col="red")
abline(robfit,col="blue")
Well, the lines are different. But just by looking at this plot
there is not much to choose the robust (blue) one over the least
square (red). Indeed, the least squares line seems a slightly
better fit. In order to appreciate the benefit of robustness we
have to run the following script. Here you will again start with
the same two lines, but now you can add one new point to
the plot by clicking with your mouse. The plot will automatically
update the two lines.
source("robust.r")
You'll see that a single extra point (the
outlier) causes the least squares (red)
line swing more that the robust (blue) line. In fact, the blue
line does not seem to move at all!
Right click on the plot and select "Stop" from the pop-up menu to
come out. You may also need to click on the little red STOP
button in R toolbar.
Parametric vs. nonparametric
Here we shall take consider the first step (choice of the form)
once more. The choice of the form, as we have already pointed
out, depends on the underlying theory and the general pattern of
the points. There may be situations, however, where it is
difficult to come up with a simple form. Then it is common to use
non-parametric regression which internally uses a
complicated form with tremendous flexibility. We shall
demonstrate the use of one such method called LOWESS which is
implemented as the function lowess
in R.
plot(B.V,logL)
fit = lowess(logL~B.V)
lines(fit)
A newer variant called loess
produces very similar
result. However, drawing the line is slightly more messy here.
fit2 = loess(logL~B.V)
ord = order(B.V)
lines(B.V[ord],fit2$fitted[ord],col="red")
Multiple regression
So far we have been working with bivariate regression, where we
have one response and one explanatory variable. In multiple
regression we work with a single response and at least two
explanatory variables. The same functions (lm
,
lqs
, loess
etc) handle multiple regression in
R. We shall not discuss much of multiple regression owing to its
fundamental similarity with the bivariate case. We shall just
show a single example introducing RA
and DE
as explanatory variables as well as
B.V
.
fit = lm(logL~B.V+RA+DE)
summary(fit)
We shall not go any deeper beyond pointing out that the absence
of the asterisks in the RA
and DE
lines
mean these extra explanatory variables are useless
here.
A word of wisdom
Our exposition so far has been strictly from the users'
perspective with the focus being on commands to achieve things,
and not how these commands are internally executed. Sometimes the
same problem can be presented in different ways that are all
essentially the same to the user, but quite different for the
underlying mathematics. One such example is that R prefers the
explanatory variables to be centered, that is, spread
symmetrically around 0. This enhances the stability of the
formulas used. So instead of
lm(logL ~ B.V)
we should do
B.V1 = B.V - mean(B.V)
lm(logL ~ B.V1)
 |
 |  |  |
 | Always center the explanatory variables
before performing regression analysis. This does not apply to the
response variable.
|  |
 |  |  |
|
k-nearest neighbors
Imagine that you are given a data set of average parent heights
and their adult sons' heights, like this (ignore the *'s for
the time being):
Parent Son
5.5 5.9*
5.4 5.3*
5.7 5.9
5.1 5.3*
6.0 5.8
5.5 5.2*
6.1 6.0
A prospective
couple come to you, report that their average height is 5.3
feet, and want you to predict (based on this data set) the
height their son would attain
at adulthood. How should you proceed?
A pretty simple method is this: look at the cases in your data
set with average parents' height near 5.3. Say, we consider the 3
nearest cases. These are marked with *'s in the above data
set. Well, there are 4 cases, and not 3. This is because there
is a tie (three cases 5.1, 5.5, 5.5 that are at same
distance from 5.3).
Now just report the average of the sons' heights for these cases.
Well, that is all there is to it in 3-NN regression (NN = Nearest
Neighbor)! Let us
implement this in R. First the data set
parent = c(5.5, 5.4, 5.7, 5.1, 6.0, 5.5, 6.1)
son = c(5.9, 5.3, 5.9, 5.3, 5.8, 5.2, 6.0)
Now the new case:
newParent = 5.3
Find the distances of the parents' heights from this new height:
d = abs(parent- newParent)
Rank these:
rnk = rank(d,tie="min")
rnk
Notice the tie="min"
option. This allows
the three joint seconds to be each given rank 2.
We are not using order
here (as it does not handle ties
gracefully). Now identify the 3 nearest cases (or more in case of
ties).
nn = (rnk<=3)
nn
Now it is just a matter of taking averages.
newSon = mean(son[nn])
newSon
It is instructive to write a function that performs the above
computation.
knnRegr = function(x,y,newX,k) {
d = abs(x-newX)
rnk = rank(d,tie="min")
mean(y[rnk<=k])
}
Now we can use it like
newSon = knnRegr(parent,son,newParent,3)
| Exercise:
You shall apply the function that we have just written to the
Hipparcos data set.
plot(RA,pmRA)
We want to explore the relation between RA and
pmRA using the k-nearest neighbour
method.
Our aim is to predict the value pmRA based on a
new value of RA :
newRA = 90.
Use the knnRegr function to do this with k=13.
|
Nadaraya-Watson regression
While the k-nearest neighbor method is simple and
reasonable it is nevertheless open to one objection that we illustrate
now. Suppose that we show the x-values along a number line
using crosses as below. The new x-value is shown
with an arrow. The 3 nearest neighbors are circled.
 |
Are the 3 NN equally relevant? |
The k-NN method now requires us to simply average the
y-values of the circled cases. But since two of the points
are rather far away from the arrow, shouldn't a weighted
average be a better method, where remote points get low
weights?
This is precisely the idea behind the Nadaraya-Watson
regression. Here we average over all the cases (not just
nearest neighbors) but cases farther away get less weight. The
weights are decided by a kernel function (typically a
function peaked at zero, and tapering off symmetrically on both
sides).
 |
How the kernel determines the weights |
Now we may simply take the weighted average. We shall apply it to
our parent-son example first.
parent = c(5.5, 5.4, 5.7, 5.1, 6.0, 5.5, 6.1)
son = c(5.9, 5.3, 5.9, 5.3, 5.8, 5.2, 6.0)
newParent = 5.3
kernel = dnorm #this is the N(0,1) density
Now let us find the weights (for a bin width (h) = 0.5, say):
h = 0.5
wt = kernel((parent- newParent)/h)/h
Finally the weighted average:
newSon = weighted.mean(son,w = wt)
| Exercise:
Write a function called, say, nw , like this
nw = function(x,y,newX,kernel,h) {
#write commands here
}
[Hint: Basically collect the lines that we used just now.]
Now use it to predict the value of pmRA when
RA is 90 (use h=0.2):
newRA = 90
newpmRA = nw(RA,pmRA,newRA,dnorm,0.2)
|