Regression Analysis

Introduction

In every walk of science we come across variables related to one another. Mathematical formulations of such relations occupy an important place in scientific research. When only two variables are involved we often plot a curve to show functional relationships as in the left hand figure below which depicts the relation between the length of a steel wire and the weight suspended from it.

However, what the experimenter sees in the laboratory is not this neat line. He sees a bunch of points as shown in the right hand figure. One idealizes these points to get the straight line. This process of extracting an ideal (and, desirably, simple) relation between variables based on observed data is called regression analysis.

Often the relationship between two variables has a causal flavor. For example, in our example, we like to think of the weight as causing the elongation. Thus, weight is the variable that is under direct control of the experimenter. So we call it the explanatory variable. The other variable (elongation) is called the response. If we have a single explanatory variable and a single response (as here), then we have a bivariate regression problem. A multivariate regression problem has more than one explanatory variable, but (usually) a single response. We shall talk about bivariate regression first.

Bivariate regression

A typical regression analysis proceeds in four steps:
  1. First we postulate a form of the relation, like a straight line or a parabola or an exponential curve. The form involves unknown numbers to be determined. For instance, a straight line has equation y = a + bx, where a,b are unknown numbers to be determined. Typically the form of the relation is obtained by theoretical considerations and/or looking at the scatterplot.
  2. Next we have to decide upon an objective criterion of goodness of fit. For instance, in the plot below, we can see that line A is a bad fit. But both B and C appear equally good to the naked eye. An objective criterion for goodness of fit is needed to choose one over the other.

  3. Once we have chosen the form as well as the criterion, it is a matter of routine computation to find a relation of our chosen form that fits the data best. Statistics textbooks spend many pages elaborating on this step. However, for us it is just a matter of invoking R.
  4. Last but not the least, we have to check whether the ``best'' relation is indeed what we expected. (Common sense above routine computation!)
Let us load the Hipparcos data set and extract the Hyades stars. We shall do this by just invoking the script that we had created earlier.
source("hyad.r")
attach(hip[HyadFilter,])
We shall now define a luminosity variable logL
logL = (15 - Vmag - 5 * log10(Plx)) / 2.5
Let us make a scatterplot of this variable against B.V
plot(B.V,logL)
Well, the plot seems to indicate a relation of the form
logL = a + b B.V
So we have finished the first step.

Next we have to decide upon an objective criterion for goodness of fit. We shall use the most popular choice: least squares. To understand the meaning of this consider the following graph that shows a line drawn over the scatterplot.

The concept of Least Squares

From each of the data points we have drawn a vertical to the line. The length of these verticals measure how much the actual value of logL differs from that predicted by the line. In Least Squares method we try to choose a line that minimizes the sum of squares of these errors.

Exercise: To get an idea of this run the script
source("lsq.r")
Rounded corner: top-leftBackgroundRounded corner: top-right
Background The scripts used in these tutorials often contain relatively advanced R techniques. You may like to explore their contents to continue learning R after finishing the tutorials. Background
Rounded corner: bottom-leftBackgroundRounded corner: bottom-right
The script will open a small window like this:

Move the sliders to change a,b

Here you can interactively play with a and b to minimize the amount of error.

In the third step you will invoke the lm function of R to find the best values of a and b as follows.
bestfit = lm(logL ~ B.V)
Notice the somewhat weird notation logL ~ B.V. This means the formula
logL = a + b B.V
In fact, if we write y ~ x1+x2+...xk then R interprets this as
y = a0 + a1 x1 + ... + ak xk,
where a0,...,ak are the parameters to be determined.
plot(B.V,logL)
abline(bestfit)
bestfit
This finishes the third step. The final step is to check how good the fit is. Remember that a fundamental aim of regression is to be able to predict the value of the response for a given value of the explanatory variable. So we must make sure that we are able to make good predictions and that we cannot improve it any further.

The very first thing to do is to plot the data and overlay the fitted line on it, as we have already done. The points should be close to the line without any pattern. The lack of pattern is important, because if there is some pattern then we can possibly improve the fit by taking this into account. In our example there is a pattern: the points in the middle are mostly below the line, while the points near the extremes are above. This suggests a slight curvature. So we may be better off with a quadratic fit.

The second thing to do is to plot the residuals. These are the the actual value of the response variables minus the fitted value. In terms of the least squares plot shown earlier these are the lengths of the vertical lines representing errors. For points above the line (red verticals) the sign is negative, while for the blue verticals the sign is positive.

The residuals are computed by the lm function automatically.
res = bestfit$resid
plot(res)
abline(h=0)
Ideally the points should all be scattered equally around the zero line. But here the points below the line look more densely spaced.

Next we should plot the residuals against the explanatory variable.
plot(B.V,res)
abline(h=0)
The clear U pattern is a most decisive demonstration that our best fit can possibly be improved further if we fit a parabola instead of a straight line.

Now that we are a bit wiser after our first attempt at regression analysis of the Hyades data set we should go back to step 1 and choose the parabolic form
logL = a + b B.V + c B.V2.
This form has three unknown parameters to be determined: a,b and c. Again we come to step 2. We shall still choose the least squares method. The third step is almost as before
bestfit2 = lm(logL ~ B.V + I(B.V^2))
The only unexpected piece in the above line is the I. This admittedly awkward symbol means the B.V^2 should be treated as is. (What happens without this I is somewhat strange and its explanation is beyond the present scope.)
bestfit2
summary(bestfit2)
Now let us look at the fitted line. However, unlike the straight line case, here we do not have any ready-made function like abline. We shall need to draw the curve directly using the fitted values. Here is the first attempt:
plot(B.V,logL)
lines(B.V,bestfit2$fit)
Oops! We did not want this mess! Actually, what the lines function does is this: it plots all supplied the points and joins them in that order. Since the values of B.V are not sorted from small to large, the lines get drawn haphazardly. To cure the problem we need to sort B.V and order the fitted values accordingly. This is pretty easy in R:
ord = order(B.V)
plot(B.V,logL)
lines(B.V[ord],bestfit2$fit[ord])
This line does seem to be a better fit than the straight line that we fitted earlier. Let us make the residual plot. But this time we shall use a built-in feature of R.
plot(bestfit2,3)
In fact, R can automatically make 4 different plots to assess the performance of the fit. The 3 in the command asks R to produce the 3rd of these. The others are somewhat more advanced in nature. There are three points to note here.
  1. the vertical axis shows the square root of something called standardized residual. These are obtained by massaging the residuals that we were working with. The extra massaging makes the residuals more ``comparable'' (somewhat like making the denominators equal before comparing two fractions!)
  2. the horizontal axis shows the fitted values.
  3. the wavy red line tries to draw our attention to the general pattern of the points. Ideally it should be a horizontal line.
  4. There are three points that are rather too far from the zero line. These are potential trouble-makers (outliers) and R has labeled them with their case numbers.
To see the values for the point labelled 54 you may use
B.V[54]
logL[54]
Typically it is a good idea to take a careful look at these values to make sure there is nothing wrong with them (typos etc). Also, these stars may indeed be special. Some one analyzing the data about the ozonosphere had stumbled across such outliers that turned out to be the holes in the ozone layer!

Comparing models

Sometimes we have two competing fits for the same data. For example, we had the straight line as well as the quadratic line. How do we compare between them? ``Choose the one that goes closer through the points" might look like a tempting answer. But unfortunately there is a snag. While we want the fit to pass close to the points, we also want to keep the equation of the fit as simple as possible! (This is not merely out of an ascetic love for simplicity, there are also deep statistical implications that we cannot discuss here.) Typically, these two criteria: simplicity and proximity to all the points act in opposite directions. You gain one only by sacrificing the other. There are criteria that seek to strike a balance between the twain. We shall discuss two of them here: Akaike's Information Criterion (AIC) and Bayesian Information Criterion (BIC), both being computed using a function called AIC in R:
AIC(bestfit)
AIC(bestfit2)
The smaller the AIC the better is the fit. So here the quadratic fit is indeed better. However, AIC is known to overfit: it has a bias to favour more complicated models. The BIC seeks to correct this bias:
n = length(B.V)
AIC(bestfit,k=log(n)) #This computes BIC
AIC(bestfit2,k=log(n)) 
Well, BIC also confirms the superiority of the quadratic. Incidentally, both the AIC and the BIC are merely dumb mathematical formulas.
Rounded corner: top-leftBackgroundRounded corner: top-right
Background It is always important to use domain knowledge and common sense to check the fit. Background
Rounded corner: bottom-leftBackgroundRounded corner: bottom-right
In particular, here one must remove those three outliers and redo the computation just to see if there is a better fit.

Robustness

Outliers are a constant source of headache for the statistician, and astronomical data sets are well known for a high content of outliers (after all, the celestial bodies are under no obligation to follow petty statistical models!) In fact, detecting outliers is sometimes the most fruitful outcome of an analysis. In order to detect them we need statistical procedures that are dominated by the general trend of the data (so that the outliers show up in contrast!) A statistical process that is dominated only by the majority of the points (and not by the outliers) is called robust. The second step in a regression analysis (where we choose the criterion) determines how robust our analysis will be. The least squares criterion that we have used so far is not at all robust.

We shall now compare this with a more robust choice that is implemented in the function lqs of the MASS package in R.
Rounded corner: top-leftBackgroundRounded corner: top-right
Background A package in R is like an add-on. It is a set of functions and data sets (plus online helps on these) that may reside in your machine but are not automatically loaded into R. You have to load them with the function library like
library(MASS) #MASS is the name of the package
The package mechanism is the main way R grows as more and more people all over the world write and contribute packages to R. It is possible to make R download and install necessary packages from the internet, as we shall see later.
Background
Rounded corner: bottom-leftBackgroundRounded corner: bottom-right
To keep the exposition simple we shall again fit a straight line.
library(MASS)
fit = lm(logL ~ B.V)
robfit = lqs(logL ~ B.V)
plot(B.V,logL)
abline(fit,col="red")
abline(robfit,col="blue")
Well, the lines are different. But just by looking at this plot there is not much to choose the robust (blue) one over the least square (red). Indeed, the least squares line seems a slightly better fit. In order to appreciate the benefit of robustness we have to run the following script. Here you will again start with the same two lines, but now you can add one new point to the plot by clicking with your mouse. The plot will automatically update the two lines.
source("robust.r")
You'll see that a single extra point (the outlier) causes the least squares (red) line swing more that the robust (blue) line. In fact, the blue line does not seem to move at all!

Right click on the plot and select "Stop" from the pop-up menu to come out. You may also need to click on the little red STOP button in R toolbar.

Parametric vs. nonparametric

Here we shall take consider the first step (choice of the form) once more. The choice of the form, as we have already pointed out, depends on the underlying theory and the general pattern of the points. There may be situations, however, where it is difficult to come up with a simple form. Then it is common to use non-parametric regression which internally uses a complicated form with tremendous flexibility. We shall demonstrate the use of one such method called LOWESS which is implemented as the function lowess in R.
plot(B.V,logL)
fit = lowess(logL~B.V)
lines(fit)
A newer variant called loess produces very similar result. However, drawing the line is slightly more messy here.
fit2 = loess(logL~B.V)
ord = order(B.V)
lines(B.V[ord],fit2$fitted[ord],col="red")

Multiple regression

So far we have been working with bivariate regression, where we have one response and one explanatory variable. In multiple regression we work with a single response and at least two explanatory variables. The same functions (lm, lqs, loess etc) handle multiple regression in R. We shall not discuss much of multiple regression owing to its fundamental similarity with the bivariate case. We shall just show a single example introducing RA and DE as explanatory variables as well as B.V.
fit = lm(logL~B.V+RA+DE)
summary(fit)
We shall not go any deeper beyond pointing out that the absence of the asterisks in the RA and DE lines mean these extra explanatory variables are useless here.

A word of wisdom

Our exposition so far has been strictly from the users' perspective with the focus being on commands to achieve things, and not how these commands are internally executed. Sometimes the same problem can be presented in different ways that are all essentially the same to the user, but quite different for the underlying mathematics. One such example is that R prefers the explanatory variables to be centered, that is, spread symmetrically around 0. This enhances the stability of the formulas used. So instead of
lm(logL ~ B.V)
we should do
B.V1 = B.V - mean(B.V)
lm(logL ~ B.V1)
Rounded corner: top-leftBackgroundRounded corner: top-right
Background Always center the explanatory variables before performing regression analysis. This does not apply to the response variable. Background
Rounded corner: bottom-leftBackgroundRounded corner: bottom-right

k-nearest neighbors

Imagine that you are given a data set of average parent heights and their adult sons' heights, like this (ignore the *'s for the time being):

Parent  Son
5.5     5.9*
5.4     5.3*
5.7     5.9
5.1     5.3*
6.0     5.8
5.5     5.2*
6.1     6.0     

A prospective couple come to you, report that their average height is 5.3 feet, and want you to predict (based on this data set) the height their son would attain at adulthood. How should you proceed?

A pretty simple method is this: look at the cases in your data set with average parents' height near 5.3. Say, we consider the 3 nearest cases. These are marked with *'s in the above data set. Well, there are 4 cases, and not 3. This is because there is a tie (three cases 5.1, 5.5, 5.5 that are at same distance from 5.3).

Now just report the average of the sons' heights for these cases.

Well, that is all there is to it in 3-NN regression (NN = Nearest Neighbor)! Let us implement this in R. First the data set
parent = c(5.5, 5.4, 5.7, 5.1, 6.0, 5.5, 6.1)
son = c(5.9, 5.3, 5.9, 5.3, 5.8, 5.2, 6.0)
Now the new case:
newParent = 5.3
Find the distances of the parents' heights from this new height:
d = abs(parent- newParent)
Rank these:
rnk = rank(d,tie="min")
rnk
Notice the tie="min" option. This allows the three joint seconds to be each given rank 2. We are not using order here (as it does not handle ties gracefully). Now identify the 3 nearest cases (or more in case of ties).
nn = (rnk<=3)
nn
Now it is just a matter of taking averages.
newSon = mean(son[nn])
newSon
It is instructive to write a function that performs the above computation.
knnRegr = function(x,y,newX,k) {
   d = abs(x-newX)
   rnk = rank(d,tie="min")
   mean(y[rnk<=k])
}
Now we can use it like
newSon = knnRegr(parent,son,newParent,3)

Exercise: You shall apply the function that we have just written to the Hipparcos data set.
plot(RA,pmRA)
We want to explore the relation between RA and pmRA using the k-nearest neighbour method. Our aim is to predict the value pmRA based on a new value of RA:
newRA = 90.
Use the knnRegr function to do this with k=13.

Nadaraya-Watson regression

While the k-nearest neighbor method is simple and reasonable it is nevertheless open to one objection that we illustrate now. Suppose that we show the x-values along a number line using crosses as below. The new x-value is shown with an arrow. The 3 nearest neighbors are circled.

Are the 3 NN equally relevant?

The k-NN method now requires us to simply average the y-values of the circled cases. But since two of the points are rather far away from the arrow, shouldn't a weighted average be a better method, where remote points get low weights?

This is precisely the idea behind the Nadaraya-Watson regression. Here we average over all the cases (not just nearest neighbors) but cases farther away get less weight. The weights are decided by a kernel function (typically a function peaked at zero, and tapering off symmetrically on both sides).

How the kernel determines the weights

Now we may simply take the weighted average. We shall apply it to our parent-son example first.
parent = c(5.5, 5.4, 5.7, 5.1, 6.0, 5.5, 6.1)
son = c(5.9, 5.3, 5.9, 5.3, 5.8, 5.2, 6.0)
newParent = 5.3
kernel = dnorm #this is the N(0,1) density
Now let us find the weights (for a bin width (h) = 0.5, say):
h = 0.5
wt = kernel((parent- newParent)/h)/h
Finally the weighted average:
newSon = weighted.mean(son,w = wt)

Exercise: Write a function called, say, nw, like this
nw = function(x,y,newX,kernel,h) {
   #write commands here 
}
[Hint: Basically collect the lines that we used just now.]

Now use it to predict the value of pmRA when RA is 90 (use h=0.2):
newRA = 90
newpmRA = nw(RA,pmRA,newRA,dnorm,0.2)