Nonparametric statistics
This is the practical lab tutorial for nonparametric
statistics. But we shall start with a little theory that will be
crucial for understanding the underlying concepts.
Distributions
In statistics we often come across questions like
``Is a given data from such-n-such distribution?''
Before we can answer this question we need to know what is meant
by a distribution. We shall talk about this first.
In statistics, we work with random
variables. Chance controls their values.
The distribution of a random variable is the rule by which
chance governs the values.
If you know the distribution then you know the
chance of this random variable taking value in any given range.
This rule may be expressed in various ways. One popular way is
via the Cumulative Distribution Function (CDF), that we
illustrate now. If a random variable (X) has distribution with CDF
F(x) then for any given number a the chance that
X will be ≤ a is F(a).
| Example:
Suppose that I choose a random star from the Hipparcos data
set. If I tell you that its Vmag is a random variable
with the following CDF then what is the chance that it takes values
≤ 10? Also find out the value x such that
the chance of Vmag being ≤ x is 0.4.
For the first question the required probability is F(10)
which, according to the
graph, is slightly above 0.8.
In the second part we need F(x) = 0.4. From the graph it
seems to be about 7.5.
Just to make sure that these are indeed meaningful, let us load
the Hipparcos data set and check:
hip = read.table("HIP.dat", header = TRUE, fill = TRUE)
attach(hip)
n = length(Vmag) #total number of cases
count = sum(Vmag <= 10) #how many <= 10
count/n #should be slightly above 0.8
This checks the answer of the first problem. For the second
count = sum(Vmag <= 7.5) #how many <= 7.5
count/n #should be around 0.4
|
So you see how powerful a CDF is: in a sense it stores as much
information as the entire Vmag
data. It is a
common practice in statistics to regard the distribution as the
ultimate truth behind the data. We want to infer about the
underlying distribution based on the data.
R has many standard distributions already built into it. This
basically means that R has functions to compute their CDFs.
These functions all start with the letter p.
| Example:
A random variable has standard Gaussian distribution. We know
that R computes its CDF using the function pnorm . How to
find the probability that the random variable takes values
≤ 1?
The answer may be found as follows:
pnorm(1)
|
For every p-function there is a
q-function that is basically its inverse.
| Example:
A random variable has standard Gaussian distribution. Find
x such that the random variable is ≤ x with
chance 0.3.
Now we shall use the function qnorm :
qnorm(0.3)
|
OK, now that we are through our litttle theory session, we are
ready for the nonparametric lab.
One sample nonparametric tests
Sign-test
In a sense this is the simplest possible of all tests. Here
we shall consider the data set LMC.dat that stores the measured
distances to the Large Magellanic Cloud.
LMC = read.table("LMC.dat",head=T)
data = LMC[,2] #These are the measurments
data
We want to test if the
measurements exceed 18.41 on an average. Now, this does
not
mean whether the average of the data exceeds 18.41, which is trivial
to find out. The question here is actually about the underlying
distribution. We want to know if the median of the underlying
distribution exceeds 18.41, which is a less trivial question,
since we are not given that distribution.
We shall use the sign test to see if the median is 18.41 or larger.
First it finds how many of the
observations are above this value:
abv = sum(data>18.41)
abv
Clearly, if this number is large, then we should think that the
median (of the underlying distribution) exceeds 18.41. The
question is how large is ``large
enough''? For this we consult the binomial distribution to get
the p-value. The rationale behind this should come from
the theory class.
n = nrow(LMC)
pValue = 1-pbinom(abv-1,n,0.5)
pValue
We shall learn more about p-values in a later
tutorial. For now, we shall use the following rule of thumb:
If this p-value is below 0.05, say, we shall conclude that
the median is indeed larger than 18.41.
Wilcoxon's Signed Rank Test
As you should already know from the theoretical class, there is a
test called Wilcoxon's Signed Rank test that is better (if a little
more complicated) than the sign test. R provides the function
wilcox.test
for this purpose. We shall work with the
Hipparcos data set this time, which we have already loaded.
We want to see if the median of the distribution of the
pmRA
variable is 0 or not.
wilcox.test(pmRA, mu=0)
Since the p-value is pretty large (above 0.05, say) we
shal conclude that the median is indeed equal to 0. R itself
has come to the same conclusion.
Incidentally, there is a little caveat. For Wilcoxon's Rank Sum Test
to be applicable we need the underlying distribution to be
symmetric around the median. To get an idea about this we should
draw the histogram.
hist(pmRA)
Well, it looks pretty symmetric (around a centre line).
But would you apply Wilcoxon's
Rank Sum Test to the variable DE
?
hist(DE)
No, this is not at all symmetric!
Kolmogorov-Smirnov test
The theory class has talked about a non-parametric test called
Kolmogorov-Smirnov test. We shall cover it in a later tutorial.
k-nearest neighbours
Imagine that you are given a data set of average parent heights
and their adult sons' heights, like this (ignore the colour for
the time being):
Parent Son
5.5 5.9
5.4 5.3
5.7 5.9
5.1 5.3
6.0 5.8
5.5 5.2
6.1 6.0
A prosepective
couple comes to you, reports that their average height is 5.3
feet, and wants you to predict (based on this data set) the
height their son would attain
at adulthood. How should you proceed?
A pretty simple method is this: look at the cases in your data
set with average parents' height near 5.3. Say, we consider the 3
nearest cases. These are shown in red in the above data
set. Well, there are 4 cases, and not 3. This is because there
is a tie (three cases 5.1, 5.5, 5.5 that are at same
distance from 5.3).
Now just report the average of the sons' height for these cases.
Well, that is all there is to it in 3-NN regression (NN = Nearest
Neightbour)! Let us
implement this in R. First the data set
parent = c(5.5, 5.4, 5.7, 5.1, 6.0, 5.5, 6.1)
son = c(5.9, 5.3, 5.9, 5.3, 5.8, 5.2, 6.0)
Now the new case:
newParent = 5.3
Find the ditances of the parents' heights from this new height:
d = abs(parent- newParent)
Rank these:
rnk = rank(d,tie="min")
rnk
Notice the tie="min"
option. This allows
the three joint seconds to be each given rank 2.
We are not using order
here (as it does not handle ties
gracefully). Now identify the 3 nearest cases (or more in case of
ties).
nn = (rnk<=3)
nn
Now it is just a matter of taking averages.
newSon = mean(son[nn])
newSon
It is instructive to write a function that performs the above
computation.
knnRegr = function(x,y,newX,k) {
d = abs(x-newX)
rnk = rank(d,tie="min")
mean(y[rnk<=k])
}
Now we can use it like
newSon = knnRegr(parent,son,newParent,3)
| Exercise:
You shall apply the function that we have just written to the
Hipparcos data set.
plot(RA,pmRA)
We want to explore the relation between RA and
pmRA using the k-nearest neighbour
method.
Our aim is to predict the value pmRA based on a
new value of RA :
newRA = 90.
Use the knnRegr function to do this with k=13.
|
Nadaraya-Watson regression
While the k-nearest neighbour method is simple and
reasonable it is nevertheless open to one objection that we illustrate
now. Suppose that we show the x-values along a number line
using crosses as below. The new x-value is shown
with an arrow. The 3 nearest neighbours are circled.
 |
Are the 3 NN equally relevant? |
The k-NN method now requires us to simply average the
y-values of the circled cases. But since two of the points
are rather far away from the arrow, shouldn't a weighted
average be a better method, where remote points get low
weights?
This is precisely the idea behind the Nadaraya-Watson
regression. Here we average over all the cases (not just
nearest neighbours) but cases farther away get less weight. The
weights are decided by a kernel function (typically a
function peaked at zero, and tapering off symmetrically on both
sides).
 |
How the kernel determines the weights |
Now we may simply take the weighted average. We shall apply it to
our parent-son example first.
parent = c(5.5, 5.4, 5.7, 5.1, 6.0, 5.5, 6.1)
son = c(5.9, 5.3, 5.9, 5.3, 5.8, 5.2, 6.0)
newParent = 5.3
kernel = dnorm #this is the N(0,1) density
Now let us find the weights (for a bin width (h) = 0.5, say):
h = 0.5
wt = kernel((parent- newParent)/h)/h
Finally the weighted average:
newSon = weighted.mean(son,w = wt)
| Exercise:
Write a function called, say, nw , like this
nw = function(x,y,newX,kernel,h) {
#write commands here
}
[Hint: Basically collect the lines that we used just now.]
Now use it to predict the value of pmRA when
RA is 90 (use h=0.2):
newRA = 90
newpmRA = nw(RA,pmRA,newRA,dnorm,0.2)
|
Two-sample nonparametric tests
Here we shall be taking about the shape of distributions. Let us
first make sure of this concept. Suppose that I make a list of
all the people in a capitalist country
and make a histogram of their incomes.
I should get a histogram like this
 |
Income histogram of a capitalist
country |
But if the same thing is done for a socialist country we shall
see something like this (where almost everybody is in the middle
income group).
 |
Income histogram of a socialist
country |
Clearly, the shapes of the histograms differ for the two
populations. We must be careful about the term ``shape'' here. For
example, the following two histograms are for two capitalist
countries, one rich the other poor.
 |
Income histograms two capitalist
countries |
Here the histograms have the same shape but differ in
location. You can get one from the other by applying a
shift in the location.
This is a very common phenomon in statistics. If we have two
populations with comparable structure they often have similar
shapes, and differ in just a location shift. In such a situation
we are interested in knowing about the amount of shift. if the
amount of shift is zero, then the two populations behave
identically.
Wilcoxon's rank sum test is one method to learn about the amount
of shift based on samples from the two populations.
We shall start with the example done in the theory class, where
we had two samples
x = c(37,49,55,57)
y = c(23,31,46)
m = length(x)
n = length(y)
We are told that both the samples come from populations with the
same shape, though one may be a shifted version of the
other.
Our aim is to check if indeed there is any shift or not.
We shall consider the pooled data (i.e., all the 7 numbers
taken together) and rank them
pool = c(x,y)
pool
r = rank(pool)
r
The first m=4 of these ranks are for the first
sample. Let us
sum them:
H = sum(r[1:m])
H
If the the two distributions are really identical (i.e.,
if there is no shift) then we should have (according to the
theory class) H close to the value
m*(m+n+1)/2 #Remember: * means multiplication
Can the H that we computed from our data be considered
``close enough" to this value? We shall let us R determine that
for us.
wilcox.test(x,y)
So R clearly tells us that there is a shift in location. The
output also mentions a W and a
p-value.
tutorial. But W is essentially what we had called
H. Well, H was 21, while W is 11. This
is because R has the habit of subtracting
from H and calling it W. Indeed for our data set
which perfectly accounts for the difference between our H
and R's W.
You may recall from the theory class that H is called
Wilcoxon's rank sum statistic, while W is the Mann-Whitney
statistic (denoted by U in class).
Now that R has told us that there is a shift, we should demand to
estimate the amount of shift.
wilcox.test(x,y,conf.int=T)
The output gives two different forms of estimates. It
gives a single value (a point estimate) which is 16.
Before it gives a 95% confidence interval: -9 to
34.
This
basically means that
we can say with 95% confidence that
the true value of the shift is between -9 and 34.
We shall talk more about confidence intervals later. For now let
us see how R got the poit estimate 16. It used the so-called
Hodges-Lehman formula, that we have seen in the theoretical
class. To see how this method works we shall first form the following
"differecne table"
| 23 31 46
------------
37| 14 6 -9
49| 26 18 3
55| 32 24 9
57| 34 26 11
Here the rows are headed by the first sample, and coumns by the
second sa,mple. Each entry is obtained by subtracting the column
heading from
the row heading (e.g., -9 = 37-46). This table, by the
way, can be created very easily with the function outer
.
outer(x,y,"-")
 |
 |  |  |
 |
The outer function is a very useful (and somewhat
tricky!) tool in R. It takes two vectors x, and y,
say, and some function f(x,y). Then it computes a matrix
where the (i,j)-th entry is
|  |
 |  |  |
|
Now we take the median of all the entries in the table:
median(outer(x,y,"-"))
This gives us the Hodges-Lehman estimate of the location shift.