Nonparametric statistics

This is the practical lab tutorial for nonparametric statistics. But we shall start with a little theory that will be crucial for understanding the underlying concepts.

Distributions

In statistics we often come across questions like

``Is a given data from such-n-such distribution?''

Before we can answer this question we need to know what is meant by a distribution. We shall talk about this first.

In statistics, we work with random variables. Chance controls their values.

The distribution of a random variable is the rule by which chance governs the values. If you know the distribution then you know the chance of this random variable taking value in any given range. This rule may be expressed in various ways. One popular way is via the Cumulative Distribution Function (CDF), that we illustrate now. If a random variable (X) has distribution with CDF F(x) then for any given number a the chance that X will be ≤ a is F(a).

Example: Suppose that I choose a random star from the Hipparcos data set. If I tell you that its Vmag is a random variable with the following CDF then what is the chance that it takes values ≤ 10? Also find out the value x such that the chance of Vmag being ≤ x is 0.4.

For the first question the required probability is F(10) which, according to the graph, is slightly above 0.8.
In the second part we need F(x) = 0.4. From the graph it seems to be about 7.5.
Just to make sure that these are indeed meaningful, let us load the Hipparcos data set and check:
hip = read.table("HIP.dat",head=T) attach(hip) n = length(Vmag) #total number of cases count = sum(Vmag<=10) #how many <= 10 count/n #should be slightly above 0.8
This checks the answer of the first problem. For the second
count = sum(Vmag<=7.5) #how many <= 7.5 count/n #should be around 0.4

So you see how powerful a CDF is: in a sense it stores as much information as the entire Vmag data. It is a common practice in statistics to regard the distribution as the ultimate truth behind the data. We want to infer about the underlying distribution based on the data.

R has many standard distributions already built into it. This basically means that R has functions to compute their CDFs. These functions all start with the letter p.

Example: A random variable has standard Gaussian distribution. We know that R computes its CDF using the function pnorm. How to find the probability that the random variable takes values ≤ 1?
The answer may be found as follows:
pnorm(1)

For every p-function there is a q-function that is basically its inverse.

Example: A random variable has standard Gaussian distribution. Find x such that the random variable is ≤ x with chance 0.3.
Now we shall use the function qnorm:
qnorm(0.3)

OK, now that we are through our litttle theory session, we are ready for the nonparametric lab.

One sample nonparametric tests

Sign-test

In a sense this is the simplest possible of all tests. Here we shall consider the data set LMC.dat that stores the measured distances to the Large Magellanic Cloud.


LMC = read.table("LMC.dat",head=T)
data = LMC[,2] #These are the measurments
data

We want to test if the measurements exceed 18.41 on an average. Now, this does not mean whether the average of the data exceeds 18.41, which is trivial to find out. The question here is actually about the underlying distribution. We want to know if the median of the underlying distribution exceeds 18.41, which is a less trivial question, since we are not given that distribution.

We shall use the sign test to see if the median is 18.41 or larger. First it finds how many of the observations are above this value:


abv = sum(data>18.41)
abv

Clearly, if this number is large, then we should think that the median (of the underlying distribution) exceeds 18.41. The question is how large is ``large enough''? For this we consult the binomial distribution to get the p-value. The rationale behind this should come from the theory class.


n = nrow(LMC)
pValue = 1-pbinom(abv-1,n,0.5)
pValue

We shall learn more about p-values in a later tutorial. For now, we shall use the following rule of thumb:

If this p-value is below 0.05, say, we shall conclude that the median is indeed larger than 18.41.

Wilcoxon's Signed Rank Test

As you should already know from the theoretical class, there is a test called Wilcoxon's Signed Rank test that is better (if a little more complicated) than the sign test. R provides the function wilcox.test for this purpose. We shall work with the Hipparcos data set this time, which we have already loaded. We want to see if the median of the distribution of the pmRA variable is 0 or not.


wilcox.test(pmRA, mu=0)

Since the p-value is pretty large (above 0.05, say) we shal conclude that the median is indeed equal to 0. R itself has come to the same conclusion.

Incidentally, there is a little caveat. For Wilcoxon's Rank Sum Test to be applicable we need the underlying distribution to be symmetric around the median. To get an idea about this we should draw the histogram.


hist(pmRA)

Well, it looks pretty symmetric (around a centre line). But would you apply Wilcoxon's Rank Sum Test to the variable DE?


hist(DE)

No, this is not at all symmetric!

Kernel density estimation

We have already mentioned how important the distribution underlying a sample is, and how the distribution may be specified via its CDF. An (almost) equivalent way of specifying the distribution is via its density function, which is just the derivative of the CDF. The simplest way to get an idea about the density function underlying a data set is to make its histogram. The diagram below shows a density and the typical shape of the histogram that you can expect for a data set with this density. Note the similarity in shape.

A density function and a typical histogram

Here is how we can use histograms to estimate the density of Vmag.


hist(Vmag, prob=T)

Typically densities are smooth functions, while histograms are jagged. A better estimate may be obtained using kernels as we have seen in the theory class.

Let us start with a very simple example to get the procedure right. We shall apply the Gaussian kernel to a sample consisting of just two points!


source("krnl1.r")

We shall take some time to understand the different parts of the plot that pops up. Recall that the kernel density estimator has the formula

where K is the kernel used. In our example, the sum has just two terms

that are half the Gaussian densities centred around the two points. These are shown in red. The sum (which is the kernel density estimate) is shown in black.

Now notice the little pop up window that looks something like this:

Kernel pop up

Move the slider to change the bin width (which controls how peaked the kernel is). See how a large value of the bin width produces a flat estimator, while a small value produces two peaks (one for each point). You may not readily notice that the black curve is getting flatter with increasing bin width, until you notice how the x-axis is growing and the y-axis shrinking.

Next we shall move over to a more realistic situation where we shall work with 100 observations that are artificially generated from the standard Gaussian distribution. The aim now will be to compare the estimated density with the actual density (which we know since the data set is artificial).


source("krnl2.r")

Again play with the slider and convince yourself that small bin widths produce jagged estimated, while large ones produce flat estimates. The best value is somewhere in the middle. The theoretical class has taught you how to find a good bin width. You can also let R find a good value for you as we shall see next. We shall use the Hipparcos data set once again and the function density.


dd = density(Vmag)
plot(dd$x,dd$y,ty="l")

It might be more interesting to see the estimate overlayed on top of the histogram:


hist(Vmag,prob=T)
lines(dd$x,dd$y,col="red")

What bin width did R use? To see this look inside dd:

dd

If you would like to try a different bin width (say 0.1) then that is OK too:


dd2 = density(Vmag,bw = 0.1)
lines(dd2$x,dd2$y,col="blue")

Exercise: All these examples work with the Gaussian kernel. R allows you to use a number of other kernels as well: Epanechnikov, Rectangular, and Triangular, among others. Try these out to see that the choice of the kernels does not matter as much as the bin width does. Here is how you specify the kernel:
dd3 = density(Vmag,kernel="epan") #abbrevs are OK lines(dd3$x,dd3$y,col="green")

Q-Q plot

Suppose that you are given a random sample and asked to check if it comes from the standard Gaussian distribution. There are many ways to go about it, and the Q-Q plot is one of the easiest.

To understand this process let us use a small data set:


x = c(2.3, -0.3, 0.4, 2.0, -1.1, 1.0, 2.1, 3.4, -0.4, 0.5, 1.2,
      2.1)
n = length(x)
n

First we shall sort them in increasing order:


sortedX = sort(x)
sortedX

Now pick a number in this sorted list. Say we pick the 3rd number, which is -0.3. What is the fraction of numbers that are less than or equal to this? Clearly, it is 3/n = 3/12 = 0.25.


frac = 3/n

So we see that 0.25 fraction of the numbers are less than or equal to -0.3 in the data.

Now it is time to compare it with what we should see for a true standard Gaussian data. We shall find the number M such that the a standard Gaussian random variable will be ≤ M with chance 0.25. Then ideally M should be close to our observed -0.3.


M = qnorm(frac)

Well, due to a little technicality we subtract 1/2n from 0.25. This hardly matters when n is large.


M = qnorm(frac -  1/(2*n) )
M

This M is far from -0.3, and so we become suspicious that our sample is not from the standard Gaussian distribution.

So far we are working with just the 3rd number in sortedX, which we picked arbitrarily. We now repeat the process for all the points.


frac = (1:n)/n
M = qnorm(frac - 1/(2*n))
plot(M,sortedX)

This is called the Q-Q plot of the sample against the standard Gaussian distribution. If the points are close to the y=x line then we can be hopeful that the sample is indeed from a standard Gaussian distribution. To facilitate the comparision let us add the y=x line to the plot.


abline(a=0,b=1)

Well, our data is far from standard Gaussian!

Comparison with the Gaussian distribution is so very common that R provides a function qqnorm for it.


qqnorm(x)

R does not provide any readymade function for making a Q-Q plot against other distributions. But one can always follow the method shown here to make a Q-Q plot against any distribution for which R can compute the CDF (or, rather, its inverse using a q-function).

Kolmogorov-Smirnov test

The theory class has talked about a non-parametric test called Kolmogorov-Smirnov test. We shall cover it in a later tutorial.

k-nearest neighbours

Imagine that you are given a data set of average parent heights and their adult sons' heights, like this (ignore the colour for the time being):


Parent  Son
5.5     5.9
5.4     5.3
5.7     5.9
5.1     5.3
6.0     5.8
5.5     5.2
6.1     6.0

A prosepective couple comes to you, reports that their average height is 5.3 feet, and wants you to predict (based on this data set) the height their son would attain at adulthood. How should you proceed?

A pretty simple method is this: look at the cases in your data set with average parents' height near 5.3. Say, we consider the 3 nearest cases. These are shown in red in the above data set. Well, there are 4 cases, and not 3. This is because there is a tie (three cases 5.1, 5.5, 5.5 that are at same distance from 5.3).

Now just report the average of the sons' height for these cases.

Well, that is all there is to it in 3-NN regression (NN = Nearest Neightbour)! Let us implement this in R. First the data set


parent = c(5.5, 5.4, 5.7, 5.1, 6.0, 5.5, 6.1)
son = c(5.9, 5.3, 5.9, 5.3, 5.8, 5.2, 6.0)

Now the new case:


newParent = 5.3

Find the ditances of the parents' heights from this new height:


d = abs(parent- newParent)

Rank these:


rnk = rank(d,tie="min")
rnk

Notice the tie="min" option. This allows the three joint seconds to be each given rank 2. We are not using order here (as it does not handle ties gracefully). Now identify the 3 nearest cases (or more in case of ties).


nn = (rnk<=3)
nn

Now it is just a matter of taking averages.


newSon = mean(son[nn])
newSon

It is instructive to write a function that performs the above computation.


knnRegr = function(x,y,newX,k) {
  d = abs(x-newX)
  rnk = rank(d,tie="min")
  mean(y[rnk<=k])
}

Now we can use it like


newSon = knnRegr(parent,son,newParent,3)

Exercise: You shall apply the function that we have just written to the Hipparcos data set.
plot(RA,pmRA)
We want to explore the relation between RA and pmRA using the k-nearest neighbour method. Our aim is to predict the value pmRA based on a new value of RA:
newRA = 90.
Use the knnRegr function to do this with k=13.

Nadaraya-Watson regression

While the k-nearest neighbour method is simple and reasonable it is nevertheless open to one objection that we illustrate now. Suppose that we show the x-values along a number line using crosses as below. The new x-value is shown with an arrow. The 3 nearest neighbours are circled.

Are the 3 NN equally relevant?

The k-NN method now requires us to simply average the y-values of the circled cases. But since two of the points are rather far away from the arrow, shouldn't a weighted average be a better method, where remote points get low weights?

This is precisely the idea behind the Nadaraya-Watson regression. Here we average over all the cases (not just nearest neighbours) but cases farther away get less weight. The weights are decided by a kernel function (typically a function peaked at zero, and tapering off symmetrically on both sides).

How the kernel determines the weights

Now we may simply take the weighted average. We shall apply it to our parent-son example first.


parent = c(5.5, 5.4, 5.7, 5.1, 6.0, 5.5, 6.1)
son = c(5.9, 5.3, 5.9, 5.3, 5.8, 5.2, 6.0)
newParent = 5.3
kernel = dnorm #this is the N(0,1) density

Now let us find the weights (for a bin width (h) = 0.5, say):


h = 0.5
wt = kernel((parent- newParent)/h)/h

Finally the weighted average:


newSon = weighted.mean(son,w = wt)

Exercise: Write a function called, say, nw, like this
nw = function(x,y,newX,kernel,h) { #write commands here }
[Hint: Basically collect the lines that we used just now.]
Now use it to predict the value of pmRA when RA is 90 (use h=0.2):
newRA = 90 newpmRA = nw(RA,pmRA,newRA,dnorm,0.2)

Two-sample nonparametric tests

Here we shall be taking about the shape of distributions. Let us first make sure of this concept. Suppose that I make a list of all the people in a capitalist country and make a histogram of their incomes. I should get a histogram like this

Income histogram of a capitalist country

But if the same thing is done for a socialist country we shall see something like this (where almost everybody is in the middle income group).

Income histogram of a socialist country

Clearly, the shapes of the histograms differ for the two populations. We must be careful about the term ``shape'' here. For example, the following two histograms are for two capitalist countries, one rich the other poor.

Income histograms two capitalist countries

Here the histograms have the same shape but differ in location. You can get one from the other by applying a shift in the location.

This is a very common phenomon in statistics. If we have two populations with comparable structure they often have similar shapes, and differ in just a location shift. In such a situation we are interested in knowing about the amount of shift. if the amount of shift is zero, then the two populations behave identically.

Wilcoxon's rank sum test is one method to learn about the amount of shift based on samples from the two populations.

We shall start with the example done in the theory class, where we had two samples


x = c(37,49,55,57)
y = c(23,31,46)
m = length(x)
n = length(y)

We are told that both the samples come from populations with the same shape, though one may be a shifted version of the other.

Our aim is to check if indeed there is any shift or not.

We shall consider the pooled data (i.e., all the 7 numbers taken together) and rank them


pool = c(x,y)
pool
r = rank(pool)
r

The first m=4 of these ranks are for the first sample. Let us sum them:


H = sum(r[1:m])
H

If the the two distributions are really identical (i.e., if there is no shift) then we should have (according to the theory class) H close to the value


m*(m+n+1)/2   #Remember: * means multiplication

Can the H that we computed from our data be considered ``close enough" to this value? We shall let us R determine that for us.


wilcox.test(x,y)

So R clearly tells us that there is a shift in location. The output also mentions a W and a p-value. tutorial. But W is essentially what we had called H. Well, H was 21, while W is 11. This is because R has the habit of subtracting

m(m+1)/2

from H and calling it W. Indeed for our data set

m(m+1)/2 = 10,

which perfectly accounts for the difference between our H and R's W.

You may recall from the theory class that H is called Wilcoxon's rank sum statistic, while W is the Mann-Whitney statistic (denoted by U in class).

Now that R has told us that there is a shift, we should demand to estimate the amount of shift.


wilcox.test(x,y,conf.int=T)

The output gives two different forms of estimates. It gives a single value (a point estimate) which is 16. Before it gives a 95% confidence interval: -9 to 34. This basically means that

we can say with 95% confidence that the true value of the shift is between -9 and 34.

We shall talk more about confidence intervals later. For now let us see how R got the poit estimate 16. It used the so-called Hodges-Lehman formula, that we have seen in the theoretical class. To see how this method works we shall first form the following "differecne table"


  | 23 31 46
------------
37| 14  6 -9
49| 26 18  3
55| 32 24  9
57| 34 26 11

Here the rows are headed by the first sample, and coumns by the second sa,mple. Each entry is obtained by subtracting the column heading from the row heading (e.g., -9 = 37-46). This table, by the way, can be created very easily with the function outer.


outer(x,y,"-")


	The `outer` function is a very useful (and somewhat tricky!) tool in R. It takes two vectors x, and y, say, and some function f(x,y). Then it computes a matrix where the (i,j)-th entry is f(x[i],y[j]).

Now we take the median of all the entries in the table:


median(outer(x,y,"-"))

This gives us the Hodges-Lehman estimate of the location shift.