Descriptive Statistics in R

Descriptive statistics

In the course of learning a bit about how to generate data summaries in R, one will inevitably learn some useful R syntax and commands. Thus, this first tutorial on descriptive statistics serves a dual role as a brief introduction to R. When this tutorial is used online, the indented lines in non-proportional font

 
   # like this one

are meant to be copied and pasted directly into R at the command prompt.

Datasets and other files used in this tutorial:

HIP_star.dat

Obtaining astronomical datasets

The astronomical community has a vast complex of on-line databases. Many databases are hosted by data centers such as NASA's archive research centers, the Centre des Donnees astronomiques de Strasbourg (CDS), the NASA/IPAC Extragalactic Database (NED), and the Astrophysics Data System (ADS). The Virtual Observatory (VO) is developing new flexible tools for accessing, mining and combining datasets at distributed locations; see the Web sites for the international, European, and U.S. VO for information on recent developments. The VO Web Services, Summer Schools, and Core Applications provide helpful entries into these new capabilities.

We initially treat here only input of tabular data such as catalogs of astronomical sources. We give two examples of interactive acquisition of tabular data. One of the multivariate tabular datasets used here is a dataset of stars observed with the European Space Agency's Hipparcos satellite during the 1990s. It gives a table with 9 columns and 2719 rows giving Hipparcos stars lying between 40 and 50 parsecs from the Sun. The dataset was acquired using CDS's Vizier Catalogue Service as follows:

In Web browser, go to http://vizier.u-strasbg.fr/viz-bin/VizieR?-source=I/239/hip_main
Set Max Entries to 9999, Output layout ASCII table
Remove "Compute r" and "Compute Position" buttons
Set parallax constraint "20 .. 25" to gives stars between 40 and 50 pc
Retrieve 9 properties: HIP, V, RA, Dec, Plx, pmRA, pmDE, e_Plx, and B-V
Submit Query
Use ASCII editor to trim header to one line with variable names
Trim trailer
Save ASCII file on disk for ingestion into R

Reading data into R

Enter R by typing "R" (UNIX) or double-clicking to execute Rgui.exe (Windows) or R.app (Mac). In the commands below, we start by extracting some system and user information, the R.version you are using, and some of its capabilities. citation tells how to cite R in publications. R is released under the GNU Public Licence, as indicated by copyright. Typing a question mark in front of a command opens the help file for that command.

   Sys.info()
   R.version 
   capabilities() 
   citation() 
   ?copyright

The various capitalizations above are important as R is case-sensitive. When using R interactively, it is very helpful to know that the up-arrow key can retrieve previous commands, which may be edited using the left- and right-arrow keys and the delete key.

The last command above, ?copyright, is equivalent to help(copyright) or help("copyright"). However, to use this command you have to know that the function called "copyright" exists. Suppose that you knew only that there was a function in R that returned copyright information but you could not remember what it was called. In this case, the help.search function provides a handy reference tool:

   help.search("copyright")

Last but certainly not least, a vast array of documentation and reference materials may be accessed via a simple command:

   help.start()

The initial working directory in R is set by default or by the directory from which R is invoked (if it is invoked on the command line). It is possible to read and set this working directory using the getwd or setwd commands. A list of the files in the current working directory is given by list.files, which has a variety of useful options and is only one of several utilities interfacing to the computer's files. In the setwd command, note that in Windows, path (directory) names are not case-sensitive and may contain either forward slashes or backward slashes; in the latter case, a backward slash must be written as "\\" when enclosed in quotation marks.

   getwd()
   list.files() # what's in this directory?
   # The # symbol means that the rest of that line is a comment.

We wish to read an ASCII data file into an R object using the read.table command or one of its variants. Let's begin with a cleaned-up version of the Hipparcos dataset described above, a description of which is given at http://astrostatistics.psu.edu/datasets/HIP_star.html. There are two distinct lines below that read the dataset and create an object named hip. The first (currently commented out) may be used whenever one has access to the internet; the second assumes that the HIP_star.dat file has been saved into the current working directory.

#   hip  <-  read.table("http://astrostatistics.psu.edu/datasets/HIP_star.dat",
#      header=T,fill=T) # T is short for TRUE
   hip  <-  read.table("HIP_star.dat", header=T,fill=T)

The "<-", which is actually "less than" followed by "minus", is the R assignment operator. Admittedly, this is a bit hard to type repeatedly, so fortunately R also allows the use of a single equals sign (=) for assignment.

Note that no special character must be typed when a command is broken across lines as in the example above. Whenever a line is entered that is not yet syntactically complete, R will replace the usual prompt, ">" with a + sign to indicate that more input is expected. The read.table function can refer to a location on the web, though a filename (of a file in the working directory) or a pathname would have sufficed. The "header=TRUE" option is used because the first row of the file is a header containing the names of the columns. We used the "fill=TRUE" option because some of the columns have only 8 of the 9 columns filled, and "fill=TRUE" instructs R to fill in blank fields at the end of the line with missing values, denoted by the special R constant NA ("not available"). Warning: This only works in this example because all of the empty cells are in the last column of the table. (You can verify this by checking the ASCII file HIP_star.dat.) Because the read.table function uses delimiters to determine where to break between columns, any row containing only 8 values would always put the NA in the 9th column, regardless of where it was intended to be placed. As a general rule, data files with explicit delimiters are to be preferred to files that use line position to indicate column number, particularly when missing data are present. If you must use line position, R provides the read.fortran and read.fwf functions for reading fixed width format files.

Summarizing the dataset

The following R commands list the dimensions of the dataset and print the variable names (from the single-line header). Then we list the first row, the first 20 rows for the 7th column, and the sum of the 3rd column.

   dim(hip)
   names(hip) 
   hip[1,]
   hip[1:20,7]
   sum(hip[,3])

Note that vectors, matrices, and arrays are indexed using the square brackets and that "1:20" is shorthand for the vector containing integers 1 through 20, inclusive. Even punctuation marks such as the colon have help entries, which may be accessed using help(":").

Next, list the maximum, minimum, median, and mean absolute deviation (similar to standard deviation) of each column. First we do this using a for-loop, which is a slow process in R. Inside the loop, c is a generic R function that combines its arguments into a vector and print is a generic R command that prints the contents of an object. After the inefficient but intuitively clear approach using a for-loop, we then do the same job in a more efficient fashion using the apply command. Here the "2" refers to columns in the x array; a "1" would refer to rows.

   for(i in 1:ncol(hip)) {
      print(c(max(hip[,i]), min(hip[,i]), median(hip[,i]), mad(hip[,i]))) 
   }
   apply(hip, 2, max) 
   apply(hip, 2, min) 
   apply(hip, 2, median) 
   apply(hip, 2, mad)

The curly braces {} in the for loop above are optional because there is only a single command inside. Notice that the output gives only NA for the last column's statistics. This is because a few values in this column are missing. We can tell how many are missing and which rows they come from as follows:

   sum(is.na(hip[,9]))
   which(is.na(hip[,9]))

There are a couple of ways to deal with the NA problem. One is to repeat all of the above calculations on a new R object consisting of only those rows containing no NAs:

   y  <-  na.omit(hip)
   for(i in 1:ncol(y)) {
      print(c(max(y[,i]), min(y[,i]), median(y[,i]), mad(y[,i]))) 
   }

Another possibility is to use the na.rm (remove NA) option of the summary functions. This solution gives slightly different answers from the the solution above; can you see why?

   for(i in 1:ncol(hip)) {
      print(c(max(hip[,i],na.rm=T), min(hip[,i],na.rm=T), median(hip[,i],na.rm=T), mad(hip[,i],na.rm=T))) 
   }

A vector can be sorted using the Shellsort or Quicksort algorithms; rank returns the order of values in a numeric vector; and order returns a vector of indices that will sort a vector. The last of these functions, order, is often the most useful of the three, because it allows one to reorder all of the rows of a matrix according to one of the columns:

   sort(hip[1:10,3])
   hip[order(hip[1:10,3]),]

Each of the above lines gives the sorted values of the first ten entries of the third column, but the second line reorders each of the ten rows in this order. Note that neither of these commands actually alters the value of x, but we could reassign x to equal its sorted values if desired.

Standard errors and confidence intervals

The standard error of an estimator is, by definition, an estimate of the standard deviation of that estimator. Let's consider an example.

Perhaps the most commonly used estimator is the sample mean (called a statistic because it depends only on the data), which is an estimator of the population mean (called a parameter). Assuming that our sample of data truly consists of independent observations of a random variable X, the true standard deviation of the sample mean equals stdev(X)/sqrt(n), where n is the sample size. However, we do not usually know stdev(X), so we estimate the standard deviation of the sample mean by replacing stdev(X) by an estimate thereof.

If the Vmag column (the 2nd column) of our dataset may be considered a random sample from some larger population, then we may estimate the true mean of this population by

   mean(hip[,2])

and the standard error of this estimator is

   sd(hip[,2]) / sqrt(2719)

We know that our estimator of the true population mean is not exactly correct, so a common way to incorporate the uncertainty in our measurements into reporting estimates is by reporting a confidence interval. A confidence interval for some population quantity is always a set of "reasonable" values for that quantity. In this case, the Central Limit Theorem tells us that the sample mean has a roughly Gaussian, or normal, distribution centered at the true population mean. Thus, we may use the fact that 95% of the mass of any Gaussian distribution is contained within 1.96 standard deviations of its mean to construct the following 95% confidence interval for the true population mean of Vmag:

   mean(hip[,2]) + c(-1.96,1.96)*sd(hip[,2]) / sqrt(2719)

In fact, many confidence intervals in statistics have exactly the form above, namely, (estimator) +/- (critical value) * (standard error of estimator).

The precise interpretation of a confidence interval is a bit tricky. For instance, notice that the above interval is centered not at the true mean (which is unknown), but at the sample mean. If we were to take a different random sample of the same size, the confidence interval would change even though the true mean remains fixed. Thus, the correct way to interpret the "95%" in "95% confidence interval" is to say that roughly 95% of all such hypothetical intervals will contain the true mean. In particular, it is not correct to claim, based on the previous output, that there is a 95% probability that the true mean lies between 8.189 and 8.330. Although this latter interpretation is incorrect, if one chooses to use Bayesian estimation procedures, then the analogue of a confidence interval is a so-called "credible interval"; and the incorrect interpretation of a confidence interval is actually the correct interpretation of a credible interval (!).

More R syntax

Arithmetic in R is straightforward. Some common operators are: + for addition, - for subtraction, * for multiplication, / for division, %/% for integer division, %% for modular arithmetic, ^ for exponentiation. The help page for these operators may accessed by typing, say,

   ?'+'

Some common built-in functions are exp for the exponential function, sqrt for square root, log10 for base-10 logarithms, and cos for cosine. The syntax resembles "sqrt(z)". Comparisons are made using < (less than), <= (less than or equal), == (equal to) with the syntax "a >= b". To test whether a and b are exactly equal and return a TRUE/FALSE value (for instance, in an "if" statement), use the command identical(a,b) rather a==b. Compare the following two ways of comparing the vectors a and b:

   a <- c(1,2);b <- c(1,3)
   a==b
   identical(a,b)

Also note that in the above example, 'all(a==b)' is equivalent to 'identical(a,b)'.

R also has other logical operators such as & (AND), | (OR), ! (NOT). There is also an xor (exclusive or) function. Each of these four functions performs elementwise comparisons in much the same way as arithmetic operators:

   a <- c(TRUE,TRUE,FALSE,FALSE);b <- c(TRUE,FALSE,TRUE,FALSE)
   !a
   a & b
   a | b
   xor(a,b)

However, when 'and' and 'or' are used in programming, say in 'if' statements, generally the '&&' and '||' forms are preferable. These longer forms of 'and' and 'or' evaluate left to right, examining only the first element of each vector, and evaluation terminates when a result is determined. Some other operators are listed here.

The expression "y == x^2" evaluates as TRUE or FALSE, depending upon whether y equals x squared, and performs no assignment (if either y or x does not currently exist as an R object, an error results).

Let's continue with simple characterization of the dataset: find the row number of the object with the smallest value of the 4th column using which.min. A longer, but instructive, way to accomplish this task creates a long vector of logical constants (tmp), mostly FALSE with one TRUE, then pick out the row with "TRUE".

   which.min(hip[,4])
   tmp <- (hip[,4]==min(hip[,4])) 
   (1:nrow(hip))[tmp] # or equivalently, 
   which(tmp)

The cut function divides the range of x into intervals and codes the values of x according to which interval they fall. It this is a quick way to group a vector into bins. Use the "breaks" argument to either specify a vector of bin boundaries, or give the number of intervals into which x should be cut. Bin string labels can be specified. Cut converts numeric vectors into an R object of class "factor" which can be ordered and otherwise manipulated; e.g. with command levels. A more flexible method for dividing a vector into groups using user-specified rules is given by split.

   table(cut(hip[,"Plx"],breaks=20:25))

The command above uses several tricks. Note that a column in a matrix may be referred to by its name (e.g., "Plx") instead of its number. The notation '20:25' is short for 'c(20,21,22,23,24,25)' and in general, 'a:b' is the vector of consecutive integers starting with a and ending with b (this also works if a is larger than b). Finally, the table command tabulates the values in a vector or factor.

Although R makes it easy for experienced users to invoke multiple functions in a single line, it may help to recognize that the previous command accomplishes the same task as following string of commands:

   p <- hip[,"Plx"]
   cuts <- cut(p,breaks=20:25)
   table(cuts)

The only difference is that the string of three separate commands creates two additional R objects, p and cuts. The preferred method of carrying out these operations depends on whether there will later be any use for these additional objects.

Univariate plots

Recall the variable names in the Hipparcos dataset using the names function. By using attach, we can automatically create temporary variables with these names (these variables are not saved as part of the R session, and they are superseded by any other R objects of the same names).

   names(hip)
   attach(hip)

After using the attach command, we can obtain, say, individual summaries of the variables:

   summary(Vmag)
   summary(B.V)

Next, summarize some of this information graphically using a simple yet sometimes effective visualization tool called a dotplot or dotchart, which lets us view all observations of a quantitative variable simultaneously:

   dotchart(B.V)

The shape of the distribution of the B.V variable may be viewed using a traditional histogram. If we use the prob=TRUE option for the histogram so that the vertical axis is on the probability scale (i.e., the histogram has total area 1), then a so-called kernel density estimate, or histogram smoother, can be overlaid:

   hist(B.V,prob=T)
   d <- density(B.V,na.rm=T)
   lines(d,col=2,lwd=2,lty=2)

The topic of density estimation will be covered in a later tutorial. For now, it is important to remember that even though histograms and density estimates are drawn in two-dimensional space, they are intrinsically univariate analysis techniques here: We are only studying the single variable B.V in this example (though there are multivariate versions of these techniques as well).

Check the help file for the par function (by typing "?par") to see what the col, lwd, and lty options accomplish in the lines function above.

A simplistic histogram-like object for small datasets, which both gives the shape of a distribution and displays each observation, is called a stem-and-leaf plot. It is easy to create by hand, but R will create a text version:

   stem(sample(B.V,100))

The sample command was used above to obtain a random sample of 100, without replacement, from the B.V vector.

Finally, we consider box-and-whisker plots (or "boxplots") for the four variables Vmag, pmRA, pmDE, and B.V (the last variable used to be B-V, or B minus V, but R does not allow certain characters). These are the 2nd, 6th, 7th, and 9th columns of 'hip'.

   boxplot(hip[,c(2,6,7,9)])

Our first attempt above looks pretty bad due to the different scales of the variables, so we construct an array of four single-variable plots:

   par(mfrow=c(2,2))
   for(i in c(2,6,7,9)) 
      boxplot(hip[,i],main=names(hip)[i])
   par(mfrow=c(1,1))

The boxplot command does more than produce plots; it also returns output that can be more closely examined. Below, we produce boxplots and save the output.

   b <- boxplot(hip[,c(2,6,7,9)])
   names(b)

'b' is an object called a list. To understand its contents, read the help for boxplot. Suppose we wish to see all of the outliers in the pmRA variable, which is the second of the four variables in the current boxplot:

   b$names[2]
   b$out[b$group==2]

R scripts

While R is often run interactively, one often wants to carefully construct R scripts and run them later. A file containing R code can be run using the source command. In addition, R may be run in batch mode.

The editor Emacs, together with "Emacs speaks statistics", provides a nice way to produce R scripts.