Descriptive statistics
In the course of learning a bit about how to generate data summaries in R,
one will inevitably learn some useful R syntax and commands. Thus, this
first tutorial on descriptive statistics
serves a dual role as a brief introduction to R. When this tutorial is
used online, the indented lines in non-proportional font
# like this one
are meant to be copied and pasted directly into R at the command prompt.
Datasets and other files used in this tutorial:
Obtaining astronomical datasets
The astronomical community has a vast complex of on-line
databases. Many databases are hosted by data centers such as NASA's archive research
centers, the Centre des
Donnees astronomiques de Strasbourg (CDS), the NASA/IPAC Extragalactic Database
(NED), and the Astrophysics
Data System (ADS).
The Virtual Observatory (VO) is developing new flexible tools for
accessing, mining and combining datasets at distributed
locations; see
the Web sites for the international,
European, and U.S. VO for information on recent
developments. The VO Web
Services, Summer
Schools, and Core Applications
provide helpful entries into these new capabilities.
We initially treat here only input of tabular data such as catalogs of
astronomical sources. We give two examples of interactive
acquisition
of tabular data. One of the multivariate tabular datasets used
here is
a dataset of stars observed with the European Space Agency's Hipparcos
satellite during the 1990s. It gives a table with 9 columns and
2719 rows giving Hipparcos stars lying between 40 and 50 parsecs from
the Sun. The dataset was acquired using CDS's Vizier Catalogue Service as
follows:
- In Web browser, go to
http://vizier.u-strasbg.fr/viz-bin/VizieR?-source=I/239/hip_main
-
Set Max Entries to 9999, Output layout ASCII table
- Remove
"Compute r" and "Compute Position" buttons
-
Set parallax constraint "20 .. 25" to gives stars between 40 and 50 pc
-
Retrieve 9 properties: HIP, V, RA, Dec, Plx, pmRA, pmDE, e_Plx, and B-V
- Submit
Query
- Use ASCII editor to trim header to one line with variable
names
- Trim trailer
- Save ASCII file on disk for ingestion into R
Reading data into R
Enter R by typing "R"
(UNIX) or double-clicking to execute Rgui.exe (Windows) or R.app (Mac). In the
commands below, we start by extracting
some system
and user information, the R.version
you are using, and some of its capabilities.
citation
tells how to cite R in publications. R is released under the GNU
Public Licence, as indicated by copyright.
Typing a question mark in front of a command opens the help file
for that command.
Sys.info()
R.version
capabilities()
citation()
?copyright
The various capitalizations above are important as R is
case-sensitive. When using R interactively, it is very helpful to know that
the up-arrow key can retrieve previous commands,
which may be edited using the left- and right-arrow keys and the delete key.
The last command above, ?copyright, is equivalent to help(copyright) or
help("copyright"). However, to use this command you have to know
that the function called "copyright" exists. Suppose that you knew
only that there was a function in R
that returned copyright information but you could not remember what it
was called. In this case, the
help.search function provides a
handy reference tool:
help.search("copyright")
Last but certainly not least, a vast array of documentation and reference materials
may be accessed via a simple command:
help.start()
The initial working directory in R is set by default or by the directory from
which R is invoked (if it is invoked on the command line). It is possible to
read and set this working directory using the
getwd or
setwd
commands. A list of the files in the current working directory is given by
list.files, which
has a variety of useful options and is only one of several utilities
interfacing to the computer's files.
In the
setwd
command, note that in Windows, path (directory) names are not case-sensitive and
may contain either forward slashes or backward slashes; in the latter case,
a backward slash must be written as "\\" when enclosed in quotation marks.
getwd()
list.files() # what's in this directory?
# The # symbol means that the rest of that line is a comment.
We wish to read an ASCII data file into an R object using the
read.table command or one of its
variants.
Let's begin with a cleaned-up version of
the Hipparcos dataset described above, a description of which is given at
http://astrostatistics.psu.edu/datasets/HIP_star.html. There are two
distinct lines below that read the dataset and create an object named hip.
The first (currently commented out) may be used whenever one has access to
the internet; the second assumes that the HIP_star.dat file has been
saved into the current working directory.
# hip <- read.table("http://astrostatistics.psu.edu/datasets/HIP_star.dat",
# header=T,fill=T) # T is short for TRUE
hip <- read.table("HIP_star.dat", header=T,fill=T)
The "<-", which is actually "less than" followed by "minus", is the R assignment
operator. Admittedly, this is a bit hard to type repeatedly, so fortunately R
also allows the use of a single equals sign (=) for assignment.
Note that no special character must be typed when a command is broken across lines
as in the example above. Whenever a line is entered that is not yet syntactically
complete, R will replace the usual prompt, ">" with a + sign to indicate that
more input is expected.
The read.table function can refer
to a location on the web, though a filename (of a file in the working directory)
or a pathname would have sufficed.
The "header=TRUE" option is used because the first row of the file is
a header containing the names of the columns. We used the "fill=TRUE" option
because some of the columns have only 8 of the 9 columns filled, and "fill=TRUE"
instructs R to fill in blank fields at the end of the line with missing values,
denoted by the special R constant NA ("not
available"). Warning: This only works in this example because
all of the empty cells are in the last column of the table. (You can verify
this by checking the ASCII file
HIP_star.dat.)
Because the read.table function uses delimiters to determine where to break between
columns, any row containing only 8 values would always put the
NA in the
9th column, regardless of where it was intended to be placed.
As a general rule, data files with explicit delimiters are to be preferred to
files that use line position to indicate column number, particularly when missing
data are present. If you must use line position, R provides the
read.fortran and
read.fwf functions for reading
fixed width format files.
Summarizing the dataset
The following R commands list
the dimensions
of the dataset and print the variable names
(from the single-line header). Then
we list the first row, the first 20 rows for the 7th
column, and the sum
of the 3rd column.
dim(hip)
names(hip)
hip[1,]
hip[1:20,7]
sum(hip[,3])
Note that vectors, matrices, and arrays are indexed using the square brackets
and that "1:20" is shorthand for the vector containing integers 1 through 20,
inclusive. Even punctuation marks such as the colon have help entries, which
may be accessed using
help(":").
Next, list the
maximum,
minimum,
median, and
mean absolute deviation
(similar to standard deviation) of each column. First we do
this using a
for-loop,
which is a slow process in R. Inside the loop, c is
a generic R function that combines its arguments into a vector
and print
is a generic R command that prints the contents of an object.
After the inefficient but intuitively clear approach using a
for-loop,
we then
do the same job in a more efficient fashion using the
apply command. Here
the "2" refers to columns in the x array; a "1" would refer to
rows.
for(i in 1:ncol(hip)) {
print(c(max(hip[,i]), min(hip[,i]), median(hip[,i]), mad(hip[,i])))
}
apply(hip, 2, max)
apply(hip, 2, min)
apply(hip, 2, median)
apply(hip, 2, mad)
The curly braces {} in the for loop above are optional because there is
only a single command inside. Notice that the output gives only NA for the last column's
statistics. This is because a few values in this column are missing. We can
tell how many are missing and which rows they come from as follows:
sum(is.na(hip[,9]))
which(is.na(hip[,9]))
There are a couple of ways to deal with the NA problem. One is to repeat
all of the above calculations on a new R object consisting of only those
rows containing no NAs:
y <- na.omit(hip)
for(i in 1:ncol(y)) {
print(c(max(y[,i]), min(y[,i]), median(y[,i]), mad(y[,i])))
}
Another possibility is to use the na.rm (remove NA) option of the summary
functions. This solution gives slightly different answers from the
the solution above; can you see why?
for(i in 1:ncol(hip)) {
print(c(max(hip[,i],na.rm=T), min(hip[,i],na.rm=T), median(hip[,i],na.rm=T), mad(hip[,i],na.rm=T)))
}
A vector can be sorted
using the Shellsort or Quicksort
algorithms; rank
returns the order
of values in a numeric vector; and order
returns a vector of indices that will sort a vector. The last of these functions,
order, is often the most useful of the three, because it allows one to reorder all
of the rows of a matrix according to one of the columns:
sort(hip[1:10,3])
hip[order(hip[1:10,3]),]
Each of the above lines gives the sorted values of the first ten entries of the
third column, but the second line reorders each of the ten rows in this order. Note
that neither of these commands actually alters the value of x, but we could reassign
x to equal its sorted values if desired.
Standard errors and confidence intervals
The standard error of an estimator is, by definition, an estimate of the
standard deviation of that estimator. Let's consider an example.
Perhaps the most commonly used estimator is the sample mean (called
a statistic because it depends only on the data), which is
an estimator of the population mean (called a parameter).
Assuming that our sample of data truly consists
of independent observations of a random variable X, the true standard
deviation of the sample mean equals stdev(X)/sqrt(n), where n is the sample
size. However, we do not usually know stdev(X), so we estimate the standard
deviation of the sample mean by replacing stdev(X) by an estimate thereof.
If the Vmag column (the 2nd column) of our dataset may be considered a random sample from
some larger population, then we may estimate the true mean of this population
by
mean(hip[,2])
and the standard error of this estimator is
sd(hip[,2]) / sqrt(2719)
We know that our estimator of the true population mean is not exactly correct, so
a common way to incorporate the uncertainty in our measurements into reporting
estimates is by reporting a confidence interval. A confidence interval for
some population quantity is always a set of "reasonable" values for that quantity.
In this case, the Central Limit Theorem tells us that the sample mean
has a roughly Gaussian, or normal, distribution centered at the true population mean.
Thus, we may
use the fact that 95% of the mass of any Gaussian distribution is contained within
1.96 standard deviations of its mean to construct the following 95% confidence interval
for the true population mean of Vmag:
mean(hip[,2]) + c(-1.96,1.96)*sd(hip[,2]) / sqrt(2719)
In fact, many confidence intervals in statistics have exactly the form above,
namely, (estimator) +/- (critical value) * (standard error of estimator).
The precise interpretation of a confidence interval is a bit tricky.
For instance, notice that
the above interval is centered not at the true mean (which is unknown), but
at the sample mean. If we were to take a different random sample of the same
size, the confidence interval would change even though the true mean
remains fixed. Thus, the correct way to interpret the "95%" in "95% confidence
interval" is to say that roughly 95% of all such hypothetical intervals will
contain the true mean. In particular, it is not correct to claim,
based on the previous output, that
there is a 95% probability that the true mean lies between 8.189 and 8.330.
Although this latter interpretation is incorrect,
if one chooses to use Bayesian estimation procedures, then the analogue of a
confidence interval is a so-called "credible interval"; and the incorrect
interpretation of a confidence interval is actually the correct interpretation
of a credible interval (!).
More R syntax
Arithmetic
in R is straightforward. Some common operators are: + for
addition, - for subtraction, * for multiplication, / for division, %/%
for integer division, %% for modular arithmetic, ^ for exponentiation.
The help page for these operators may accessed by typing, say,
?'+'
Some common
built-in functions are exp
for the exponential function, sqrt
for square root, log10
for base-10 logarithms, and cos
for cosine. The syntax resembles "sqrt(z)". Comparisons
are made using < (less than), <= (less than or equal), ==
(equal to) with the syntax "a >= b". To test whether a and b
are exactly equal and return a TRUE/FALSE value (for instance, in an
"if" statement), use the command identical(a,b)
rather a==b. Compare the following two ways of comparing the vectors a and b:
a <- c(1,2);b <- c(1,3)
a==b
identical(a,b)
Also note that in the above example, 'all(a==b)' is equivalent to 'identical(a,b)'.
R also has other logical
operators such as & (AND), | (OR), ! (NOT). There is also an xor (exclusive or)
function. Each of these four functions performs elementwise comparisons in much the same
way as arithmetic operators:
a <- c(TRUE,TRUE,FALSE,FALSE);b <- c(TRUE,FALSE,TRUE,FALSE)
!a
a & b
a | b
xor(a,b)
However, when 'and' and 'or' are used in programming, say in 'if' statements,
generally the '&&' and '||' forms are preferable. These longer forms of 'and'
and 'or' evaluate left to right, examining only the first element of each vector,
and evaluation terminates when a result is determined.
Some other operators are listed here.
The
expression "y == x^2" evaluates as TRUE or FALSE, depending upon whether
y equals x squared, and performs no assignment (if either y or x does not
currently exist as an R object, an error results).
Let's continue with simple characterization of the dataset: find the
row number of the object with the smallest value of
the 4th column using which.min. A longer, but instructive, way to
accomplish this task creates a long vector of logical constants (tmp),
mostly FALSE with one TRUE, then pick out the row with "TRUE".
which.min(hip[,4])
tmp <- (hip[,4]==min(hip[,4]))
(1:nrow(hip))[tmp] # or equivalently,
which(tmp)
The cut
function divides the range of x into intervals and codes the values of
x according to which interval they fall. It this is a quick way
to group a vector into bins. Use the "breaks" argument to either
specify a vector of bin boundaries, or give the number of intervals
into which x should be cut. Bin string labels can be
specified. Cut converts numeric vectors into an R object of
class "factor"
which can be ordered and otherwise manipulated; e.g. with command levels.
A more flexible method for dividing a vector into groups using
user-specified rules is given by split.
table(cut(hip[,"Plx"],breaks=20:25))
The command above uses several tricks. Note that a column in a matrix may be referred to by its
name (e.g., "Plx") instead of its number. The notation '20:25' is short for 'c(20,21,22,23,24,25)'
and in general, 'a:b' is the vector of consecutive
integers starting with a and ending with b (this also works
if a is larger than b). Finally, the
table command
tabulates the values in a vector or factor.
Although R makes it easy for experienced users to invoke multiple functions
in a single line, it may help to recognize that the previous command accomplishes
the same task as following string of commands:
p <- hip[,"Plx"]
cuts <- cut(p,breaks=20:25)
table(cuts)
The only difference is that the string of three separate commands creates
two additional R objects, p and cuts. The preferred method of carrying out
these operations depends on whether there will later be any use for these
additional objects.
Univariate plots
Recall the variable names in the Hipparcos dataset using the
names
function. By using
attach,
we can automatically create temporary variables with these names (these variables
are not saved as part of the R session, and they are superseded by any other R objects
of the same names).
names(hip)
attach(hip)
After using the attach command, we can obtain, say, individual
summaries
of the variables:
summary(Vmag)
summary(B.V)
Next, summarize some of this information graphically using a simple
yet sometimes effective visualization tool called a dotplot or dotchart,
which lets us view all observations of a quantitative variable simultaneously:
dotchart(B.V)
The shape of the distribution of the B.V variable may be viewed using a
traditional histogram. If we use the prob=TRUE option for the histogram
so that the vertical axis is on the probability scale (i.e., the histogram
has total area 1), then a so-called kernel density estimate, or
histogram smoother, can be overlaid:
hist(B.V,prob=T)
d <- density(B.V,na.rm=T)
lines(d,col=2,lwd=2,lty=2)
The topic of density estimation will be covered in a later tutorial.
For now, it is important to remember that even though histograms and
density estimates are drawn in two-dimensional space, they are
intrinsically univariate analysis techniques here: We are only
studying the single variable B.V in this example (though there
are multivariate versions of these techniques as well).
Check the help file for the
par function (by typing "?par")
to see what the col, lwd, and lty options accomplish
in the lines function above.
A simplistic histogram-like object for small datasets, which both gives the
shape of a distribution and displays each observation, is called a stem-and-leaf
plot. It is easy to create by hand, but R will create a text version:
stem(sample(B.V,100))
The sample command was used above to obtain a random sample of 100, without
replacement, from the
B.V vector.
Finally, we consider
box-and-whisker plots
(or "boxplots") for the four variables Vmag, pmRA, pmDE, and B.V (the last variable
used to be B-V, or B minus V, but R does not allow certain characters). These
are the 2nd, 6th, 7th, and 9th columns of 'hip'.
boxplot(hip[,c(2,6,7,9)])
Our first attempt above looks pretty bad due to the different scales of
the variables, so we construct an array of four single-variable plots:
par(mfrow=c(2,2))
for(i in c(2,6,7,9))
boxplot(hip[,i],main=names(hip)[i])
par(mfrow=c(1,1))
The
boxplot
command does more than produce plots; it also returns output that can be
more closely examined. Below, we produce boxplots
and save the output.
b <- boxplot(hip[,c(2,6,7,9)])
names(b)
'b' is an object called a list. To understand its contents, read the help for
boxplot. Suppose we wish to see
all of the outliers in the pmRA variable, which is the second of the four variables
in the current boxplot:
b$names[2]
b$out[b$group==2]
R scripts
While R is often run interactively, one often wants to carefully
construct R scripts and run them later. A file containing R code can be run using the source
command. In addition, R may be run in batch mode.
The editor Emacs, together with "Emacs speaks statistics",
provides a nice way to produce R scripts.