Introduction to R
R is a free statistical software. It has many uses including
shall acquaint ourselves with the basics of R in this
- performing simple calculations (like a very powerful pocket
- making plots (graphs, diagrams etc),
- analyzing data using ready-made statistical tools (e.g.,,
- and above all it is a powerful programming language.
First you must have R installed in your computer. Then you'll
have to do one of a number of things depending on your computer
set up. The simplest technique is to turn to the guy who has worked
with R in your lab, and ask for help! If no such guy is at hand,
then you may try one of these.
If everything goes well, you should see something like this.
- If you see an icon like then (double) clicking on
it should work.
- Open a command
window (xterm, say) and try typing R.
- Search for the path where R is installed, and cd to its bin
folder. Then issue the ./R command.
R : Copyright 2005, The R Foundation for Statistical Computing
Version 2.1.1 (2005-06-20), ISBN 3-900051-07-0
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for a HTML browser interface to help.
Type 'q()' to quit R.
is the R prompt.
You have to type commands in
front of this prompt and press the key on your
R may be used like a simple calculator. Type the following
command in front of the prompt and hit .
2 + 3
 in the output for the time being. We
shall learn its meaning later.
2 / 3
What about the following? Wait! Don't type these all over again.
2 * 3
2 - 3
|Just hit the key of
your keyboard to replay the last line. Now use the
and cursor keys and the key to make the necessary
So now you know about three different types of `numbers' that R can
handle: ordinary numbers, infinities, NaN (Not a Number).
What does R say to the following?
Guess the result of
R can work with variables. For example
x = 4
assigns the value 4 to the variable x. This assignment
occurs silently, so you do not see any visible effect
immediately. To see the value of x type
This is a very important thing to remember:
Let us create a
|To see the value of
the variable just type its name and hit .
y = -4
Try the following.
x + y
x - 2*y
x^2 + 1/y
The caret (^) in the last line denotes power.
What happens if you type the following?
and what about the next line?
X + Y
Well, I should have told you already: R is case sensitive!
Can you explain the effect of this?
x = 2*x
Unlike many other languages, R allows the dot character (.) as
part of a variable name. Thus you can write
speed.light = 3e8 # 300 000 000 m/s
R knows most of the standard functions.
sin(pi) #pi is a built-in constant
|The part of a line after # is called a
comment. It is meant for U, the UseR who use R!
R does not care about comments.||
While you are in the mood of using R as a calculator you may
What is the base of the logarithm?
R has many many features and it is impossible to keep all its
nuances in one's head. So R has an efficient online help
system. The next exercise introduces you to this.
Suppose that you desperately need logarithm to the base 10. You
want to know if R has a ready-made function to compute that. So
A new window (the help window) will pop up. Do you find what you
Sometimes, you may not know the exact name of the function that
you are interested in. Then you can try the
|Always look up the help of
anything that does not seem clear. The technique is to type a
question mark followed by the name of the thing you are
interested in. All words written like |
this in this
tutorial have online help.
function. It has a useful abbreviation:
can be contracted to
Searching for functions with names known only approximately is
Can R compute the Gamma function? As your first effort try
Oops! Apparently this is not the Gamma function you are looking
for. So try
This will list all the topics that involve Gamma. After some
deliberation you can see that ``Special Functions of
Mathematics'' matches your need most closely. So type
Got the information you needed?
|Sometimes it is easier to google the
internet than perform |
We can type
to get the value sin(1). Here
sin is a standard built-in
function. R allows us to create new functions of our own. For
example, suppose that some computation requires you to find
the value of
f(x) = x/(1-x)
repeatedly. Then we can write function to do this as follows.
f = function(x) x/(1-x)
Now you may type
y = 4
f is the name of the function. It can be any name of your
choice (as long as it does not conflict with names already
existing in R).
A couple of points are in order here. First, the choice of the
name depends completely on you. Second, the name of the argument
is also a matter of personal choice. But you must use the same
name also inside the body of the function.
It is also possible to write functions of more than one
|Anatomy of an R function
Try out the following.
g = function(x,y) (x+2*y)/3
Write a function with name
myfun that computes
x+2*y/3. Use it to compute 2+2*3/3.
So far R appears to be little more than a sophisticated
most calculators it can handle vectors, which are
basically lists of numbers.
x = c(1,2,4,56)
c function is for concatenating numbers (or
variables) into vectors.
There are useful methods to create long vectors whose elements
are in arithmetic progression:
y = c(x,c(-1,5),x)
x = 1:20
If the common difference is not 1 or -1 then we can use
Try the following
Do you see the meaning of the numbers inside the square brackets?
How to create the following vector in R?
1, 1.1, 1.2, , ... 1.9, 2, 5, 5.2, 5.4, ... 9.8, 10
Hint: First make the two parts separately, and then concatenate
Working with vectors
Now that we know how to create vectors in R, it is time to use
them. There are basically three different types of functions to
- those that work entrywise
- those that summarize a vector into a few numbers (like finds
the sum of all the numbers)
It is very easy to add/subtract/multiply/divide two vectors entry
Most operations that work with numbers act entrywise when applied
to vectors. Try this.
x = 1:5
Next we meet some functions that summarizes a vector into one or
x = c(1,2,-3,0)
y = c(0,3,4,0)
Try the following and guess the meanings of commands.
val = c(2,1,-4,4,56,-4,2)
Guess the outcome of
Check your guess with the online help.
Extracting parts of a vector
If x is a vector of length 3 then its entries may be accessed as
x, x and x.
x = c(2,4,-1)
i = 3
Note that the counting starts from 1 and proceeds left-to-right.
The quantity inside the square brackets is called the
subscript or index.
C/C++ and Java users beware: indexing in R
starts from 1, and not from 0.
It is also possible to access multiple entries of a vector by
using a subscript that is itself a vector.||
x = 3:10
What is the effect of the following?
x = c(10,3,4,1)
ind = c(3,2,4,1) #a permutation of 1,2,3,4
This technique is often useful to rearrange a vector.
Subscripting allows us to find one or more entries in a vector if
we know the position(s) in the vector. There is a different (and
very useful) form
of subscripting that allows us to extract entries with some given
Try the following to find how R interprets negative subscripts.
x = 3:10
x = c(100,2,200,4)
The second line extracts all the entires in x that exceed
50. There are some nifty things that we can achieve using this
kind of subscripting. To find the sum of all entries exceeding 50
we can use
How does this work? If you type
you will get a vector of
stands for a case where the entry exceeds 50. When such a
True-False vector is used as the subscript only the entries
corresponding to the
TRUEs are retained. Even that is not
all. Internally a
TRUE is basically a 1, while a
FALSE is a 0. So if
you will get the number of entries exceeding
|The number of entries
satisfying some given property (like ``less than 4'')may be found
val = c(1,30,10,24,24,30,10,45)
then what will be the result of the following?
sum(val >= 10 & val <= 40)
sum(val > 40 | val < 10) # | means "OR"
sum(val == 30) #we are using == and not =
sum(val != 24)
Be careful with ==. It is different from =. The former
means comparing for equality, while the latter means assignment
of a value to a variable.
compute? No, it is not the mean of all the x's exceeding 5.
Try and interpret the results of the following.
x = c(100,2,200,4)
x = c(2,3,4,5,3,1)
y = sort(x)
Sometimes we need to order one vector according to another
Look up the help of the
sort function to find out how to
sort in decreasing order.
x = c(2,3,4,5,3,1)
y = c(3,4,1,3,8,9)
ord = order(x)
ord is the position of the smallest
ord is the position of the next smallest
number, and so on.
x[ord] #same as sort(x)
y[ord] #y sorted according to x
R has no direct way to create an arbitrary matrix. You have to first list all the
entries of the matrix as a single vector (an m by n matrix
will need a vector of length mn) and then fold the vector into a
matrix. To create
we first list the entries column by column to get
1, 3, 2, 4.
To create the matrix in R:
A = matrix(c(1,3,2,4),nrow=2)
nrow=2 command tells R that the matrix has 2 rows (then R
can compute the number of columns by dividing the length of the vector by
nrow.) You could have also typed:
A <- matrix(c(1,3,2,4),ncol=2) #<- is same as =
to get the same effect. Notice that R folds a vector into a matrix
column by column. Sometimes, however, we may need to fold row by row
A = matrix(c(1,3,2,4),nrow=2,byrow=T)
T is same as
Subscripting a matrix is done much like subscripting a vector,
except that for a matrix we need two subscripts. To see the
(1,2)-th entry (i.e., the entry in row 1 and column 2) of
Matrix operations in R are more or less straight forward.
Try the following.
A = matrix(c(1,3,2,4),ncol=2)
B = matrix(2:7,nrow=2)
C = matrix(5:2,ncol=2)
A%*%C #matrix multiplication
A*C #entrywise multiplication
Try out the following commands to find what they do.
Working with rows and columns
Consider the following.
A = matrix(c(1,3,2,4),ncol=2)
sin function applies entrywise. Now
to find the sum of each column. So we want to apply the sum
function columnwise. We achieve this by using the
apply function like this:
The 2 above means columnwise. If we need to find the
rowwise means we can use
Vectors and matrices in R are two ways to work with a collection
of objects. Lists provide a third method. Unlike a vector
or a matrix a list can hold different kinds of objects. Thus, one
entry in a list may be a number, while the next is a matrix,
while a third is a character string (like "Hello R!"). Lists are
useful to store different pieces of information about some common
entity. The following list, for example, stores details about a
x = list(name="Chang", nationality="Chinese", height=5.5, grades=c(95,45,80))
We can now extract the different fields of
x$hei #abbrevs are OK
In the coming tutorials we shall never
need to make a list ourselves. But the statistical functions of R
usually return the result in the form of lists. So we must know
how to unpack a list using the
$ symbol as above.
Let us see an example of this. Suppose we want to write a
function that finds the length, total and mean of a vector.
Since the function is returning three different pieces of
information we should use lists as follows.
To see the online help about symbols like |
Notice the quotes surrounding the symbol.
f = function(x) list(len=length(x),total=sum(x),mean=mean(x))
Now we can use it like this:
dat = 1:10
result = f(dat)
Doing statistics with R
Now that we know R to some extent it is time to put our knowledge
to perform some statistics using R. There are basically three
ways to do this.
In this first tutorial we shall content ourselves with the first
of these three. But first we need to get our data set inside R.
- Doing elementary statistical summarization or plotting of
- Using R as a calculator to compute some formula obtained
from some statistics text.
- Using the sophisticated statistical tools built into R.
Loading a data set into R
We shall consider part of a data set given in
Distance to the Large Magellanic Cloud: The RR Lyrae Stars
Gisella Clementini, Raffaele Gratton, Angela Bragaglia,
Eugenio Carretta, Luca Di Fabrizio, and Marcella Maio
Astronomical Journal 125, 1309-1329 (2003).
We have slightly doctored the data file to make it compatible
with R. The file is called LMC.dat and resides in some folder
F:\astro, say. The data set has two columns with the
Here are the first few lines of the file:
Note the following points: the first line contains the variable
names. Character strings with spaces in them are surrounded by
quotes. There is a single case per line.
There are various ways to load the data set. One is to use
Method Dist Err
"Cepheids: trig. paral." 18.70 0.16
"Cepheids: MS fitting" 18.55 0.06
"Cepheids: B-W" 18.55 0.10
LMC = read.table("F:/astro/LMC.dat", header=T)
Note the use of forward slash (
/) even if you are
Windows. Also the
header=T tells that the first line
of the data file gives the names of the columns. Here we have
used the absolute path of the data file. In Unix the
absolute path starts with a forward slash (
LMC is like a matrix (more precisely it
is called a data frame). Each column stores the values of
one variable, and each row stores a case. Its main difference
with a matrix is that different columns can hold different types
of data (for example, the
Method column stores character strings,
while the other two columns hold numbers). Otherwise, a data
frame is really like a matrix.
We can find the mean of the
Dist variable like this
Note that each column of the
LMC matrix is a variable, so it is
tempting to write
but this will not work, since
Dist is inside
can ``bring it out'' by the command
Now the command
All the values of the
Dist variable are different measurements of
the same distance. So it is only natural to use the average as
an estimate of the true distance. But the
Err variable tells us
that not all the measurements are equally reliable. So a better
estimate might be a weighted mean, where the weights are
inversely proportional to the errors.
We can use R as a calculator to directly implement this formula:
or you may want to be a bit more explicit
wt = 1/Err
Actually there is a smarter way than both of these.
So far we are using R interactively where we type commands
at the prompt and the R executes a line before we type the next
line. But sometimes we may want to submit many lines of commands
to R at a single go. Then we need to use scripts.
A script file in R is a text file containing R commands (much as
you would type them at the prompt). As an example, open a text
editor (e.g., notepad in Windows, or gedit in
Linux). Avoid fancy editors like MSWord.
Create a file called, say,
Use script files to save frequently used command
sequences. Script files are also useful for replaying
an analysis at a later date.
test.r containing the
x = seq(0,10,0.1)
y = sin(x)
plot(x,y,ty="l") #guess what this line does!
Save the file in some folder (say
In order to make R execute this script type
If your script has any mistake in it then R will produce error
messages at this point. Otherwise, it will execute your script.
|In this as well as other examples involving files, you must
use the actual path on your system for things to work. The
examples give the paths that work in the author's machine.||
y created inside
the command file are available for use from the prompt now. For
example, you can check the value of
x by simply
typing its name at the prompt.
Commands inside a script file are executed pretty much like commands
typed at the prompt. One important difference is that
in order to
print the value of a variable
x on the screen you have
on a line by itself will not do inside a script file.
Printing results of the intermediate steps using |
from inside a script file is a good way to debug R scripts.