Detecting periodic patterns
in unevenly spaced gene expression time series
using Lomb-Scargle periodograms

Home    Paper    Supplementary Information    R Code    Feedback

Analysis Programs in

Download
LombScargle.zip for Windows, or
LombScargle.tar.gz for Linux
(download files include Bozdech data in CSV format -- see notes)

Getting Started

The following assumes you have R installed on Windows or Linux and gives the instructions for extracting the needed R and data files to reproduce figures from the paper.

Windows

  1. Unzip files in LombScargle.ZIP into work area, e.g., C:\LombScargle
  2. Start "R"
  3. File | Change dir | C:\LombScargle | OK
  4. Enter commands shown in "Comments" section below

Linux

  1. Copy LombScarlge.tar.gz to a directory.  Untar files in LombScargle.tar.gz into a LombScargle subdirectory:
    gunzip < LombScargle.tar.gz | tar xvf -
  2. cd LombScargle
  3. Start "R"
  4. Enter commands shown in "Comments" column below

Web
In some cases below commands can be copied from this web page and pasted directly into R for immediate results.

Testing
These "R" programs have been tested using R version 2.1.1 under Windows and Linux.

(R versions 2.0.1 and earlier will give slightly different p.adjust values because of how missing values were treated.  The use of R 2.1.0 or later is recommended for the improved treatment of missing values in p.adjust.)

The files below reproduce the figures from the paper.  To work with your dataset, see Step-by-Step Instructions:  How to Apply Lomb-Scargle Analysis to Your Set of Expression Data.

R File
Requires
Comments
cosine.R -

This file is perhaps a good starting point since it is a self-contained example of a Lomb-Scargle periodogram analysis of a 20-point cosine curve with even spacing over a 120 minute period.

Run the script from a local file:

source("cosine.R")

or directly from this page (copy and paste this line into R):

source("http://research.stowers-institute.org/efg/2005/LombScargle/R/cosine.R")

Output consists of a single screen of four graphics showing:

  1. Cosine Curve (N=20, evenly spaced)
  2. Histogram of time interval variability
  3. Lomb-Scargle Periodogram
  4. Peak Significance (p-value) Curve
R-Mnemonics.R  -

"R" symbolic constants, mostly for graphics. This list is far from complete.

The R community doesn't seem to want to adopt a good programming practice of using mnemonic constants in programs, but rather uses arbitrary numeric values.  For example, for "line type" R uses "1" for a solid line instead of a mnemonic like LTY_SOLID suggested in this file

See Chapter 11, "The Power of Variable Names," in Code Complete 2 by Steve McConnell for why such mnemonics should adopted by the R community.

R-efg.R
R-Mnemonics.R "WriteStamped..." routines to create time-stamped graphics files.
LombScargle.R R-Mnemonics.R

Library of Lomb-Scargle periodogram computations.  Most used functions are:
- ComputeLombScargle
- ComputeAndPlotLombScargle

To use these functions in R directly from this page (copy and paste this line into R):

source("http://research.stowers-institute.org/efg/2005/LombScargle/R/LombScargle.R")

or using a local copy:

source("LombScargle.R")

The following is a brief example using an "Expression" profile at time values that are unevenly spaced (cut and paste these commands into R after running one of the above "source" statements) :

unit <<- "hour" # hourly data
set.seed(19)    # make this example reproducible

# Use "random" times for uneven spacing
# (Note: Using "t" (transpose)or "T" (TRUE) as "R"
# "time" variables is not a good idea since these
# symbols already used in a way that could conflict.)
time <- sort( runif(48, 0, 48) )

Expression <- cos(2*pi*time/24 + pi/3) # 24 hour period

M <- 4*length(time) # often 2 or 4

# Use test frequencies corresponding to
# periods from 8 hours to 96 hours
TestFrequencies <- (1/96) + (1/8 - 1/96) * (1:M / M)

# Use Horne & Baliunas' estimate of independent frequencies
Nindependent <- NHorneBaliunas(length(Expression))

ComputeAndPlotLombScargle(time, Expression,
  TestFrequencies, Nindependent,
  "Cosine Curve (24 hour period)")

Output consists of a single figure with four plots showing:

  1. Cosine Curve (N=48, unevenly spaced).  The peak of the loess-smoothed curve (dotted line) is used for phase.
  2. Histogram of time interval variability
  3. Lomb-Scargle Periodogram with peak corresponding to 24.1 hour period (expected value is 24.0 hours)
  4. Peak Significance Curve:  p-value of Lomb-Scargle peak is quite significant:  p = 3.97E-9.
cosine48.R LombScargleLibrary.R

Used for experiments with cosine curves.  See below. 

Examples periodograms from cosine curves with periods of 48 and 24 hours:
source("cosine48.R")

or copy and paste this line into R:

source("http://research.stowers-institute.org/efg/2005/LombScargle/R/cosine48.R")

and then

par(mfrow=c(2,2))
DemoLombScargleCosine(48, 48,1.0, 0.00, 0.05)
DemoLombScargleCosine(24, 48, 1.0, 0.00, 0.08)

Output from each DemoLombScargleCosine call results in a single screen with three graphics (the histogram is suppressed):

  1. Cosine Curve
  2. Lomb-Scargle Periodogram
  3. Peak Significance Curve

Alternatively, in R (under Windows) select History | Recording in graphics window and don't bother with the mfrow setting above.  You'll see three individual graphics that way using the PageUp/PageDown keys to view them.

Numerical Experiments
SinglePeriod.R cosine48.R

Show Lomb-Scargle periodograms for "ideal gene" with single dominant period ranging from 4 to 72 hours with even and uneven time spacing, and with and without noise.
Run script:  source("SinglePeriod.R")
Output is written to file SinglePeriod.pdf (18 pages).  See A-1 in Numerical Experiments for discussion.

Nyquist.R cosine48.R Show spurious spikes seen in periodogram above Nyquist limit. 
Run script:  source("Nyquist.R")
Output is written to file Nyquist.pdf (6 pages) .  See A-2 in Numerical Experiments for discussion.
DutyCycle.R LombScargleLibrary.R

"Duty cycle" represents fraction of expression cycle that gene is on or off.  See TwoPeriods.R.  

Example of gene on 2/3rds of time and off 1/3 of time:

source("DutyCycle.R")
DutyCycle( c(rep(1,16), rep(0, 8)),
    0.00, 0.50, 2,
    "duty cycle: 2/3")

Output is a single screen of three graphics (as with cosine48.R). 

See A-3.2 in Numerical Experiments for discussion.

TwoPeriods.R
cosine48.R
DutyCycle.R
Two periodicities can occur from two separate independent periods, or from an asymmetric "duty cycle." 
Run script:  source("TwoPeriods.R")
Output is written to file TwoPeriods.pdf (16 pages) and DutyCycle.pdf (4 pages).  See A-3 in Numerical Experiments for discussion.
ThreePeriods.R cosine48.R Three periodicities from three separate independent periods.
Run script:  source("ThreePeriods.R")
Output is written to file ThreePeriods.pdf (20 pages).  See A-4 in Numerical Experiments for discussion.
Ndetermination.R cosine48.R "N" determination for single gene "p" value for cosine curve using linear least squares curve fit. 
Run script:  source("Ndetermination.R")
Output is written to file Ndetermination.pdf (21 pages) with least squares results displayed on the screen. See A-5 in Numerical Experiments for discussion.
ProcessLombScargle
Dataset.R
R-Mnemonics.R
R-efg.R
LombScargle.R
Common code for processing a set of expression profiles, whether from a dataset like from Bozdech, or one from numerical experiments.  Normally not used directly.
NoiseExperiment
Setup.R
ProcessLombScargle
DataSet.R
Code for defining"Noise" and "Mixture" experiments (see below).  Normally not used directly.

Noise-1.R
Noise-2.R
Noise-3.R

NoiseExperiment
Setup.R

Experiment with "ideal gene" and noise:  cosine curve plus known amounts of noise.  Run scripts in this order:

1) source("Noise-1.R")

This can take 1 to 2 hours (faster on newer hardware) to create three files (Dominant.CSV, Periodogram.CSV, and pValue.CSV) in 25 directories. 

15 directories have names:

  <period>Cosine<seed>Noise<amount>

where

  <period> = 12, 24 or 48
  <seed> = seed used for random number generator
  <amount> = noise amount, 0, 10, 20, 50, 100

and 10 directories have names:

  <period>Random<seed>

where

  <period> = 12, 24 or 36 (1 each) or 48 (7 replicates)
  <seed> = seed used for random number generator

Each directory needs about 13 MB of space.

2) source("Noise-2.R")

This takes a few seconds to summarize the "Cosine" directories into a single table.  Results are stored in Noise-2.pdf  (30 pages) and Noise-2.txt.

3) source("Noise-3.R")

This takes a few seconds to summarize the "Random" directories from Noise-1 in a table.  Results are stored in Noise-3.pdf (20 pages) and Noise-3.txt.

See A-6 in Numerical Experiments for discussion.

Mixture-1.R
Mixture-2.R
NoiseExperiment
Setup.R

Histograms of "p" values for various mixtures.  Run scripts in this order:  

1) source("Mixture-1.R")

This takes ~15 minutes to create three files (Dominant.CSV, Periodogram.CSV, and pValue.CSV) in 5 directories named:

  48Mixture<Fraction>Seed191Noise20

where

<fraction> =  0, 250, 500, 750 or 1000 for
      0.0, 0.25, 0.50, 0.75, and 1.00

2. source("Mixture-2.R")

Creates out file MixtureExperiment.pdf (10 pages) with histograms and multiple testing correction summaries.

See A-7 in Numerical Experiments for discussion.

Exploratory analysis and Results
FindExtremes.R R-efg.R
R-Mnemonics.R

Find extreme values in raw Bozdech dataset and plot charts of "extreme" values. 
To run:    source("FindExtremes.R")
Results in PDF files are in a directory named "bad".  The boxplot-Extremes.pdf file show a summary of the original, unmodified data.  PDFs for eight probes containing the extreme values can be inspected.

Bozdech03.R

ProcessLombScargle
Dataset.R

File:  Complete_Dataset.csv

Routines to process Bozdech 2003 Plasmodium dataset.

Process complete dataset and create charts shown on Exploratory analysis page, sections B-1, B-3, B-4, B-5, B-6, and Results page, sections C-1 and C-2,  (note the chart in B-7 is only created if certain Bioconductor packages are available.):

  source("Bozdech03.R")
  ProcessAll()


The process above takes about 20 minutes on a reasonably fast Linux processor.  For now, the analysis shows all the graphics as they are being computed, which adds considerable processing time.  The "PhaseShift" value is only computed now as part of plotting the curves.

ProcessAll creates a "Complete" directory with three CSV files (Dominant.CSV, Periodogram.CSV, pValue.CSV) and 8 PDF files with figures for B-1 through B-7.

Analyze a single probe at any time:

  source("Bozdech03.R")
  d <- LoadPlasmodiumData()

  ShowSingleProbeProfile(d, 2000, "Row 2000")
or
  ShowSingleProbeProfileByName(d, "opfi17738")

Any filename can be specified with ProcessAll or LoadPlasmodiumData but the default filename is Complete_Dataset.csv.

Analyze1.R

R-efg.R
R-Mnemonics.R

File:  Complete_Dataset.csv

Post-process data created by ProcessAll function above.  Create figures C-3 ("p" histogram), C-4 (period histogram) and C-5 (multiple testing methods), and file Complete/Dominant-p-adjust.CSV 

Run script: 
   source("Analyze1.R")

The script will prompt for a filename:  Complete/Dominant.csv.

This code should execute in both R 2.0.1 or R 2.1.0 or later. NOTE:  Results will vary slightly by version of R.    The output file Target.txt will have 4358 probes if run in R 2.0.1 or earlier and 4355 probes if run in R 2.1.0 or later.

Output files: 
- Dominant-p-adjust.csv
- p-adjust.pdf
- histogram-Period.pdf
- histogram-N.pdf
- histogram-log10p.pdf
- Target.txt

StatsAndSets.R Files:  Complete_Dataset.csv,
QC_Dataset.csv,
Overview_Dataset.csv,
Target.txt

Perform a variety of set operations to get the cardinality of the various sets in the Venn diagram (Fig 5. in the paper).  Compare Lomb-Scargle "p" value to Bozdech's "Max Power" heuristic, as well as phase shifts.  See Results, Sections C-6a - C-6d. 

Run script:     source("StatsAndSets.R")

Output files include: 
- StatsAndSets.pdf
- BozdechMinusLombScargle.txt (108 probes)
- LombScargleMinusBozedech.txt (501 probes)
- LombScargleBozdechManyMissing.txt (243 probes)

Analyze2.R

Bozdech03.R

Files: 
Complete_Dataset.csv
Dominant-p-adjust.csv

Create "Small N" and "Long Period" charts in PDF files shown in C-7.

Run script:     source("Analyze2.R")

Output files:  SetLongPeriod.pdf and SetSmallN.pdf

Analyze3.R

Bozdech03.R

Files:
Complete_Dataset.csv
Dominant-p-adjust.csv
Periodogram.csv

Create Periodogram Map (C-8) and Phaseogram (C-9) in Results.

Run script:     source("Analyze3.R")

Output files: 
PhaseogramHeatmap.png
and PhaseShiftHeatmap.png

loess-imputation.R

-

Demonstrate that loess could be used to impute pseudodata on large gap.  Imputing data points for such a wide gap would usually not be considered appropriate. 

Run script:  source("loess-imputation.R")

Output:  loess-imputation.pdf

MissingTests.R -

Process "Complete" Bozdech dataset for an ever-wider, hypothetical "gap" in the data.

Run script:  source("MissingTests.R")

This script calls the ProcessAll() function, which was described above, and takes considerable time -- about an 1-2 hours, depending on processor speed.  [Each ProcessAll() could be run in parallel under Linux for a good speedup.]

Several subdirectories are created:  Missing7, Missing11, Missing15, Missing19, Missing25.  These directories contain the same output as in the "Complete" directory when ProcessAll() is run.

Run this script to determine the number of significant periodic probes: 
   source("Analyze1.R")

The script will prompt for a filename:  <directory>/Dominant.csv.

Here is a output from Analyze1 for the MissingTests output..

Getting Started With Your Own Data

SetupLombScargle
ExampleSet.R

Analyze
MyTimeSeriesSet.R

-

See Step-by-Step Instructions: How to Apply Lomb-Scargle Analysis to Your Set of Expression Data.

source("SetupLombScargleExampleSet.R")
source("AnalyzeMyTimeSeriesSet.R")

For paper and web site
CreateFigures.R

cosine48.R
Bozdech03.R

File:  Complete_Dataset.csv

Create the computer-generated figures used in the paper.

Run script:  source("CreateFigures.R")

Output files:  Figure1.pdf  Figure2.pdf  Figure3.pdf  Figure4.pdf 
Figure6.pdf Figure7.pdf   (Figure 5 was created manually)

Logo.R -

Create "logo" of two out-of-phase sine curves with uneven time-spacing representing periodic genes where red is "up regulated" and green is "down regulated."  Run this script to reproduce the logo:  source("http://research.stowers-institute.org/efg/2005/LombScargle/R/Logo.R")

imputation-GIF.R -

Use R to create 8 graphs as .png files to be frames in an animated GIF of missing data points.  If on Linux, ImageMagick's convert is used to create the animated GIF from the PNGs. See "Missing Value Tolerance" (C-10).

Run script:  source("imputation-GIF.R")

Output:  missing*.png files and missing.gif

 

 


Updated
22 Sept 2005