Search This Blog

Monday, January 23, 2012

Title, legends, text in R

Axes and Text

Many high level plotting functions (plot, hist, boxplot, etc.) allow you to include axis and text options (as well as other graphical paramters). For example
# Specify axis options within plot()
plot(x, y, main="title", sub="subtitle",
  xlab="X-axis label", ylab="y-axix label",
  xlim=c(xmin, xmax), ylim=c(ymin, ymax))

For finer control or for modularization, you can use the functions described below.

Titles

Use the title( ) function to add labels to a plot.
title(main="main title", sub="sub-title",
   xlab="x-axis label", ylab="y-axis label")

Many other graphical parameters (such as text size, font, rotation, and color) can also be specified in the title( ) function.
# Add a red title and a blue subtitle. Make x and y
# labels 25% smaller than the default and green.
title(main="My Title", col.main="red",
  sub="My Sub-title", col.sub="blue",
  xlab="My X label", ylab="My Y label",
  col.lab="green", cex.lab=0.75)

Text Annotations

Text can be added to graphs using the text( ) and mtext( ) functions. text( ) places text within the graph while mtext( ) places text in one of the four margins.
text(location, "text to place", pos, ...)
mtext("text to place", side, line=n, ...)

Common options are described below.
option description
location location can be an x,y coordinate. Alternatively, the text can be placed interactively via mouse by specifying location as locator(1).
pos position relative to location. 1=below, 2=left, 3=above, 4=right. If you specify pos, you can specify offset= in percent of character width.
side which margin to place text. 1=bottom, 2=left, 3=top, 4=right. you can specify line= to indicate the line in the margin starting with 0 and moving out. you can also specify adj=0 for left/bottom alignment or adj=1 for top/right alignment.
Other common options are cex, col, and font (for size, color, and font style respectively).

Labeling points

You can use the text( ) function (see above) for labeling point as well as for adding other text annotations. Specify location as a set of x, y coordinates and specify the text to place as a vector of labels. The x, y, and label vectors should all be the same length.
# Example of labeling points
attach(mtcars)
plot(wt, mpg, main="Milage vs. Car Weight",
   xlab="Weight", ylab="Mileage", pch=18, col="blue")
text(wt, mpg, row.names(mtcars), cex=0.6, pos=4, col="red")

labeling points click to view

Math Annotations

You can add mathematically formulas to a graph using TEX-like rules. See help(plotmath) for details and examples.

Axes

You can create custom axes using the axis( ) function.
axis(side, at=, labels=, pos=, lty=, col=, las=, tck=, ...)
where
option description
side an integer indicating the side of the graph to draw the axis (1=bottom, 2=left, 3=top, 4=right)
at a numeric vector indicating where tic marks should be drawn
labels a character vector of labels to be placed at the tickmarks
(if NULL, the at values will be used)
pos the coordinate at which the axis line is to be drawn.
(i.e., the value on the other axis where it crosses)
lty line type
col the line and tick mark color
las labels are parallel (=0) or perpendicular(=2) to axis
tck length of tick mark as fraction of plotting region (negative number is outside graph, positive number is inside, 0 suppresses ticks, 1 creates gridlines) default is -0.01
(...) other graphical parameters
If you are going to create a custom axis, you should suppress the axis automatically generated by your high level plotting function. The option axes=FALSE suppresses both x and y axes. xaxt="n" and yaxt="n" suppress the x and y axis respectively. Here is a (somewhat overblown) example.
# A Silly Axis Example

# specify the data
x <- c(1:10); y <- x; z <- 10/x

# create extra margin room on the right for an axis
par(mar=c(5, 4, 4, 8) + 0.1)

# plot x vs. y
plot(x, y,type="b", pch=21, col="red",
   yaxt="n", lty=3, xlab="", ylab="")

# add x vs. 1/x
lines(x, z, type="b", pch=22, col="blue", lty=2)

# draw an axis on the left
axis(2, at=x,labels=x, col.axis="red", las=2)

# draw an axis on the right, with smaller text and ticks
axis(4, at=z,labels=round(z,digits=2),
  col.axis="blue", las=2, cex.axis=0.7, tck=-.01)

# add a title for the right axis
mtext("y=1/x", side=4, line=3, cex.lab=1,las=2, col="blue")

# add a main title and bottom and left axis labels
title("An Example of Creative Axes", xlab="X values",
   ylab="Y=X")

axis example click to view

Minor Tick Marks

The minor.tick( ) function in the Hmisc package adds minor tick marks.
# Add minor tick marks
library(Hmisc)
minor.tick(nx=n, ny=n, tick.ratio=n)

nx is the number of minor tick marks to place between x-axis major tick marks.
ny does the same for the y-axis. tick.ratio is the size of the minor tick mark relative to the major tick mark. The length of the major tick mark is retrieved from par("tck").

Reference Lines

Add reference lines to a graph using the abline( ) function.
abline(h=yvalues, v=xvalues)
Other graphical parameters (such as line type, color, and width) can also be specified in the abline( ) function.
# add solid horizontal lines at y=1,5,7
abline(h=c(1,5,7))
# add dashed blue verical lines at x = 1,3,5,7,9
abline(v=seq(1,10,2),lty=2,col="blue")

Note: You can also use the grid( ) function to add reference lines.

Legend

Add a legend with the legend() function.
legend(location, title, legend, ...)
Common options are described below.
option description
location There are several ways to indicate the location of the legend. You can give an x,y coordinate for the upper left hand corner of the legend. You can use locator(1), in which case you use the mouse to indicate the location of the legend. You can also use the keywords "bottom", "bottomleft", "left", "topleft", "top", "topright", "right", "bottomright", or "center". If you use a keyword, you may want to use inset= to specify an amount to move the legend into the graph (as fraction of plot region).
title A character string for the legend title (optional)
legend A character vector with the labels
... Other options. If the legend labels colored lines, specify col= and a vector of colors. If the legend labels point symbols, specify pch= and a vector of point symbols. If the legend labels line width or line style, use lwd= or lty= and a vector of widths or styles. To create colored boxes for the legend (common in bar, box, or pie charts), use fill= and a vector of colors.
Other common legend options include bty for box type, bg for background color, cex for size, and text.col for text color. Setting horiz=TRUE sets the legend horizontally rather than vertically.
# Legend Example
attach(mtcars)
boxplot(mpg~cyl, main="Milage by Car Weight",
   yaxt="n", xlab="Milage", horizontal=TRUE,
   col=terrain.colors(3))
legend("topright", inset=.05, title="Number of Cylinders",
   c("4","6","8"), fill=terrain.colors(3), horiz=TRUE)

Legend example click to view
For more on legends, see help(legend). The examples in the help are particularly informative.

Laplace Distribution

Laplace Distribution

Description

These functions provide information about the Laplace distribution with location parameter equal to m and dispersion equal to s: density, cumulative distribution, quantiles, log hazard, and random generation.
The Laplace distribution has density
f(y) = exp(-abs(y-m)/s)/(2*s)
where m is the location parameter of the distribution and s is the dispersion.

Usage

dlaplace(y, m=0, s=1, log=FALSE)
plaplace(q, m=0, s=1)
qlaplace(p, m=0, s=1)
hlaplace(y, m=0, s=1)
rlaplace(n, m=0, s=1)

Arguments

y vector of responses.
q vector of quantiles.
p vector of probabilities
n number of values to generate
m vector of location parameters.
s vector of dispersion parameters.
log if TRUE, log probabilities are supplied.

Author(s)

J.K. Lindsey

See Also

dexp for the exponential distribution and dcauchy for the Cauchy distribution.

Examples

dlaplace(5, 2, 1)
plaplace(5, 2, 1)
qlaplace(0.95, 2, 1)
rlaplace(10, 2, 1)

PROBILITY DISTRIBUTIONS, QUANTILES, CHECKS FOR NORMALITY

PROBILITY DISTRIBUTIONS, QUANTILES, CHECKS FOR NORMALITY

Probability Distributions
R has density and distribution functions built-in for about 20 probability distributions, including those in the following table:
distributionfunctiontype
binomialbinomdiscrete
chi-squaredchisqcontinuous
Ffcontinuous
hypergeometrichyperdiscrete
normalnormcontinuous
Poissonpoisdiscrete
Student's ttcontinuous
uniformunifcontinuous
By prefixing a "d" to the function name in the table above, you can get probability density values (pdf). By prefixing a "p", you can get cumulative probabilities (cdf). By prefixing a "q", you can get quantile values. By prefixing an "r", you can get random numbers from the distribution. I will demonstrate using the normal distribution.
PDF The dnorm( ) function returns the height of the normal curve at some value along the x-axis. This is illustrated in the figure at left. Here the value of dnorm(1) is shown by the vertical line at x=1...

> dnorm(1)
[1] 0.2419707
With no options specified, the value of "x" is treated as a standard score or z-score. To change this, you can specify "mean=" and "sd=" options. In other words, dnorm( ) returns the probability density function or pdf.
CDF The pnorm( ) function is the cumulative density function or cdf. It returns the area below the given value of "x", or for x=1, the shaded region in the figure at right...

> pnorm(1)
[1] 0.8413447
Once again, the defaults for mean and sd are 0 and 1 respectively. These can be set to other values as in the case of dnorm( ). To find the area above the cutoff x-value, either subtract from 1, or set the "lower.tail=" option to FALSE...
> 1 - pnorm(1)
[1] 0.1586553
> pnorm(1, lower.tail=F)
[1] 0.1586553

So, good news! No more tables! To get quantiles or "critical values", you can use the qnorm( ) function as in the following examples...

> qnorm(.95)                           # p = .05, one-tailed (upper)
[1] 1.644854
> qnorm(c(.025,.975))                  # p = .05, two-tailed
[1] -1.959964  1.959964
> qnorm(seq(.1,.9,.1))                 # deciles from the unit normal dist.
[1] -1.2815516 -0.8416212 -0.5244005 -0.2533471  0.0000000  0.2533471  0.5244005
[8]  0.8416212  1.2815516
Once again, there are "mean=" and "sd=" options. To use these functions with other distributions, more parameters may need to be given. Here are some examples...

> pt(2.101, df=8)                      # area below t = 2.101, df = 8
[1] 0.9655848
> qchisq(.95, df=1)                    # critical value of chi square, df = 1
[1] 3.841459
> qf(c(.025,.975), df1=3, df2=12)
[1] 0.06975178 4.47418481
> dbinom(60, size=100, prob=.5)        # a discrete binomial probability
[1] 0.01084387
The help pages for these functions will give the necessary details. Random numbers are generated from a given distribution like this...

> runif(9)                             # 9 uniformly distributed random nos.
[1] 0.01961714 0.62086249 0.64193142 0.99583719 0.06294405 0.94324289 0.88233387
[8] 0.11851026 0.60300929
> rnorm(9)                             # 9 normally distributed random nos.
[1] -0.95186711  0.09650050 -0.37148202  0.56453509 -0.44124876 -0.43263580
[7] -0.46909466  1.38590806 -0.06632486
> rt(9, df=10)                         # 9 t-distributed random nos.
[1] -1.538466123 -0.249067184 -0.324245905 -0.009964799  0.143282490
[6]  0.619253016  0.247399305  0.691629869 -0.177196453
One again, I refer you to the help pages for all the gory details.
Empirical Quantiles
Suppose you want quartiles or deciles or percentiles or whatever from a sample or empirical distribution. The appropriate function is quantile( ). From the help page for this function, the syntax is...

quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
         names = TRUE, type = 7, ...)
This says "enter a vector, x, of data values, or the name of such a vector, and I will return quantiles for positions 0, .25, .5, .75, and 1 (in other words, quartiles along with the min and max values), without removing missing values (and if missing values exist the function will fail and return an error message), I'll give each of the returned values a name, and I will use method 7 (of 9) to do the calculations." Let's see this happen using the built-in data set "rivers"...
> quantile(rivers)
  0%  25%  50%  75% 100% 
 135  310  425  680 3710
Compare this to what you get with a summary...
> summary(rivers)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  135.0   310.0   425.0   591.2   680.0  3710.0
So what's the point? The quantile( ) function is much more versatile because you can change the default "probs=" values...
> quantile(rivers, probs=seq(.2,.8,.2))     # quintiles
20% 40% 60% 80% 
291 375 505 735 
> quantile(rivers, probs=seq(.1,.9,.1))     # deciles
 10%  20%  30%  40%  50%  60%  70%  80%  90% 
 255  291  330  375  425  505  610  735 1054 
> quantile(rivers, probs=.55)               # 55th percentile
55% 
460
> quantile(rivers, probs=c(.05,.95))        # and so on
  5%  95% 
 230 1450
And then there is the "type=" option. It turns out there is some disagreement among different sources as to just how quantiles should be calculated from an empirical distribution. R doesn't take sides. It gives you nine different methods! Pick the one you like best by setting the "type=" option to a number between 1 and 9. Here are some details (and more are available on the help page): type=2 will give the results most people are taught to calculate in an intro stats course, type=3 is the SAS definition, type=6 is the Minitab and SPSS definition, type=7 is the default and the S definition and seems to work well when the variable is continuous.
Checks For Normality
Parametric procedures like the t-test, F-test (ANOVA), and Pearson r assume the data are distributed normally. There are several ways to check this assumption.
qqnorm The qqnorm( ) function allows a graphical evaluation...

> qqnorm(rivers)
If the values in the vector are normally distributed, the points on the plot will fall (more or less) along a straight line. This line can be plotted on the graph like this...
> qqline(rivers)
As you can see, the "rivers" vector is strongly skewed, as indicated by the bowing of the points up away from the expected straight line. The very long upper tail (strong positive skew) in this distribution could also have been visualized using...
> plot(density(rivers))
...the output of which is not shown here. You can see this in the QQ plot as well by the fact that the higher sample values are much too large to be from a theoretical normal distribution. The lower tail of the distribution appears to be a bit short.
Statistical tests for normality are also available. Perhaps the best known of these is the Shapiro-Wilk test...

> shapiro.test(rivers)

        Shapiro-Wilk normality test

data:  rivers 
W = 0.6666, p-value < 2.2e-16
I believe we can safely reject the null hypothesis of normality here! Here's a question for all you stat students out there: how often should the following result in a rejection of the null hypothesis if our random number generator is worth its salt?
> shapiro.test(rnorm(100))

        Shapiro-Wilk normality test

data:  rnorm(100) 
W = 0.9894, p-value = 0.6192
This question will be on the exam! Another test that can be used here is the Kolmogorov-Smirnov test...

> ks.test(rivers, "pnorm", alternative="two.sided")

        One-sample Kolmogorov-Smirnov test

data:  rivers 
D = 1, p-value < 2.2e-16
alternative hypothesis: two-sided 

Warning message:
In ks.test(rivers, "pnorm", alternative = "two.sided") :
  cannot compute correct p-values with ties
But it gets upset when there are ties in the data. Inside the function, the value "pnorm" tells the test to compare the empirical cumulative density function of "rivers" to the cumulative density function of a normal distribution. The null hypothesis says the two will match. Clearly they do not, so the null hypothesis is rejected. We conclude once again, and for the last time in this tutorial, that "rivers" is not normally distributed.

Quantile-Quantile Plots

Quantile-Quantile Plots

Description

qqnorm is a generic function the default method of which produces a normal QQ plot of the values in y. qqline adds a line to a normal quantile-quantile plot which passes through the first and third quartiles.
qqplot produces a QQ plot of two datasets.
Graphical parameters may be given as arguments to qqnorm, qqplot and qqline.

Usage

qqnorm(y, ...)
## Default S3 method:
qqnorm(y, ylim, main = "Normal Q-Q Plot",
       xlab = "Theoretical Quantiles",
       ylab = "Sample Quantiles", plot.it = TRUE, datax = FALSE,
       ...)
qqline(y, datax = FALSE, ...)
qqplot(x, y, plot.it = TRUE, xlab = deparse(substitute(x)),
       ylab = deparse(substitute(y)), ...)

Arguments

x The first sample for qqplot.
y The second or only data sample.
xlab, ylab, main plot labels.
plot.it logical. Should the result be plotted?
datax logical. Should data values be on the x-axis?
ylim, ... graphical parameters.

Value

For qqnorm and qqplot, a list with components
x The x coordinates of the points that were/would be plotted
y The original y vector, i.e., the corresponding y coordinates including NAs.

References

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

See Also

ppoints.

Examples

y <- rt(200, df = 5)
qqnorm(y); qqline(y, col = 2)
qqplot(y, rt(300, df = 5))

qqnorm(precip, ylab = "Precipitation [in/yr] for 70 US cities")