Search This Blog

Tuesday, November 12, 2013

MODEL FORMULAE for predictive analytics

MODEL FORMULAE

This is a short tutorial on writing model formulae for ANOVA and regression analyses. It will be linked to from those tutorials, but you are welcome to read it just for kicks if you'd like.
R functions such as aov( ), lm( ), and glm( ) use a formula interface to specify the variables to be included in the analysis. The formula determines the model that will be built (and tested) by the R procedure. The basic format of such a formula is...
response variable ~ explanatory variables
The tilde should be read "is modeled by" or "is modeled as a function of." The trick is in how the explanatory variables are given.
A basis regression analysis would be formulated this way...
y ~ x
...where "x" is the explanatory variable or IV, and "y" is the response variable or DV. Additional explanatory variables would be added in as follows...
y ~ x + z
...which would make this a multiple regression with two predictors. This raises a critical issue that must be understood to get model formulae correct. Symbols used as mathematical operators in other contexts do not have their usual mathematical meaning inside model formulae. The following table lists the meaning of these symbols when used in a formula.
symbolexamplemeaning
++ xinclude this variable
-- xdelete this variable
:x : zinclude the interaction between these variables
*x * zinclude these variables and the interactions between them
/x / znesting: include z nested within x
|x | zconditioning: include x given z
^(u + v + w)^3include these variables and all interactions up to three way
polypoly(x,3)polynomial regression: orthogonal polynomials
ErrorError(a/b)specify the error term
II(x*z)as is: include a new variable consisting of these variables multiplied
1- 1intercept: delete the intercept (regress through the origin)
You may have noticed already that some formula structures can be specified in more than one way...
y ~ u + v + w + u:v + u:w + v:w + u:v:w
y ~ u * v * w
y ~ (u + v + w)^3
All three of these specify a model in which the variables "u", "v", "w", and all the interactions between them are included. Any of these formats...
y ~ u + v + w + u:v + u:w + v:w
y ~ u * v * w - u:v:w
y ~ (u + v + w)^2
...would delete the three way interaction.
The nature of the variables--binary, categorial (factors), numerical--will determine the nature of the analysis. For example, if "u" and "v" are factors...
y ~ u + v
...dictates an analysis of variance (without the interaction term). If "u" and "v" are numerical, the same formula would dictate a multiple regression. If "u" is numerical and "v" is a factor, then an analysis of covariance is dictated.
That ought to do if for now. Specific examples will appear in the tutorials devoted to specific analyses.

No comments:

Post a Comment

Thank you