Search This Blog

Tuesday, March 27, 2012

SAS: Multivariate regression in SAS

Multivariate regression in SAS

In many ways, multivariate regression is similar to MANOVA.  The hypotheses, the methods used to obtain the estimates, and the assumptions are all similar. The multivariate test statistics are the same.  The hypothesis being tested by a multivariate regression is that there is a joint linear effect of the set of predictors on the set of responses.  Hence, the null hypothesis is that slope of all coefficients is simultaneously zero.  Note that the "set" of predictors may include no predictor or only one predictor, but usually it contains more.
The basic assumptions of multivariate regression are 1) multivariate normality of the residuals, 2) homogenous variances of residuals conditional on predictors, 3) common covariance structure across observations, and 4) independent observations.  Unfortunately, testing the first three assumptions is very difficult.  Currently, many of the common statistical packages, such as SAS and SPSS, do not offer a test of multivariate normality.  However, you can see if your data are close to being multivariate normal by creating some graphs.  First, you want to see if your residuals for each dependent variable are normal by themselves.  This is necessary, but not sufficient, for multivariate normality.  Next, you can create scatterplots of the residuals.  You want to see the points on the graph form an ellipse (as opposed to a V-shape, a wedge-shape, or some other kind of shape).  Remember that an ellipse can be any form of a circle.  You would like the points to line up in a "flattened" ellipse because the dependent variables are supposed to be correlated for MANOVA or multiple regression to be the analysis of choice, but this is not necessary for multivariate normality.  Regarding the second assumption, homogeneity of variances, there are several tests available for this.  However, most of them are very sensitive to nonnormality.  Fortunately, the F statistic is fairly robust against violations of this assumption.  As for the third assumption, the covariance matrices are rarely equal.  Monte Carlo studies have shown that keeping the number of observations (subjects) per group approximately equal is an effective method of ensuring that violations of this assumption will not be too problematic.  Regarding the independence of observations, clearly there is no statistical test for that.  Rather, that is an issue of methodology.  Care should be taken to ensure that the observations are independent, because even small intraclass correlations can cause serious problems.  For example, suppose an experimenter had three groups with 30 subjects per group and a small dependence between the observations, say an intraclass correlation of .10.  The actual alpha value would be .4917, rather than the standard .05.
If all of these assumptions are met, then the coefficients will be unbiased, the least-squares estimates will have minimum variance, and the relationships among the coefficients will reflect the relationships among the predictors.  Furthermore, a multivariate hypothesis test will account for the relationship among the coefficients, whereas a univariate F test would not.

With all of this in mind, let's try a multivariate multiple regression.  We will use the hsb2 data set for our example, and we will use read and socst as our dependent variables and write, math and science as our independent variables.  The proc reg statement is the same as it would be in a univariate regression, but the model statement is a little different: we now have two (we could have more) dependent variables listed before the equals sign.  Also, we have included the mtest statement, which is used to test hypotheses in multivariate regression.  If no equations are listed on the mtest statement, SAS tests the hypothesis that all coefficients except the intercept are zero.  You can specify some options on the mtest statement, including canprint, which will print the canonical correlations for the hypothesis combinations and the dependent variable combinations.  The details option will display the M matrix, and the print option will display the H and E matrices.
proc reg data = "g:\SAS\hsb2";
model read socst = write math science;
mtest / details print;
run;
quit;
The REG Procedure
Model: MODEL1
Dependent Variable: read

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     3          11313     3771.09916      76.94    <.0001
Error                   196     9606.12253       49.01083
Corrected Total         199          20919

Root MSE              7.00077    R-Square     0.5408
Dependent Mean       52.23000    Adj R-Sq     0.5338
Coeff Var            13.40374

                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1        4.36993        3.20878       1.36      0.1748
write         1        0.23767        0.06969       3.41      0.0008
math          1        0.37840        0.07463       5.07      <.0001
science       1        0.29693        0.06763       4.39      <.0001
The REG Procedure
Model: MODEL1
Dependent Variable: socst

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     3     9551.66620     3183.88873      46.62    <.0001
Error                   196          13385       68.28841
Corrected Total         199          22936

Root MSE              8.26368    R-Square     0.4164
Dependent Mean       52.40500    Adj R-Sq     0.4075
Coeff Var            15.76888

                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1        8.86989        3.78763       2.34      0.0202
write         1        0.46567        0.08227       5.66      <.0001
math          1        0.27630        0.08810       3.14      0.0020
science       1        0.08512        0.07984       1.07      0.2877
The REG Procedure
Model: MODEL1
Multivariate Test 1

                               L Ginv(X'X) L'   LB-cj

0.0000991078      -0.000042904      -0.000028518      0.2376705687      0.4656741023
-0.000042904      0.0001136529      -0.000044399      0.3784014963      0.2763008055
-0.000028518      -0.000044399      0.0000933347      0.2969346843      0.0851168364

                         Inv(L Ginv(X'X) L')    Inv()(LB-cj)

   17878.875         10911.025          10653.25          11541.35         12247.225
   10911.025         17465.795          11642.35          12659.33         10897.755
    10653.25          11642.35           19507.5           12729.9           9838.15

       Error Matrix (E)

9606.1225306      3657.5503071
3657.5503071      13384.528803

     Hypothesis Matrix (H)

11313.297469      9955.8196929
9955.8196929      9551.6661967

 Hypothesis + Error Matrix (T)

    20919.42          13613.37
    13613.37         22936.195


         Eigenvectors

    0.004986          0.002488
   -0.007281          0.008053

 Eigenvalues

    0.587507
    0.051687                                                                

The REG Procedure
Model: MODEL1
Multivariate Test 1

                 Multivariate Statistics and F Approximations

                             S=2    M=0    N=96.5

Statistic                        Value    F Value    Num DF    Den DF    Pr > F

Wilks' Lambda               0.39117291      38.93         6       390    <.0001
Pillai's Trace              0.63919333      30.69         6       392    <.0001
Hotelling-Lawley Trace      1.47878554      47.94         6    258.23    <.0001
Roy's Greatest Root         1.42428180      93.05         3       196    <.0001

NOTE: F Statistic for Roy's Greatest Root is an upper bound.
NOTE: F Statistic for Wilks' Lambda is exact.
Looking at the very bottom of the output we can see that the overall model is statistically significant.  We can look at the first half of the output to see the univariate results.  Here we see that with only the dependent variable read, the overall model is statistically significant, as well as each of the predictors.  When we look at the univariate results for socst, we see that the overall model is statistically significant, as are the predictors write and math, but not science.  In other words, multivariate tests tell us that the set of predictors accounts for a statistically significant portion of the variance in the dependent variables, and the univariate tests break this down for us so that we can see where the significant differences are.

Let's run the same model again, but this time, we will specify some hypotheses to be tested on the mtest statement.  In the first mtest statement, we will test the hypothesis that the parameter for write is the same for read and socst. In the second mtest statement, we will test the hypothesis that the parameter for science is the same for read and socst.  You will notice that, as with test statements in other procs, we can use a label before the statement so that it is labeled in the output.
proc reg data = "g:\SAS\hsb2";
model read socst = write math science;
write: mtest read- socst, write / details print;
science: mtest read - socst, science / details print;
run;
quit;
The REG Procedure
Model: MODEL1
Dependent Variable: read

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     3          11313     3771.09916      76.94    <.0001
Error                   196     9606.12253       49.01083
Corrected Total         199          20919

Root MSE              7.00077    R-Square     0.5408
Dependent Mean       52.23000    Adj R-Sq     0.5338
Coeff Var            13.40374

                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1        4.36993        3.20878       1.36      0.1748
write         1        0.23767        0.06969       3.41      0.0008
math          1        0.37840        0.07463       5.07      <.0001
science       1        0.29693        0.06763       4.39      <.0001

The REG Procedure
Model: MODEL1
Dependent Variable: socst

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     3     9551.66620     3183.88873      46.62    <.0001
Error                   196          13385       68.28841
Corrected Total         199          22936

Root MSE              8.26368    R-Square     0.4164
Dependent Mean       52.40500    Adj R-Sq     0.4075
Coeff Var            15.76888

                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1        8.86989        3.78763       2.34      0.0202
write         1        0.46567        0.08227       5.66      <.0001
math          1        0.27630        0.08810       3.14      0.0020
science       1        0.08512        0.07984       1.07      0.2877

The REG Procedure
Model: MODEL1
Multivariate Test: write

                Multivariate Statistics and Exact F Statistics

                             S=1    M=-0.5    N=97

Statistic                        Value    F Value    Num DF    Den DF    Pr > F

Wilks' Lambda               0.96762141       6.56         1       196    0.0112
Pillai's Trace              0.03237859       6.56         1       196    0.0112
Hotelling-Lawley Trace      0.03346205       6.56         1       196    0.0112
Roy's Greatest Root         0.03346205       6.56         1       196    0.0112

The REG Procedure
Model: MODEL1
Multivariate Test: science

                Multivariate Statistics and Exact F Statistics

                             S=1    M=-0.5    N=97

Statistic                        Value    F Value    Num DF    Den DF    Pr > F

Wilks' Lambda               0.97024627       6.01         1       196    0.0151
Pillai's Trace              0.02975373       6.01         1       196    0.0151
Hotelling-Lawley Trace      0.03066616       6.01         1       196    0.0151
Roy's Greatest Root         0.03066616       6.01         1       196    0.0151
For the dependent variable read, the predictors write, math and science are significant.  For the dependent variable socst, the predictors write and math are significant.  The last two pages of the output indicate that both of the hypotheses regarding the parameters were statistically significant (F = 6.56, p = 0.0112 and F = 6.01, p = 0.0151, respectively).  Hence, we would conclude that, based on the results of the first test (which we called write), that parameters for read and socst are not the same for the variable write.  The second test (which we called science) suggests that the parameters for read and socst are not the same for the variable science.

No comments:

Post a Comment

Thank you