A Data Scientist's blog: Tuesday, March 27, 2012

Tuesday, March 27, 2012

Statistical Computing

UCLA ATS Statistical Consulting

     Services and Policies
     Classes and Workshops
     Statistics Books for Loan
     What's new on our siteStatistical Computing Packages

     SAS     SPSS     Stata    R
     HLM Mplus Other Packages

Examples
     Frequent Asked Questions (FAQs)
     Data Analysis Examples
     Annotated Output
     Textbook and Paper Examples

Resources and Data for UCLA Researchers

     Research-related Resources
     Archival Data Resources
     Software Purchasing and Updating

Other Resources for Statistical Computing

     Applied Statistics Courses Offered at UCLA
     Statistical Consultants for Hire
     Statistical Computing Job Postings
     Available Statistical Consulting Services at UCLA

Resources to help you learn and use SAS

Learning SAS

SAS Starter Kit
Classes and Seminars
Learning Modules
Frequently Asked Questions

Statistical Analyses

Data Analysis Examples
Annotated Output
Textbook Examples
Paper Examples
Web Books
What statistical analysis should I use?

Advanced Usage

Library
Code Fragments
Macro Programs

Important Links

Where to run SAS? How to get SAS?
Installing, Customizing, Updating, Renewing SAS
SAS Online Documentation

Links by Topic

Data Management
Graphics
Programming
ANOVA
Regression
Logistic (and Categorical) Regression
Count Models
Repeated Measures Analysis
Multilevel Modeling
Survival Analysis
Survey Data Analysis

SAS: Simple and Multiple Regression

Simple and Multiple Regression

Chapter Outline
    1.0 Introduction
    1.1 A First Regression Analysis
    1.2 Examining Data
    1.3 Simple linear regression
    1.4 Multiple regression
    1.5 Transforming variables
    1.6 Summary
    1.7 For more information
1.0 Introduction
This web book is composed of four chapters covering a variety of topics about using SAS for regression. We should emphasize that this book is about "data analysis" and that it demonstrates how SAS can be used for regression analysis, as opposed to a book that covers the statistical basis of multiple regression. We assume that you have had at least one statistics course covering regression analysis and that you have a regression book that you can use as a reference (see the Regression With SAS page and our Statistics Books for Loan page for recommended regression analysis books). This book is designed to apply your knowledge of regression, combine it with instruction on SAS, to perform, understand and interpret regression analyses.
This first chapter will cover topics in simple and multiple regression, as well as the supporting tasks that are important in preparing to analyze your data, e.g., data checking, getting familiar with your data file, and examining the distribution of your variables. We will illustrate the basics of simple and multiple regression and demonstrate the importance of inspecting, checking and verifying your data before accepting the results of your analysis. In general, we hope to show that the results of your regression analysis can be misleading without further probing of your data, which could reveal relationships that a casual analysis could overlook.
In this chapter, and in subsequent chapters, we will be using a data file that was created by randomly sampling 400 elementary schools from the California Department of Education's API 2000 dataset. This data file contains a measure of school academic performance as well as other attributes of the elementary schools, such as, class size, enrollment, poverty, etc.
You can access this data file over the web by clicking on elemapi.sas7bdat, or by visiting the Regression with SAS page where you can download all of the data files used in all of the chapters of this book. The examples will assume you have stored your files in a folder called c:\sasreg, but actually you can store the files in any folder you choose, but if you run these examples be sure to change c:\sasreg\ to the name of the folder you have selected.
1.1 A First Regression Analysis
Let's dive right in and perform a regression analysis using the variables api00, acs_k3, meals and full. These measure the academic performance of the school (api00), the average class size in kindergarten through 3rd grade (acs_k3), the percentage of students receiving free meals (meals) - which is an indicator of poverty, and the percentage of teachers who have full teaching credentials (full). We expect that better academic performance would be associated with lower class size, fewer students receiving free meals, and a higher percentage of teachers having full teaching credentials. Below, we use proc reg for running this regression model followed by the SAS output.

proc reg data="c:\sasreg\elemapi";
  model api00 = acs_k3 meals full;
run;

The REG Procedure
Model: MODEL1
Dependent Variable: api00 api 2000

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     3        2634884         878295     213.41    <.0001
Error                   309        1271713     4115.57673
Corrected Total         312        3906597


Root MSE             64.15276    R-Square     0.6745
Dependent Mean      596.40575    Adj R-Sq     0.6713
Coeff Var            10.75656


                                    Parameter Estimates

                                            Parameter       Standard
Variable     Label                  DF       Estimate          Error    t Value    Pr > |t|

Intercept    Intercept               1      906.73916       28.26505      32.08      <.0001
acs_k3       avg class size k-3      1       -2.68151        1.39399      -1.92      0.0553
meals        pct free meals          1       -3.70242        0.15403     -24.04      <.0001
full         pct full credential     1        0.10861        0.09072       1.20      0.2321

Let's focus on the three predictors, whether they are statistically significant and, if so, the direction of the relationship. The average class size (acs_k3, b=-2.68), is not significant (p=0.0553), but only just so, and the coefficient is negative which would indicate that larger class sizes is related to lower academic performance -- which is what we would expect. Next, the effect of meals (b=-3.70, p<.0001) is significant and its coefficient is negative indicating that the greater the proportion students receiving free meals, the lower the academic performance. Please note, that we are not saying that free meals are causing lower academic performance. The meals variable is highly related to income level and functions more as a proxy for poverty. Thus, higher levels of poverty are associated with lower academic performance. This result also makes sense. Finally, the percentage of teachers with full credentials (full, b=0.11, p=.2321) seems to be unrelated to academic performance. This would seem to indicate that the percentage of teachers with full credentials is not an important factor in predicting academic performance -- this result was somewhat unexpected.
Should we take these results and write them up for publication? From these results, we would conclude that lower class sizes are related to higher performance, that fewer students receiving free meals is associated with higher performance, and that the percentage of teachers with full credentials was not related to academic performance in the schools. Before we write this up for publication, we should do a number of checks to make sure we can firmly stand behind these results. We start by getting more familiar with the data file, doing preliminary data checking, looking for errors in the data.
1.2 Examining data
First, let's use proc contents to learn more about this data file. We can verify how many observations it has and see the names of the variables it contains.

proc contents data="c:\sasreg\elemapi" ;
run;

The CONTENTS Procedure

Data Set Name: c:\sasreg\elemapi                       Observations:         400
Member Type:   DATA                                    Variables:            21
Engine:        V8                                      Indexes:              0
Created:       4:58 Saturday, January 9, 1960          Observation Length:   83
Last Modified: 4:58 Saturday, January 9, 1960          Deleted Observations: 0
Protection:                                            Compressed:           NO
Data Set Type:                                         Sorted:               NO
Label:


     -----Engine/Host Dependent Information-----

Data Set Page Size:         8192
Number of Data Set Pages:   5
First Data Page:            1
Max Obs per Page:           98
Obs in First Data Page:     56
Number of Data Set Repairs: 0
File Name:                  c:\sasreg\elemapi.sas7bdat
Release Created:            7.0000M0
Host Created:               WIN_NT


            -----Alphabetic List of Variables and Attributes-----

 #    Variable    Type    Len    Pos    Label
-----------------------------------------------------------------------------
11    acs_46      Num       3     39    avg class size 4-6
10    acs_k3      Num       3     36    avg class size k-3
 3    api00       Num       4     12    api 2000
 4    api99       Num       4     16    api 1999
17    avg_ed      Num       8     57    avg parent ed
15    col_grad    Num       3     51    parent college grad
 2    dnum        Num       4      8    district number
 7    ell         Num       3     27    english language learners
19    emer        Num       3     73    pct emer credential
20    enroll      Num       4     76    number of students
18    full        Num       8     65    pct full credential
16    grad_sch    Num       3     54    parent grad school
 5    growth      Num       4     20    growth 1999 to 2000
13    hsg         Num       3     45    parent hsg
21    mealcat     Num       3     80    Percentage free meals in 3 categories
 6    meals       Num       3     24    pct free meals
 9    mobility    Num       3     33    pct 1st year in school
12    not_hsg     Num       3     42    parent not hsg
 1    snum        Num       8      0    school number
14    some_col    Num       3     48    parent some college
 8    yr_rnd      Num       3     30    year round school

We will not go into all of the details of this output. Note that there are 400 observations and 21 variables. We have variables about academic performance in 2000 and 1999 and the change in performance, api00, api99 and growth respectively. We also have various characteristics of the schools, e.g., class size, parents education, percent of teachers with full and emergency credentials, and number of students. Note that when we did our original regression analysis it said that there were 313 observations, but the proc contents output indicates that we have 400 observations in the data file.
If you want to learn more about the data file, you could use proc print to show some of the observations. For example, below we proc print to show the first five observations.

proc print data="c:\sasreg\elemapi"(obs=5) ;
run;

                              m             s c g
                               o        n    o o r                   m
                    g       y  b  a  a  o    m l a    a           e  e
            a   a   r  m    r  i  c  c  t    e _ d    v           n  a
     s   d  p   p   o  e    _  l  s  s  _    _ g _    g     f  e  r  l
 O   n   n  i   i   w  a  e r  i  _  _  h  h c r s    _     u  m  o  c
 b   u   u  0   9   t  l  l n  t  k  4  s  s o a c    e     l  e  l  a
 s   m   m  0   9   h  s  l d  y  3  6  g  g l d h    d     l  r  l  t

  1 906 41 693 600 93 67  9 0 11 16 22  0  0 0 0 0  .      76 24 247 2
  2 889 41 570 501 69 92 21 0 33 15 32  0  0 0 0 0  .      79 19 463 3
  3 887 41 546 472 74 97 29 0 36 17 25  0  0 0 0 0  .      68 29 395 3
  4 876 41 571 487 84 90 27 0 27 20 30 36 45 9 9 0 1.91000 87 11 418 3
  5 888 41 478 425 53 89 30 0 44 18 31 50 50 0 0 0 1.50000 87 13 520 3

This takes up lots of space on the page, but does not give us a lot of information. Listing our data can be very helpful, but it is more helpful if you list just the variables you are interested in. Let's list the first 10 observations for the variables that we looked at in our first regression analysis.

proc print data="c:\sasreg\elemapi"(obs=10) ;
  var api00 acs_k3 meals full;
run;

Obs    api00    acs_k3    meals    full
  1     693       16        67       76
  2     570       15        92       79
  3     546       17        97       68
  4     571       20        90       87
  5     478       18        89       87
  6     858       20         .      100
  7     918       19         .      100
  8     831       20         .       96
  9     860       20         .      100
 10     737       21        29       96

We see that among the first 10 observations, we have four missing values for meals. It is likely that the missing data for meals had something to do with the fact that the number of observations in our first regression analysis was 313 and not 400.
Another useful tool for learning about your variables is proc means. Below we use proc means to learn more about the variables api00, acs_k3, meals, and full.

proc means data="c:\sasreg\elemapi";
  var api00 acs_k3 meals full;
run;

The MEANS Procedure

Variable  Label                  N          Mean       Std Dev       Minimum
----------------------------------------------------------------------------
api00     api 2000             400   647.6225000   142.2489610   369.0000000
acs_k3    avg class size k-3   398    18.5477387     5.0049328   -21.0000000
meals     pct free meals       315    71.9936508    24.3855697     6.0000000
full      pct full credential  400    66.0568000    40.2979258     0.4200000
----------------------------------------------------------------------------

Variable  Label                     Maximum
-------------------------------------------
api00     api 2000              940.0000000
acs_k3    avg class size k-3     25.0000000
meals     pct free meals        100.0000000
full      pct full credential   100.0000000
-------------------------------------------

We see that the api00 scores don't have any missing values (because the N is 400) and the scores range from 369-940. This makes sense since the api scores can range from 200 to 1000. We see that the average class size (acs_k3) had 398 valid values ranging from -21 to 25 and 2 are missing. It seems odd for a class size to be -21. The percent receiving free meals (meals) ranges from 6 to 100, but there are only 315 valid values (85 are missing). This seems like a large number of missing values. The percent with full credentials (full) ranges from .42 to 100 with no missing.
We can also use proc freq to learn more about any categorical variables, such as yr_rnd, as shown below.

proc freq data="c:\sasreg\elemapi";
  tables yr_rnd;
run;
                     year round school

                                   Cumulative    Cumulative
yr_rnd    Frequency     Percent     Frequency      Percent
-----------------------------------------------------------
     0         308       77.00           308        77.00
     1          92       23.00           400       100.00

The variable yr_rnd is coded 0=No (not year round) and 1=Yes (year round). Of the 400 schools, 308 are non-year round and 92 are year round, and none are missing.
The above commands have uncovered a number of peculiarities worthy of further examination. For example, let us look further into the average class size by getting more detailed summary statistics for acs_k3 using proc univariate.

proc univariate data="c:\sasreg\elemapi";
  var acs_k3;
run;

The UNIVARIATE Procedure
Variable:  acs_k3  (avg class size k-3)

                            Moments
N                         398    Sum Weights                398
Mean               18.5477387    Sum Observations          7382
Std Deviation      5.00493282    Variance            25.0493526
Skewness           -7.1055928    Kurtosis            53.0136683
Uncorrected SS         146864    Corrected SS        9944.59296
Coeff Variation    26.9840594    Std Error Mean      0.25087461

              Basic Statistical Measures

    Location                    Variability
Mean     18.54774     Std Deviation            5.00493
Median   19.00000     Variance                25.04935
Mode     19.00000     Range                   46.00000
                      Interquartile Range      2.00000

           Tests for Location: Mu0=0

Test           -Statistic-    -----p Value------
Student's t    t  73.93231    Pr > |t|    <.0001
Sign           M       193    Pr >= |M|   <.0001
Signed Rank    S     37839    Pr >= |S|   <.0001

Quantiles (Definition 5)

Quantile      Estimate
100% Max            25
99%                 23
95%                 21
90%                 21
75% Q3              20
50% Median          19
25% Q1              18
10%                 17
5%                  16
1%                 -20
0% Min             -21

        Extreme Observations

----Lowest----        ----Highest---
Value      Obs        Value      Obs
  -21       43           22      365
  -21       42           23       36
  -21       41           23       79
  -20       40           23      361
  -20       38           25      274

               Missing Values

                       -----Percent Of-----
Missing                             Missing
  Value       Count     All Obs         Obs
      .           2        0.50      100.00

Looking in the section labeled Extreme Observations, we see some of the class sizes are -21 and -20, so it seems as though some of the class sizes somehow became negative, as though a negative sign was incorrectly typed in front of them. Let's do a proc freq for class size to see if this seems plausible.

proc freq data="c:\sasreg\elemapi";
  tables acs_k3;
run;
                     avg class size k-3

                                   Cumulative    Cumulative
acs_k3    Frequency     Percent     Frequency      Percent
-----------------------------------------------------------
   -21           3        0.75             3         0.75
   -20           2        0.50             5         1.26
   -19           1        0.25             6         1.51
    14           2        0.50             8         2.01
    15           1        0.25             9         2.26
    16          14        3.52            23         5.78
    17          20        5.03            43        10.80
    18          64       16.08           107        26.88
    19         143       35.93           250        62.81
    20          97       24.37           347        87.19
    21          40       10.05           387        97.24
    22           7        1.76           394        98.99
    23           3        0.75           397        99.75
    25           1        0.25           398       100.00

Frequency Missing = 2

Indeed, it seems that some of the class sizes somehow got negative signs put in front of them. Let's look at the school and district number for these observations to see if they come from the same district. Indeed, they all come from district 140.

proc print data="c:\sasreg\elemapi";
  where (acs_k3 < 0);
  var snum dnum acs_k3;
run;

Obs    snum    dnum    acs_k3
 38     600     140      -20
 39     596     140      -19
 40     611     140      -20
 41     595     140      -21
 42     592     140      -21
 43     602     140      -21
 85     116     294        .
306    4534     630        .

Notice that when we looked at the observations where (acs_k3 < 0) this also included observations where acs_k3 is missing (represented as a period). To be more precise, the above command should exclude such observations like this.

proc print data="c:\sasreg\elemapi";
  where (acs_k3 < 0) and (acs_k3 ^= .);
  var snum dnum acs_k3;
run;

Obs    snum    dnum    acs_k3
 38     600     140      -20
 39     596     140      -19
 40     611     140      -20
 41     595     140      -21
 42     592     140      -21
 43     602     140      -21

Now, let's look at all of the observations for district 140.

proc print data="c:\sasreg\elemapi";
  where (dnum = 140);
  var snum dnum acs_k3;
run;

Obs    snum    dnum    acs_k3
 38     600     140      -20
 39     596     140      -19
 40     611     140      -20
 41     595     140      -21
 42     592     140      -21
 43     602     140      -21

All of the observations from district 140 seem to have this problem. When you find such a problem, you want to go back to the original source of the data to verify the values. We have to reveal that we fabricated this error for illustration purposes, and that the actual data had no such problem. Let's pretend that we checked with district 140 and there was a problem with the data there, a hyphen was accidentally put in front of the class sizes making them negative. We will make a note to fix this! Let's continue checking our data.
Let's take a look at some graphical methods for inspecting data. For each variable, it is useful to inspect them using a histogram, boxplot, and stem-and-leaf plot. These graphs can show you information about the shape of your variables better than simple numeric statistics can. We already know about the problem with acs_k3, but let's see how these graphical methods would have revealed the problem with this variable.
First, we show a histogram for acs_k3. This shows us the observations where the average class size is negative.

proc univariate data="c:\sasreg\elemapi";
  var acs_k3 ;
  histogram / cfill=gray;
run;

Likewise, a boxplot and stem-and-leaf plot would have called these observations to our attention as well. In SAS you can use the plot option with proc univariate to request a boxplot and stem and leaf plot. Below we show just the combined boxplot and stem and leaf plot from this output. You can see the outlying negative observations way at the bottom of the boxplot.

proc univariate data="c:\sasreg\elemapi" plot;
  var acs_k3;
run;

                      Histogram                       #  Boxplot
     25+*                                              1     0
       .**                                            10     |
       .****************************                 137  +-----+
       .******************************************   207  *--+--*
       .*******                                       34     |
       .*                                              3     0
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .*                                              3     *
    -21+*                                              3     *
        ----+----+----+----+----+----+----+----+--
        * may represent up to 5 counts

We recommend plotting all of these graphs for the variables you will be analyzing. We will omit, due to space considerations, showing these graphs for all of the variables. However, in examining the variables, the stem-and-leaf plot for full seemed rather unusual. Up to now, we have not seen anything problematic with this variable, but look at the stem and leaf plot for full below. It shows 104 observations where the percent with a full credential that is much lower than all other observations. This is over 25% of the schools and seems very unusual.

proc univariate data="c:\sasreg\elemapi" plot;
  var full;
run;

                   Histogram                   #  Boxplot
  102.5+***************************            81     |
       .******************                     54  +-----+
   92.5+****************                       46  |     |
       .************                           36  *-----*
   82.5+**********                             30  |     |
       .******                                 17  |     |
   72.5+***                                     8  |     |
       .**                                      6  |  +  |
   62.5+**                                      4  |     |
       .**                                      5  |     |
   52.5+*                                       1  |     |
       .**                                      4  |     |
   42.5+*                                       3  |     |
       .*                                       1  |     |
   32.5+                                           |     |
       .                                           |     |
   22.5+                                           |     |
       .                                           |     |
   12.5+                                           |     |
       .                                           |     |
    2.5+***********************************   104  +-----+
        ----+----+----+----+----+----+----+
        * may represent up to 3 counts

Let's look at the frequency distribution of full to see if we can understand this better. The values go from 0.42 to 1.0, then jump to 37 and go up from there. It appears as though some of the percentages are actually entered as proportions, e.g., 0.42 was entered instead of 42 or 0.96 which really should have been 96.

proc freq data="c:\sasreg\elemapi" ;
  tables full;
run;

                       pct full credential

                                         Cumulative    Cumulative
        full    Frequency     Percent     Frequency      Percent
-----------------------------------------------------------------
0.4199999869           1        0.25             1         0.25
0.4499999881           1        0.25             2         0.50
0.4600000083           1        0.25             3         0.75
0.4699999988           1        0.25             4         1.00
0.4799999893           1        0.25             5         1.25
         0.5           3        0.75             8         2.00
0.5099999905           1        0.25             9         2.25
0.5199999809           1        0.25            10         2.50
0.5299999714           1        0.25            11         2.75
0.5400000215           1        0.25            12         3.00
0.5600000024           2        0.50            14         3.50
0.5699999928           2        0.50            16         4.00
0.5799999833           1        0.25            17         4.25
0.5899999738           3        0.75            20         5.00
0.6000000238           1        0.25            21         5.25
0.6100000143           4        1.00            25         6.25
0.6200000048           2        0.50            27         6.75
0.6299999952           1        0.25            28         7.00
0.6399999857           3        0.75            31         7.75
0.6499999762           3        0.75            34         8.50
0.6600000262           2        0.50            36         9.00
0.6700000167           6        1.50            42        10.50
0.6800000072           2        0.50            44        11.00
0.6899999976           3        0.75            47        11.75
0.6999999881           1        0.25            48        12.00
0.7099999785           1        0.25            49        12.25
0.7200000286           2        0.50            51        12.75
0.7300000191           6        1.50            57        14.25
        0.75           4        1.00            61        15.25
0.7599999905           2        0.50            63        15.75
0.7699999809           2        0.50            65        16.25
0.7900000215           3        0.75            68        17.00
0.8000000119           5        1.25            73        18.25
0.8100000024           8        2.00            81        20.25
0.8199999928           2        0.50            83        20.75
0.8299999833           2        0.50            85        21.25
0.8399999738           2        0.50            87        21.75
0.8500000238           3        0.75            90        22.50
0.8600000143           2        0.50            92        23.00
0.8999999762           3        0.75            95        23.75
0.9200000167           1        0.25            96        24.00
0.9300000072           1        0.25            97        24.25
0.9399999976           2        0.50            99        24.75
0.9499999881           2        0.50           101        25.25
0.9599999785           1        0.25           102        25.50
           1           2        0.50           104        26.00
          37           1        0.25           105        26.25
          41           1        0.25           106        26.50
          44           2        0.50           108        27.00
          45           2        0.50           110        27.50
          46           1        0.25           111        27.75
          48           1        0.25           112        28.00
          53           1        0.25           113        28.25
          57           1        0.25           114        28.50
          58           3        0.75           117        29.25
          59           1        0.25           118        29.50
          61           1        0.25           119        29.75
          63           2        0.50           121        30.25
          64           1        0.25           122        30.50
          65           1        0.25           123        30.75
          68           2        0.50           125        31.25
          69           3        0.75           128        32.00
          70           1        0.25           129        32.25
          71           3        0.75           132        33.00
          72           1        0.25           133        33.25
          73           2        0.50           135        33.75
          74           1        0.25           136        34.00
          75           4        1.00           140        35.00
          76           4        1.00           144        36.00
          77           2        0.50           146        36.50
          78           4        1.00           150        37.50
          79           3        0.75           153        38.25
          80          10        2.50           163        40.75
          81           4        1.00           167        41.75
          82           3        0.75           170        42.50
          83           9        2.25           179        44.75
          84           4        1.00           183        45.75
          85           8        2.00           191        47.75
          86           5        1.25           196        49.00
          87          12        3.00           208        52.00
          88           6        1.50           214        53.50
          89           5        1.25           219        54.75
          90           9        2.25           228        57.00
          91           8        2.00           236        59.00
          92           7        1.75           243        60.75
          93          12        3.00           255        63.75
          94          10        2.50           265        66.25
          95          17        4.25           282        70.50
          96          17        4.25           299        74.75
          97          11        2.75           310        77.50
          98           9        2.25           319        79.75
         100          81       20.25           400       100.00

Let's see which district(s) these data came from.

proc freq data="c:\sasreg\elemapi" ;
  where (full <= 1);
  tables dnum;
run;
                     district number

                                 Cumulative    Cumulative
dnum    Frequency     Percent     Frequency      Percent
---------------------------------------------------------
 401         104      100.00           104       100.00

We note that all 104 observations in which full was less than or equal to one came from district 401. Let's see if this accounts for all of the observations that come from district 401.

proc freq data="c:\sasreg\elemapi" ;
  where (dnum = 401);
  tables dnum;
run;
                     district number

                                 Cumulative    Cumulative
dnum    Frequency     Percent     Frequency      Percent
---------------------------------------------------------
 401         104      100.00           104       100.00

All of the observations from this district seem to be recorded as proportions instead of percentages. Again, let us state that this is a pretend problem that we inserted into the data for illustration purposes. If this were a real life problem, we would check with the source of the data and verify the problem. We will make a note to fix this problem in the data as well.
Another useful graphical technique for screening your data is a scatterplot matrix. While this is probably more relevant as a diagnostic tool searching for non-linearities and outliers in your data, it can also be a useful data screening tool, possibly revealing information in the joint distributions of your variables that would not be apparent from examining univariate distributions. Let's look at the scatterplot matrix for the variables in our regression model. This reveals the problems we have already identified, i.e., the negative class sizes and the percent full credential being entered as proportions.

proc insight data="c:\sasreg\elemapi";
  scatter api00 acs_k3 meals full *  api00 acs_k3 meals full;
run;

We have identified three problems in our data. There are numerous missing values for meals, there were negatives accidentally inserted before some of the class sizes (acs_k3) and over a quarter of the values for full were proportions instead of percentages. The corrected version of the data is called elemapi2. Let's use that data file and repeat our analysis and see if the results are the same as our original analysis. First, let's repeat our original regression analysis below.

proc reg data="c:\sasreg\elemapi"
  model api00 = acs_k3 meals full;
run;

Dependent Variable: api00 api 2000

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     3        2634884         878295     213.41    <.0001
Error                   309        1271713     4115.57673
Corrected Total         312        3906597

Root MSE             64.15276    R-Square     0.6745
Dependent Mean      596.40575    Adj R-Sq     0.6713
Coeff Var            10.75656

                                    Parameter Estimates

                                            Parameter       Standard
Variable     Label                  DF       Estimate          Error    t Value    Pr > |t|

Intercept    Intercept               1      906.73916       28.26505      32.08      <.0001
acs_k3       avg class size k-3      1       -2.68151        1.39399      -1.92      0.0553
meals        pct free meals          1       -3.70242        0.15403     -24.04      <.0001
full         pct full credential     1        0.10861        0.09072       1.20      0.2321

Now, let's use the corrected data file and repeat the regression analysis. We see quite a difference in the results! In the original analysis (above), acs_k3 was nearly significant, but in the corrected analysis (below) the results show this variable to be not significant, perhaps due to the cases where class size was given a negative value. Likewise, the percentage of teachers with full credentials was not significant in the original analysis, but is significant in the corrected analysis, perhaps due to the cases where the value was given as the proportion with full credentials instead of the percent. Also, note that the corrected analysis is based on 398 observations instead of 313 observations, due to getting the complete data for the meals variable which had lots of missing values.

proc reg data="c:\sasreg\elemapi2";
  model api00 = acs_k3 meals full ;
run;

Dependent Variable: api00 api 2000

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F
Model                     3        6604966        2201655     615.55    <.0001
Error                   394        1409241     3576.75370
Corrected Total         397        8014207

Root MSE             59.80597    R-Square     0.8242
Dependent Mean      648.46985    Adj R-Sq     0.8228
Coeff Var             9.22263

                              Parameter Estimates

                                      Parameter     Standard
Variable   Label                DF     Estimate        Error  t Value  Pr > |t|
Intercept  Intercept             1    771.65811     48.86071    15.79    <.0001
acs_k3     avg class size k-3    1     -0.71706      2.23882    -0.32    0.7489
meals      pct free meals        1     -3.68626      0.11178   -32.98    <.0001
full       pct full credential   1      1.32714      0.23887     5.56    <.0001

From this point forward, we will use the corrected, elemapi2, data file.
So far we have covered some topics in data checking/verification, but we have not really discussed regression analysis itself. Let's now talk more about performing regression analysis in SAS.

1.3 Simple Linear Regression
Let's begin by showing some examples of simple linear regression using SAS. In this type of regression, we have only one predictor variable. This variable may be continuous, meaning that it may assume all values within a range, for example, age or height, or it may be dichotomous, meaning that the variable may assume only one of two values, for example, 0 or 1. The use of categorical variables with more than two levels will be covered in Chapter 3. There is only one response or dependent variable, and it is continuous.
In SAS, the dependent variable is listed immediately after the model statement followed by an equal sign and then one or more predictor variables. Let's examine the relationship between the size of school and academic performance to see if the size of the school is related to academic performance. For this example, api00 is the dependent variable and enroll is the predictor.

proc reg data="c:\sasreg\elemapi2";
  model api00 = enroll;
run;

Dependent Variable: api00 api 2000

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F
Model                     1         817326         817326      44.83    <.0001
Error                   398        7256346          18232
Corrected Total         399        8073672

Root MSE            135.02601    R-Square     0.1012
Dependent Mean      647.62250    Adj R-Sq     0.0990
Coeff Var            20.84949

                             Parameter Estimates

                                     Parameter     Standard
Variable   Label               DF     Estimate        Error  t Value  Pr > |t|
Intercept  Intercept            1    744.25141     15.93308    46.71    <.0001
enroll     number of students   1     -0.19987      0.02985    -6.70    <.0001

Let's review this output a bit more carefully. First, we see that the F-test is statistically significant, which means that the model is statistically significant. The R-squared is .1012 means that approximately 10% of the variance of api00 is accounted for by the model, in this case, enroll. The t-test for enroll equals -6.70 , and is statistically significant, meaning that the regression coefficient for enroll is significantly different from zero. Note that (-6.70)² = 44.89, which is the same as the F-statistic (with some rounding error). The coefficient for enroll is -.19987, or approximately -0.2, meaning that for a one unit increase in enroll, we would expect a 0.2-unit decrease in api00. In other words, a school with 1100 students would be expected to have an api score 20 units lower than a school with 1000 students. The constant is 744.2514, and this is the predicted value when enroll equals zero. In most cases, the constant is not very interesting. We have prepared an annotated output which shows the output from this regression along with an explanation of each of the items in it.
In addition to getting the regression table, it can be useful to see a scatterplot of the predicted and outcome variables with the regression line plotted. SAS makes this very easy for you by using the plot statement as part of proc reg. For example, below we show how to make a scatterplot of the outcome variable, api00 and the predictor, enroll. Note that the graph also includes the predicted values in the form of the regression line.

proc reg data="c:\sasreg\elemapi2";
  model api00 = enroll ;
  plot api00 * enroll ;
run;

As you see, this one command produces a scatterplot and regression line, and it also includes the regression model with the correlation of the two variables in the title. We could include a 95% prediction interval using the pred option on the plot statement as illustrated below.

proc reg data="c:\sasreg\elemapi2"  ;
  model api00 = enroll ;
  plot api00 * enroll / pred;
run;  
quit;

Another kind of graph that you might want to make is a residual versus fitted plot. As shown below, we can use the plot statement to make this graph. The keywords residual. and predicted. in this context refer to the residual value and predicted value from the regression analysis and can be abbreviated as r. and p. .

proc reg data="c:\sasreg\elemapi2";
  model api00 = enroll ;
  plot residual. * predicted. ;
run;

The table below shows a number of other keywords that can be used with the plot statement and the statistics they display.

Keyword	Statistic
COOKD.	Cook's D influence statistics
COVRATIO.	standard influence of observation on covariance of betas
DFFITS.	standard influence of observation on predicted value
H.	leverage
LCL.	lower bound of $100(1-\alpha)$ % confidence interval for individual prediction
LCLM.	lower bound of $100(1-\alpha)$ % confidence interval for the mean of the dependent variable
PREDICTED. \| PRED. \| P.	predicted values
PRESS.	residuals from refitting the model with current observation deleted
RESIDUAL. \| R.	residuals
RSTUDENT.	studentized residuals with the current observation deleted
STDI.	standard error of the individual predicted value
STDP.	standard error of the mean predicted value
STDR.	standard error of the residual
STUDENT.	residuals divided by their standard errors
UCL.	upper bound of $100(1-\alpha)$ % confidence interval for individual prediction
UCLM.	upper bound of $100(1-\alpha)$ % confidence interval for the mean of the dependent variables

1.4 Multiple Regression
Now, let's look at an example of multiple regression, in which we have one outcome (dependent) variable and multiple predictors. For this multiple regression example, we will regress the dependent variable, api00, on all of the predictor variables in the data set.

proc reg data="c:\sasreg\elemapi2"  ;
  model api00 = ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll ;
run;  

Dependent Variable: api00 api 2000

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F
Model                     9        6740702         748967     232.41    <.0001
Error                   385        1240708     3222.61761
Corrected Total         394        7981410


Root MSE             56.76810    R-Square     0.8446
Dependent Mean      648.65063    Adj R-Sq     0.8409
Coeff Var             8.75172

                                 Parameter Estimates

                                            Parameter     Standard
Variable   Label                      DF     Estimate        Error  t Value  Pr > |t|
Intercept  Intercept                   1    758.94179     62.28601    12.18    <.0001
ell        english language learners   1     -0.86007      0.21063    -4.08    <.0001
meals      pct free meals              1     -2.94822      0.17035   -17.31    <.0001
yr_rnd     year round school           1    -19.88875      9.25844    -2.15    0.0323
mobility   pct 1st year in school      1     -1.30135      0.43621    -2.98    0.0030
acs_k3     avg class size k-3          1      1.31870      2.25268     0.59    0.5586
acs_46     avg class size 4-6          1      2.03246      0.79832     2.55    0.0113
full       pct full credential         1      0.60972      0.47582     1.28    0.2008
emer       pct emer credential         1     -0.70662      0.60541    -1.17    0.2439
enroll     number of students          1     -0.01216      0.01679    -0.72    0.4693

Let's examine the output from this regression analysis. As with the simple regression, we look to the p-value of the F-test to see if the overall model is significant. With a p-value of zero to four decimal places, the model is statistically significant. The R-squared is 0.8446, meaning that approximately 84% of the variability of api00 is accounted for by the variables in the model. In this case, the adjusted R-squared indicates that about 84% of the variability of api00 is accounted for by the model, even after taking into account the number of predictor variables in the model. The coefficients for each of the variables indicates the amount of change one could expect in api00 given a one-unit change in the value of that variable, given that all other variables in the model are held constant. For example, consider the variable ell. We would expect a decrease of 0.86 in the api00 score for every one unit increase in ell, assuming that all other variables in the model are held constant. The interpretation of much of the output from the multiple regression is the same as it was for the simple regression. We have prepared an annotated output that more thoroughly explains the output of this multiple regression analysis.
You may be wondering what a 0.86 change in ell really means, and how you might compare the strength of that coefficient to the coefficient for another variable, say meals. To address this problem, we can use the stb option on the model statement to request that in addition to the standard output that SAS also display a table of the standardized values, sometimes called beta coefficients. Below we show just the portion of the output that includes these standardized values. The beta coefficients are used by some researchers to compare the relative strength of the various predictors within the model. Because the beta coefficients are all measured in standard deviations, instead of the units of the variables, they can be compared to one another. In other words, the beta coefficients are the coefficients that you would obtain if the outcome and predictor variables were all transformed to standard scores, also called z-scores, before running the regression.

proc reg data="c:\sasreg\elemapi2"  ;
  model api00 = ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll / stb;
run;

                 Parameter Estimates

                                          Standardized
Variable   Label                      DF      Estimate

Intercept  Intercept                   1             0
ell        english language learners   1      -0.14958
meals      pct free meals              1      -0.66070
yr_rnd     year round school           1      -0.05914
mobility   pct 1st year in school      1      -0.06864
acs_k3     avg class size k-3          1       0.01273
acs_46     avg class size 4-6          1       0.05498
full       pct full credential         1       0.06380
emer       pct emer credential         1      -0.05801
enroll     number of students          1      -0.01936

Because these standardized coefficients are all in the same standardized units you can compare these coefficients to assess the relative strength of each of the predictors. In this example, meals has the largest Beta coefficient, -0.66, and acs_k3 has the smallest Beta, 0.013. Thus, a one standard deviation increase in meals leads to a 0.66 standard deviation decrease in predicted api00, with the other variables held constant. And, a one standard deviation increase in acs_k3, in turn, leads to a 0.013 standard deviation increase api00 with the other variables in the model held constant.
In interpreting this output, remember that the difference between the regular coefficients (from the prior output) and the standardized coefficients above is the units of measurement. For example, to describe the raw coefficient for ell you would say "A one-unit decrease in ell would yield a .86-unit increase in the predicted api00." However, for the standardized coefficient (Beta) you would say, "A one standard deviation decrease in ell would yield a .15 standard deviation increase in the predicted api00."
So far, we have concerned ourselves with testing a single variable at a time, for example looking at the coefficient for ell and determining if that is significant. We can also test sets of variables, using the test command, to see if the set of variables are significant. First, let's start by testing a single variable, ell, using the test statement. Note that the part before the test command, test1:, is merely a label to identify the output of the test command. This label could be any short label to identify the output.

proc reg data="c:\sasreg\elemapi2"  ;
  model api00 = ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll ;
  test1: test ell =0;
run;

     Test TEST1 Results for Dependent Variable api00

                                Mean
Source             DF         Square    F Value    Pr > F

Numerator           1          53732      16.67    <.0001
Denominator       385     3222.61761

If you compare this output with the output from the last regression you can see that the result of the F-test, 16.67, is the same as the square of the result of the t-test in the regression (-4.083^2 = 16.67). Note that you could get the same results if you typed the following since SAS defaults to comparing the term(s) listed to 0.

proc reg data="c:\sasreg\elemapi2"  ;
  model api00 = ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll / stb;
  test2: test ell;
run;  

     Test TEST2 Results for Dependent Variable api00

                                Mean
Source             DF         Square    F Value    Pr > F

Numerator           1          53732      16.67    <.0001
Denominator       385     3222.61761

Perhaps a more interesting test would be to see if the contribution of class size is significant. Since the information regarding class size is contained in two variables, acs_k3 and acs_46, so we include both of these separated by a comma on the test command. Below we show just the output from the test command.

proc reg data="c:\sasreg\elemapi2"  ;
  model api00 = ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll ;
  test_class_size: test acs_k3, acs_46;
run;

Test TEST_CLASS_SIZE Results for Dependent Variable api00

                                Mean
Source             DF         Square    F Value    Pr > F

Numerator           2          12742       3.95    0.0200
Denominator       385     3222.61761

The significant F-test, 3.95, means that the collective contribution of these two variables is significant. One way to think of this, is that there is a significant difference between a model with acs_k3 and acs_46 as compared to a model without them, i.e., there is a significant difference between the "full" model and the "reduced" models.
Finally, as part of doing a multiple regression analysis you might be interested in seeing the correlations among the variables in the regression model. You can do this with proc corr as shown below.

proc corr data="c:\sasreg\elemapi2"  ;
  var api00 ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll ;
run;

                            Pearson Correlation Coefficients
                               Prob > |r| under H0: Rho=0
                                  Number of Observations

                                 api00          ell        meals       yr_rnd     mobility

api00                          1.00000     -0.76763     -0.90070     -0.47544     -0.20641
api 2000                                     <.0001       <.0001       <.0001       <.0001
                                   400          400          400          400          399

ell                           -0.76763      1.00000      0.77238      0.49793     -0.02046
english language learners       <.0001                    <.0001       <.0001       0.6837
                                   400          400          400          400          399

meals                         -0.90070      0.77238      1.00000      0.41845      0.21665
pct free meals                  <.0001       <.0001                    <.0001       <.0001
                                   400          400          400          400          399

yr_rnd                        -0.47544      0.49793      0.41845      1.00000      0.03479
year round school               <.0001       <.0001       <.0001                    0.4883
                                   400          400          400          400          399

mobility                      -0.20641     -0.02046      0.21665      0.03479      1.00000
pct 1st year in school          <.0001       0.6837       <.0001       0.4883
                                   399          399          399          399          399

acs_k3                         0.17100     -0.05565     -0.18797      0.02270      0.04014
avg class size k-3              0.0006       0.2680       0.0002       0.6517       0.4245
                                   398          398          398          398          398

acs_46                         0.23291     -0.17330     -0.21309     -0.04207      0.12769
avg class size 4-6              <.0001       0.0005       <.0001       0.4032       0.0110
                                   397          397          397          397          396

full                           0.57441     -0.48476     -0.52756     -0.39771      0.02521
pct full credential             <.0001       <.0001       <.0001       <.0001       0.6156
                                   400          400          400          400          399

emer                          -0.58273      0.47218      0.53304      0.43472      0.05961
pct emer credential             <.0001       <.0001       <.0001       <.0001       0.2348
                                   400          400          400          400          399

enroll                        -0.31817      0.40302      0.24103      0.59182      0.10502
number of students              <.0001       <.0001       <.0001       <.0001       0.0360
                                   400          400          400          400          399

                            Pearson Correlation Coefficients
                               Prob > |r| under H0: Rho=0
                                  Number of Observations

                                acs_k3       acs_46         full         emer       enroll

api00                          0.17100      0.23291      0.57441     -0.58273     -0.31817
api 2000                        0.0006       <.0001       <.0001       <.0001       <.0001
                                   398          397          400          400          400

ell                           -0.05565     -0.17330     -0.48476      0.47218      0.40302
english language learners       0.2680       0.0005       <.0001       <.0001       <.0001
                                   398          397          400          400          400

meals                         -0.18797     -0.21309     -0.52756      0.53304      0.24103
pct free meals                  0.0002       <.0001       <.0001       <.0001       <.0001
                                   398          397          400          400          400

yr_rnd                         0.02270     -0.04207     -0.39771      0.43472      0.59182
year round school               0.6517       0.4032       <.0001       <.0001       <.0001
                                   398          397          400          400          400

mobility                       0.04014      0.12769      0.02521      0.05961      0.10502
pct 1st year in school          0.4245       0.0110       0.6156       0.2348       0.0360
                                   398          396          399          399          399

acs_k3                         1.00000      0.27078      0.16057     -0.11033      0.10890
avg class size k-3                           <.0001       0.0013       0.0277       0.0298
                                   398          395          398          398          398

acs_46                         0.27078      1.00000      0.11773     -0.12446      0.02829
avg class size 4-6              <.0001                    0.0190       0.0131       0.5741
                                   395          397          397          397          397

full                           0.16057      0.11773      1.00000     -0.90568     -0.33769
pct full credential             0.0013       0.0190                    <.0001       <.0001
                                   398          397          400          400          400

emer                          -0.11033     -0.12446     -0.90568      1.00000      0.34309
pct emer credential             0.0277       0.0131       <.0001                    <.0001
                                   398          397          400          400          400

enroll                         0.10890      0.02829     -0.33769      0.34309      1.00000
number of students              0.0298       0.5741       <.0001       <.0001
                                   398          397          400          400          400

We can see that the strongest correlation with api00 is meals with a correlation in excess of -0.9. The variables ell and emer are also strongly correlated with api00. All three of these correlations are negative, meaning that as the value of one variable goes down, the value of the other variable tends to go up. Knowing that these variables are strongly associated with api00, we might predict that they would be statistically significant predictor variables in the regression model. Note that the number of cases used for each correlation is determined on a "pairwise" basis, for example there are 398 valid pairs of data for enroll and acs_k3, so that correlation of .1089 is based on 398 observations.
1.5 Transforming Variables
Earlier we focused on screening your data for potential errors. In the next chapter, we will focus on regression diagnostics to verify whether your data meet the assumptions of linear regression. Here, we will focus on the issue of normality. Some researchers believe that linear regression requires that the outcome (dependent) and predictor variables be normally distributed. We need to clarify this issue. In actuality, it is the residuals that need to be normally distributed. In fact, the residuals need to be normal only for the t-tests to be valid. The estimation of the regression coefficients do not require normally distributed residuals. As we are interested in having valid t-tests, we will investigate issues concerning normality.
A common cause of non-normally distributed residuals is non-normally distributed outcome and/or predictor variables. So, let us explore the distribution of our variables and how we might transform them to a more normal shape. Let's start by making a histogram of the variable enroll, which we looked at earlier in the simple regression.

proc univariate data="c:\sasreg\elemapi2";
  var enroll ;
  histogram / cfill=gray;
run;

We can use the normal option to superimpose a normal curve on this graph and the midpoints option to indicate that we want bins with midpoints from 100 to 1500 going in increments of 100.

proc univariate data="c:\sasreg\elemapi2";
  var enroll ;
  histogram / cfill=gray normal midpoints=100 to 1500 by 100;
run;

Histograms are sensitive to the number of bins or columns that are used in the display. An alternative to histograms is the kernel density plot, which approximates the probability density of the variable. Kernel density plots have the advantage of being smooth and of being independent of the choice of origin, unlike histograms. You can add a kernel density plot to the above plot with he kernel option as illustrated below.

proc univariate data="c:\sasreg\elemapi2";
  var enroll ;
  histogram / cfill=gray normal midpoints=100 to 1500 by 100 kernel;
run;

Not surprisingly, the kdensity plot also indicates that the variable enroll does not look normal.
There are two other types of graphs that are often used to examine the distribution of variables; quantile-quantile plots and normal probability plots.
A quantile-quantile plot graphs the quantiles of a variable against the quantiles of a normal (Gaussian) distribution. Such plots are sensitive to non-normality near the tails, and indeed we see considerable deviations from normal, the diagonal line, in the tails. This plot is typical of variables that are strongly skewed to the right.

proc univariate data="c:\sasreg\elemapi2";
  var enroll ;
  qqplot / normal;
run;

The normal probability plot is also useful for examining the distribution of variables and is sensitive to deviations from normality nearer to the center of the distribution. We will use SAS proc capability to get the normal probability plot. Again, we see indications non-normality in enroll.

proc capability data="c:\sasreg\elemapi2" noprint;
  ppplot enroll ;
run;

Given the skewness to the right in enroll, let us try a log transformation to see if that makes it more normal. Below we create a variable lenroll that is the natural log of enroll and then we repeat some of the above commands to see if lenroll is more normally distributed.

data elemapi3;
  set "c:\sasreg\elemapi2";
  lenroll = log(enroll);
run;

Now let's try showing a histogram for lenroll with a normal overlay and a kernel density estimate.

proc univariate data=elemapi3 noprint;
  var lenroll ;
  histogram / cfill=grayd0  normal kernel (color = red);
run;

We can see that lenroll looks quite normal. We could then create a quantile-quantile plot and a normal probability plot to further assess whether lenroll seems normal, as well as seeing how lenroll impacts the residuals, which is really the important consideration.
1.6 Summary
In this lecture we have discussed the basics of how to perform simple and multiple regressions, the basics of interpreting output, as well as some related commands. We examined some tools and techniques for screening for bad data and the consequences such data can have on your results. Finally, we touched on the assumptions of linear regression and illustrated how you can check the normality of your variables and how you can transform your variables to achieve normality. The next chapter will pick up where this chapter has left off, going into a more thorough discussion of the assumptions of linear regression and how you can use SAS to assess these assumptions for your data. In particular, the next lecture will address the following issues.

Checking for points that exert undue influence on the coefficients
Checking for constant error variance (homoscedasticity)
Checking for linear relationships
Checking model specification
Checking for multicollinearity
Checking normality of residuals

1.7 For more information

Related Web Pages
- Overview of SAS PROC REG
- Annotated SAS Output- Regression

SAS: Multivariate regression in SAS

Multivariate regression in SAS

In many ways, multivariate regression is similar to MANOVA. The hypotheses, the methods used to obtain the estimates, and the assumptions are all similar. The multivariate test statistics are the same. The hypothesis being tested by a multivariate regression is that there is a joint linear effect of the set of predictors on the set of responses. Hence, the null hypothesis is that slope of all coefficients is simultaneously zero. Note that the "set" of predictors may include no predictor or only one predictor, but usually it contains more.
The basic assumptions of multivariate regression are 1) multivariate normality of the residuals, 2) homogenous variances of residuals conditional on predictors, 3) common covariance structure across observations, and 4) independent observations. Unfortunately, testing the first three assumptions is very difficult. Currently, many of the common statistical packages, such as SAS and SPSS, do not offer a test of multivariate normality. However, you can see if your data are close to being multivariate normal by creating some graphs. First, you want to see if your residuals for each dependent variable are normal by themselves. This is necessary, but not sufficient, for multivariate normality. Next, you can create scatterplots of the residuals. You want to see the points on the graph form an ellipse (as opposed to a V-shape, a wedge-shape, or some other kind of shape). Remember that an ellipse can be any form of a circle. You would like the points to line up in a "flattened" ellipse because the dependent variables are supposed to be correlated for MANOVA or multiple regression to be the analysis of choice, but this is not necessary for multivariate normality. Regarding the second assumption, homogeneity of variances, there are several tests available for this. However, most of them are very sensitive to nonnormality. Fortunately, the F statistic is fairly robust against violations of this assumption. As for the third assumption, the covariance matrices are rarely equal. Monte Carlo studies have shown that keeping the number of observations (subjects) per group approximately equal is an effective method of ensuring that violations of this assumption will not be too problematic. Regarding the independence of observations, clearly there is no statistical test for that. Rather, that is an issue of methodology. Care should be taken to ensure that the observations are independent, because even small intraclass correlations can cause serious problems. For example, suppose an experimenter had three groups with 30 subjects per group and a small dependence between the observations, say an intraclass correlation of .10. The actual alpha value would be .4917, rather than the standard .05.
If all of these assumptions are met, then the coefficients will be unbiased, the least-squares estimates will have minimum variance, and the relationships among the coefficients will reflect the relationships among the predictors. Furthermore, a multivariate hypothesis test will account for the relationship among the coefficients, whereas a univariate F test would not.

With all of this in mind, let's try a multivariate multiple regression. We will use the hsb2 data set for our example, and we will use read and socst as our dependent variables and write, math and science as our independent variables. The proc reg statement is the same as it would be in a univariate regression, but the model statement is a little different: we now have two (we could have more) dependent variables listed before the equals sign. Also, we have included the mtest statement, which is used to test hypotheses in multivariate regression. If no equations are listed on the mtest statement, SAS tests the hypothesis that all coefficients except the intercept are zero. You can specify some options on the mtest statement, including canprint, which will print the canonical correlations for the hypothesis combinations and the dependent variable combinations. The details option will display the M matrix, and the print option will display the H and E matrices.

proc reg data = "g:\SAS\hsb2";
model read socst = write math science;
mtest / details print;
run;
quit;

The REG Procedure
Model: MODEL1
Dependent Variable: read

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     3          11313     3771.09916      76.94    <.0001
Error                   196     9606.12253       49.01083
Corrected Total         199          20919

Root MSE              7.00077    R-Square     0.5408
Dependent Mean       52.23000    Adj R-Sq     0.5338
Coeff Var            13.40374

                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1        4.36993        3.20878       1.36      0.1748
write         1        0.23767        0.06969       3.41      0.0008
math          1        0.37840        0.07463       5.07      <.0001
science       1        0.29693        0.06763       4.39      <.0001

The REG Procedure
Model: MODEL1
Dependent Variable: socst

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     3     9551.66620     3183.88873      46.62    <.0001
Error                   196          13385       68.28841
Corrected Total         199          22936

Root MSE              8.26368    R-Square     0.4164
Dependent Mean       52.40500    Adj R-Sq     0.4075
Coeff Var            15.76888

                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1        8.86989        3.78763       2.34      0.0202
write         1        0.46567        0.08227       5.66      <.0001
math          1        0.27630        0.08810       3.14      0.0020
science       1        0.08512        0.07984       1.07      0.2877

The REG Procedure
Model: MODEL1
Multivariate Test 1

                               L Ginv(X'X) L'   LB-cj

0.0000991078      -0.000042904      -0.000028518      0.2376705687      0.4656741023
-0.000042904      0.0001136529      -0.000044399      0.3784014963      0.2763008055
-0.000028518      -0.000044399      0.0000933347      0.2969346843      0.0851168364

                         Inv(L Ginv(X'X) L')    Inv()(LB-cj)

   17878.875         10911.025          10653.25          11541.35         12247.225
   10911.025         17465.795          11642.35          12659.33         10897.755
    10653.25          11642.35           19507.5           12729.9           9838.15

       Error Matrix (E)

9606.1225306      3657.5503071
3657.5503071      13384.528803

     Hypothesis Matrix (H)

11313.297469      9955.8196929
9955.8196929      9551.6661967

 Hypothesis + Error Matrix (T)

    20919.42          13613.37
    13613.37         22936.195


         Eigenvectors

    0.004986          0.002488
   -0.007281          0.008053

 Eigenvalues

    0.587507
    0.051687                                                                

The REG Procedure
Model: MODEL1
Multivariate Test 1

                 Multivariate Statistics and F Approximations

                             S=2    M=0    N=96.5

Statistic                        Value    F Value    Num DF    Den DF    Pr > F

Wilks' Lambda               0.39117291      38.93         6       390    <.0001
Pillai's Trace              0.63919333      30.69         6       392    <.0001
Hotelling-Lawley Trace      1.47878554      47.94         6    258.23    <.0001
Roy's Greatest Root         1.42428180      93.05         3       196    <.0001

NOTE: F Statistic for Roy's Greatest Root is an upper bound.
NOTE: F Statistic for Wilks' Lambda is exact.

Looking at the very bottom of the output we can see that the overall model is statistically significant. We can look at the first half of the output to see the univariate results. Here we see that with only the dependent variable read, the overall model is statistically significant, as well as each of the predictors. When we look at the univariate results for socst, we see that the overall model is statistically significant, as are the predictors write and math, but not science. In other words, multivariate tests tell us that the set of predictors accounts for a statistically significant portion of the variance in the dependent variables, and the univariate tests break this down for us so that we can see where the significant differences are.

Let's run the same model again, but this time, we will specify some hypotheses to be tested on the mtest statement. In the first mtest statement, we will test the hypothesis that the parameter for write is the same for read and socst. In the second mtest statement, we will test the hypothesis that the parameter for science is the same for read and socst. You will notice that, as with test statements in other procs, we can use a label before the statement so that it is labeled in the output.

proc reg data = "g:\SAS\hsb2";
model read socst = write math science;
write: mtest read- socst, write / details print;
science: mtest read - socst, science / details print;
run;
quit;

The REG Procedure
Model: MODEL1
Dependent Variable: read

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     3          11313     3771.09916      76.94    <.0001
Error                   196     9606.12253       49.01083
Corrected Total         199          20919

Root MSE              7.00077    R-Square     0.5408
Dependent Mean       52.23000    Adj R-Sq     0.5338
Coeff Var            13.40374

                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1        4.36993        3.20878       1.36      0.1748
write         1        0.23767        0.06969       3.41      0.0008
math          1        0.37840        0.07463       5.07      <.0001
science       1        0.29693        0.06763       4.39      <.0001

The REG Procedure
Model: MODEL1
Dependent Variable: socst

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     3     9551.66620     3183.88873      46.62    <.0001
Error                   196          13385       68.28841
Corrected Total         199          22936

Root MSE              8.26368    R-Square     0.4164
Dependent Mean       52.40500    Adj R-Sq     0.4075
Coeff Var            15.76888

                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1        8.86989        3.78763       2.34      0.0202
write         1        0.46567        0.08227       5.66      <.0001
math          1        0.27630        0.08810       3.14      0.0020
science       1        0.08512        0.07984       1.07      0.2877

The REG Procedure
Model: MODEL1
Multivariate Test: write

                Multivariate Statistics and Exact F Statistics

                             S=1    M=-0.5    N=97

Statistic                        Value    F Value    Num DF    Den DF    Pr > F

Wilks' Lambda               0.96762141       6.56         1       196    0.0112
Pillai's Trace              0.03237859       6.56         1       196    0.0112
Hotelling-Lawley Trace      0.03346205       6.56         1       196    0.0112
Roy's Greatest Root         0.03346205       6.56         1       196    0.0112

The REG Procedure
Model: MODEL1
Multivariate Test: science

                Multivariate Statistics and Exact F Statistics

                             S=1    M=-0.5    N=97

Statistic                        Value    F Value    Num DF    Den DF    Pr > F

Wilks' Lambda               0.97024627       6.01         1       196    0.0151
Pillai's Trace              0.02975373       6.01         1       196    0.0151
Hotelling-Lawley Trace      0.03066616       6.01         1       196    0.0151
Roy's Greatest Root         0.03066616       6.01         1       196    0.0151

For the dependent variable read, the predictors write, math and science are significant. For the dependent variable socst, the predictors write and math are significant. The last two pages of the output indicate that both of the hypotheses regarding the parameters were statistically significant (F = 6.56, p = 0.0112 and F = 6.01, p = 0.0151, respectively). Hence, we would conclude that, based on the results of the first test (which we called write), that parameters for read and socst are not the same for the variable write. The second test (which we called science) suggests that the parameters for read and socst are not the same for the variable science.

R: binding two vectors and display it as a matrix in R

Hello,

I have two data.

x<-c(1, 2, 3)
y<-c(4,5,6)

How do i create matrix of 3 by 3 from this two, such that

(1,4) (1,5) (1,6)
(2,4) (2,5) (2,6)
(3,4) (3,5) (3,6)

I tried some thing like this:

xy <- as.data.frame(c(0,0,0), dim=c(3,3))
for(i in 1:3)
for(j in 1:3)
xy[i][j]<-c(x[i],y[j])

but i got errors..!!!!

Try
x1<-matrix(1,3,1)%x%x
y1<-y%x%matrix(1,3,1)

Z<-cbind(x1,y1)
And later you need to move towards list and matrix

R: Cross-Validation for Linear Regression

cv.lm {DAAG}

R Documentation

Cross-Validation for Linear Regression

Description

This function gives internal and cross-validation measures of predictive accuracy for ordinary linear regression. The data are randomly assigned to a number of `folds'. Each fold is removed, in turn, while the remaining data is used to re-fit the regression model and to predict at the deleted observations.

Usage

cv.lm(df = houseprices, m = 3, form.lm = formula(sale.price ~ area), dots = FALSE,
 seed = 29)

Arguments

`df`	a data frame
`m`	the number of folds
`form.lm`	a formula object
`dots`	uses pch=16 for the plotting character
`seed`	random number generator seed

Value

For each fold, a table listing

the residuals
ms = the overall mean square of prediction error

Author(s)

J.H. Maindonald

Examples

data(houseprices)
cv.lm()

[Package Contents]

R: Using Validation and Cross Validation

Example 42.2 Using Validation and Cross Validation

This example shows how you can use both test set and cross validation to monitor and control variable selection. It also demonstrates the use of split classification variables.
The following statements produce analysis and test data sets. Note that the same statements are used to generate the observations that are randomly assigned for analysis and test roles in the ratio of approximately two to one.

   
 data analysisData testData;
    drop i j c3Num;
    length c3$ 7;

    array x{20} x1-x20;

    do i=1 to 1500;
       do j=1 to 20;
          x{j} = ranuni(1);
       end;

       c1 = 1 + mod(i,8);
       c2 = ranbin(1,3,.6); 
  
       if      i < 50   then do; c3 = 'tiny';     c3Num=1;end;
       else if i < 250  then do; c3 = 'small';    c3Num=1;end;
       else if i < 600  then do; c3 = 'average';  c3Num=2;end;
       else if i < 1200 then do; c3 = 'big';      c3Num=3;end;
       else                  do; c3 = 'huge';     c3Num=5;end;  

       y = 10 + x1 + 2*x5 + 3*x10 + 4*x20  + 3*x1*x7 + 8*x6*x7
              + 5*(c1=3)*c3Num + 8*(c1=7)  + 5*rannor(1);
     
       if ranuni(1) < 2/3 then output analysisData;
                          else output testData;
   end;
 run;

Suppose you suspect that the dependent variable depends on both main effects and two-way interactions. You can use the following statements to select a model:

ods graphics on;

proc glmselect data=analysisData testdata=testData
               seed=1 plots(stepAxis=number)=(criterionPanel ASEPlot);
   partition fraction(validate=0.5);
   class c1 c2 c3(order=data);  
   model y =  c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10
             |x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2
           / selection=stepwise(choose = validate 
                                select = sl) 
             hierarchy=single stb;
run;

Note that a TESTDATA= data set is named in the PROC GLMSELECT statement and that a PARTITION statement is used to randomly assign half the observations in the analysis data set for model validation and the rest for model training. You find details about the number of observations used for each role in the number of observations tables shown in Output 42.2.1.

Output 42.2.1 Number of Observations Tables

The GLMSELECT Procedure

Observation Profile for Analysis Data
Number of Observations Read	1010
Number of Observations Used	1010
Number of Observations Used for Training	510
Number of Observations Used for Validation	500

The "Class Level Information" and "Dimensions" tables are shown in Output 42.2.2. The "Dimensions" table shows that at each step of the selection process, 278 effects are considered as candidates for entry or removal. Since several of these effects have multilevel classification variables as members, there are 661 parameters.

Output 42.2.2 Class Level Information and Problem Dimensions

Class Level Information
Class	Levels	Values
c1	8	1 2 3 4 5 6 7 8
c2	4	0 1 2 3
c3	5	tiny small average big huge

Dimensions
Number of Effects	278
Number of Parameters	661

The model statement options request stepwise selection with the default entry and stay significance levels used for both selecting entering and departing effects and stopping the selection method. The CHOOSE=VALIDATE suboption specifies that the selected model is chosen to minimize the predicted residual sum of squares when the models at each step are scored on the observations reserved for validation. The HIERARCHY=SINGLE option specifies that interactions can enter the model only if the corresponding main effects are already in the model, and that main effects cannot be dropped from the model if an interaction with such an effect is in the model. These settings are listed in the model information table shown in Output 42.2.3.

Output 42.2.3 Model Information

The GLMSELECT Procedure

Data Set	WORK.ANALYSISDATA
Test Data Set	WORK.TESTDATA
Dependent Variable	y
Selection Method	Stepwise
Select Criterion	Significance Level
Stop Criterion	Significance Level
Choose Criterion	Validation ASE
Entry Significance Level (SLE)	0.15
Stay Significance Level (SLS)	0.15
Effect Hierarchy Enforced	Single
Random Number Seed	1

The stop reason and stop details tables are shown in Output 42.2.4. Note that because the STOP= suboption of the SELECTION= option was not explicitly specified, the stopping criterion used is the selection criterion, namely significance level.

Output 42.2.4 Stop Details

Selection stopped because the candidate for entry has SLE > 0.15 and the candidate for removal has SLS < 0.15.

Stop Details
Candidate For	Effect	Candidate Significance		Compare Significance
Entry	x2*x5	0.1742	>	0.1500	(SLE)
Removal	x5*x10	0.0534	<	0.1500	(SLS)

The criterion panel in Output 42.2.5 shows how the various fit criteria evolved as the stepwise selection method proceeded. Note that other than the ASE evaluated on the validation data, these criteria are evaluated on the training data. You see that the minimum of the validation ASE occurs at step 9, and hence the model at this step is selected.

Output 42.2.5 Criterion Panel

Output 42.2.6 shows how the average squared error (ASE) evolved on the training, validation, and test data. Note that while the ASE on the training data decreases monotonically, the errors on both the validation and test data start increasing beyond step 9. This indicates that models after step 9 are beginning to overfit the training data.

Output 42.2.6 Average Squared Errors

Output 42.2.7 shows the selected effects, analysis of variance, and fit statistics tables for the selected model. Output 42.2.8 shows the parameter estimates table.

Output 42.2.7 Selected Model Details

The GLMSELECT Procedure

Selected Model

The selected model, based on Validation ASE, is the model at Step 9.

Effects:	Intercept c1 c3 c1*c3 x1 x5 x6 x7 x10 x20

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value
Model	44	22723	516.43621	20.49
Error	465	11722	25.20856
Corrected Total	509	34445

Root MSE	5.02081
Dependent Mean	21.09705
R-Square	0.6597
Adj R-Sq	0.6275
AIC	2200.75319
AICC	2210.09228
SBC	1879.30167
ASE (Train)	22.98427
ASE (Validate)	27.71105
ASE (Test)	24.82947

Output 42.2.8 Parameter Estimates

Parameter Estimates
Parameter	DF	Estimate	Standardized Estimate	Standard Error	t Value
Intercept	1	6.867831	0	1.524446	4.51
c1 1	1	0.226602	0.008272	2.022069	0.11
c1 2	1	-1.189623	-0.048587	1.687644	-0.70
c1 3	1	25.968930	1.080808	1.693593	15.33
c1 4	1	1.431767	0.054892	1.903011	0.75
c1 5	1	1.972622	0.073854	1.664189	1.19
c1 6	1	-0.094796	-0.004063	1.898700	-0.05
c1 7	1	5.971432	0.250037	1.846102	3.23
c1 8	0	0	0	.	.
c3 tiny	1	-2.919282	-0.072169	2.756295	-1.06
c3 small	1	-4.635843	-0.184338	2.218541	-2.09
c3 average	1	0.736805	0.038247	1.793059	0.41
c3 big	1	-1.078463	-0.063580	1.518927	-0.71
c3 huge	0	0	0	.	.
c1*c3 1 tiny	1	-2.449964	-0.018632	4.829146	-0.51
c1*c3 1 small	1	5.265031	0.069078	3.470382	1.52
c1*c3 1 average	1	-3.489735	-0.064365	2.850381	-1.22
c1*c3 1 big	1	0.725263	0.017929	2.516502	0.29
c1*c3 1 huge	0	0	0	.	.
c1*c3 2 tiny	1	5.455122	0.050760	4.209507	1.30
c1*c3 2 small	1	7.439196	0.131499	2.982411	2.49
c1*c3 2 average	1	-0.739606	-0.014705	2.568876	-0.29
c1*c3 2 big	1	3.179351	0.078598	2.247611	1.41
c1*c3 2 huge	0	0	0	.	.
c1*c3 3 tiny	1	-19.266847	-0.230989	3.784029	-5.09
c1*c3 3 small	1	-15.578909	-0.204399	3.266216	-4.77
c1*c3 3 average	1	-18.119398	-0.395770	2.529578	-7.16
c1*c3 3 big	1	-10.650012	-0.279796	2.205331	-4.83
c1*c3 3 huge	0	0	0	.	.
c1*c3 4 tiny	0	0	0	.	.
c1*c3 4 small	1	4.432753	0.047581	3.677008	1.21
c1*c3 4 average	1	-3.976295	-0.091632	2.625564	-1.51
c1*c3 4 big	1	-1.306998	-0.033003	2.401064	-0.54
c1*c3 4 huge	0	0	0	.	.
c1*c3 5 tiny	1	6.714186	0.062475	4.199457	1.60
c1*c3 5 small	1	1.565637	0.022165	3.182856	0.49
c1*c3 5 average	1	-4.286085	-0.068668	2.749142	-1.56
c1*c3 5 big	1	-2.046468	-0.045949	2.282735	-0.90
c1*c3 5 huge	0	0	0	.	.
c1*c3 6 tiny	1	5.135111	0.039052	4.754845	1.08
c1*c3 6 small	1	4.442898	0.081945	3.079524	1.44
c1*c3 6 average	1	-2.287870	-0.056559	2.601384	-0.88
c1*c3 6 big	1	1.598086	0.043542	2.354326	0.68
c1*c3 6 huge	0	0	0	.	.
c1*c3 7 tiny	1	1.108451	0.010314	4.267509	0.26
c1*c3 7 small	1	7.441059	0.119214	3.135404	2.37
c1*c3 7 average	1	1.796483	0.038106	2.630570	0.68
c1*c3 7 big	1	3.324160	0.095173	2.303369	1.44
c1*c3 7 huge	0	0	0	.	.
c1*c3 8 tiny	0	0	0	.	.
c1*c3 8 small	0	0	0	.	.
c1*c3 8 average	0	0	0	.	.
c1*c3 8 big	0	0	0	.	.
c1*c3 8 huge	0	0	0	.	.
x1	1	2.713527	0.091530	0.836942	3.24
x5	1	2.810341	0.098303	0.816290	3.44
x6	1	4.837022	0.167394	0.810402	5.97
x7	1	5.844394	0.207035	0.793775	7.36
x10	1	2.463916	0.087712	0.794599	3.10
x20	1	4.385924	0.156155	0.787766	5.57

The magnitudes of the standardized estimates and the $\text{[math]}$ statistics of the parameters of the effect "c1" reveal that only levels "3" and "7" of this effect contribute appreciably to the model. This suggests that a more parsimonious model with similar or better predictive power might be obtained if parameters corresponding to the levels of "c1" are allowed to enter or leave the model independently. You request this with the SPLIT option in the CLASS statement as shown in the following statements:

proc glmselect data=analysisData testdata=testData
               seed=1 plots(stepAxis=number)=all;
   partition fraction(validate=0.5);
   class c1(split) c2 c3(order=data);  
   model y =  c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10
             |x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2
           / selection=stepwise(stop   = validate 
                                select = sl) 
             hierarchy=single;  
   output out=outData;
run;

The "Class Level Information" and "Dimensions" tables are shown in Output 42.2.9. The "Dimensions" table shows that while the model statement specifies 278 effects, after splitting the parameters corresponding to the levels of "c1," there are 439 split effects that are considered for entry or removal at each step of the selection process. Note that the total number of parameters considered is not affected by the split option.

Output 42.2.9 Class Level Information and Problem Dimensions

The GLMSELECT Procedure

Class Level Information
Class	Levels		Values
c1	8	*	1 2 3 4 5 6 7 8
c2	4		0 1 2 3
c3	5		tiny small average big huge
* Associated Parameters Split

Dimensions
Number of Effects	278
Number of Effects after Splits	439
Number of Parameters	661

The stop reason and stop details tables are shown in Output 42.2.10. Since the validation ASE is specified as the stopping criterion, the selection stops at step 11, where the validation ASE achieves a local minimum and the model at this step is the selected model.

Output 42.2.10 Stop Details

Selection stopped at a local minimum of the residual sum of squares of the validation data.

Stop Details
Candidate For	Effect	Candidate Validation ASE		Compare Validation ASE
Entry	x18	25.9851	>	25.7462
Removal	x6*x7	25.7611	>	25.7462

You find details of the selected model in Output 42.2.11. The list of selected effects confirms that parameters corresponding to levels "3" and "7" only of "c1" are in the selected model. Notice that the selected model with classification variable "c1" split contains 18 parameters, whereas the selected model without splitting "c1" has 45 parameters. Furthermore, by comparing the fit statistics in Output 42.2.7 and Output 42.2.11, you see that this more parsimonious model has smaller prediction errors on both the validation and test data.

Output 42.2.11 Details of the Selected Model

The GLMSELECT Procedure

Selected Model

The selected model is the model at the last step (Step 11).

Effects:	Intercept c1_3 c1_7 c3 c1_3c3 x1 x5 x6 x7 x6x7 x10 x20

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value
Model	17	22111	1300.63200	51.88
Error	492	12334	25.06998
Corrected Total	509	34445

Root MSE	5.00699
Dependent Mean	21.09705
R-Square	0.6419
Adj R-Sq	0.6295
AIC	2172.72685
AICC	2174.27787
SBC	1736.94624
ASE (Train)	24.18515
ASE (Validate)	25.74617
ASE (Test)	22.57297

When you use a PARTITION statement to subdivide the analysis data set, an output data set created with the OUTPUT statement contains a variable named "_ROLE_" that shows the role each observation was assigned to. See the section OUTPUT Statement and the section Using Validation and Test Data for additional details.
The following statements use PROC PRINT to produce Output 42.2.12, which shows the first five observations of the outData data set.

proc print data=outData(obs=5); 
run;

Output 42.2.12 Output Data Set with _ROLE_ Variable

Obs	c3	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	x11	x12	x13	x14	x15	x16	x17	x18	x19	x20	c1	c2	y	_ROLE_	p_y
1	tiny	0.18496	0.97009	0.39982	0.25940	0.92160	0.96928	0.54298	0.53169	0.04979	0.06657	0.81932	0.52387	0.85339	0.06718	0.95702	0.29719	0.27261	0.68993	0.97676	0.22651	2	1	11.4391	VALIDATE	18.5069
2	tiny	0.47579	0.84499	0.63452	0.59036	0.58258	0.37701	0.72836	0.50660	0.93121	0.92912	0.58966	0.29722	0.39104	0.47243	0.67953	0.16809	0.16653	0.87110	0.29879	0.93464	3	1	31.4596	TRAIN	26.2188
3	tiny	0.51132	0.43320	0.17611	0.66504	0.40482	0.12455	0.45349	0.19955	0.57484	0.73847	0.43981	0.04937	0.52238	0.34337	0.02271	0.71289	0.93706	0.44599	0.94694	0.71290	4	3	16.4294	VALIDATE	17.0979
4	tiny	0.42071	0.07174	0.35849	0.71143	0.18985	0.14797	0.56184	0.27011	0.32520	0.56918	0.04259	0.43921	0.91744	0.52584	0.73182	0.90522	0.57600	0.18794	0.33133	0.69887	5	3	15.4815	VALIDATE	16.1567
5	tiny	0.42137	0.03798	0.27081	0.42773	0.82010	0.84345	0.87691	0.26722	0.30602	0.39705	0.34905	0.76593	0.54340	0.61257	0.55291	0.73591	0.37186	0.64565	0.55718	0.87504	6	2	26.0023	TRAIN	24.6358

Cross validation is often used to assess the predictive performance of a model, especially for when you do not have enough observations for test set validation. See the section Cross Validation for further details. The following statements provide an example where cross validation is used as the CHOOSE= criterion.

proc glmselect data=analysisData testdata=testData
               plots(stepAxis=number)=(criterionPanel ASEPlot);
   class c1(split) c2 c3(order=data);  
   model y =  c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10
             |x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2
           / selection = stepwise(choose = cv 
                                  select = sl)  
             stats     = press 
             cvMethod  = split(5) 
             cvDetails = all 
             hierarchy = single;
    output out=outData;
run;

The CVMETHOD=SPLIT(5) option in the MODEL statement requests five-fold cross validation with the five subsets consisting of observations $\text{[math]}$ , $\text{[math]}$ , and so on. The STATS=PRESS option requests that the leave-one-out cross validation predicted residual sum of squares (PRESS) also be computed and displayed at each step, even though this statistic is not used in the selection process.
Output 42.2.13 shows how several fit statistics evolved as the selection process progressed. The five-fold CV PRESS statistic achieves its minimum at step 19. Note that this gives a larger model than was selected when the stopping criterion was determined using validation data. Furthermore, you see that the PRESS statistic has not achieved its minimum within 25 steps, so an even larger model would have been selected based on leave-one-out cross validation.

Output 42.2.13 Criterion Panel

Output 42.2.14 shows how the average squared error compares on the test and training data. Note that the ASE error on the test data achieves a local minimum at step 11 and is already slowly increasing at step 19, which corresponds to the selected model.

Output 42.2.14 Average Squared Error Plot

The CVDETAILS=ALL option in the MODEL statement requests the "Cross Validation Details" table in Output 42.2.15 and the cross validation parameter estimates that are included in the "Parameter Estimates" table in Output 42.2.16. For each cross validation index, the predicted residual sum of squares on the observations omitted is shown in the "Cross Validation Details" table and the parameter estimates of the corresponding model are included in the "Parameter Estimates" table. By default, these details are shown for the selected model, but you can request this information at every step with the DETAILS= option in the MODEL statement. You use the "_CVINDEX_" variable in the output data set shown in Output 42.2.17 to find out which observations in the analysis data are omitted for each cross validation fold.

Output 42.2.15 Breakdown of CV Press Statistic by Fold

Cross Validation Details
Index	Observations		CV PRESS
Index	Fitted	Left Out	CV PRESS
1	808	202	5059.7375
2	808	202	4278.9115
3	808	202	5598.0354
4	808	202	4950.1750
5	808	202	5528.1846
Total			25293.5024

Output 42.2.16 Cross Validation Parameter Estimates

Parameter Estimates
Parameter	Cross Validation Estimates
Parameter	1	2	3	4	5
Intercept	10.7617	10.1200	9.0254	13.4164	12.3352
c1_3	28.2715	27.2977	27.0696	28.6835	27.8070
c1_7	7.6530	7.6445	7.9257	7.4217	7.6862
c3 tiny	-3.1103	-4.4041	-5.1793	-8.4131	-7.2096
c3 small	2.2039	1.5447	1.0121	-0.3998	1.4927
c3 average	0.3021	-1.3939	-1.2201	-3.3407	-2.1467
c3 big	-0.9621	-1.2439	-1.6092	-3.7666	-3.4389
c3 huge	0	0	0	0	0
c1_3*c3 tiny	-21.9104	-21.7840	-22.0173	-22.6066	-21.9791
c1_3*c3 small	-20.8196	-20.2725	-19.5850	-20.4515	-20.7586
c1_3*c3 average	-16.8500	-15.1509	-15.0134	-15.3851	-13.4339
c1_3*c3 big	-12.7212	-12.1554	-12.0354	-12.3282	-13.0174
c1_3*c3 huge	0	0	0	0	0
x1	0.9238	1.7286	2.5976	-0.2488	1.2093
x1*c3 tiny	-1.5819	-1.1748	-3.2523	-1.7016	-2.7624
x1*c3 small	-3.7669	-3.2984	-2.9755	-1.8738	-4.0167
x1*c3 average	2.2253	2.4489	1.5675	4.0948	2.0159
x1*c3 big	0.9222	0.5330	0.7960	2.6061	1.2694
x1*c3 huge	0	0	0	0	0
x5	-1.3562	0.5639	0.3022	-0.4700	-2.5063
x6	-0.9165	-3.2944	-1.2163	-2.2063	-0.5696
x7	5.2295	5.3015	6.2526	4.1770	5.8364
x6*x7	6.4211	7.5644	6.1182	7.0020	5.8730
x10	1.9591	1.4932	0.7196	0.6504	-0.3989
x5*x10	3.6058	1.7274	4.3447	2.4388	3.8967
x15	-0.0079	0.6896	1.6811	0.0136	0.1799
x15*c1_3	-3.5022	-2.7963	-2.6003	-4.2355	-4.7546
x7*x15	-5.1438	-5.8878	-5.9465	-3.6155	-5.3337
x18	-2.1347	-1.5656	-2.4226	-4.0592	-1.4985
x18*c3 tiny	2.2988	1.1931	2.6491	6.1615	5.6204
x18*c3 small	4.6033	3.2359	4.4183	5.5923	1.7270
x18*c3 average	-2.3712	-2.5392	-0.6361	-1.1729	-1.6481
x18*c3 big	2.3160	1.4654	2.7683	3.0487	2.5768
x18*c3 huge	0	0	0	0	0
x6*x18	3.0716	4.2036	4.1354	4.9196	2.7165
x20	4.1229	4.5773	4.5774	4.6555	4.2655

The following statements display the first eight observations in the outData data set.

proc print data=outData(obs=8);
run;

Output 42.2.17 First Eight Observations in the Output Data Set

Obs	c3	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	x11	x12	x13	x14	x15	x16	x17	x18	x19	x20	c1	c2	y	_CVINDEX_	p_y
1	tiny	0.18496	0.97009	0.39982	0.25940	0.92160	0.96928	0.54298	0.53169	0.04979	0.06657	0.81932	0.52387	0.85339	0.06718	0.95702	0.29719	0.27261	0.68993	0.97676	0.22651	2	1	11.4391	1	18.1474
2	tiny	0.47579	0.84499	0.63452	0.59036	0.58258	0.37701	0.72836	0.50660	0.93121	0.92912	0.58966	0.29722	0.39104	0.47243	0.67953	0.16809	0.16653	0.87110	0.29879	0.93464	3	1	31.4596	2	24.7930
3	tiny	0.51132	0.43320	0.17611	0.66504	0.40482	0.12455	0.45349	0.19955	0.57484	0.73847	0.43981	0.04937	0.52238	0.34337	0.02271	0.71289	0.93706	0.44599	0.94694	0.71290	4	3	16.4294	3	16.5752
4	tiny	0.42071	0.07174	0.35849	0.71143	0.18985	0.14797	0.56184	0.27011	0.32520	0.56918	0.04259	0.43921	0.91744	0.52584	0.73182	0.90522	0.57600	0.18794	0.33133	0.69887	5	3	15.4815	4	14.7605
5	tiny	0.42137	0.03798	0.27081	0.42773	0.82010	0.84345	0.87691	0.26722	0.30602	0.39705	0.34905	0.76593	0.54340	0.61257	0.55291	0.73591	0.37186	0.64565	0.55718	0.87504	6	2	26.0023	5	24.7479
6	tiny	0.81722	0.65822	0.02947	0.85339	0.36285	0.37732	0.51054	0.71194	0.37533	0.22954	0.68621	0.55243	0.58182	0.17472	0.04610	0.64380	0.64545	0.09317	0.62008	0.07845	7	1	16.6503	1	21.4444
7	tiny	0.19480	0.81673	0.08548	0.18376	0.33264	0.70558	0.92761	0.29642	0.22404	0.14719	0.59064	0.46326	0.41860	0.25631	0.23045	0.08034	0.43559	0.67020	0.42272	0.49827	1	1	14.0342	2	20.9661
8	tiny	0.04403	0.51697	0.68884	0.45333	0.83565	0.29745	0.40325	0.95684	0.42194	0.78079	0.33106	0.17210	0.91056	0.26897	0.95602	0.13720	0.27190	0.55692	0.65825	0.68465	2	3	14.9830	3	17.5644

This example demonstrates the usefulness of effect selection when you suspect that interactions of effects are needed to explain the variation in your dependent variable. Ideally, a priori knowledge should be used to decide what interactions to allow, but in some cases this information might not be available. Simply fitting a least squares model allowing all interactions produces a model that overfits your data and generalizes very poorly.
The following statements use forward selection with selection based on the SBC criterion, which is the default selection criterion. At each step, the effect whose addition to the model yields the smallest SBC value is added. The STOP=NONE suboption specifies that this process continue even when the SBC statistic grows whenever an effect is added, and so it terminates at a full least squares model. The BUILDSSCP=FULL option is specified in a PERFORMANCE statement, since building the SSCP matrix incrementally is counterproductive in this case. See the section for details. Note that if all you are interested in is a full least squares model, then it is much more efficient to simply specify SELECTION=NONE in the MODEL statement. However, in this example the aim is to add effects in roughly increasing order of explanatory power.

proc glmselect data=analysisData testdata=testData plots=ASEPlot;
   class c1 c2 c3(order=data);  
   model y =  c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10
             |x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2
           / selection=forward(stop=none) 
             hierarchy=single;
   performance buildSSCP = full;
run;

ods graphics off;

The ASE plot shown in Output 42.2.18 clearly demonstrates the danger in overfitting the training data. As more insignificant effects are added to the model, the growth in test set ASE shows how the predictions produced by the resulting models worsen. This decline is particularly rapid in the latter stages of the forward selection, because the use of the SBC criterion results in insignificant effects with lots of parameters being added after insignificant effects with fewer parameters.

Output 42.2.18 Average Squared Error Plot

A Data Scientist's blog

Search This Blog

Tuesday, March 27, 2012

Statistical Computing

Statistical Computing

SAS: Resources to help you learn and use SAS

Resources to help you learn and use SAS

SAS: Simple and Multiple Regression

Simple and Multiple Regression

SAS: Multivariate regression in SAS

Multivariate regression in SAS

R: binding two vectors and display it as a matrix in R

R: Cross-Validation for Linear Regression

Cross-Validation for Linear Regression

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

R: Using Validation and Cross Validation

Example 42.2 Using Validation and Cross Validation