Search This Blog

Thursday, November 14, 2013

Logistic Regression in structural equation modeling SEM framework

Logistic Regression in structural equation modeling SEM framework

Dear lavaan users,

as far as i understand regression models are a special case of the more general structural equation models. i wonder, whether it is possible to model logistic regression in lavaan? Wouldn't one just need to transform the left-hand-side of the formula with the logit? Probably it's much more complicated, isn't it?



It is. Lavaan can not handle logistic regression (yet). It can handle
probit regression (using the probit link instead of the logit link) for
binary outcomes. But parameters can only be estimated by using WLS(MV),
not ML (at least not in 0.5-9), but the results are typically almost
identical. The only downside is that you can not directly interpret the
(exponentiated) regression coefficients as odds (ratios).

This would be an example (if y is binary):

model <- br="" x1="" x2="" x3="" y="">fit <- data="myData," model="" nbsp="" ordered="y" sem="" span="">



Beyond logistic regression: structural equations modelling for binary variables and its application to investigating unobserved confounders

Beyond logistic regression: structural equations modelling for binary variables and its application to investigating unobserved confounders

Emil Kupek


     This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Structural equation modelling (SEM) has been increasingly used in medical statistics for solving a system of related regression equations. However, a great obstacle for its wider use has been its difficulty in handling categorical variables within the framework of generalised linear models.

Methods

A large data set with a known structure among two related outcomes and three independent variables was generated to investigate the use of Yule's transformation of odds ratio (OR) into Q-metric by (OR-1)/(OR+1) to approximate Pearson's correlation coefficients between binary variables whose covariance structure can be further analysed by SEM. Percent of correctly classified events and non-events was compared with the classification obtained by logistic regression. The performance of SEM based on Q-metric was also checked on a small (N = 100) random sample of the data generated and on a real data set.

Results

SEM successfully recovered the generated model structure. SEM of real data suggested a significant influence of a latent confounding variable which would have not been detectable by standard logistic regression. SEM classification performance was broadly similar to that of the logistic regression.

Conclusion

The analysis of binary data can be greatly enhanced by Yule's transformation of odds ratios into estimated correlation matrix that can be further analysed by SEM. The interpretation of results is aided by expressing them as odds ratios which are the most frequently used measure of effect in medical statistics.

Background

Statistical problems that require going beyond standard logistic regression

Although logistic regression has become the cornerstone of modelling categorical outcomes in medical statistics, separate regression analysis for each outcome of interest is hardly challenged as a pragmatic approach even in the situations when the outcomes are naturally related. This is common in process evaluation where the same variable can be an outcome at one point in time and a predictor of another outcome in future. For example, preterm delivery is both an important obstetric outcome and a risk factor for low birthweight, which in turn can adversely affect future health. Sequential nature of these outcomes is not encompassed by repeated measures models which deal with the same outcome at different time points. Another example of a research problem difficult to handle by logistic regression model is when an outcome is determined not only by direct influences of the predictor variables but also by their unobserved common cause. For example, survival time since the onset of an immune system disease may be adversely affected by concomitant occurrence of various markers of disease progression indicating immunosupression as an underlying common factor, the latter being an unobserved latent variable whose estimation requires solving a system of related regression equations.
Structural equation modelling (SEM) is a very general statistical framework for dealing with above issues. In recent years, it has been increasingly used in medical statistics. In addition to traditional areas such as psychometric properties of health questionnaires and tests, behavioural genetics [1], measurement errors [2] and covariance structure in mixed regression models [3] have received particular attention. In addition to specific applications, important research methodology issues in SEM have been given more space in medical statistics, among which a comparison with multiple regression [4], the relevance of latent variable means in clinical trials [5] and power of statistical tests [6] deserve special attention.
However, a great obstacle for wider use of SEM has been its difficulty in handling categorical variables. The aim of this paper is to briefly review main aspects of this difficulty and to demonstrate a new approach to this problem based on a simple transformation. Two examples with both simulated and real data are provided to illustrate this approach.
SEM includes both observed and unobserved (latent) variables such as common factors and measurement errors. The Linear Structural Relationships (LISREL) model [7] was the first to spread in psychometric applications due to the availability of software. Other formulations of SEM and corresponding software emerged (see [8] for an overview). The details of these models, as well as important issues regarding their identifiability, estimation and robustness, are beyond the scope of this work but an illustration of the situations where SEM is needed is presented instead (Figure 1). As a general rule, SEM is indicated when more than one regression equation is necessary for statistical modelling of the phenomena under investigation.
thumbnailFigure 1. Statistical problems needing SEM approach.
The left part of Figure 1 shows a situation where two outcomes, denoted Y1 and Y2, are mutually related (a feed-back loop) and influenced by two predictors, denoted X1 and X2. For example, the outcomes could be demand and supply of a particular health service or risk perception and incidence of a particular health problem. The predictor variables' error terms, denoted e1 and e2, may be correlated (r) if an important variable influencing both predictors is omitted, i.e. in the case of bias in exposure measures. The terms d1 and d2 indicate disturbances of the two outcomes. The right part of Figure 1 illustrates a combination of common factors and regression model. In this case, it is of interest to test whether the outcome Y is determined not only by direct influences of the predictor variables, denoted X1, X2, X3 and X4, but also by their latent determinant as indicated by the regression coefficient b.
SEM has received many criticisms, most of which have been concerned with vulnerability of complex models relying on many assumptions, as well as with uncritical use and interpretation of SEM. These are well placed concerns but are not intrinsic to SEM; even well known and widely applied techniques such as regression share the same concerns. Complex phenomena require complex models whose inferential aspects are more prone to error as the number of parameters increases. SEM is often the only statistical framework by which many of these issues can be addressed by testing and comparing the models obtained [9].

Handling categorical variables in SEM

Specific criticism regarding the treatment of categorical and ordinal variables in SEM has been a strong deterrent for its wider use. Naive treatment of binary and ordered categorical variables as if they were normally distributed in some SEM applications was partly due to the lack of viable alternatives in its early days. Inadequate use of standardized regression coefficients as the measures of effect in some SEM applications was also criticised [10]. Even when distributional properties of categorical variables were taken into account, the interpretation of SEM parameter estimates in terms of impact measures such as attributable risk was not applied. Standard errors and confidence limits – rarely used in SEM – are generally underestimating structural model uncertainties such as selection of relevant variables and correct specification of their influences.
A recent review of handling categorical and other non-normal variables in SEM [11] listed four main strategies: a) asymptotic distribution free (ADF) estimators adjusting for non-normality by taking into account kurtosis in joint multivariate distribution [12], b) the use of robust maximum likelihood estimation or resampling techniques such as jacknife or bootstrap to obtain the standard errors of SEM parameters as these are most affected by departure from multivariate normality [13], c) calculating polyserial, tetrachoric or polychoric correlations for pairs of variables with non-normal joint distribution by assuming that these have an underlying (latent) continuous scale whose large sample joint distribution is bivariate normal, then using these correlations as the input for SEM [14], and d) estimating probit or logit model scores for observed categorical variables as the first level, then proceeding with SEM based on these scores as the second-level [15]. The ADF estimation generally requires large samples to keep the type II error at a reasonable level and extremely non-normal variables such as binary may be difficult to handle with sufficient precision. The last two strategies critically depend on how well the first-level model fits the data.
A review of statistical models for categorical data reveals the lack of a method capable of handling more than one regression equation [16]. Although log-linear models for contingency tables may analyse related categorical outcomes and their relationship with independent variables, possibly complex interactions between the variables in the model do not indicate the direction of influences as in regression models. This underlines the need for a SEM framework for categorical data analysis in order to handle both dimensionality reduction and regression techniques within the same model (cf. the right part of Figure 1).
Two major recent developments in handling categorical data include Muthen's extension of SEM to the 'latent variable modeling' approach [17] and an extension of generalized linear models to latent and mixed variables under GLLAMM (Generalized Linear Latent And Mixed Models) framework [18]. Despite coming from different statistical backgrounds, both Muthén's Mplus software [19] and GLLAMM are capable of modelling a mixture of continuous, ordinal and nominal scale variables, multiple groups (including clusters) and hierarchical (multi-level) data, random effects, missing data, latent variables (including latent classes and latent growth models) and discrete-time survival models. Both of these developments are based on the vision of generalized linear models as a unifying framework for both continuous and categorical variables, where the latter are first transformed into continuous linear functions and subsequently modelled by SEM. This paper follows the same line but proposes a different transformation for categorical variables, so far unused in SEM. A simulated and a real data example with a latent confounding variable are presented.

Methods

Data generation and transformation

This work illustrates the application of SEM for binary variables using Yule's transformation to approximate the matrix of Pearson's correlation coefficients from odds ratio (OR) by a well known formula (OR-1)/(OR+1). The first example is based on known data generating processes to avoid uncertainty about true model, virtually inevitable for empirical data. A data set with 5000 observations was generated to allow normal theory approximation. First, three continuous random variables, denominated x1 to x3, were created from the uniform distribution. The variables were uncorrelated in the population. Their binary versions, denominated BIN1 to BIN3, were obtained by coding the values above the mean as one versus zero otherwise. Two continuous dependent variables were created by the following equations: m = 1.5 x1 + 2 x2 + e1 and y = 0.5 x2 - 2.5 x3 + 1.3 m + e2, with e1 and e2 being normally distributed random errors (N~0,1), generated from different seeds. The binary versions of the dependent variables, denominated MBIN and YBIN, were created by applying the logistic regression classification rule, i.e. score 1 if exp(m)/(1+exp(m)) and exp(y)/(1+exp(y)) exceed 0.5 versus 0 otherwise, where 'exp' stands for 'exponentiation'.
Observed odds ratios between the variables of interest in the generated data sets are reported in table 1. The structural relationships among the variables in the second data set are depicted in Figure 2.
Table 1. Simulated data: Observed odds ratios (OR), associated 95% confidence intervals (CI) and SEM regression coefficients with corresponding standard errors (SE) obtained via ML estimation (N = 5000)
thumbnailFigure 2. Simulated model.
In addition, a random sample of 100 observations was taken from the generated data set with 5000 observations in order to illustrate small sample performance of the SEM based on Yule's transformation compared to logistic regression. Finally, a real data example with related binary obstetric outcomes, including premature birth, lower segment Caesarian section, low birthweight (<2500 10574="" 1="" a="" and="" applied="" as="" baby="" between="" care="" compare="" data.="" data="" delivered="" extracted="" from="" g="" logistic="" multiparous="" obstetric="" of="" pregnancies="" records="" regression="" sem="" singleton="" special="" standard="" sup="" technique="" the="" this="" to="" type="" unit="" used="" utilization="" was="" were="" who="" with="" women="">st
August 1994 and 31st July 1995 in nine maternity units in England and Wales [20].
Yule's transformation was used to estimate the matrix of Pearson's correlation coefficients for both simulated and real obstetric data. The correlations were used as input for SEM. For the simulated data, both logistic and SEM analysis were repeated for a random subset of 100 observations taken from the original data set. Maximum likelihood (ML) estimation was used.
SEM raw regression coefficients were back-transformed from Q-metric into odds metric by (1+Q)/(1-Q) to get an impact measure for the binary predictor variables. SAS software procedures CALIS and LOGISTIC were used for SEM and logistic analysis, respectively [21].

Evaluation of classification performance

Raw data residuals were calculated as the difference between observed and SEM-predicted values for both data sets. The predicted values were calculated by multiplying the raw regression parameters obtained in SEM with corresponding observed values of the predictor variables. The back-transformation from SEM parameters, denoted S, to the odds metric is given by (1+S)/(1-S) and provides the odds of being the case for each independent variable; summing these odds over the independent variables gives the odds of being the case for each profile of independent variables. The odds greater than one were classified as SEM predicted cases versus otherwise.
For logistic regression, the percent of correctly classified outcomes was calculated using the cut-off point of 0.5 for the estimated probability of outcome variables.
The classification performance of SEM and logistic regression was compared on a real data set with several obstetric outcomes of interest [20] and on a small random sample of 100 observations taken from the simulated data set of 5000 observations.

Power analysis

Statistical power analysis used a calculation based on non-central chi-squared distribution, providing the number of observations required to achieve the 90% power (beta or type II error of 0.10), denoted as N [22,23]. If n denotes the number of observations used in SEM, k denotes the multiplying factor for a chosen power level, degrees of freedom and alpha (type I error), and d denotes the chi-square difference between the SEM with and without the parameter(s) of interest, then N = k*n/d gives the required sample size. Releasing one parameter at a time (one degree of freedom), with fixed type I error of 5% and type II error of 10%, point to the tabulated k-value of 10.51 [23]. This approach assumes that the model is correctly specified.

Results

Table 1 contains observed odds ratios for the simulated data set and their decomposition into regression effects based on SEM using Yule's transformation of odds ratios.
A standard approach to the analysis of binary variables using multivariate logistic regression for the simulated data is presented in Table 2.
Table 2. Multivariate logistic regression for generated data: parameter estimates (standard errors) for large (N = 5000) and small (N = 100) samples
The normal probability plot of raw data residuals between observed outcomes and the estimated probability of outcome based on SEM for simulated data showed some departure from the normal distribution (Figure 3). On the other side, the residuals fall within the normal range. Both SEM and logistic regression models for real obstetric data (Figure 4) showed satisfactory fit regarding individual data residuals.
thumbnailFigure 3. Normal probability plots for raw data residuals. Normal probability plots for raw data residuals in the simulated data model with two related outcomes: YBIN (top) and MBIN (bottom). Asterisk may represent up to 30 residuals.
thumbnailFigure 4. Comparison of SEM and logistic model estimates for the obstetric data example.
The comparison of classification performance for SEM versus logistic regression showed slightly better results with the latter for one outcome in a small sample analysis and very similar results for all other comparisons (Table 4). True positive fraction for events was always considerably higher for SEM compared to logistic regression, albeit at the expense of lower true negative fraction for non-events.
Table 3. Small sample (N = 100) parameter estimates and their standard errors (SE) for SEM using Q-statistic input (correlations estimated via Yule's transformation)
Table 4. Percentage of correctly classified events for logistic regression (LR) models in table 2 versus SEM in tables 1 and 3
Logistic regression showed better overall classification rate due to better prediction of non-events (Table 5). On the other hand, events were better predicted by SEM.
Table 5. Classification performance for the obstetric data example (N = 10574): logistic regression (LR) and SEM with Q-metric input (see Figure 4)
SEM permitted further investigation of the unobserved determinant of observed obstetric risk factors in predicting the need for specialised neonatal care through a latent variable. A model was tested assuming that a common cause of some of the risk factors is a latent confounding variable influencing both observed risk factors and the outcome of interest (special baby care unit) and adding predictive power over and above the observed risk variables (Figure 5). The estimation was possible upon solving the observed variables' parameters first (so-called path analysis) and fixing the factor loading for preterm delivery to the value of one – a convention allowing the comparison of the contribution of the other two observed risk variables to the unobserved latent risk using premature birth as unit risk. The factor loadings (standard errors) for Caesarian section and low birthweight were -0.3948 (0.003) and 0.8630 (0.001), respectively.
thumbnailFigure 5. SEM with latent risk variable for the obstetric data example.
The relevance of the latent variable for predicting the use of special care baby unit was also tested by linear regression with raw data SEM residuals (observed minus SEM predicted probability of using special care baby unit) as the dependent variable and the latent variable scores as the predictor variable. The predictor was estimated at 0.0874 (standard error 0.0053) and was highly significant (p < 0.001).
The model suggested that propensity for premature birth resulting in low birthweight upon delivery which did not use Caesarian section increased the chances of special neonatal care utilization. The raw SEM coefficient representing this effect, denominated b4 on Figure 5, was estimated at 0.0956 with corresponding standard error of 0.016, leading to a highly significant t-value of 61.54. Transforming back to odds metric via (1+b4)/(1-b4) resulted in odds ratio of 1.21 and corresponding 95% confidence intervals from 1.14 to 1.29. Although a multivariate logistic regression model for the special baby care unit utilization did not find the above combination of risk factors statistically significant when it was added as interaction term to the risk factors themselves (odds ratio 1.16 with 95% confidence intervals from 0.72 to 1.86), it should be stressed that this is a model different from the above SEM.
Statistical power analysis found that only the b3 parameter in table 3 would require a larger sample size (N = 5918) than the one available to achieve the 90% power.

Discussion

The analysis demonstrated the viability of SEM using Yule's Q-transformation of odds ratio as input for binary variables models. On the level of individual data points, the raw data residuals were within the normal range and the discriminant rule for classification of outcomes into events and non-events based on SEM Q-scores performed slightly worse but still similarly to the results based on standard approach using logistic regression. The conclusion holds for the small sample example with generated data and for the real data set tested here. All these elements point out to the feasibility and utility of SEM using Yule's transformation for binary data, principally when complex relationships between the variables are present. For example, the investigation of the common cause of obstetric risk indicators on the outcome of interest identified a latent confounding variable which increased the chances of utilizing special neonatal care over and above the impact of the same risk indicators taken as independent predictors (Figure 5). The interpretation of the latent variable may lead to hypothesising a health service routine of treating premature births in a particular way (i.e. restraining from Caesarian section) or a biological propensity for birth complications, with both of these alternatives leading to an increased need for intensive neonatal care. This illustrates how SEM helps generating and investigating complex hypothesis not available by other methods. Yule's transformation may be helpful in preparing binary data for SEM. By using odds ratio both as a starting point and for the results presentation, the proposed transformation facilitates the interpretation of effects in the model.
For alpha level <0 .05="" and="" both="" em="" for="" likelihood="" ratio="" t-test="" test="" the="" univariate="">b
3 parameter being equal to zero indicated its statistical significance in SEM (details not shown) despite non-significance of observed odds ratio (table 3). However, the power of this test is less than the pre-established criterion of 90% and the impact of this parameter is clearly inferior to that of the other predictors in the model. The tendency to include extra parameters was also reported for SEM ML estimates where ordered categorical variables were treated as continuous [24] and may be expected for ADF estimates in SEM with raw binary data input. It should be noted that binary variables and the amount of noise introduced in the model analysed are serious obstacles to specifying the correct relationship between the variables for ADF estimation methods, typically applied to the data with smaller departure from the multivariate normal distribution. However, there has been some progress in developing both large sample and finite sample robustness of SEM parameters in handling non-normal data and outliers [25,26].
The advantage of SEM over separate logistic regression models for each outcome is twofold. First, SEM can model all regression equations simultaneously, thus providing a flexible framework for testing a range of possible relationships between the variables in the model, including mediating effects and possible latent confounding variables. Second, on a more general level, SEM parameters can quantify the contribution of each predictor to the covariance structure such as common factors model (Figure 5 is an example), whereas neither the interaction of continuous variables, defined as their crossproduct, nor the interaction terms for categorical independent variables in a regression model, can do this. The modelling of a common cause of observed risk factors and its influence on the outcome of interest is impossible outside SEM framework. Genetic propensity for various diseases is probably the most vivid example of the need for above model, enabling an investigation of the latent confounding variables frequently cited in the study design literature. This includes latent growth models with a relatively long sequence of indicators of an evolving process such as disease whose symptoms are typically binary indicators used for statistical modelling of the outcomes of interest. It is no coincidence that some recent developments in regression modelling have been marked by the efforts to integrate regression with a variety of covariance structure models [1-3].
Another advantage of SEM using Yule's Q-transformation of odds ratios for binary variables over two-level approach, based on probit or logit model or estimated correlations for non-normal variables as first level and SEM as second level modeling, may lay in the fact that the former is based on data transformation rather than estimation, thus avoiding the sources of error due to the latter. However, this view is not universally accepted and the discussion goes back to the beginning of the 20th century when Karl Pearson and George Udney Yule argued whether a measure of association of two binary variables needs to assume underlying continuum and bivariate normal distribution [16]. While the former based his calculation of tetrachoric correlation on these assumptions, the latter disagreed, saying that some categorical variables are inherently discrete, so that the continuum assumption is tenuous and in fact unnecessary because a measure of association for such cases can be obtained directly from cell counts in a 2 by 2 table as in odds ratio and its transformation, today known as Yule's Q. Although the popularity of odds ratio over Pearson's correlation in medical statistic points to a prevailing tendency of embracing Yule's view in this field, an attempt to reconcile the two viewpoints has been made [16].
The fact that Yule's transformation is well known and allows an easy back-transformation of model parameters to odds metric makes it easier to interpret them as effect measures. Although SEM estimates based on already existing methods for handling categorical variables could be converted to an odds ratio metric for the purpose of interpretation, it has been used very rarely in the publications in the field and almost exclusively with GLLAMM.
Usual tools for evaluating SEM fit such as the analysis of residuals are available not only for input covariance matrix but also for individual data points. When classification of outcomes into events and non-events is of interest, sensitivity and specificity parameters can easily be obtained, thus making this approach applicable to a wide range of research problems.
Although other measures of comparative model fit, abundant in the SEM literature [9], may also be useful to assess various aspects of this important issue, classification performance is a preferred measure of predictive power in practice, particularly if cross-validated. For example, both data sets analysed here used saturated models which perfectly predicted the input correlation matrices, so the fit indices based on the discrepancy between observed and SEM-predicted correlation matrices obtained maximum values possible, but this was not particularly informative. On the other hand, SEM fit indices may be useful to select the best model in many other situations.
Despite the advantages of SEM mentioned above, there are several limitations of this work. First, Yule's Q is not exactly Pearson's correlation coefficient but rather an approximation to it which seems reasonable in large samples and for the types of models tested. Although the illustration of a small sample size performance seems satisfactory compared to logistic regression models, it is yet to be tested fully for a much wider range of dependency structures than presented here in order to evaluate the robustness of the parameters obtained. However, this requirement is a consequence of complex modelling issues which often arise in SEM as Yule's Q is no new estimator. Therefore, the findings about the properties of ML, ADF and least squares estimators in SEM, accumulated for almost three decades of research, apply here. This is the main reason why no attempt of a simulation study of SEM parameter estimates has been made in this work. Second, the lack of a simple rule for variable selection in SEM and the need to test a variety of models before selecting the acceptable ones can make it difficult to use this approach for quick decision making often favoured in routine applications of medical statistics. Model selection based on Bayes factors [27] may be helpful in this situation. Finally, although logit is the most popular transformation in modelling binary outcomes in medical statistics, there are many other link functions which may be more suitable for a particular model. GLLAMM [18] theory and software seem to be the most complete framework for such investigation up to date.
When the scale of SEM variables is not equal or their variances differ significantly, covariance matrix input should be preferred instead of correlation matrix input. Although SEM standard errors are less accurate with the latter even with the sample size of few hundreds, the data used here had much larger sample sizes and therefore are less influenced by the type of input matrix. In addition, the input of all SEM variables was on the same scale, i.e. in the odds metric. On the other hand, many SEM applications are performed on moderate and small samples, so the covariance matrix input would be preferable. With multivariate normal distribution, sample covariance matrix contains all the necessary information for SEM. However, with non-normal data, kurtosis was shown to be the most relevant parameter to be taken into account to correct the standard errors of SEM parameters, as in ADF estimators [12]. If means are of interest in SEM, input covariance matrix can be augmented with this information as well. Another way of dealing with SEM standard errors from non-normal data is bootstrapping, already included in several statistical packages with SEM module.
If the raw regression parameters from SEM exceed the domain of the inverse of Yule's transformation function, i.e. the interval from -1 to 1, then standardized SEM parameters can be used to get the odds metric via (1+Q)/(1-Q). Alternatively, a transformation mapping the raw SEM coefficients to this interval may be used, such as Yule's or logit, with corresponding back-transformation of the results to odds metric.
Although this work does not address the question of the association between continuous and dichotomous variables, extensions to include this case can be envisaged. One strategy would be to transform continuous variables into ordered categories with one of them serving as a baseline and then calculate odds ratios using logistic regression. Subsequently, Yule's transformation can be used to convert the odds into correlation metric to be analyzed by SEM. Another strategy would be to use polychoric or poliserial correlation for above situation and only substitute tetrachoric correlation by Yule's Q, particularly when the structural relationships of interest are between binary variables in the model and some exogenous variables are ordered or continuous.
Further research is needed to elucidate various aspects of the SEM based on Q-metric input, particularly small sample performance for a wide range of statistical models and their classification performance. In addition, the variance of odds ratios may be used to weight the estimated correlation matrix, so that Q-metric input for SEM takes into account the precision of the original scale and not only the magnitude of association between two binary variables. Relative fit measures such as those recently proposed by Agresti & Caffo [28] may help selecting among competing models of different kind.

Conclusion

SEM based on Q-transformation of odds ratios can be used to investigate complex dependency structures such as latent confounding factors and their influences on both observed risk factors and categorical outcome variables.

Competing interests

The author(s) declare that they have no competing interest.

References

  1. Posthum D, de Geus EJ, Neale MC, Hulshoff Pol HE, Baare WEC, Kahn RS, Boomsma D: Multivariate genetic analysis of brain structure in an extended twin design.
    Behav Genet 2000, 30:311-319. PubMed Abstract | Publisher Full Text OpenURL

  2. Plummer MT, Clayton DG: Measurement error in dietary assessment, an investigation using covariance structure models.
    Stat Med 1993, 12:925-948. PubMed Abstract OpenURL

  3. Littell RC, Pendergast J, Natarajan R: Modelling covariance structure in the analysis of repeated measures data.
    Stat Med 2000, 19:1793-1819. PubMed Abstract | Publisher Full Text OpenURL

  4. Wall MM, Li R: Comparison of multiple regression to two latent variable techniques for estimation and prediction.
    Stat Med 2003, 22:3671-3685. PubMed Abstract | Publisher Full Text OpenURL

  5. Donaldson GW: General liner contrasts on latent variable means: structural equation hypothesis tests for multivariate clinical trials.
    Stat Med 2003, 22:2893-2917. PubMed Abstract | Publisher Full Text OpenURL

  6. Miles J: A framework for power analysis using structural equation modelling procedure.
    BMC Med Res Methodol 2003, 3:27. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  7. Joreskog K, Sorbom D: LISREL VI. Uppsala: Uppsala University Press; 1981. OpenURL

  8. Structural Equation Modelling [http://www.gsu.edu/~mkteer] webcite


  9. Tanaka JS: Multifaceted conceptions of fit in structural equation models. In Testing structural equation models. Edited by Bollen KA, Long J. Newbury Park, CA: Sage Publications; 1993::10-39. OpenURL

  10. Greenland S, Schlesselman JJ, Criqui MH: The fallacy of employing standardized regression coefficients and correlations as measures of effect.
    Am J Epidemiol 1986, 2:203-208. OpenURL

  11. Kupek E: Log-linear transformation of binary variables: A suitable input for structural equation modeling.
    Structural Equation Modeling: A Multidisciplinary Journal 2005, 12:35-47. OpenURL

  12. Browne MW: Asymptotically distribution-free methods for the analysis of covariance structures.
    British Journal of Mathematical and Statistical Psychology 1984, 37:62-83. OpenURL

  13. Bollen KA, Stine RA: Bootstrapping goodness-of-fit measures in structural equation models. In Testing structural equation models. Edited by Bollen KA, Long J. Newbury Park, CA: Sage Publications; 1993::111-135. OpenURL

  14. Joreskog KG, Soerbom D: PRELIS: A preprocessor for LISREL. Chicago, IL: Scientific Software International; 1994. OpenURL

  15. Muthén BO: Goodness of fit with categorical and other nonnormal variables. In Testing structural equation models. Edited by Bollen KA, Long J. Newbury Park, CA: Sage Publications; 1993::205-234. OpenURL

  16. Agresti A: An introduction to categorical data analysis. New York: John Wiley & Sons; 1996. OpenURL

  17. Muthén BO: Beyond SEM: General latent variable modeling.
    Behaviormetrika 2002, 29:81-117. OpenURL

  18. Skrondal A, Rabe-Hesketh S: Generalized latent variable modelling: multilevel, longitudinal and structural equation models. Boca Raton: Chapman & Hall/CRC; 2004. OpenURL

  19. Muthén B, Muthén B: Mplus, 2.0 [computer program]. Los Angeles, CA: Muthén & Muthén; 2001. OpenURL

  20. Vause S, Maresh M: Indicators of quality of antenatal care: a pilot study.
    Br J Obstet Gynaecol 1999, 106:197-205. PubMed Abstract OpenURL

  21. SAS Institute: The CALIS procedure. In SAS/STATR User's Guide. Edited by SAS Institute. Cary, NC: SAS Institute; 1989::245-366. OpenURL

  22. Satorra A, Saris WE: The power of the likelihood ratio test in covariance structure analysis.
    Psychometrika 1985, 50:83-90. Publisher Full Text OpenURL

  23. Dunn G, Everitt B, Pickles A: Modelling covariances and latent variables using EQS. London: Chapman & Hall; 1993. OpenURL

  24. Bentler PM, Newcomb MD: Linear structural modeling with nonnormal continuous variables. Application: Relations among social support, drug use, and health in young adults. In Statistical models for longitudinal studies of health. Edited by Dwayer JH, Feinlieb M, Lippert P. Oxford: Oxford University Press; 1992::132-160. OpenURL

  25. Yuan KH, Bentler PM: Robust mean and covariance structure analysis.
    British Journal of Mathematical and Statistical Psychology 1998, 51:63-88. OpenURL

  26. Yuan KH, Bentler PM: Normal theory based test statistics in structural equation modeling.
    British Journal of Mathematical and Statistical Psychology 1998, 51:289-309. OpenURL

  27. Raftery AE: Bayesian model selection in structural equation models. In Testing structural equation models. Edited by Bollen KA, Long J. Newbury Park, CA: Sage Publications; 1993::163-180. OpenURL

  28. Agresti A, Caffo B: Measures of relative model fit.
    Computational Statistic and Data Analysis 2002, 39:127-136. Publisher Full Text OpenURL

Pre-publication history

The pre-publication history for this paper can be accessed here:

Tuesday, November 12, 2013

Interpreting Regression Output

Interpreting Regression Output 

Introduction 

This guide assumes that you have at least a little familiarity with the concepts of linear multiple regression, and are capable of performing a regression in some software package such as Stata, SPSS or Excel. You may wish to read our companion page Introduction to Regression first. For assistance in performing regression in particular software packages, there are some resources at UCLA Statistical Computing Portal .

Brief review of regression 

Remember that regression analysis is used to produce an equation that will predict a dependent variable using one or more independent variables. This equation has the form
  • Y = b1X1 + b2X2 + ... + A
where Y is the dependent variable you are trying to predict, X1X2 and so on are the independent variables you are using to predict it, b1b2 and so on are the coefficients or multipliers that describe the size of the effect the independent variables are having on your dependent variable Y, and A is the value Y is predicted to have when all the independent variables are equal to zero.
In the Stata regression shown below, the prediction equation is price = -294.1955 (mpg) + 1767.292 (foreign) + 11905.42 - telling you that price is predicted to increase 1767.292 when the foreign variable goes up by one, decrease by 294.1955 when mpg goes up by one, and is predicted to be 11905.42 when both mpg and foreign are zero.
Coming up with a prediction equation like this is only a useful exercise if the independent variables in your dataset have some correlation with your dependent variable. So in addition to the prediction components of your equation--the coefficients on your independent variables (betas) and the constant (alpha)--you need some measure to tell you how strongly each independent variable is associated with your dependent variable.
When running your regression, you are trying to discover whether the coefficients on your independent variables are really different from 0 (so the independent variables are having a genuine effect on your dependent variable) or if alternatively any apparent differences from 0 are just due to random chance. The null (default) hypothesis is always that each independent variable is having absolutely no effect (has a coefficient of 0) and you are looking for a reason to reject this theory.

P, t and standard error 

The t statistic is the coefficient divided by its standard error. The standard error is an estimate of thestandard deviation of the coefficient, the amount it varies across cases. It can be thought of as a measure of the precision with which the regression coefficient is measured. If a coefficient is large compared to its standard error, then it is probably different from 0.
How large is large? Your regression software compares the t statistic on your variable with values in the Student's t distribution to determine the P value, which is the number that you really need to be looking at.The Student's t distribution describes how the mean of a sample with a certain number of observations (your n) is expected to behave. For more information on the t distribution, look at this web page .
If 95% of the t distribution is closer to the mean than the t-value on the coefficient you are looking at, then you have a P value of 5%. This is also reffered to a significance level of 5%. The P value is the probability of seeing a result as extreme as the one you are getting (a t value as large as yours) in a collection of random data in which the variable had no effect. A P of 5% or less is the generally accepted point at which to reject the null hypothesis. With a P value of 5% (or .05) there is only a 5% chance that results you are seeing would have come up in a random distribution, so you can say with a 95% probability of being correct that the variable is having some effect, assuming your model is specified correctly.
The 95% confidence interval for your coefficients shown by many regression packages gives you the same information. You can be 95% confident that the real, underlying value of the coefficient that you are estimating falls somewhere in that 95% confidence interval, so if the interval does not contain 0, your P value will be .05 or less.
Note that the size of the P value for a coefficient says nothing about the size of the effect that variable is having on your dependent variable - it is possible to have a highly significant result (very small P-value) for a miniscule effect.

Coefficients 

In simple or multiple linear regression, the size of the coefficient for each independent variable gives you the size of the effect that variable is having on your dependent variable, and the sign on the coefficient (positive or negative) gives you the direction of the effect. In regression with a single independent variable, the coefficient tells you how much the dependent variable is expected to increase (if the coefficient is positive) or decrease (if the coefficient is negative) when that independent variable increases by one. In regression with multiple independent variables, the coefficient tells you how much the dependent variable is expected to increase when that independent variable increases by one, holding all the other independent variables constant. Remember to keep in mind the units which your variables are measured in.
Note: in forms of regression other than linear regression, such as logistic or probit, the coefficients do not have this straightforward interpretation. Explaining how to deal with these is beyond the scope of an introductory guide.

R-Squared and overall significance of the regression 

The R-squared of the regression is the fraction of the variation in your dependent variable that is accounted for (or predicted by) your independent variables. (In regression with a single independent variable, it is the same as the square of the correlation between your dependent and independent variable.) The R-squared is generally of secondary importance, unless your main concern is using the regression equation to make accurate predictions. The P value tells you how confident you can be that each individual variable has some correlation with the dependent variable, which is the important thing.
Another number to be aware of is the P value for the regression as a whole. Because your independent variables may be correlated, a condition known as multicollinearity, the coefficients on individual variables may be insignificant when the regression as a whole is significant. Intuitively, this is because highly correlated independent variables are explaining the same part of the variation in the dependent variable, so their explanatory power and the significance of their coefficients is "divided up" between them.

Correlation and Causation

Correlation and Causation 


What are correlation and causation and how are they different?


Two or more variables considered to be related, in a statistical context, if their values change so that as the value of one variable increases or decreases so does the value of the other variable (although it may be in the opposite direction).

For example, for the two variables "hours worked" and "income earned" there is a relationship between the two if the increase in hours worked is associated with an increase in income earned. If we consider the two variables "price" and "purchasing power", as the price of goods increases a person's ability to buy these goods decreases (assuming a constant income).


Correlation is a statistical measure (expressed as a number) that describes the size and direction of a relationship between two or more variables. A correlation between variables, however, does not automatically mean that the change in one variable is the cause of the change in the values of the other variable.


Causation indicates that one event is the result of the occurrence of the other event; i.e. there is a causal relationship between the two events. This is also referred to as cause and effect. 


Theoretically, the difference between the two types of relationships are easy to identify — an action or occurrence can cause another (e.g. smoking causes an increase in the risk of developing lung cancer), or it can correlate with another (e.g. smoking is correlated with alcoholism, but it does not cause alcoholism). In practice, however, it remains difficult to clearly establish cause and effect, compared with establishing correlation.


Why are correlation and causation important?

The objective of much research or scientific analysis is to identify the extent to which one variable relates to another variable. For example:
  • Is there a relationship between a person's education level and their health?
  • Is pet ownership associated with living longer?
  • Did a company's marketing campaign increase their product sales?
These and other questions are exploring whether a correlation exists between the two variables, and if there is a correlation then this may guide further research into investigating whether one action causes the other. By understanding correlation and causality, it allows for policies and programs that aim to bring about a desired outcome to be better targeted. 

How is correlation measured?


For two variables, a statistical correlation is measured by the use of a Correlation Coefficient, represented by the symbol (r), which is a single number that describes the degree of relationship between two variables. 

The coefficient's numerical value ranges from +1.0 to –1.0, which provides an indication of the strength and direction of the relationship.
If the correlation coefficient has a negative value (below 0) it indicates a negative relationship between the variables. This means that the variables move in opposite directions (ie when one increases the other decreases, or when one decreases the other increases).
If the correlation coefficient has a positive value (above 0) it indicates a positive relationship between the variables meaning that both variables move in tandem, i.e. as one variable decreases the other also decreases, or when one variable increases the other also increases.
Where the correlation coefficient is 0 this indicates there is no relationship between the variables (one variable can remain constant while the other increases or decreases).
While the correlation coefficient is a useful measure, it has its limitations:Correlation coefficients are usually associated with measuring a linear relationship. 
For example, if you compare hours worked and income earned for a tradesperson who charges an hourly rate for their work, there is a linear (or straight line) relationship since with each additional hour worked the income will increase by a consistent amount. 

If, however, the tradesperson charges based on an initial call out fee and an hourly fee which progressively decreases the longer the job goes for, the relationship between hours worked and income would be non-linear, where the correlation coefficient may be closer to 0. 

Care is needed when interpreting the value of 'r'. It is possible to find correlations between many variables, however the relationships can be due to other factors and have nothing to do with the two variables being considered. 
For example, sales of ice creams and the sales of sunscreen can increase and decrease across a year in a systematic manner, but it would be a relationship that would be due to the effects of the season (ie hotter weather sees an increase in people wearing sunscreen as well as eating ice cream) rather than due to any direct relationship between sales of sunscreen and ice cream.

The correlation coefficient should not be used to say anything about cause and effect relationship. By examining the value of 'r', we may conclude that two variables are related, but that 'r' value does not tell us if one variable was the cause of the change in the other. 
How can causation be established?


Causality is the area of statistics that is commonly misunderstood and misused by people in the mistaken belief that because the data shows a correlation that there is necessarily an underlying causal relationship . 


The use of a controlled study is the most effective way of establishing causality between variables. In a controlled study, the sample or population is split in two, with both groups being comparable in almost every way. The two groups then receive different treatments, and the outcomes of each group are assessed.

For example, in medical research, one group may receive a placebo while the other group is given a new type of medication. If the two groups have noticeably different outcomes, the different experiences may have caused the different outcomes.
Due to ethical reasons, there are limits to the use of controlled studies; it would not be appropriate to use two comparable groups and have one of them undergo a harmful activity while the other does not. To overcome this situation, observational studies are often used to investigate correlation and causation for the population of interest. The studies can look at the groups' behaviours and outcomes and observe any changes over time.
The objective of these studies is to provide statistical information to add to the other sources of information that would be required for the process of establishing whether or not causality exists between two variables.