AMOS Model Fit Measures.
Contents
The
Minimum sample discrepancy model:
A.The Minimum sample discrepancy model:
The following fit
measures are based on the minimum value of the discrepancy.
CMIN: CMIN is the minimum value, , of the discrepancy, C
P:P is the probability of getting as large a discrepancy as occurred with the present sample (under appropriate distributional assumptions and assuming a correctly specified model). That is, P is a "p value" for testing the hypothesis that the model fits perfectly in the population.One approach to model selection employs statistical hypothesis testing to eliminate from consideration those models that are inconsistent with the available data. Hypothesis testing is a widely accepted procedure and there is a lot of experience in its use. However, its unsuitability as a device for model selection was pointed out early in the development of analysis of moment structures (Jöreskog, 1969). It is generally acknowledged that most models are useful approximations that do not fit perfectly in the population. In other words, the null hypothesis of perfect fit is not credible to begin with and will in the end be accepted only if the sample is not allowed to get too big.If you encounter resistance to the foregoing view of the role of hypothesis testing in model fitting, the following quotations may come in handy. The first two quotes predate the development of structural modeling, and refer to other model fitting problems.
▪ "The power of the test to detect
an underlying disagreement between theory and data is controlled largely by the
size of the sample. With a small sample an alternative hypothesis which departs
violently from the null hypothesis may still have a small probability of
yielding a significant value of 7245. In a very large sample, small and
unimportant departures from the null hypothesis are almost certain to be
detected." (Cochran, 1952)
▪ "If the sample is small then the
7245 test will show that the data are 'not significantly different from' quite
a wide range of very different theories, while if the sample is large, the 7245
test will show that the data are significantly different from those expected on
a given theory even though the difference may be so very slight as to be
negligible or unimportant on other criteria." (Gulliksen & Tukey,
1958, pp. 95–96)
▪ "Such a hypothesis [of perfect
fit] may be quite unrealistic in most empirical work with test data. If a
sufficiently large sample were obtained this 7245 statistic would, no doubt,
indicate that any such non-trivial hypothesis is statistically untenable."
(Jöreskog, 1969, p. 200)
▪ "... in very large samples
virtually all models that one might consider would have to be rejected as
statistically untenable .... In effect, a nonsignificant chi-square value is
desired, and one attempts to infer the validity of the hypothesis of no
difference between model and data. Such logic is well-known in various
statistical guises as attempting to prove the null hypothesis. This procedure
cannot generally be justified, since the chi-square variate v can be made small
by simply reducing sample size." (Bentler & Bonett, 1980, p. 591)
▪ "Our opinion ... is that this
null hypothesis [of perfect fit] is implausible and that it does not help much
to know whether or not the statistical test has been able to detect that it is
false." (Browne & Mels, 1992, p. 78).
CMIN/DF
CMIN/DF is the minimum
discrepancy, , divided by its degrees of freedom:
.
Several writers have
suggested the use of this ratio as a measure of fit. For every estimation
criterion except for Uls and Sls, the ratio should
be close to one for correct models. The trouble is that it isn't clear how far
from one you should let the ratio get before concluding that a model is
unsatisfactory.
Rules of thumb:
▪
|
"...Wheaton et al. (1977) suggest
that the researcher also compute a relative chi-square () ....
They suggest a ratio of approximately five or less 'as beginning to be
reasonable.' In our experience, however, to
degrees of freedom ratios in the range of 2 to 1 or 3 to 1 are indicative of
an acceptable fit between the hypothetical model and the sample data." (Carmines
and McIver, 1981, page 80)
|
▪
|
"... different researchers
have recommended using ratios as low as 2 or as high as 5 to indicate a
reasonable fit." (Marsh & Hocevar, 1985).
|
▪
|
"... it seems clear that
a ratio
> 2.00 represents an inadequate fit." (Byrne, 1989, p. 55).
|
FMIN
FMIN is the minimum
value, , of the discrepancy
Measures of parsimony
Models
with relatively few parameters (and relatively many degrees of freedom) are
sometimes said to be high in parsimony, or simplicity. Models with many
parameters (and few degrees of freedom) are said to be complex, or lacking in
parsimony. This use of the terms, simplicity and complexity, does not always
conform to everyday usage. For example, the saturated model would be called
complex while a model with an elaborate pattern of linear dependencies but with
highly constrained parameter values would be called simple.
While
one can inquire into the grounds for preferring simple, parsimonious models
(e.g., Mulaik, et al., 1989), there does not appear to be any disagreement that
parsimonious models are preferable to complex ones. When it comes to
parameters, all other things being equal, less is more. At the same time, well
fitting models are preferable to poorly fitting ones. Many fit measures
represent an attempt to balance these two conflicting objectives—simplicity and
goodness of fit.
"In
the final analysis, it may be, in a sense, impossible to define one best way to
combine measures of complexity and measures of badness-of-fit in a single
numerical index, because the precise nature of the best numerical tradeoff
between complexity and fit is, to some extent, a matter of personal taste. The
choice of a model is a classic problem in the two-dimensional analysis of
preference." (Steiger, 1990, p. 179)
NPAR:
NPAR
is the number of distinct parameters (q) being estimated. Two parameters (two
regression weights, say) that are required to be equal to each other count as a
single parameter, not two.
DF:
DF
is the number of degrees of freedom for testing the model:
.
where
p is the number of sample moments and q is the number of distinct parameters.
Rigdon (1994a) gives a detailed explanation of the calculation and
interpretation of degrees of freedom.
PRATIO:
The
parsimony ratio (James, Mulaik & Brett, 1982; Mulaik, et al., 1989)
expresses the number of constraints in the model being evaluated as a fraction
of the number of constraints in the independence model:
,
where
d is the degrees of freedom of the model being evaluated and 7243 is the
degrees of freedom of the independence model. The parsimony ratio is used in
the calculation of PNFI and
PNFI:
The
PNFI is the result of applying the James, Mulaik and Brett, 1982 parsimony
adjustment to the NFI:
where
d is the degrees of freedom for the model being evaluated, and 7310 is the
degrees of freedom for the baseline model.
PCFI
The
PCFI is the result of applying the James, Mulaik and Brett, 1982 parsimony
adjustment to the CFI:
where
d is the degrees of freedom for the model being evaluated, and 7312 is the
degrees of freedom for the baseline model.
PCLOSE
Gets
the "p value" for testing the null hypothesis that RMSEA is less than
.05 in the population. (Browne & Cudeck, 1993)
Measures based on population discrepancy:
Steiger and Lind (1980) introduced
the use of the population discrepancy function as a measure of model adequacy.
The population discrepancy function, , is the
value of the discrepancy function obtained by fitting a model to the population
moments rather than to sample moments. That is,
in contrast to
.
Steiger, Shapiro and Browne
(1985) showed that under certain
conditions has
a noncentral chi-square distribution with ddegrees
of freedom and noncentrality parameter . The
Steiger-Lind approach to model evaluation centers around the estimation
of and
related quantities.
The present discussion of measures
related to the population discrepancy relies mainly on Steiger and Lind (1980) and Steiger, Shapiro and Browne (1985).
The notation is based on Browne and Mels (1992).
NCP
is an estimate of the noncentrality parameter, .
The columns labeled LO 90 and HI 90
contain the lower limit () and
upper limit () of a 90%
confidence interval for . is
obtained by solving
for ,
and is
obtained by solving
for ,
where is the
distribution function of the noncentral chi-squared distribution with
noncentrality parameter and ddegrees of freedom.
F0
is an estimate of .
The columns labeled LO 90 and HI 90 contain the
lower limit and upper limit of a 90% confidence interval for :
.
RMSEA:
incorporates no penalty for model complexity and will
tend to favor models with many parameters. In comparing two nested
models, will
never favor the simpler model. Steiger and
Lind (1980) suggested compensating for
the effect of model complexity by dividing by
the number of degrees of freedom for testing the model. Taking the square root
of the resulting ratio
gives the population "root mean square error of
approximation", called RMS by Steiger and Lind, and RMSEA by Browne and Cudeck (1993).
The columns labeled LO 90 and HI
90 contain the lower limit and upper limit of a 90% confidence
interval for the population value of RMSEA. The limits are given by
Rule
of thumb:
"Practical experience has made
us feel that a value of the RMSEA of about .05 or less would indicate a close
fit of the model in relation to the degrees of freedom. This figure is based on
subjective judgment. It cannot be regarded as infallible or correct, but it is
more reasonable than the requirement of exact fit with the RMSEA = 0.0. We are
also of the opinion that a value of about 0.08 or less for the RMSEA would
indicate a reasonable error of approximation and would not want to employ a
model with a RMSEA greater than 0.1." (Browne and
Cudeck, 1993)
PCLOSE:
is a "p value" for
testing the null hypothesis that the population RMSEA is no greater than .05:
.
By contrast, P is
for testing the hypothesis that the population RMSEA is zero:
.
Based on their experience
with RMSEA, Browne and Cudeck (1993) suggest that a RMSEA of
.05 or less indicates a "close fit". Employing this definition of
"close fit", PCLOSE gives a test of close fit
while P gives a test of exact fit.
Information theoretic measures:
Amos reports several statistics of the form or ,
where k is some positive constant. Each of these statistics
creates a composite measure of badness of fit () and
complexity (q) by forming a weighted sum of the two. Simple models that
fit well receive low scores according to such a criterion. Complicated, poorly
fitting models get high scores. The constant k determines the
relative penalties to be attached to badness of fit and to complexity.
The statistics described in this
section are intended for model comparisons and not for the evaluation of an
isolated model.
All of these statistics were
developed for use with maximum likelihood estimation. Amos reports them
for Gls and Adf estimation as well, although
it is not clear that their use is appropriate there.
AIC
The Akaike information criterion
(Akaike, 1973; Akaike, 1987) is given by
.
BCC
The Browne-Cudeck (Browne
& Cudeck, 1989) criterion is given by
where if
the Emulisrel6 method has been used, or if
it has not.
BCC imposes a slightly greater penalty for model complexity
than does AIC.
BCC is the only measure in this section that was developed
specifically for analysis of moment structures. Browne and Cudeck provided some
empirical evidence suggesting that BCC may be superior to more
generally applicable measures. Arbuckle
(unpublished) gives an alternative
justification for BCC and derives the above formula for
multiple groups.
BIC
The Bayes information criterion (Schwarz, 1978; Raftery, 1995) is given by the
formula,
.
Amos 4 used the formula (Raftery, 1993),
.
In comparison to AIC, BCC and CAIC, BIC assigns
a greater penalty to model complexity, and so has a greater tendency to pick
parsimonious models. BIC is reported only for the case of a
single group where means and intercepts are not explicit model parameters.
CAIC
Bozdogan's (Bozdogan,
1987) CAIC (consistent AIC)
is given by the formula,
.
CAIC assigns a greater penalty to
model complexity than either AIC or BCC, but not as great a penalty as does
BIC. CAIC is reported only for the case of a single group where means and
intercepts are not explicit model parameters.
ECVI
Except for a constant scale factor, ECVI is
the same as AIC:
.
The columns labeled LO 90 and HI
90 give the lower limit and upper limit of a 90% confidence interval
for the population ECVI:
,
.
MECVI
Except for a scale factor, MECVI is
identical to BCC:
,
where if the Emulisrel6 method has been
used, or if it
has not.
Comparison to baseline model
Several
fit measures encourage you to reflect on the fact that, no matter how badly
your model fits, things could always be worse.
Bentler
and Bonett (1980) and Tucker and Lewis (1973) suggested fitting the independence model or some other very badly
fitting "baseline" model as an exercise to see how large the
discrepancy function becomes. The object of the exercise is to put the fit of
your own model(s) into some perspective. If none of your models fit very well,
it may cheer you up to see a really bad model. For example, as
the following output shows, Model A from Example 6 has a rather large
discrepancy (71.544) in relation to its degrees of freedom. On the other hand,
71.544 does not look so bad compared to 2131.790 (the discrepancy for the independence
model).
Model
|
NPAR
|
CMIN
|
DF
|
P
|
CMIN/DF
|
Model A: No
Autocorrelation
|
15
|
71.544
|
6
|
.000
|
11.924
|
Model B: Most General
|
16
|
6.383
|
5
|
.271
|
1.277
|
Model C:
Time-Invariance
|
13
|
7.501
|
8
|
.484
|
.938
|
Model D: A and C
Combined
|
12
|
73.077
|
9
|
.000
|
8.120
|
Saturated model
|
21
|
.000
|
0
|
||
Independence model
|
6
|
2131.790
|
15
|
.000
|
142.119
|
This
things-could-be-worse philosophy of model evaluation is incorporated into a
number of fit measures. All of the measures tend to range between zero and one,
with values close to one indicating a good fit. Only NFI (described
below) is guaranteed to be between zero and one, with one indicating a perfect
fit. (CFI is also guaranteed to be between zero and one, but this
is because values bigger than one are reported as one, while values less than
zero are reported as zero.)
The
independence model is only one example of a model that can be chosen as the
baseline model, although it is the one most often used, and the one that Amos
uses. Sobel and Bohrnstedt (1985) contend that the choice of the
independence model as a baseline model is often inappropriate. They suggest
alternatives, as did Bentler and Bonett (1980), and give some examples to
demonstrate the sensitivity of NFI to the choice of baseline
model.
NFI
The
Bentler-Bonett (Bentler & Bonett, 1980) normed fit index ( NFI),
or in
the notation of Bollen
(1989b) can be written
,
where is the
minimum discrepancy of the model being evaluated and is the
minimum discrepancy of the baseline model.
In
Example 6 the independence model can be obtained by adding constraints to any
of the other models. Any model can be obtained by constraining the saturated
model. So Model A, for instance, with , is
unambiguously "in between" the perfectly fitting saturated model () and the
independence model ).
Model
|
NPAR
|
CMIN
|
DF
|
P
|
CMIN/DF
|
Model A: No Autocorrelation
|
15
|
71.544
|
6
|
.000
|
11.924
|
Model B: Most General
|
16
|
6.383
|
5
|
.271
|
1.277
|
Model C: Time-Invariance
|
13
|
7.501
|
8
|
.484
|
.938
|
Model D: A and C Combined
|
12
|
73.077
|
9
|
.000
|
8.120
|
Saturated model
|
21
|
.000
|
0
|
||
Independence model
|
6
|
2131.790
|
15
|
.000
|
142.119
|
Looked
at in this way, the fit of Model A is a lot closer to the fit of the saturated
model than it is to the fit of the independence model. In fact you might say
that Model A has a discrepancy that is 96.6% of the way between the (terribly
fitting) independence model and the (perfectly fitting) saturated model:
.
Rule of
thumb:
"Since
the scale of the fit indices is not necessarily easy to interpret (e.g., the
indices are not squared multiple correlations), experience will be required to
establish values of the indices that are associated with various degrees of
meaningfulness of results. In our experience, models with overall fit indices
of less than .9 can usually be improved substantially. These indices, and the
general hierarchical comparisons described previously, are best understood by
examples." (Bentler
& Bonett, 1980, p. 600, referring to both theNFI and
the TLI)
RFI
Bollen's (Bollen, 1986) relative fit index ( RFI) is given by
,
where and are the
discrepancy and the degrees of freedom for the model being evaluated, and and are
the discrepancy and the degrees of freedom for the baseline model.
The RFI is
obtained from the NFI by substituting F/d for F.
RFI values close to 1 indicate a very good fit.
IFI
Bollen's (Bollen, 1989b) incremental fit index ( IFI) is given by
,
where and are
the discrepancy and the degrees of freedom for the model being evaluated,
and and are
the discrepancy and the degrees of freedom for the baseline model.
IFI values close to 1 indicate a very good fit.
TLI
The Tucker-Lewis coefficient ( in
the notation of Bollen, 1989b) was discussed by Bentler and
Bonett (1980) in the context of analysis
of moment structures, and is also known as the Bentler-Bonett non-normed fit
index ( NNFI).
,
where and are
the discrepancy and the degrees of freedom for the model being evaluated,
and and are
the discrepancy and the degrees of freedom for the baseline model.
The typical range for TLI lies
between zero and one, but it is not limited to that range. TLI values
close to 1 indicate a very good fit.
CFI
The comparative fit index (CFI; Bentler, 1990) is given by.
,
where , and NCP
are the discrepancy, the degrees of freedom and the noncentrality parameter
estimate for the model being evaluated, and are the
discrepancy, the degrees of freedom and the noncentrality parameter estimate
for the baseline model.
The CFI is
identical to the McDonald and Marsh (1990) relative noncentrality index ( RNI),
,
except that the CFI is
truncated to fall in the range from 0 to 1. CFI values close
to 1 indicate a very good fit.
Parsimony adjusted measures:
James, Mulaik and Brett, 1982 suggested multiplying the NFI
by a "parsimony index" so as to take into account the number of
degrees of freedom for testing both the model being evaluated and the baseline
model. Mulaik, et al. (1989) suggested applying the same adjustment to the GFI.
Amos also applies a parsimony adjustment to the CFI.
PNFI
The PNFI is the result of applying
the James, Mulaik and Brett, 1982 parsimony adjustment to the NFI:
,
where d is the degrees of freedom
for the model being evaluated, and is
the degrees of freedom for the baseline model.
PCFI
The PCFI is the result of applying
the James, Mulaik and Brett, 1982 parsimony adjustment to the CFI:
where d is the degrees of freedom
for the model being evaluated, and is
the degrees of freedom for the baseline model.
GFI and related measures:
GFI
The GFI (goodness of fit index) was
devised by Jöreskog and Sörbom (1984) for Ml and Uls estimation,
and generalized to other estimation criteria by Tanaka and Huba (1985). The GFI is
given by
where is the
minimum value of the discrepancy function defined in Appendix B and is
obtained by evaluating F with , g= 1, 2,...,G.
An exception has to be made for maximum likelihood estimation, since (D2) in
Appendix B is not defined for . For the
purpose of computing GFI in the case of maximum likelihood estimation, in Appendix B is calculated as
with ,
where is the
maximum likelihood estimate of .
GFI is less than or equal to 1. A value of 1 indicates a
perfect fit.
AGFI
The AGFI (adjusted goodness of fit index)
takes into account the degrees of freedom available for testing the model. It
is given by
,
where
.
The AGFI is
bounded above by one, which indicates a perfect fit. It is not, however,
bounded below by zero, as the GFI is.
PGFI
The PGFI (parsimony goodness of fit
index), suggested by Mulaik, et al. (1989),
is a modification of the GFI that takes into account the
degrees of freedom available for testing the model:
,
where d is the degrees of
freedom for the model being evaluated, and
Miscellaneous measures:
Hoelter index:
Hoelter's "critical
N" (Hoelter, 1983) is the largest sample size for which one would accept the
hypothesis that a model is correct. Hoelter does not specify a significance
level to be used in determining the critical N, although he uses .05 in his
examples. Amos reports a critical N for significance levels of .05 and .01.
Here are the critical N's displayed by Amos for each of the models in Example
6.
Model
|
HOELTER
.05
|
HOELTER
.01
|
Model A: No
Autocorrelation
|
164
|
219
|
Model B: Most General
|
1615
|
2201
|
Model C: Time-Invariance
|
1925
|
2494
|
Model D: A and C Combined
|
216
|
277
|
Independence model
|
11
|
14
|
Model A, for instance,
would have been accepted at the .05 level if the sample moments had been
exactly as they were found to be in the Wheaton study, but with a sample size
of 164. With a sample size of 165, Model A would have been rejected. Hoelter
argues that a critical N of 200 or better indicates a satisfactory fit. In an
analysis of multiple groups, he suggests a threshold of 200 times the number of
groups. Presumably this threshold is to be used in conjunction with a
significance level of .05. This standard eliminates Model A and the
independence model in Example 6. Models B, C and D are satisfactory according
to the Hoelter criterion. I am not myself convinced by Hoelter's arguments in
favor of the 200 benchmark. Unfortunately, the use of critical N as a practical
aid to model selection requires some such standard. Bollen
and Liang (1988) report some studies of the
critical N statistic.
RMR
The RMR (root mean square residual) is the
square root of the average squared amount by which the sample variances and
covariances differ from their estimates obtained under the assumption that your
model is correct:
.
The smaller the RMR is,
the better. An RMR of zero indicates a perfect fit.
References:
- Akaike, H. (1973). Information theory and an
extension of the maximum likelihood principle. In Petrov,
- Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 52, 317–332.
- B.N. & Csaki, F. [Eds.], Proceedings of the 2nd
International Symposium on Information Theory. Budapest:
Akademiai Kiado, 267–281.
- Bentler, P. M., & Bonett, D. G. (1980).
Significance tests and goodness of fit in the analysis of covariance
structures. Psychological bulletin, 88(3), 588.
- Bollen, K.A. (1989b). A new incremental fit index
for general structural equation models. Sociological Methods and
Research, 17, 303–316.
- Bollen, K.A. & Long, J.S. [Eds.] (1993). Testing structural equation
models. Newbury Park, CA:
Sage.
- Bozdogan, H. (1987). Model selection and Akaike's
information criterion (AIC): The general theory and its analytical extensions. Psychometrika, 52, 345–370.
- Browne, M. W., & Mels, G. (1992). RAMONA
user’s guide.
- Browne, M. W., & Cudeck, R. (1993).
Alternative ways of assessing model fit. Sage focus editions, 154, 136-136.
- Byrne, B.M. (1989). A primer of LISREL: Basic
applications and programming for confirmatory factor analytic models. New York:
Springer-Verlag.
- Carmines, E., & McIver, J. (1981).
Analyzing models with unobserved variables, Social measurement: Current issues. Beverly Hills: Sage.
- Cochran, W. G. (1952). The χ2 test of goodness
of fit. The Annals of Mathematical Statistics, 315-345.
- Gulliksen, H., & Tukey, J. W. (1958).
Reliability for the law of comparative judgment. Psychometrika, 23(2), 95-110.
- Hoelter, J.W. (1983). The analysis of covariance
structures: Goodness-of-fit indices. Sociological Methods and
Research, 11, 325–344.
- James, L. R., Mulaik, S. A., & Brett, J. M.
(1982). Causal analysis: Assumptions, models, and data (Vol. 1). SAGE Publications, Incorporated.
- Jöreskog, K. G. (1969). A general approach to
confirmatory maximum likelihood factor analysis. Psychometrika, 34(2), 183-202.
- Jöreskog, K.G. & Sörbom, D. (1984). LISREL-VI user's guide (3rd ed.).
Mooresville, IN: Scientific Software.
- McDonald, R.P. & Marsh, H.W. (1990). Choosing
a multivariate model: Noncentrality and goodness of fit. Psychological Bulletin, 107, 247-255.
- Marsh, H. W., & Hocevar, D. (1985).
Application of confirmatory factor analysis to the study of self-concept:
First-and higher order factor models and their invariance across groups. Psychological bulletin, 97(3), 562.
- Mulaik, S. A., James, L. R., Van Alstine, J.,
Bennett, N., Lind, S., & Stilwell, C. D. (1989). Evaluation of
goodness-of-fit indices for structural equation models. Psychological bulletin, 105(3), 430.
- Raftery, A. (1995). Bayesian model selection in
social research. In P. Marsden (Ed.), Sociological Methodology 1995 (pp.
111-163): San Francisco.
- Raftery, A.E. (1993). Bayesian model selection in
structural equation models. In Bollen, K.A. & Long, J.S. [Eds.] Testing structural equation
models. Newbury Park, CA:
Sage, 163–180.
- Rigdon, E. E. (1998).
Structural equation modeling.
- Schwarz, G. (1978).
Estimating the dimension of a model. The Annals of Statistics, 6, 461–464.
- Sobel, M.E. &
Bohrnstedt, G.W. (1985). Use of null models in evaluating the fit of covariance
structure models. In Tuma, N.B [Ed.] Sociological methodology 1985. San Francisco:
Jossey-Bass, 152–178.
- Steiger, J. H. (1990).
Structural model evaluation and modification: An interval estimation approach. Multivariate behavioral
research, 25(2), 173-180.
- Steiger, J.H. & Lind,
J.C. (1980, May 30, 1980). Statistically-based tests for
the number of common factors. Paper presented at the Annual Spring Meeting of the
Psychometric Society, Iowa City.
- Steiger, J.H., Shapiro, A.
& Browne, M.W. (1985). On the multivariate asymptotic distribution of
sequential chi-square statistics. Psychometrika, 50, 253–263.
- Tanaka, J.S. & Huba,
G.J. (1985). A fit index for covariance structure models under arbitrary GLS
estimation. British Journal of Mathematical and Statistical Psychology, 38, 197–201.
- Tucker, L.R & Lewis, C.
(1973). A reliability coefficient for maximum likelihood factor analysis. Psychometrika, 38, 1–10.
- Wheaton, B., Muthen, B.,
Alwin, D. F., & Summers, G. F. (1977). Assessing reliability and stability
in panel models. Sociological methodology, 8, 84-136.