In discussions of the uses of IRT for psychological measurement the contention is often heard that IRT allows only the ordering of persons and items and that therefore a metric measurement of amounts of change or of treatment effects, e.g., in clinical psychology, cannot be established by means of IRT. The standard argument is that ultimately all measurements of abilities or traits must remain ordinal because we may at any time apply strictly monotone transformations to latent scales.
On the other hand, it might be countered that the latter is true in physics where we may at any time decide to replace the scales of properties like mass, force, or voltage - the metric scale properties of which are never doubted - by monotone functions of these scales, like square roots or logarithms. To this again it might be retorted that the fundamental measurement of physical properties is based on concatenation operations as an underpinning of the resulting ratio scales, whereas such operations do not exist for person traits in psychology.
It may be shown, however, that the principle of Specific Objectivity (SO), which is an empirically testable axiom introduced by Rasch (1968, 1972, 1977) in IRT and which was applied by Fischer (1987) to the measurement of change, actually is a perfect analogue of a concatenation operation. To make this clearer, we have to introduce the formal definition of SO on which the paper of Fischer (1987) was based.
Parameterization: Consider a patient S suffering from anxiety
,
who undergoes some therapy. Let
,
for t=1,2,3, be
S's positions on the latent anxiety scale at three time points
T1,T2,T3, and denote the amount of change (= the effect of a
tretment) between Ta and Tb by
.
Then it is
assumed that there exists a continuous function U(x,y), defined on
R2, such that
.
U is called a
`comparator' because it serves to compare the two values of anxiety,
and
,
in order to arrive at an assessment of the
amount of change,
.
For any fixed value x0,
z=U(x0,y) is assumed to be a bijective decreasing
mapping, and for any fixed value y0,
z=U(x,y0) is assumed to
be a bijective increasing
mapping.
It then follows immediately from the monotonicity of U that there
exists a function F such that
.
The meaning of F is that, if we know both S's latent anxiety at time
point Ta before the treatment and the effect of the treatment
given between Ta and Tb, denoted
,
we can predict uniquely S's amount of anxiety at time point Tb. F is therefore called the `effect function'.
Specific Objectivity (SO): The total effect of any two treatments with effect parameters
and
,
denoted by
,
is
This definition of SO in a measurement of change framework is
analogous to Rasch's (1967, 1968, 1972, 1977) definition of SO for
the comparison of items or persons. The analogy becomes obvious if
one inserts
in (1), yielding
This equation, which should hold for all
possible values of the parameters
,
has strong implications for the functions H and U, because the
right-hand side is a function of
,
whereas the left-hand side is independent of
;
this means that
must somehow cancel out of the right-hand side.
From results in the theory of functional equations (see Aczél, 1966)
it follows that there exist continuous, bijective functions (`scale
tranformations')
and
such that
Moreover, the scale transformation
is unique up to positive
linear tranformations
,
with c>0,
and
is unique up to similarity transformations
.
Using a term introduced by G. Rasch, we may say that U and H are `latently additive' functions. From this property it is further concluded that, once this admissible additive representation of the latent trait and the effect parameters has been adopted, the scale for anxiety is an interval scale, and that for the treatment effect, a ratio scale.
These results, however, are only one side of the problem of
constructing an IRT model that allows for a specifically objective
assessment of treatment effects. The other side is the search for an
appropriate ICC that links the latent parameters to the observed
reactions. Suppose the manifest reactions are S's responses to a
dichotomous item or the presence vs. absence of a certain symptom of
anxiety, and assume that the comparator function U be
a function of the two response probabilities
and
at time points Ta and Tb, respectively,
,
which is some monotone function of
an (unconditional or conditional) likelihood of the observed
responses; then it can be shown that the ICC
must be
a logistic function (cf. Fischer, 1987),
The reason why we get so strong results about the measurement of treatment effects lies in the functional equations (1) and (2), whereby continuity and strict monotonicity of the functions plays a decisive role. But are these assumptions reasonable? In fact they are. Firstly, monotonicity is implied by the underlying psychological concepts: it is a defensible assumption that a medicinal or psychotherapeutic treatment of anxiety will always reduce - rather than increase - anxiety; etc. Secondly, the assumption of continuity is defensible because we may increase the dosage of a treatment as much or as little as we want, thus creating treatments with infinitely many different effect parameters, and it appears to be plausible that a small dosage of the treatment will at most produce a small change of the latent trait parameter. For this argument, however, it is essential that the dosage of a treatment is a manipulable experimental factor rather than an incidental factor. This is the substantive justification of the assumptions that yield strong results concerning the measurement of effects.
It is worthwhile to mention that in other psychometric settings we do not obtain equally strong results about the measurment of, say, abilities in educational testing: if persons can be tested only once and if the test consists of a finite number of items with fixed item parameters conforming to the Rasch Model (RM), it can be shown that the scale properties depend on whether the differences between item parameters are rational or irrational numbers; in other words, the scale properties of the measurement are empirically undecidable (see Fischer, 1995). (This is primarily of theoretical interest, though, and has little practical consequences because interval scale properties can be shown to hold for infinitely many discrete points along the latent continuum; but in principle the scale level remains undecidable.) The reason of this dilemma is that continuitiy in terms of the item parameters - tacitly assumed by Rasch (1967, 1968, 1972, 1977) and most authors in IRT literature - is not justifiable: we cannot change item content in such a manner that arbitrarily small changes of item difficulty result.
These considerations show why applications of IRT to clinical or applied psychology are particularly nice from a psychometric point of view. There are also other reasons why IRT is especially fruitful for clinical psychology; some of them are given below.
Although treatment effect studies basically are experimental because treatments can be given or withheld from patients and dosages can be manipulated, the randomized assignment of patients to treatment groups is often hard to realize: there may be practical or ethical reasons to reject randomization. Therefore, often treatment groups are compared to waiting groups (i.e., groups of patients waiting for treatment). Unfortunately, experience showns that in most cases waiting groups differ significantly from the treatment groups with respect to the distribution of relevant person characteristics. Moreover, in psychotherapy the patients play an active role in the selection of the therapy method. Lastly, patients' syndromes and severity of their illness influence the treatment decisions made by doctors or psychologists. Therefore, treatment effect studies often are not experimental studies in a strict sense, they are only what is called `quasi-experimental'.
We therefore have to ask, how can results about treatment effects be
obtained that remain valid in spite of systematic differences between
treatment and control groups? IRT has been seen to be an excellent
basis for solving this problem as long as the differences between
treatment and control groups reside in the distributions of the
person parameters
.
The answer again lies in the method of
conditional inference introduced in IRT by Rasch (1960) and Andersen
(1973). Conditional inference is the statistical conterpart of the
epistemic principle of SO. It requires that there exist a nontrivial
likelihood function depending on the treatment effect parameter(s)
but being independent of the incidental person parameters
.
Maximizing that function yields a conditional maximum likelihod
(CML) estimator of the treatment effect idependently of the
distribution of the latent person parameters in the samples. This CML
estimator will generally be biased (like other ML estimators), but if
its consistency can be proved, it suffices to increase the sample
size sufficiently to obtain as precise a result as one desires. SO
as a general methodological principle and conditional inference as a
practical tool therefore provide an excellent answer to the
scientific challenges of therapy effect studies - as long as the
differences between groups are describable by the distributions of
the person parameters. (On the many uses of the CML methods in the
Rasch model and related models, cf. the monograph by Fischer &
Molenaar, 1995).
The greater part of the IRT literature deals with dichotomous items. In clinical psychology, however, symptoms may be expressed in several degrees, or critical events (such as epileptic seizures of migraine attacks) may occur more or less frequently. Therefore, in clinical psychology IRT models are required for observations that are realizations of bounded or unbounded integer variables. A fairly general formal framework for treatment effect studies in clinical or applied psychology therefore is the following:
Let the state of a person at time point T1 be expressed in terms
of the distribution of a positive integer random variable H1 (a
so-called `lattice variable') with probability functions
,
,
governed by a real valued parameter
,
where the
are continuous in
and
.
For time point T2, let the state of the
person similarly be expressed in the form of a distribution of a
lattice variable H2 with probability functions
,
,
where
is a real valued parameter
and
.
The variables H1 and H2 are
assumed to be stochastically independent and increasing in
in the sense that
and
,
for all h, are strictly monotone
increasing in
,
with limits
The present formalization covers cases where the observation is a
frequency, or where it is a response to a polytomous item with
ordered categories, or a response to a dichotomous item; in the first
of these cases, h is allowed to take infinitely many values,
;
in the second, H1 and H2 are bounded such that
,
where m+1 is the number of response
categories; and in the third, m =1. Notice that the parameter of
H2 is written as
without loss of generality
because we know that, if a specifically objective result concerning
change is possible, the parameters of the person and of the treatment
effect can always be transformed into the additive system (6)
through (8).
Under these assumptions about the variables H1 and H2, the
following result can be derived (see Fischer, 1995, pp. 298-303):
If the conditional probability
,
for any
realizations h1 and h2, is some function
independent of
and strictly monotone in
,
then
and
must be probability functions of the form
The distribution form underlying (11) is a so-called `power
series distribution' which is well-known in statistics (see, e.g.,
Noack, 1950; Patil, 1965; Johnson & Kotz, 1969, Vol. 1). However,
the general form of the model in (11) is not yet useful because
the power series distributions comprise infinitely many unknown
constants ph and qh. To render (11) useful, we either
have to introduce more restrictive assumptions on the model or to
truncate the variables H1 and H2 so that h is restricted to
the values
,
as, e.g., in rating scale items with
m+1 ordered response categories; then the number of constants ph
and qh is at most 2m, so that they can be estimated empirically.
Regarding the first of these two cases, consider an application where
H1 is the number of migraine attacks of a patient that occur during an
observation period prior to the migraine treatment; that is, H1 represents
the so-called `base rate' of that patient. To
enable a meaningful comparison of both periods of observation, we
assume that the two distributions are the same, that is,
;
in other words, all that possibly changes reside in the
latent trait parameter
.
Clearly, for any chosen observation period, some attacks will have
happened prior to it and others would happen after its end if the the
observation were continued; that is, some attacks will remain
unobserved. The same holds for the observation period after the
treatment. The methodological problem is that the unobserved events
may be a source of bias of the result about the treatment effect. If
it is postulated, however, that ignoring the unobserved events does
not systematically affect the result about
(denoted
`ignorability principle' by Fischer, 1991), it can be shown that the
model must be equivalent to Rasch's (1960, 1973) multiplicative
Poisson model, which results from (11) by setting
! and
!.
In the second case, where H1 and H2 are responses to rating
scale items, let the variables H1 and H2 be restricted to
the values
.
Introducing new parameters
ln
and ln
,
and rewriting
(11), we obtain
By applying again the CML method, the
parameters can be
eliminated and the so-called `basic parameters'
and
estimated jointly. A problem that arises is that
treatment effect studies mostly employ only small samples, so that
estimating the
jointly with the treatment effects
parameters
may bias the latter. Suppose, however, that
a sample of calibration data for the items is available. The
information about the
contained in the calibration data
can be used by simply adding the calibration sample to the actual
treatment effect data: notice that the model for time point T1
- prior to the treatments, where the dosagens qvj1 are zero for
all v and j - is
just a PCM, so that the calibration data may simply be added to the
observations obtained at time point T1. This will stabilize the
and thus render the
more precise.
Here we encounter one more payoff of the CML approach: since the
are eliminated, the remaining `basic parameters' of the
LPCM may generalize over different samples or populations. The LPCM
is therefore very well suited for those problems which are treated in
Meta Analysis, which nowadays is so popular in clinical psychology:
the central question in Meta Analysis is, do treatment effects
generalize over different studies? In the present models, the H0
of the generalizability of treatment effects over different samples
is immediately testable by means of conditional likelihood ratio
tests.
There are two important special cases of the LPCM which deserve being
mentioned: in the LPCM, there is one parameter
for each
combination of item
category, that is, the items may have
different response categories. In items which all have the same
response categories, it often makes sense to replace the
by
,
that is, to assign one scalar
item (easiness) parameter to each item and one category
(attractiveness) parameter to each response category. Thereby,
the number of parameters to be estimated is considerably reduced.
This yields
The other special case occurs if the items are dichotomous (`yes' vs. `no'; correct vs. incorrect; `+' vs. `-'); then the Linear Logistic Test Model (LLTM; Fischer, 1973, 1976, 1995b) for the measurement of change results as a special case. All these models - including of course the simple RM, the PCM and the RSM applied to data obtained at just one time point - can be estimated and hypotheses within these models can be tested by means of a recently developed program that runs on PCs under Windows (LPCM-WIN 1.0; Fischer & Ponocny-Seliger, 1998). This program is an elegant tool expecially for treatment effect studies in clinical or applied psychology.
In the family of models described so far it is assumed that all items
Ii measure one and the same unidimensional ability or trait
.
In many domains of applied research in psychology or
education, however, the researcher will opt for multidimensional item
sets to monitor change of behavior under the effects of treatments. A
typical example is the treatment of depressive patients, where the
items are various symptoms related to emotion, cognition, body
functions, circadiane rhythm, to work and to social relations.
Clearly, such a heterogeneous set of items will never form a
unidimensional scale of depressiveness. It is therefore paramount to
extend the models so that items may be multidimensional.
This method of dealing with multidimensionality within the family of
LPCMs is a simple reinterpretation of the LPCM: to each person
Sv, assign one separate parameter
per item Ii,
.
These parameters will be eliminated again, however,
by conditioning on one raw score per item namely,
To obtain a multidimensional LPCM, all we have to do is to rewrite
(15) as
To specify the subsets Jd and to handle the data accordingly, various tools are provided by the computer program LPCM-WIN 1.0. The program supports incomplete designs where different samples of items from each subset Jd are selected at different time points, which has the advantage that memory and satiation effects, which often are quite troublesome in repeated measurement designs, are avoided. To summarize this, LPCM-WIN 1.0 allows for