## Applying the Postulate of Specific Objectivity to the Measurement of Treatment Effects in Clinical Psychology

#### Gerhard H. Fischer, University of Vienna

In discussions of the uses of IRT for psychological measurement the contention is often heard that IRT allows only the ordering of persons and items and that therefore a metric measurement of amounts of change or of treatment effects, e.g., in clinical psychology, cannot be established by means of IRT. The standard argument is that ultimately all measurements of abilities or traits must remain ordinal because we may at any time apply strictly monotone transformations to latent scales.

On the other hand, it might be countered that the latter is true in physics where we may at any time decide to replace the scales of properties like mass, force, or voltage - the metric scale properties of which are never doubted - by monotone functions of these scales, like square roots or logarithms. To this again it might be retorted that the fundamental measurement of physical properties is based on concatenation operations as an underpinning of the resulting ratio scales, whereas such operations do not exist for person traits in psychology.

It may be shown, however, that the principle of Specific Objectivity (SO), which is an empirically testable axiom introduced by Rasch (1968, 1972, 1977) in IRT and which was applied by Fischer (1987) to the measurement of change, actually is a perfect analogue of a concatenation operation. To make this clearer, we have to introduce the formal definition of SO on which the paper of Fischer (1987) was based.

Parameterization: Consider a patient S suffering from anxiety , who undergoes some therapy. Let , for t=1,2,3, be S's positions on the latent anxiety scale at three time points T1,T2,T3, and denote the amount of change (= the effect of a tretment) between Ta and Tb by . Then it is assumed that there exists a continuous function U(x,y), defined on R2, such that . U is called a comparator' because it serves to compare the two values of anxiety, and , in order to arrive at an assessment of the amount of change, . For any fixed value x0, z=U(x0,y) is assumed to be a bijective decreasing mapping, and for any fixed value y0, z=U(x,y0) is assumed to be a bijective increasing mapping.

It then follows immediately from the monotonicity of U that there exists a function F such that . The meaning of F is that, if we know both S's latent anxiety at time point Ta before the treatment and the effect of the treatment given between Ta and Tb, denoted , we can predict uniquely S's amount of anxiety at time point Tb. F is therefore called the effect function'.

Specific Objectivity (SO): The total effect of any two treatments with effect parameters and , denoted by , is

 (1)

where H(x,y) is some continuous function defined on R2.

This definition of SO in a measurement of change framework is analogous to Rasch's (1967, 1968, 1972, 1977) definition of SO for the comparison of items or persons. The analogy becomes obvious if one inserts in (1), yielding

 (2)

This equation, which should hold for all possible values of the parameters , has strong implications for the functions H and U, because the right-hand side is a function of , whereas the left-hand side is independent of ; this means that must somehow cancel out of the right-hand side.

From results in the theory of functional equations (see Aczél, 1966) it follows that there exist continuous, bijective functions (scale tranformations') and such that

 (3)

 (4)

for a,b = 1,2,3, and
 (5)

Or equivalently,
 (6)

 (7)

for a,b = 1,2,3, and
 (8)

Moreover, the scale transformation is unique up to positive linear tranformations , with c>0, and is unique up to similarity transformations .

Using a term introduced by G. Rasch, we may say that U and H are latently additive' functions. From this property it is further concluded that, once this admissible additive representation of the latent trait and the effect parameters has been adopted, the scale for anxiety is an interval scale, and that for the treatment effect, a ratio scale.

These results, however, are only one side of the problem of constructing an IRT model that allows for a specifically objective assessment of treatment effects. The other side is the search for an appropriate ICC that links the latent parameters to the observed reactions. Suppose the manifest reactions are S's responses to a dichotomous item or the presence vs. absence of a certain symptom of anxiety, and assume that the comparator function U be a function of the two response probabilities and at time points Ta and Tb, respectively, , which is some monotone function of an (unconditional or conditional) likelihood of the observed responses; then it can be shown that the ICC must be a logistic function (cf. Fischer, 1987),

 (9)

where c>0 and d are arbitrary constants as above. This again shows that is defined (is measurable) up to positive linear transformations, and up to similarity transformations.

The reason why we get so strong results about the measurement of treatment effects lies in the functional equations (1) and (2), whereby continuity and strict monotonicity of the functions plays a decisive role. But are these assumptions reasonable? In fact they are. Firstly, monotonicity is implied by the underlying psychological concepts: it is a defensible assumption that a medicinal or psychotherapeutic treatment of anxiety will always reduce - rather than increase - anxiety; etc. Secondly, the assumption of continuity is defensible because we may increase the dosage of a treatment as much or as little as we want, thus creating treatments with infinitely many different effect parameters, and it appears to be plausible that a small dosage of the treatment will at most produce a small change of the latent trait parameter. For this argument, however, it is essential that the dosage of a treatment is a manipulable experimental factor rather than an incidental factor. This is the substantive justification of the assumptions that yield strong results concerning the measurement of effects.

It is worthwhile to mention that in other psychometric settings we do not obtain equally strong results about the measurment of, say, abilities in educational testing: if persons can be tested only once and if the test consists of a finite number of items with fixed item parameters conforming to the Rasch Model (RM), it can be shown that the scale properties depend on whether the differences between item parameters are rational or irrational numbers; in other words, the scale properties of the measurement are empirically undecidable (see Fischer, 1995). (This is primarily of theoretical interest, though, and has little practical consequences because interval scale properties can be shown to hold for infinitely many discrete points along the latent continuum; but in principle the scale level remains undecidable.) The reason of this dilemma is that continuitiy in terms of the item parameters - tacitly assumed by Rasch (1967, 1968, 1972, 1977) and most authors in IRT literature - is not justifiable: we cannot change item content in such a manner that arbitrarily small changes of item difficulty result.

These considerations show why applications of IRT to clinical or applied psychology are particularly nice from a psychometric point of view. There are also other reasons why IRT is especially fruitful for clinical psychology; some of them are given below.

Although treatment effect studies basically are experimental because treatments can be given or withheld from patients and dosages can be manipulated, the randomized assignment of patients to treatment groups is often hard to realize: there may be practical or ethical reasons to reject randomization. Therefore, often treatment groups are compared to waiting groups (i.e., groups of patients waiting for treatment). Unfortunately, experience showns that in most cases waiting groups differ significantly from the treatment groups with respect to the distribution of relevant person characteristics. Moreover, in psychotherapy the patients play an active role in the selection of the therapy method. Lastly, patients' syndromes and severity of their illness influence the treatment decisions made by doctors or psychologists. Therefore, treatment effect studies often are not experimental studies in a strict sense, they are only what is called quasi-experimental'.

We therefore have to ask, how can results about treatment effects be obtained that remain valid in spite of systematic differences between treatment and control groups? IRT has been seen to be an excellent basis for solving this problem as long as the differences between treatment and control groups reside in the distributions of the person parameters . The answer again lies in the method of conditional inference introduced in IRT by Rasch (1960) and Andersen (1973). Conditional inference is the statistical conterpart of the epistemic principle of SO. It requires that there exist a nontrivial likelihood function depending on the treatment effect parameter(s) but being independent of the incidental person parameters . Maximizing that function yields a conditional maximum likelihod (CML) estimator of the treatment effect idependently of the distribution of the latent person parameters in the samples. This CML estimator will generally be biased (like other ML estimators), but if its consistency can be proved, it suffices to increase the sample size sufficiently to obtain as precise a result as one desires. SO as a general methodological principle and conditional inference as a practical tool therefore provide an excellent answer to the scientific challenges of therapy effect studies - as long as the differences between groups are describable by the distributions of the person parameters. (On the many uses of the CML methods in the Rasch model and related models, cf. the monograph by Fischer & Molenaar, 1995).

The greater part of the IRT literature deals with dichotomous items. In clinical psychology, however, symptoms may be expressed in several degrees, or critical events (such as epileptic seizures of migraine attacks) may occur more or less frequently. Therefore, in clinical psychology IRT models are required for observations that are realizations of bounded or unbounded integer variables. A fairly general formal framework for treatment effect studies in clinical or applied psychology therefore is the following:

Let the state of a person at time point T1 be expressed in terms of the distribution of a positive integer random variable H1 (a so-called lattice variable') with probability functions , , governed by a real valued parameter , where the are continuous in and . For time point T2, let the state of the person similarly be expressed in the form of a distribution of a lattice variable H2 with probability functions , , where is a real valued parameter and . The variables H1 and H2 are assumed to be stochastically independent and increasing in in the sense that and , for all h, are strictly monotone increasing in , with limits

 (10)

for .

The present formalization covers cases where the observation is a frequency, or where it is a response to a polytomous item with ordered categories, or a response to a dichotomous item; in the first of these cases, h is allowed to take infinitely many values, ; in the second, H1 and H2 are bounded such that , where m+1 is the number of response categories; and in the third, m =1. Notice that the parameter of H2 is written as without loss of generality because we know that, if a specifically objective result concerning change is possible, the parameters of the person and of the treatment effect can always be transformed into the additive system (6) through (8).

Under these assumptions about the variables H1 and H2, the following result can be derived (see Fischer, 1995, pp. 298-303): If the conditional probability , for any realizations h1 and h2, is some function independent of and strictly monotone in , then and must be probability functions of the form

 (11)

where with c > 0, , and with certain constants ph > 0 and qh > 0.

The distribution form underlying (11) is a so-called power series distribution' which is well-known in statistics (see, e.g., Noack, 1950; Patil, 1965; Johnson & Kotz, 1969, Vol. 1). However, the general form of the model in (11) is not yet useful because the power series distributions comprise infinitely many unknown constants ph and qh. To render (11) useful, we either have to introduce more restrictive assumptions on the model or to truncate the variables H1 and H2 so that h is restricted to the values , as, e.g., in rating scale items with m+1 ordered response categories; then the number of constants ph and qh is at most 2m, so that they can be estimated empirically.

Regarding the first of these two cases, consider an application where H1 is the number of migraine attacks of a patient that occur during an observation period prior to the migraine treatment; that is, H1 represents the so-called base rate' of that patient. To enable a meaningful comparison of both periods of observation, we assume that the two distributions are the same, that is, ; in other words, all that possibly changes reside in the latent trait parameter .

Clearly, for any chosen observation period, some attacks will have happened prior to it and others would happen after its end if the the observation were continued; that is, some attacks will remain unobserved. The same holds for the observation period after the treatment. The methodological problem is that the unobserved events may be a source of bias of the result about the treatment effect. If it is postulated, however, that ignoring the unobserved events does not systematically affect the result about (denoted ignorability principle' by Fischer, 1991), it can be shown that the model must be equivalent to Rasch's (1960, 1973) multiplicative Poisson model, which results from (11) by setting ! and !.

In the second case, where H1 and H2 are responses to rating scale items, let the variables H1 and H2 be restricted to the values . Introducing new parameters ln and ln , and rewriting (11), we obtain

 (12)

 (13)

Basically, this model has the form of a Partial Credit Model (Masters, 1982; Wright and Masters, 1982), however, at time point T2 there is an additive treatment effect parameter that is weighted by h and added to the item cagtegories parameters . In general, if there are several treatments and more time points, (13) will become
 (14)

where qvjt is the dosage of treatment Bj given to patient Sv up to time point Tt and is the effect of one dosage unit of treatment Bj. Since the term is added to the item category parameters , that is, since the model is linear in the parameters and , it is denoted the Linear Partial Credit Model (LPCM; Fischer & Ponocny, 1994, 1995). The parameters cannot be determined uniquely, so that some normalization is needed; it is customary to put , for all items Ii, and .

By applying again the CML method, the parameters can be eliminated and the so-called basic parameters' and estimated jointly. A problem that arises is that treatment effect studies mostly employ only small samples, so that estimating the jointly with the treatment effects parameters may bias the latter. Suppose, however, that a sample of calibration data for the items is available. The information about the contained in the calibration data can be used by simply adding the calibration sample to the actual treatment effect data: notice that the model for time point T1 - prior to the treatments, where the dosagens qvj1 are zero for all v and j - is just a PCM, so that the calibration data may simply be added to the observations obtained at time point T1. This will stabilize the and thus render the more precise.

Here we encounter one more payoff of the CML approach: since the are eliminated, the remaining basic parameters' of the LPCM may generalize over different samples or populations. The LPCM is therefore very well suited for those problems which are treated in Meta Analysis, which nowadays is so popular in clinical psychology: the central question in Meta Analysis is, do treatment effects generalize over different studies? In the present models, the H0 of the generalizability of treatment effects over different samples is immediately testable by means of conditional likelihood ratio tests.

There are two important special cases of the LPCM which deserve being mentioned: in the LPCM, there is one parameter for each combination of item category, that is, the items may have different response categories. In items which all have the same response categories, it often makes sense to replace the by , that is, to assign one scalar item (easiness) parameter to each item and one category (attractiveness) parameter to each response category. Thereby, the number of parameters to be estimated is considerably reduced. This yields

 (15)

which is the so-called Linear Rating Scale Model (LRSM; Fischer & Parzer, 1991) as an extension of the Rating Scale Model (Rasch, 1965; Andrich, 1978a,b; Wright & Masters, 1982).

The other special case occurs if the items are dichotomous (yes' vs. no'; correct vs. incorrect; +' vs. -'); then the Linear Logistic Test Model (LLTM; Fischer, 1973, 1976, 1995b) for the measurement of change results as a special case. All these models - including of course the simple RM, the PCM and the RSM applied to data obtained at just one time point - can be estimated and hypotheses within these models can be tested by means of a recently developed program that runs on PCs under Windows (LPCM-WIN 1.0; Fischer & Ponocny-Seliger, 1998). This program is an elegant tool expecially for treatment effect studies in clinical or applied psychology.

In the family of models described so far it is assumed that all items Ii measure one and the same unidimensional ability or trait . In many domains of applied research in psychology or education, however, the researcher will opt for multidimensional item sets to monitor change of behavior under the effects of treatments. A typical example is the treatment of depressive patients, where the items are various symptoms related to emotion, cognition, body functions, circadiane rhythm, to work and to social relations. Clearly, such a heterogeneous set of items will never form a unidimensional scale of depressiveness. It is therefore paramount to extend the models so that items may be multidimensional.

This method of dealing with multidimensionality within the family of LPCMs is a simple reinterpretation of the LPCM: to each person Sv, assign one separate parameter per item Ii, . These parameters will be eliminated again, however, by conditioning on one raw score per item namely,

 (16)

where xvith=1 if Sv at time point Tt gives a response to item Ii in category Ch, and xvith=0 otherwise. The conditional likelihood function will then depend only on the treatment effect parameters and on the . Therefore, the may assume any values whatsoever, they may be correlated or in any other way dependent on each other, or they may also be independent. So the latent traits measured by the items may, for instance, be functions of certain primary' factors. Therefore, the model will always be compatible with any substantive theory about the structure of the latent traits.

To obtain a multidimensional LPCM, all we have to do is to rewrite (15) as

 (17)

where the underlie the restrictions
 (18)

with Gg for the g'-th treatment group and qgjt for the dosage of treatment BJ given to all up to time point Tt. We now have to condition on the rvi defined in (16). In essence, the procedure boils down to a reinterpretation of the model and of the data. In practice, it is sometimes known beforehand that certain subsets of items, , where the Jd are given subsets of the item sample, measure the same latent dimension Dd, so that, within person Sv, one parameter underlies the responses to all these . In such cases, the raw scores on which the likelihood is to be conditioned, become
 (19)

Depending on which subsets Jd - which may just be single items - are unidimensional, the data have to be rearranged and statistics computed in appropriate form.

To specify the subsets Jd and to handle the data accordingly, various tools are provided by the computer program LPCM-WIN 1.0. The program supports incomplete designs where different samples of items from each subset Jd are selected at different time points, which has the advantage that memory and satiation effects, which often are quite troublesome in repeated measurement designs, are avoided. To summarize this, LPCM-WIN 1.0 allows for

• any number of unidimensional subsets of items, Jd, which may also be single items,
• the specification of designs with any number of treatment groups, treatment combinations, and time points, including a single time point,
• the selection of different samples of items for different time points,
• the choice of any of the models PCM, RSM, or - in the dichotomous case - RM as the underlying model to which the linear structure (18) is added,
• the inclusion of calibration data and the combination of data from different studies,
• the formulation and the testing of a host of hypotheses on treatment effects (on change), in particular tests for the generalizability of treatment effects over groups of items (symptomes) or over person groups (over different studies), and
• the assessment of change in individuals (`single case studies') when the items are Dichotomous and the item parameters are known.
The family of LPCMs and the LPCM-WIN 1.0 software therefore are a rather flexible tool for research in clinical (and also in applied) psychology, fitting particularly well to the underlying psychological concepts and to the tests commonly used in these areas. Experiences made so far with IRT applications in these areas are quite positive.

9. References

• Aczél, J. (1966). Lectures on functional equations and their applications. New York: Academic Press.
• Andersen, E. B. (1973a). Conditional inference for multiple-choice questionnaires. British Journal of Mathematical and Statistical Psychology, 26, 31-44.
• Andersen, E. B. (1973b). Conditional inference and models for measuring. Copenhagen: Mentalhygiejnisk Forlag.
• Andrich, D. (1978a). A rating formulation for ordered response categories. Psychometrika, 43, 561-573.
• Andrich, D. (1978b). Application of a psychometric rating model to ordered categories which are scored with successive integers. Applied Psychological Measurement, 2, 581-594.
• Fischer, G. H. (1987). Applying the principles of specific objectivity and generalizability to the measurement of change. Psychometrika, 52, 565-587.
• Fischer, G. H. (1991). On power series models and the specifically objective assessment of change in event frequencies. In J.-C. Falmagne & J.-P. Doignon, Mathematical Psychology: Current Developments (pp. 293-310). New York: Springer-Verlag.
• Fischer, G. H. (1995). The derivation of polytomous Rasch models. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models. Foundations, recent developments, and applications (pp. 157-180). New York: Springer-Verlag.
• Fischer, G.H. & Molenaar, I.W. (Eds) (1995). Rasch models. Foundations, recent developments, and applications. New York: Springer-Verlag.
• Fischer, G. H., & Ponocny, I. (1995). Extended rating scale and partial credit models for assessing change. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models. Foundations, recent developments, and applications (pp. 353-370). New York: Springer-Verlag.
• Fischer, G. H., & Ponocny-Seliger, E. (1998). Structural Rasch modeling. Handbook of the usage of LPCM-WIN 1.0. Groningen: ProGAMMA.
• Glas, C. A. W., & Verhelst, N. D. (1989). Extensions of the partial credit model. Psychometrika, 54, 635-659.
• Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174.
• Noack, A. (1950). A class of random variables with discrete distributions. Annals of Mathematical Statistics, 21, 127-132.
• Patil, G. P. On the multivariate generalized power series distribution and its application to the multinomial and negative multinomial. In G. P. Patil (Ed.), Classical and contagious discrete distributions (pp. 183-194). London: Pergamon Press.
• Rasch, G. (1968). A mathematical theory of objectivity and its consequences for model construction. Paper presented at the European Meeting on Statistics, Econometrics, and Management Science, Amsterdam, September 2-7, 1968.
• Rasch, G. (1972). Objectivitet i samfundsvidenskaberne et metodeproblem. [Objectivity in the social sciences as a methodological problem.] Nationaløkonomisk Tidsskrift, 110, 161-196.
• Rasch, G. (1973). Two applications of the multiplicative Poisson model in road accidents statistics. Invited paper presented at the 1973 Meeting of the International Statistical Institute in Vienna.
• Rasch, G. (1977). On specific objectivity. An attempt at formalizing the request for generality and validity of scientific statements. In M. Blegvad (Ed.), The Danish Yearbook of Philosophy (pp. 58-94). Copenhagen: Munksgaard.

Top of page
ODL-Team
Wed Jan 12 2000