Can IRT supersede CTT nowadays?

Question

CTT is Classical Test Theory, IRT is Item Response Theory.

IRT can handle anything CTT can. But I am curious why IRT is seldom mentioned on this site.

IRT is already utilized by many, I'm not quite sure what you are asking? — bdeonovic, Aug 10 '17 at 16:12
I don't think so; I think if CTT is used nowadays its usually because of tradition/cost to change a system. almost all the research though is on IRT nowadays. — bdeonovic, Aug 10 '17 at 16:34
The lack of IRT related questions on this forum is not an indication of its disuse but rather an indication that psychometricians do not use CrossValidated. — bdeonovic, Aug 10 '17 at 16:40
Judges rate items on a scale of 0 to 100. You want to assess the reliability of ratings. — Jeremy Miles, Aug 23 '17 at 04:31
This is the sort of problem that I frequently analyze in my work. — Jeremy Miles, Aug 23 '17 at 04:48
@JeremyMiles see https://stats.stackexchange.com/questions/44515/how-is-item-response-theory-irt-called-for-continuous-response , there are IRT models for continuous responses (taking this aside, for many authors this is questionable is such rating scales make sense, but this is another thing) — Tim, May 16 '18 at 06:53
As about this site, I guess it is that not popular in here because the site didn't attract that many IRT people yet... I wouldn't judge anything by it's popularity on (any) web site. — Tim, May 16 '18 at 06:54

score 1 · Answer 1 · answered Oct 27 '20 at 20:35

To expand and cleanup my previous comment, I would like to add some considerations, and I will deliberately limit myself to measurement theories, since this is what CTT and IRT are all about. By the way, you actually forgot about Generalizability Theory, which provides an interesting approach when we study the case of raters by subjects interactions, or more generally in reliability analysis.

I don't have a definitive answer to your two questions: I would say "yes and no", because both approaches are complementary when it comes to developing a questionnaire. If you start with an already existing and validated scale, then probably IRT is enough --- but soon or later you will have to report individual scores and face end users who don't want to spend too much time dealing with logit scale and constraints put on the ability posterior distribution. I guess the real question amounts to quantify the quality of the approximation of raw scores as compared to IRT scores for the latent traits. (In 15 years, I have never seen anyone reporting ability scores on the "latent trait" scale, except in the case of adaptive testing.)

IRT has some nice properties that are lacking in CTT, which assumes that the standard error of measurement is the same for all individuals (which rarely holds in practice), and where there's no way to separate items from respondents: This is known as the separability and specific objectivity issue and this partly motivated the development of the Rasch model. Then we quickly come to the problem of overfitting: while the Rasch model was devised as a measurement model, isn't adding more and more parameters (discrimination, pseudo-guessing in the 2-PL and 3-PL model) a way to make the model fit the data better, rather than validate a measurement model?

That the model is not true is certainly correct, no models are–not even the Newtonian laws. (...) Models should not be true, but it is important that they are applicable. --- G. Rasch, Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark : Danmarks Paedogogiske Institut, 1960.

IRT is more flexible in that it can handle dichotomous and polytomous items with an underlying probabilistic model of (the transition from one response to the other) response. It has close relationships with confirmatory factor analysis (CFA), which was initially concerned with continuous responses, but it aims to provide a unified approach to the measurement of latent traits and classes (see Bartholomew, D. and Knott, M., Latent Variable Models and Factor Analysis, Arnold (1999)). This has been known since the 80s but CFA can do the same job. Later, this was the marginal approach from mixed-effect modeling that somewhat takes the lead.

IRT is useful for adaptive testing and item bank calibration (including item anchoring), and for test reduction, i.e. when you want to convert a 40-item questionnaire to a 10-item questionnaire while preserving the same precision of measurement. Nothing in CTT allows that.

Lastly, IRT is helpful when it comes to assess measurement invariance (differential item functioning across groups or along time), like is the case for multi-group confirmatory factor analysis, which is harder under the CTT framework because of the assumption of constant measurement error mentionned above. Moreover, we have model fit indices (and everybody loves p-values).

However, CTT remains useful in test developement or the construction of item banks. Think about item analysis (i.e. detecting floor or ceiling effect, wording issue, distractor behavior in the case of MCQ items, etc.), the developement of parallel forms of a questionnaire or the analysis of score reliability. I should note that the later is what really motivates the use of CTT: The aim of CTT is ultimately to analyse the test as a whole, and the unit of analysis is rarely the item, unlike IRT which provides tons of ways to assess item and person fit indices; no matter what the decision threshold to declare an item or person as a possible outlier is, we have a p-value (again). The reliability of test scores can also be estimated as well.

Note also that there exist many models or techniques that do not belong to the IRT family of psychometric models, strictly speaking, and more generally that are not concerned with ipsative or other forms of tests. One such example is multidimensional scaling or conjoint analysis. If those approaches are quite at an angle with the CTT itself, they remain anchored in the idea of selecting, ranking or classifying a set of objects.

Sidenote:

But I am curious why IRT is seldom mentioned on this site.

There's an irt tag, which is already good news, IMO. When I first added the psychometrics tag 10 years ago, I did not expect as much questions and nice responses related to factor analytical methods, scale validation, reliability analysis and the modeling of scale scores.

Can IRT supersede CTT nowadays?

1 Answers1