How to create one score from a mixed set of positive and negative variables?

Question

I have 3,000 observations (administrative communities) characterized by five variables. Four of them work in the direction 'the more, the worse' and one goes in the opposite.

I'd like to create one score or an ordered list of these observations that will best take into account all of those five variables.

I have tried clustering using MCLUST package in R, and it gives some meaningful results but it's hard to decide about the ordering of observations on the basis of cluster membership.

My second attempt was to run PCA and extract the first component, which is closer to what I'd like to get.

What other solutions (R- or Stata-based preferably) could I use to deal with this problem?

When you say "will best take into account", what do you mean by "best"? E.g. do you want the variables to be weighted equally, or do you want to give some variables more weight than others? Or a different way to put it, what would be the purpose of the score you'd like to get? — SheldonCooper, Mar 11 '11 at 18:25
@SheldonCooper: Thanks for your comment. The purpose of the score would be to determine relative 'position' of observations. In a way I'd like to be able to say what what is the standing of one particular community within the population. Initially my approach would probably avoid weights (or use data driven weights?). But it might be possible to obtain these weights from another dataset and use them later on as wellin simmilar vein to what Gordon (1995) suggests http://jech.bmj.com/content/49/Suppl_2/S39.abstract — radek, Mar 12 '11 at 13:24

score 7 · Accepted Answer · edited Mar 13 '11 at 01:08

7

You might consider u-scores as defined in [1] Wittkowski, K. M., Lee, E., Nussbaum, R., Chamian, F. N. and Krueger, J. G. (2004), Combining several ordinal measures in clinical studies. Statistics in Medicine, 23: 1579–1592. (PDF)

The basic idea is that for each observation you count how many observations there are compared to which it is definitely better (four variables lower, one higher), and how many are definitely worse, and then create a combined score.

edited Mar 13 '11 at 01:08

Jeromy Anglim

42,044
23
146
250

answered Mar 11 '11 at 19:47

Aniko

10,209
29
32

Just one question. I understand how to generate score for one variable depending on relative position of an observation. What is the next step for combining the scores of five variables? Simple sum? – radek Mar 12 '11 at 14:03

Jeromy Anglim · Answer 2 · 2011-03-13T01:06:48.167

Data or Theory Driven?

The first issue is whether you want the composite to be data driven or theory driven? If you are wishing to form a composite variable, it is likely that you think that each component variable is important in measuring some overall domain.

In this case, you are likely going to prefer a theoretical set of weights. If, alternatively, you are interested in whatever is shared or common amongst the component variables, at the risk of not including one of the variables because it measures something that is orthogonal or less related to the remaining set, then you might want to explore data driven approaches.

This question maps on to the discussion in the structural equation modelling literature between reflective and formative measures ( e.g., see here).

Whatever you do it is important to align your measurement with your actual research question.

Theory Driven

If the composite is theoretically driven then you will want to form a weighted composite of the component variables where the weight assigned aligns with your theoretical weighting of the component. If the variables are ordinal, then you'll have to think about how to scale the variable. After scaling each component variable, you'll have to think about theoretical relative weighting and issues related to differential standard deviations of the variable. One simple strategy is to convert all component variables into z-scores, and sum the z-scores. If you have component variables, where some are positive and others are negative, then you'll need to reverse either just the negative or just the positive component variables.

I wrote a post on forming composites which addresses several scenarios for forming composites.

Theoretical driven approaches can be implemented easily in any statistical packages. score.items in the psych package is one function that makes it a little easier, but it is limited. You might just write your own equation using simple arithmetic, and perhaps the scale function.

Data Driven

If you are more interested in being data driven, then there are many possible approaches.

Taking the first principal component sounds like a reasonable idea.

If you have ordinal variables you might think about categorical PCA which would allow the component variables to be reweighted. This could automatically handle the quantification given the constraints you provide.

(+1) Of note, another interesting discussion on formative vs. reflective models can be found in Chapter 3 of *Measuring the Mind*, by Denny Borsboom (Cambridge, 2005). — chl, Mar 12 '11 at 08:40

score 2 · Answer 3 · answered Mar 13 '11 at 03:12

For a non-ordinal measure, you could try MDS (multi-dimensional scaling). This can be done easily in R. This will try to arrange the points on a line (1d in your case) in such a way that distances between points will be preserved.

Some general comments: as you probably realize, the question is pretty vague, and not much can be said without knowing more about the data. For example, normalizing the variables (to zero mean and unit variance) may or may not be appropriate; weighing all variable equally may or may not be appropriate; etc. If this is not an exploratory analysis and you do have some 'correct' score in mind, then it may be appropriate to learn a set of weights either on a different dataset, or on a subset of your current dataset, and using these weights instead.

score 1 · Answer 4 · answered Mar 30 '11 at 18:54

I am sorry, as it may not be straight answer to your question, by if you are using this "total score" as a predictor of something why dont you try regression and evaluate the results with the AUC of ROC ?

or the other way, maybe user Neural networks / Random Forest / Support Vector machines on them to predict given outcome ?

Regards Luke

How to create one score from a mixed set of positive and negative variables?

4 Answers4

Data or Theory Driven?

Theory Driven

Data Driven