5

I have 5 categories, each category is divided into the subcategories low, medium and high. An object can belong to one or more of these categories with a number between 1 and 100 in each subcategory but the sum for each category can no exceed 100. Is there are a way to summarise this into one single number?

Any hints or directions are very welcomed.

johannes
  • 163
  • 1
  • 1
  • 6
  • As a hypothetical example, suppose the categories are named "A" through "E". Could an object simultaneously have values of 10 for A.Low, 20 for B.High, 50 for B.Medium, 15 for B.Low, and 5 for C.High (and zeros in all other subcategories)? Or are you treating all five categories independently so that, e.g., the object could have 100 in A.Medium, 90 in B.High and 10 in B.Medium, 50 in C.Low, 10 in E.High (and zeros in the other 10 subcategories)? In the former case you would seek a single index; in the latter, five index values (one for each of A, B, ..., E). – whuber Oct 26 '10 at 15:58
  • Thanks for you answer, yes an object could have A.low = 20, A.medium = 40, B.High = 40 and C.High = 5. All others would be zero. So yes, I am seeking for a single index. – johannes Oct 26 '10 at 19:07
  • 1
    OK, just to be clear, because this is an important detail: is there a natural hierarchy to the categories and subcategories? Could one assume that the ordering is A.High A.Medium A.Low B.High ... E.Low (or something comparable)? (If so, is there anything at all to prevent us from thinking of these 15 classifications as forming a single *ordinal* variable?) Also, it would help to clarify what the numerical values are intended to mean. *E.g.*, should they be interpreted as fuzzy degrees of membership, as probability of membership, as combined results of multiple measurements, or something else? – whuber Oct 26 '10 at 19:24
  • For example: if I am interested in the overall urbanization of a state, my five categories would be: roads, residential areas, industrial areas, airports and intense agriculture. Each of the categories is divided into 3 subcategories (low, mid, high) with the percentage of area falling into these categories. Now I would like to express the degree of urbanization in one number without loosing to much detail. Maybe I should get rid of the subcategories? Thanks a lot for your effort. – johannes Oct 26 '10 at 19:44

2 Answers2

6

Your solution should depend on how you plan to use the information. If, for instance, you intend to use these data as potential explanatory variables in a model, then you are better off without an index, because that is likely to cause a loss of explanatory power. Just use the original variables. If, on the other hand, you would like to make a map to portray degrees of urbanization in a simple manner, then an index makes sense.

What remains in the second case is to make "urbanization" operational. Suppose, for the sake of exploring this issue, that instead of using the word "urbanization" you used some term that was completely unintelligible to me. How would I go about finding out what it meant? There is nothing in the data that will reveal the answer. What you need is either a quantitative definition of urbanization in terms of these 15 variables or else you need a 16th variable that correctly captures the degree of urbanization in some "test" or "calibration" cases. Then you could statistically explore the correlations among the 15 original variables and the degree of urbanization with the aim of finding a combination of those 15 that is reliably associated with urbanization. This can be done using canonical correlation analysis.

An alternative that (I suspect) is frequently used is just to dodge these fundamental considerations and make up an answer. Some people invent "weights" for each of the categories, form the weighted sum of the values, and declare that value to be whatever they would like it to be, whether it's "urbanization" or "environmental impact" or whatever. The problems of that approach in an objective or scientific context ought to be obvious, but it is an easy solution.

An intermediate approach remains agnostic about what "urbanization" might mean and merely seeks a parsimonious description of the variables you have. In a special situation they might all be a mixture of two extremes, allowing the vector of 15 values to be described by a single parameter (the mixture proportions). You could take that parameter to be an "index" of the data and then go on to explore the extent to which it seems to agree with your idea of urbanization. This approach is carried out using Principal Components Analysis (PCA) or Factor Analysis.

All these methods (canonical correlation, PCA, FA) are typically available in full-featured statistical software. PCA has been discussed at length on this site: explore the "PCA" tag for more information.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • Many thanks for this detailed answer, it makes a lot of things clearer and I think I will follow your suggestion regarding PCA. – johannes Oct 26 '10 at 21:06
0

There is an alternative approach as well that I have recently used from Michael Anderson (2005) on Multiple Inference and Gender Differences in the Effects of Early Intervention: A Reevaluation of the Abecedarian, Perry Preschool, and Early Training Projects. He creates a summary index of his outcome variables. The math is pretty straightforward, however, the complicated part is knowing how to code it in whichever statistical package you use. I can tell you that it is possible in both STATA and R.

NOS
  • 1