1

I have percentage demographic data for the various areas of the US which is broken down into 5 buckets: Child, Teen, Student, Adult and Retired. So for example for Texas, I have:

Child: 0.2 Teen: 0.2 Student: 0.1 Adult: 0.3 Retired: 0.2

I want to combine these into one single measure that I can use in a model. I want a measure which captures the age distribution of an area, but I'm not sure which approach to take.

Karen
  • 45
  • 4
  • Could you explain what aspects of the age distribution are of interest and how you might be using them in the model? – whuber May 31 '19 at 18:07

3 Answers3

2

This answer focuses on what I guess is meant to be an ordinal aspect of the categories, taking age into account.

Do you know (even approximately) the numerical age boundaries or midpoints of the categories? Perhaps the most significant challenge would be to distinguish between 'Teen" and 'Student', and between 'Adult' and 'Retired' categories.

For example, if you could make meaningful guesses $m_i$ for category age midpoints, then use $a = \sum_{i=1}^5 m_ip_i$ (where $p_i$ are category proportions) as approximate average age. Maybe use something like $m = (5, 15, 20, 40, 75),$ if 'Student' means student beyond high school. (Looking at more-detailed US demographic tables of age distributions might be of some help.)

Perhaps stretching the idea one step too far, you might even try to make sense of an approximate measure of variability, such as $v_1 = \sum_{i=1}^5 p_i(m_i - a)^2,\,v_2 = \sqrt{v_1},$ or $v_3 = \sum_{i=1}^5 p_i|m_i - a|.$

Note: Of course, useful answers depend on the use you want to make of your combined 'measure'. You mentioned age, so I focused on that. But the different categories might have varying impacts on the consumer economy, kinds of social services required, voting patterns, and so on.

BruceET
  • 47,896
  • 2
  • 28
  • 76
1

Use Shannon Entropy, $H(x)=-\sum_i p_i \log_2(p_i)$, where $p_i$ is the proportion in each category.

0

Comments on ordinal vs. nominal categories:

Ordinal. Suppose we are interested in ages, treat the categories as ordinal, and seek to use the available available information on the locations and variability of age in each region. Then using 'average' ages m = c(5, 15, 20, 40, 75) as in my answer, the average age in three different regions (1, 2, 3), with three different vectors of proportions for the categories, are as shown below.

The first region has the same proportion in each category, the second tends to have "older' inhabitants than the third. The three averages reflect these differences as expected.

m = c(5, 15, 20, 40, 75)
p1 = rep(.2, 5)
p2 = c(.1, .2, .3, .2, .2)
p3 = c(.2, .3, .2, .2, .1)
a1 = sum(m*p1); a2 = sum(m*p2); a3 = sum(m*p3)
a1; a2; a3
[1] 31
[1] 32.5
[1] 25

Using my second variability formula (something like a standard deviation), we see that the first region may have more diverse ages than the third.

v1 = sqrt(sum(p1*(m-a1)^2))
v2 = sqrt(sum(p2*(m-a2)^2))
v3 = sqrt(sum(p3*(m-a3)^2))
v1; v2; v3
[1] 24.77902
[1] 23.58495
[1] 20.24846

My formulas for average and variability are essentially ones that have long been used to approximate sample mean and sample standard deviation from grouped data,

Nominal. If we view the categories as nominal then it may make sense to use some kind of index of diversity among the categories. If we are mainly interested in the needs for vaious kinds of social services (pre-school, parks with playing fields, elder care) in the regions then then looking at a diversity index might be a reasonable approach.

User @4k3x9d7r suggested using Shannon entropy in his Answer (+1). I think it would be inappropriate to dismiss this suggestion. This method does not require speculating about 'average ages' in the various categories. Results for Shannon entropy (using $\log_2$) are shown below. Notice that regions 2 and 3 have the same Shannon entropy because they have the same proportions (but assigned to different categories).

se1 = -sum(p1*log2(p1))
se2 = -sum(p2*log2(p2))
se3 = -sum(p3*log2(p3))
se1; se2; se3
[1] 2.321928
[1] 2.246439
[1] 2.246439

Simpson's diversity index (for five categories) $\lambda =\sum_{i=1}^5 p_i^2.$ Results for the three hypothetical regions are shown below. For five categories, $\lambda$ takes its minimum value in region 1.

sd1 = sum(p1^2); sd2 = sum(p2^2); sd3 = sum(p3^2)
sd1; sd2; sd3
[1] 0.2
[1] 0.22
[1] 0.22

These two indexes are discussed in Wikipedia, including rationales for the formulas involved. Another recent Q & A on this site has discussed the Simpson diversity index.

BruceET
  • 47,896
  • 2
  • 28
  • 76