11

Background: My organization currently compares its workforce diversity statistics (ex. % persons with disabilities, % women, % veterans) to the total labor force availability for those groups based on the American Community Survey (a surveying project by the US Census Bureau). This is an inaccurate benchmark, because we have a very specific set of jobs that have different demographics than the labor force as a whole. Say, for example, that my organization is mostly engineers. Engineering is only about 20% women in my state. If we compare ourselves to the total labor force benchmark, which is more like of 50% women, it results in panic that “we only have 20% women, this is a disaster!” when really, 20% is what we should be expecting because that’s what the labor landscape looks like.

My goal: What I would like to do is take the American Community Survey occupation data (by diversity category) and re-weight it based on the composition of jobs in my business. Here is a sample data set for Social and Community Service workers. I want to add these job codes listed together (because our crosswalk is to job groups, not to specific job codes), then I want to weight that benchmark based on the number of people we have in that category (ex. our 3,000 Social and Community Service workers), then I want to do the same to all the other job groups, add those numbers together, and divide by our total number of workers. This would give me a new re-weighted diversity measure (ex. from 6% persons with a disability to 2% persons with a disability).

My questions: How do I fit margins of error to this final rolled up benchmark? I do not have the raw census data set (obviously), but you can view margins of error for each number in the link that I provided by toggling the “Estimate” field to "Margin of Error" at the top of the table. My other co-workers who are working with this data fully intend to ignore the margins of error, but I am worried that we are creating a statistically meaningless benchmark for ourselves. Is this data even still usable after the manipulation described above?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
DanicaE
  • 323
  • 1
  • 7
  • 4
    Don't reweight the ACS -- it is a delicate, highly sophisticated product, and with all due respect I don't think you are as good a statistician as the Census Bureau collectively is. If you can get definitions of job consistent with your task in [ACS](https://usa.ipums.org/usa-action/variables/group/work) or [CPS](https://cps.ipums.org/cps-action/variables/group/work) for nationwide comparisons, then the apples-to-apples comparison would be to compute the expected number of "diversity" categories based on ACS for your business to act as reasonable diversity targets. – StasK Jan 11 '14 at 04:32
  • 2
    Stas, I agree with you, but as I indicate below, this is not actually a reweighting of ACS. – Steve Samuels Jan 14 '14 at 14:17
  • In survey statistics, "reweighting" would mean transformation of the _original_ survey weights. An example of this would be post-stratification, sample raking, or calibration so that certain marginal distributions for the reweighted sample match distributions known externally, say from the census or ACS. The procedure Danica mentions doesn't touch the ACS weights. – Steve Samuels Jan 14 '14 at 18:19
  • What may help is to write down the finite population quantity that you want to know. Also does the ACS have replicate weights? These may help with variance estimation. – probabilityislogic Jan 16 '14 at 11:08

3 Answers3

8

Update 2014-01-15

I realize that I didn't answer Danica's original question about whether the margin of error for the indirectly adjusted proportion disabled would be larger or smaller than the margin of error for the same rate in ACS. The answer is: if the company category proportions do not differ drastically from the state ACS proportions, the margin of error given below will be smaller than the ACS margin of error. The reason: the indirect rate treats organization job category person counts (or relative proportions) as fixed numbers. The ACS estimate of proportion disabled requires, in effect, an estimate of those proportions, and the margins of error will increase to reflect this.

To illustrate, write the disabled rate as:

$$ \hat{P}_{adj} = \sum \dfrac{n_i}{n} \hat{p_i} \\ $$

where $\hat{p}_i$ is the estimated disabled rate in category $i$ in the ACS.

On the other hand, the ACS estimated rate is, in effect:

$$ \hat{P}_{acs} = \sum\widehat{\left(\frac{N_i}{N}\right)} \hat{p_i} $$

where $N_i$ and $N$ are respectively the population category and overall totals and $N_i/N$ is the population proportion in category $i$.

Thus, the standard error for the ACS rate will be larger because of the need to estimate $N_i/N$ in addition to $p_i$.

If the organization category proportions and population estimated proportions differ greatly, then the it is possible that $SE( \hat{P}_{adj} )>SE( \hat{P}_{acs} )$. In a two-category example that I constructed, the categories were represented in proportions $N_1/N= 0.7345$ and $N_2/N= 0.2655$. The standard error for the estimated proportion disabled was $SE( \hat{P}_{acs} ) = 0.0677$.

If I considered 0.7345 and 0.2655 to be the fixed values $n_1/n$ and $n_2/n$ (the indirect adjustment approach), $SE(\hat{P}_{adj} )=0.0375$, much smaller. If instead, $n_1/n= 0.15$ and $n_2/n =0.85$, $SE( \hat{P}_{adj} )=0.0678$, about the same as $SE( \hat{P}_{acs} )$ At the extreme $n_1/n= 0.001$ and $n_2/n =0.999$, $SE( \hat{P}_{adj} )=0.079$. I'd be surprised if organization and population category proportions differ so drastically. If they don't, I think that it's safe to use the ACS margin of error as a conservative, possibly very conservative, estimate of the true margin of error.

Update 2014-01-14

Short answer

In my opinion, it would be irresponsible to present such a statistic without a CI or margin of error (half CI length). To compute these, you will need to download and analyze the ACS Public Use Microdata Sample (PUMS) (http://www.census.gov/acs/www/data_documentation/public_use_microdata_sample/).

Long answer

This isn't really a re-weighting of the ACS. It is a version of indirect standardization, a standard procedure in epidemiology (google or see any epi text). In this case state ACS job (category) disability rates are weighted by organization job category employee counts. This will compute an expected number of disabled people in the organization E, which can be compared to the observed number O. The usual metric for the comparison is a standardized ratio R= (O/E). (The usual term is "SMR", for "standardized mortality ratio", but here the "outcome" is disability.). R is also the ratio of the observed disability rate (O/n) and the indirectly standardized rate (E/n), where n is the number of the organization's employees.

In this case, it appears that only a CI for E or E/n will be needed, so I will start with that:

If

 n_i = the organization employee count in job category i

 p_i = disability rate for job category i in the ACS

Then

 E = sum (n_i p_i)

The variance of E is:

 var(E) = nn' V nn

where nn is the column vector of organization category counts and V is the estimated variance-covariance matrix of the ACS category disability rates.

Also, trivially, se(E) = sqrt(var(E)) and se(E/n) = se(E)/n.

and a 90% CI for E is

  E ± 1.645 SE(E)

Divide by n to get the CI for E/n.

To estimate var(E) you would need to download and analyze the ACS Public Use Microdata Sample (PUMS) data (http://www.census.gov/acs/www/data_documentation/public_use_microdata_sample/).

I can only speak of the process for computing var(E) in Stata. As I don't know if that's available to you, I'll defer the details. However someone knowledgeable about the survey capabilities of R or (possibly) SAS can also provide code from the equations above.

Confidence Interval for the ratio R

Confidence intervals for R are ordinarily based on a Poisson assumption for O, but this assumption may be incorrect.

We can consider O and E to be independent, so

 log R = log(O) - log(E) ->

 var(log R) = var(log O) + var(log(E))

var(log(E)) can be computed as one more Stata step after the computation of var(E).

Under the Poisson independence assumption:

 var(log O) ~ 1/E(O).

A program like Stata could fit, say, a negative binomial model or generalized linear model and give you a more accurate variance term.

An approximate 90% CI for log R is

 log R ± 1.645 sqrt(var(log R))

and the endpoints can be exponentiated to get the CI for R.

Steve Samuels
  • 2,497
  • 11
  • 10
  • This is a good discussion. At the end, though, your recommendation to exponentiate a CI for $\log(R)$ can result in a truly poor CI for $R$ itself. – whuber Jan 14 '14 at 16:52
  • This didn't seem to me a case where smearing was appropriate, but I could be wrong. What would you suggest? – Steve Samuels Jan 14 '14 at 17:06
  • Some methods mentioned on CV include boostrapping the CI, the delta method, and profiling the likelihood function. – whuber Jan 14 '14 at 19:18
  • Thanks for your answer. Is it possible to pull PUMS data with R? I do not have SAS. I have pulled PUMS data before using the DataFerret tool provided by the census, but I'm not sure that that gives me anything I could usefully manipulate in Excel, which is what I have. I can install R, obviously, but I do not have any experience with it. – DanicaE Jan 15 '14 at 17:29
  • 1
    You are welcome, Danica. If this answer is helpful, please hit the check mark to accept it officially. Notice that I updated the answer. I recommend that you present the ACS margins of error as conservative substitutes for the proper ones. – Steve Samuels Jan 16 '14 at 04:53
  • Just a quick note that because of the complex sample design, using the sample counts is probably bad, as they are likely to be unrepresentative of the population counts (ie biased). This makes the comparison less meaningful as there is unquantified error in the standard error you report for $ P_{adj} $. – probabilityislogic Jan 16 '14 at 11:22
  • Perhaps my choice of notation was unfortunate, but the n_i are not sample counts; they are the number of employees in the poster's organization who fall into certain categories. – Steve Samuels Jan 16 '14 at 22:13
4

FWIW there are good resources for the ACS and accessing PUMS here (http://www.asdfree.com/2012/12/analyze-american-community-survey-acs.html).

Also there is a package for handling ACS data on the CRAN - called, naturally, ACS - which I have found really helpful for doing atypical things with ACS data. This is a good step-by-step for the package (unfortunately the documentation isn't super intuitive) -- http://dusp.mit.edu/sites/all/files/attachments/publication/working_with_acs_R.pdf

pricele2
  • 41
  • 1
4

adding to the http://asdfree.com link in @pricele2's answer..in order to solve this problem with free software, i would encourage you to follow these steps:

(1) (two hours of hard work) get acquainted with the r language. watch the first 50 videos, two minutes each

http://twotorials.com/

(2) (one hour of easy instruction-following) install monetdb on your computer

http://www.asdfree.com/2013/03/column-store-r-or-how-i-learned-to-stop.html

(3) (thirty minutes of instruction-following + overnight download) download the acs pums onto your computer. only get the years you need.

https://github.com/ajdamico/usgsd/blob/master/American%20Community%20Survey/download%20all%20microdata.R

(4) (four hours of learning and programming and checking your work) recode the variables that you need to recode, according to whatever specifications you require

https://github.com/ajdamico/usgsd/blob/master/American%20Community%20Survey/2011%20single-year%20-%20variable%20recode%20example.R

(5) (two hours of actual analysis) run the exact command you're looking for, capture the standard error, and calculate a confidence interval.

https://github.com/ajdamico/usgsd/blob/master/American%20Community%20Survey/2011%20single-year%20-%20analysis%20examples.R

(6) (four hours of programming) if you need a ratio estimator, follow the ratio estimation example (with correctly-survey-adjusted standard error) here:

https://github.com/ajdamico/usgsd/blob/master/Censo%20Demografico/variable%20recode%20example.R#L552

Anthony Damico
  • 272
  • 2
  • 17
  • Thank you, those are excellent resources. If anyone else comes here looking for this info, the R tutorials I've been using are https://www.datacamp.com/ and https://www.coursera.org/course/rprog. Data Camp is a fantastic interactive tutorial. The Coursera course is more heavy on theory/structure/names for things. – DanicaE Oct 01 '14 at 00:25