1

Small area estimation are related group of techniques in the estimation of parameters associated with a sub-population. For example, suppose I have sub-populations $S_1, \cdots, S_n$ with total population $S = S_1 \cup \cdots \cup S_n$.

For sake of simplicity, let's say I want to measure the proportion of people who like cats, and say through whatever small-area estimation techniques I use, I estimate $p_1, \cdots, p_n$, and therefore the overall proportion is \begin{align*} \widehat{p} = \frac{\sum_{i=1}^{n}p_i |S_i|}{\sum_{i=1}^{n}|S_i|} \end{align*} where $|\cdot|$ is cardinality. However, from an external validation study, I know the truth is actually $p^*$ (we can assume this is the truth), and therefore I expect the aggregation over the small area estimates to match closely to the truth. What's the best way to adjust my small area estimates?

Tom Chen
  • 407
  • 2
  • 12

1 Answers1

2

I would adjust on the log-odds scale:

$$ \delta_p = \ln\frac{p^*}{1-p^*} - \ln \frac{\hat p}{1-\hat p} $$

$$ \tilde p_i = 1/\bigl[1 + \exp(-\ln p_i - \delta_p)\bigr] $$

That will not solve the exact additivity, though, so you may have to repeat this a couple of times iteratively.

The above treats all of the $p_i$ the same way. In practice, sample sizes would vary, and uncertainty of $p_i$ is $O(n_i^{* -1/2})$ where $n^*_i$ is the effective sample size with corrections for clustering and weighting. So a more meaningful procedure would then be to estimate $\lambda$ implicitly by iterative optimization from

$$ \tilde p_i = 1/\bigl[1 + \exp(-\ln p_i - \lambda/\sqrt{n^*_i})\bigr] \mbox{ s.t. } \sum_i \tilde p_i = p^* $$

df <- data.frame( p=c(0.1,0.2,0.25),n=c(20,30,50),N=c(150,180,200))
sum(df$N*df$p)/sum(df$N)
## 
## 0.19
## 
p0 <- 0.22
dp <- function(lambda) {
df$p1 <- 1/(1 + exp(-df$p - lambda/sqrt(df$n)))
   return( sum(df$N*df$p1)/sum(df$N) - p0 )
}
solve <- uniroot(dp,lower=-10,upper=+10)
df$p1 <- 1/(1 + exp(-df$p - solve$root/sqrt(df$n)))
df
##
##      p  n   N        p1
## 1 0.10 20 150 0.1465402
## 2 0.20 30 180 0.2107658
## 3 0.25 50 200 0.2834055
## 
sum( df$p1 * df$N)/sum(df$N)
##
## 0.22
## 

Jon Rao's book would likely have more profound ideas.

StasK
  • 29,235
  • 2
  • 80
  • 165
  • Thanks!! Going Rao's book, I see that he mentions GREG and calibration. Is there a name for your proposed approach and is there a statistical justification for just shifting the estimator to eliminate the bias? In the second question, I guess your approach is similar to GREG. – Tom Chen Aug 01 '19 at 16:30
  • I guess I just need to publish it. I have seen calibration for small area estimation mentioned recently in a different context by Lehtonen and Veijanen (2019) http://isi-iass.org/home/wp-content/uploads/Survey_Statistician_January_2019.pdf. A lot of survey statistics relies on what just makes sense, since information comes in weird ways from weird external sources. "Traditional" statistics that relies on the data being i.i.d. from $f(x;\theta)$ rarely has good ways to handle external information that estimates can be pegged on to. – StasK Sep 18 '19 at 17:32