How to estimate mean and variance for rate of change when I only have state data at different ages

Question

I'll give you the intuition behind my problem first. I have data on whether children ($n \approx 200$) can read and their age in integers from 0 to 14.

For each age, it is straightforward to calculated the expected value $$E(read|age=age_t) = \frac{k}{n}.$$

For instance, if I observe 5 children at age 6 who can read and 5 who cannot, the expected value of $read$ given $age = 6$ is 0.5. If I then observe 7 children and who can read and 3 who cannot at age 7, I might think intuitively that 20 percent of children have learned to read at age 6 (assuming the sample is random and there is no attrition).

What I am really interested in is the rate of change in the state variable $read$. I want to know the mean age at which children learn to read, and the variance of that new variable. As a first pass, I have tried to tackle this discretely by calculating $\Delta read_t = E(read|age = age_{t+1}) - E(read|age = age_t)$ for each age. Then I take $\sum_{t=0}^{13}age_t \times \Delta read_t$ to be the mean value at which children learn to read.

I'm aware that this is not an adequate approach to this problem, but I'm trying to think it through. My reasoning is that I think of the number of children who can read at each age to be like a cdf for the variable "when learned to read." The rate of change is then analogous to the pdf, and I can find the expected value of "when learned to read" in the normal way.

However, each $E(read|age=age_t)$ is itself a random variable, so the cdf is not strictly increasing because the observed outcome of $read$ for an older age group might be lower than for a younger one. I would also ideally like to be able to construct valid confidence intervals and have a better grasp on the problem in general. I should also add that this is historical data, so the final outcome for reading is not 100% literacy, but more like 70%. I suspect that I'll have to normalize the variable in some way.

Any help or guidance is very welcome. I suspect the problem will involve maximum likelihood estimation using some combination of a normal and binomial distribution. I've explored this method somewhat, but I feel like I'm at a bit of a roadblock and would appreciate some guidance before proceeding. Is what I am attempting even possible?

Do you have data over many years or few? Does it make sense to assume a stationay process? I guess a few more details are needed ... — kjetil b halvorsen, Nov 17 '20 at 10:16
I have data over about 50 years. I'm using bins of about 5-10 years to try to get enough observations in each bin, but the point is ultimately to observe change over time. So, no, I don't think it is a stationary process, but it may be approximately so in a given bin. — Lou Henderson, Nov 17 '20 at 14:58
I should add, I've run a probit regression with 'read' as the dependent variable and 'age,' along with interactions for different bins, as the independent variables. This seems to work okay, but it would be nice to be able to present a graph. — Lou Henderson, Nov 17 '20 at 15:14

score 1 · Answer 1 · answered Nov 18 '20 at 19:26

1

Suppose the age when children learn to read is distributed normally, with mean $m$ and standard deviation $s$. Then the fraction of children who can read at age $a$ is $\Phi((a-m)/s)$. (Here, $\Phi$ is the cumulative distribution function for the standard normal, and we'll also want its inverse $\Phi^{-1}$ which calculates quantiles or $z$-scores.)

So for this model, we are looking for $m$ and $s$ which best fit the equation $$\frac{a_t - m}{s} = \Phi^{-1}\left(\frac{k_t}{n_t}\right)$$

We can approximate this using a simple linear regression to determine $$\Phi^{-1}\left(\frac{k_t}{n_t}\right) = \beta_1 a_t + \beta_0$$ and then taking $$m=\frac{-\beta_0}{\beta_1}, \ \ s=\frac{1}{\beta_1}.$$

This is a first approximation; the data points can also be weighted, as suggested by whuber here, which gives a more admissible answer.

answered Nov 18 '20 at 19:26

Matt F.

1,656
4
20

1

I very much doubt that a normal distribution for start of reading age can be a good model! It must have a thicker upper tail, at least. Apart from max percentage in this case around 70% ... – kjetil b halvorsen Nov 18 '20 at 19:38
1

If only 70% of them learn to read, then “the mean age at which children learn to read” would have to be specified more carefully, and might not be an appropriate metric at all. So this answer was assuming that all children in this population learn to read by around age 14. – Matt F. Nov 18 '20 at 20:06
Thanks, Matt. This helps me think about the problem more clearly. I think this method might be less efficient than a probit using individual-level observations though. I'll have to refresh my memory on weighting, but it seems like it would swap my N=200 for an N=15 sample. – Lou Henderson Nov 19 '20 at 17:05
kjetil, you're right to worry about thick right tails. My prior from the qualitative sources would be that the right-tail skew diminishes over time as the education system became more standardized. So a flexible maximum likelihood method would be quite nice, but I'm happy to have the kind of first-approximation help too. – Lou Henderson Nov 19 '20 at 17:08

score 1 · Answer 2 · answered Nov 23 '20 at 23:59

Build a common regression model with age and year as predictors. Since the response is binary (reading or not ...), logistic regression is a natural starting point. That is, $$ \DeclareMathOperator{\P}{\mathbb{P}} \P(Y_i=1 \mid a, t) =p_{a,t} $$ where $a$ is age and $t$ is timepoint. Since the expectation of a binary (0/1) variable is its expectation, $p_{a,t}$ corresponds to your $E(read|age=age_t)$ (I find that notation ambiguous so avoid it.) With a binomial distribution for $Y_i$ and a logistic link function (you could try other links) we get $$ \operatorname(logit)(p_{a,t} = \mu + \alpha a + \beta t ) $$ for a linear model, but you could try other models. I will simulate from a linear model, but show a fit with a spline for $t$, using glm in R. An alternative could be mgcv which is more automatic.

n <- 200
set.seed(7*11*13) # My public seed

# First a linear model:

year <- rep( seq.int(from=-25, to=25, by=3) , each=ceiling(n/17) )
age  <- sample(5:14, size=length(year), replace=TRUE)
p <-     -0.1482 +age*(.5/9)+year*(0.3/98)
   # This gives too many small probabilities ... so change for a more
   # representative simulation ... 
Y    <- rbinom(length(year), size=1, p=p )

mydata <- data.frame(year,  age, p)

Then we fit a model, and show a plot of the fitted spline:

library(splines)
mod0 <-  glm( Y  ~ age + ns(year, df=4), data=mydata, family=binomial) # alternatively mgcv 

summary(mod0)
anova(mod0, test="Chis")

Yhat <- predict(mod0, type="resp")
### Output from summary and anova:
summary(mod0)
anova(mod0, test="Chis")

Call:
glm(formula = Y ~ age + ns(year, df = 4), family = binomial, 
    data = mydata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7770  -0.8661  -0.5005   1.0026   2.1880  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -4.60462    0.89591  -5.140 2.75e-07 ***
age                0.38446    0.06806   5.649 1.61e-08 ***
ns(year, df = 4)1  0.28385    0.78697   0.361   0.7183    
ns(year, df = 4)2  1.46611    0.75326   1.946   0.0516 .  
ns(year, df = 4)3  0.16797    1.45083   0.116   0.9078    
ns(year, df = 4)4 -0.06443    0.65292  -0.099   0.9214    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 271.40  on 203  degrees of freedom
Residual deviance: 226.88  on 198  degrees of freedom
AIC: 238.88

Number of Fisher Scoring iterations: 4

> 
Analysis of Deviance Table

Model: binomial, link: logit

Response: Y

Terms added sequentially (first to last)


                 Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                               203     271.40              
age               1   38.626       202     232.78 5.133e-10 ***
ns(year, df = 4)  4    5.893       198     226.88    0.2073    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

### Let us get terms predictions to plot the estimated spline

Yhat.terms <- predict(mod0, type="terms")

The code for the plot is:

library(ggplot2)

ggp <- ggplot( within(mydata, {spline <- Yhat.terms[, "ns(year, df = 4)"]  
                               Yhat <- Yhat} ),
              aes(y=spline, x=year) )

ggp + geom_point( color="red") + ggtitle("Fitted spline of year")

How to estimate mean and variance for rate of change when I only have state data at different ages

2 Answers2