4

I'm studying the growth of a population (users of a website). I have the user count for each time block (which is 2 weeks long). Now, I'd like to understand if this growth follows a well-known growth curve. To start with, I ran this regression on R on 30 observations with glm(tot_users ~ block_id,family=binomial(logit), data = df). These are the observations:

25,83,111,164,251,370,557,815,1154,1513,2032,2605,3590,4904,5718,6602,7628,8727,9471,10263,11047,11799,12441,13040,13634,14168,14582,15143,15649,16164,16472

This is what I get:

Coefficients:
(Intercept)   block_id10   block_id11   block_id12   block_id13   block_id14   block_id15   block_id16   block_id17   block_id18   block_id19    block_id2   block_id20   block_id21   block_id22   block_id23   block_id24   block_id25   block_id26   block_id27   block_id28   block_id29    block_id3   block_id30   block_id31    block_id4    block_id5    block_id6    block_id7    block_id8  
 -9.397e+00    6.331e+00    6.667e+00    7.148e+00    7.398e+00    7.714e+00    7.968e+00    8.151e+00    8.364e+00    8.623e+00    8.979e+00    1.958e-13    9.263e+00    9.430e+00    9.591e+00    9.817e+00    9.985e+00    1.020e+01    1.048e+01    1.067e+01    1.098e+01    1.126e+01    6.932e-01    1.196e+01    3.196e+01    2.946e+00    3.468e+00    4.754e+00    5.555e+00    5.896e+00  
  block_id9  
  6.150e+00  

Degrees of Freedom: 30 Total (i.e. Null);  0 Residual
Null Deviance:      17.57 
Residual Deviance: 3.167e-10    AIC: 74.87

If I plot this model, I get a surprisingly close approximation. The residual deviance seems incredibly low, indicating that the model is very good (is it?). My questions are:

1) Can I assume that my variable growth follows a logistic curve?

2) What other curves do you think I should try to fit?

3) What is the usual validation process to make the regression results publishable? What measures are usually reported?

Thanks for any hint. Mulone

Mulone
  • 295
  • 2
  • 11
  • As far as I know, 'glm()' and binomial family expects a binary response. Your 'tot_users' seems to be continuous. Is there some trick behind this? – martin Aug 10 '13 at 17:22
  • Posting all the data would make it easier for people to comment. The simplest logistic growth curve levels off at a asymptote. Does your series level off? – Nick Cox Aug 10 '13 at 18:32

2 Answers2

4

Sounds like you're trying to fit a logistic growth curve of the type used in biological modeling, where you're fitting a population growth rate variable $r$. A very simple model is

$$ r_i = r_{i-1}(1 - n / k) $$ where $n$ is current population and $k$ some theoretical maximum population.

A binomial model in glm is more appropriate for binary data, so you'll probably need another technique. I'm not that expert in R, but did a quick scan and this question's answer points you toward an appropriate R function using nls():

What's the most pain-free way to fit logistic growth curves in R?

Regarding other curves, you could also try linear regression using curvilinear models, where you include powers of your independent variable(s), e.g. $$ y = b_0 + b_1X + b_2X^2.$$ You always have to be careful with these, however, not to extrapolate results beyond observed data, because the curves ultimately bend/curve according to their inherent mathematical form - which can even happen within the ranges of your observed data.

As for evaluating results, each technique will usually have its own suite (and literature) of model-fit diagnostics, so that will be very case-specific.

thomas
  • 361
  • 1
  • 4
3

You have 31 observations here, not 30.

I am not a routine R user but it's clear that your logit model fitted with glm is completely nonsensical as a fit of a logistic curve.

  • The logistic curve is not to be fitted using a glm with logit link. Such a model is usually for binary data, coded 0 or 1, although there is an extension to proportions, but your response variable is neither binary nor proportional. (There is a historical link in that Berkson borrowed the S-shape of the logistic as a suitable link function for modelling binary responses in what we call logit or logistic modelling but the software functions to be used in practice for logistic growth curves are quite different.)

  • As you fitted a model with almost as many parameters as data points its good fit is not surprising.

  • Look at the output: R has evidently treated your identifiers alphabetically, i.e. the ordering is 10, ..., 19, 2, 20, ..., 29, 3, 30, 31, 4, ..., 8. Manifestly block identifier is here a time measure and not categorical. The parameterisation makes no more sense than the model form.

  • I'm surprised that R even accepted such data for a logit model, but as it's quite the wrong model I'll leave comment there.

I concur with the pointer given by @tabSF that nonlinear least squares is the easiest way forward. I put your data into Stata and got a nonlinear least squares fit quite readily. There is some choice over parameterisation but here K is an asymptote, A tunes the speed of approach and T is the time at which the count reaches K/2. The numbers supplied on the command line are just guesses to help the iteration process.

. nl  (users = {K=20000}/(1 + exp(-{A}*(block-{T=15}))))
(obs = 31)

Iteration 0:  residual SS =  2.48e+08
Iteration 1:  residual SS =  1.96e+08
Iteration 2:  residual SS =  4.49e+07
Iteration 3:  residual SS =   4253893
Iteration 4:  residual SS =   3998721
Iteration 5:  residual SS =   3997420
Iteration 6:  residual SS =   3997405
Iteration 7:  residual SS =   3997404
Iteration 8:  residual SS =   3997404

      Source |       SS       df       MS
-------------+------------------------------         Number of obs =        31
       Model |  2.6465e+09     3   882161243         R-squared     =    0.9985
    Residual |  3997404.48    28  142764.446         Adj R-squared =    0.9983
-------------+------------------------------         Root MSE      =  377.8418
       Total |  2.6505e+09    31  85499391.4         Res. dev.     =  452.7564

------------------------------------------------------------------------------
       users |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          /K |   16316.17   259.5102    62.87   0.000     15784.58    16847.75
          /A |   .2569907   .0099346    25.87   0.000     .2366405    .2773408
          /T |    17.8977   .2089318    85.66   0.000     17.46972    18.32568
------------------------------------------------------------------------------

From the numeric output the fit looks spectacularly good but it is vital to plot the data and the fitted curve as well, which are not so encouraging. A sceptic would doubt that the data show clear evidence for levelling off.

You should be able to do something similar in R. Don't be terribly surprised if you get small differences in parameter estimates. Nonlinear least squares is a dark art and small differences in algorithm have their consequences.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156