4

I have a dataset defined by 2 variables

> head(as.data.frame(d))
   age asthma
1 54.0      1
2 57.0      1
3 42.0      2
4 44.0      1
5 56.6      1
6 58.0      1
> summary(d$asthma)
   1    2 NA's 
6937 1105  876 
> 

I would like to estimate the density distribution of asthma as a function of age. So something like P(asthma==1|age=23) = p .

During this estimation I need to take into account that this estimated probability should be a continous function of the age.

In other words P(asthma==1|age=45) ~ P(asthma==1|age=46)

I would like to know what is the method that I should use and how to compute it in R with my dataframe.

Ideally I would in the end to generate a plot of the obtained distribution P(asthma==1|age=x)

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Donbeo
  • 3,001
  • 5
  • 31
  • 48
  • 1
    You need two things: splines and logistic regression. A great deal about both can be found by searching this site. – whuber Feb 14 '15 at 22:28
  • 1
    I was thinking something like gam. But I was wondering if there is anything better – Donbeo Feb 14 '15 at 22:30
  • 1
    While framed quite differently, I think this is a possible duplicate: http://stats.stackexchange.com/questions/137582/can-one-do-glm-with-loess-transformed-variables/137649#137649 --- actually I am wondering about the feasibility of moving the answer here and closing the other as a duplicate, since I think this one doesn't get distracted by trying to transform the predictors.. – Glen_b Feb 14 '15 at 23:12

1 Answers1

1

You can use a logistic regression with the predictor age represented as a spline function. The simplest way is to use a regression spline, then you fix the amount of smoothing beforehand, most conveniently by specifying an equivalent degrees of freedom (edf). In R something like:

    library(splines)
    mod0 <- glm( asthma ~ splines::ns(age, df=6), 
                 family=binomial, data=your_data_frame)

If you want to let the data determine the degree of smoothing, you can use a gam (generalized additive model), in R maybe:

    mod1 <- mgcv::gam( asthma ~ s(age), family=binomial, 
                        data=your_data_frame )

Also search this site for similar questions.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • Correct me if I'm wrong, but using splines with df=6 defaults to a 6 order polynomial with no breakpoints. Penalized smoothers would probably be preferred. GAM effectively does this. – AdamO Nov 13 '19 at 16:34
  • @AdamO: Why do you think so? Wouldn'tit depend also on the number of observations? I should admit that I did not study thoroughly the theory behind `ns` ... – kjetil b halvorsen Nov 13 '19 at 21:53
  • 1
    Because you can simple look at the output from `model.matrix(~ splines::ns(age,df=6), data=your_data_frame)` and see it is parametrized as such – AdamO Nov 13 '19 at 23:38
  • Thanks, will do! – kjetil b halvorsen Nov 14 '19 at 00:08