estimate conditional distribution from data

Question

I have a dataset defined by 2 variables

> head(as.data.frame(d))
   age asthma
1 54.0      1
2 57.0      1
3 42.0      2
4 44.0      1
5 56.6      1
6 58.0      1
> summary(d$asthma)
   1    2 NA's 
6937 1105  876 
>

I would like to estimate the density distribution of asthma as a function of age. So something like P(asthma==1|age=23) = p .

During this estimation I need to take into account that this estimated probability should be a continous function of the age.

In other words P(asthma==1|age=45) ~ P(asthma==1|age=46)

I would like to know what is the method that I should use and how to compute it in R with my dataframe.

Ideally I would in the end to generate a plot of the obtained distribution P(asthma==1|age=x)

You need two things: splines and logistic regression. A great deal about both can be found by searching this site. — whuber, Feb 14 '15 at 22:28
I was thinking something like gam. But I was wondering if there is anything better — Donbeo, Feb 14 '15 at 22:30
While framed quite differently, I think this is a possible duplicate: http://stats.stackexchange.com/questions/137582/can-one-do-glm-with-loess-transformed-variables/137649#137649 --- actually I am wondering about the feasibility of moving the answer here and closing the other as a duplicate, since I think this one doesn't get distracted by trying to transform the predictors.. — Glen_b, Feb 14 '15 at 23:12

kjetil b halvorsen · Answer 1 · 2021-12-02T17:22:43.250

1

You can use a logistic regression with the predictor age represented as a spline function. The simplest way is to use a regression spline, then you fix the amount of smoothing beforehand, most conveniently by specifying an equivalent degrees of freedom (edf). In R something like:

    library(splines)
    mod0 <- glm( asthma ~ splines::ns(age, df=6), 
                 family=binomial, data=your_data_frame)

If you want to let the data determine the degree of smoothing, you can use a gam (generalized additive model), in R maybe:

    mod1 <- mgcv::gam( asthma ~ s(age), family=binomial, 
                        data=your_data_frame )

Also search this site for similar questions.

edited Dec 02 '21 at 17:22

answered Oct 08 '19 at 07:51

kjetil b halvorsen

63,378
26
142
467

Correct me if I'm wrong, but using splines with df=6 defaults to a 6 order polynomial with no breakpoints. Penalized smoothers would probably be preferred. GAM effectively does this. – AdamO Nov 13 '19 at 16:34
@AdamO: Why do you think so? Wouldn'tit depend also on the number of observations? I should admit that I did not study thoroughly the theory behind `ns` ... – kjetil b halvorsen Nov 13 '19 at 21:53
1

Because you can simple look at the output from `model.matrix(~ splines::ns(age,df=6), data=your_data_frame)` and see it is parametrized as such – AdamO Nov 13 '19 at 23:38
Thanks, will do! – kjetil b halvorsen Nov 14 '19 at 00:08

estimate conditional distribution from data

1 Answers1

Linked