3

I have dataset df consist of 8000 observations

org_id property1 property2  property3 uptimeDay event

and org_id is a categorical variable with 1199 different levels. The other two variables or properties of an organization and are numerical.

coxp_1<-coxph(formula = Surv(uptimeDay, event,type='right') ~ (peroperty1 + property3)^2 + property2 +  I(as.factor(org_id)), data = df_cox)

I am planning to run the following cox model in R but I keep getting this error msg which I am guessing is caused due to the fact that my categorical variable (org_id) has to many different levels.

Error in fitter(X, Y, strats, offset, init, control, weights = weights,  : 
  NA/NaN/Inf in foreign function call (arg 6)

Does anybody know what could be a potential solution for this problem?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
UserYmY
  • 131
  • 3
  • I don't think that `traceback()` should be used as a predictor in your call to `coxph`. – EdM Sep 06 '16 at 10:13
  • It seems unlikely that you want 1198 different coefficients for org_id. Why not treat it as a random effect and use coxme from the coxme package? – mdewey Sep 06 '16 at 12:29
  • @mdwey because my goal is to understand the difference in the response variable for each of these organizations. Is that possible with random effect? – UserYmY Sep 06 '16 at 13:12
  • The usual rule of thumb to avoid overfitting is that you need about 15 events per effective predictor variable, where a categorical variable counts effectively as 1 less than the number of its levels. So even if you solved the problem with the error message you can't really accomplish what you want with a 1199-level categorical variable and only 8000 observations. – EdM Sep 07 '16 at 01:05
  • Just checking to see if `peroperty1` is spelled correctly in your code. Always check the trivial! – user918967 Oct 10 '16 at 23:23
  • Look at https://stats.stackexchange.com/questions/227125/preprocess-categorical-variables-with-many-values/277302#277302 and links therein. Maybe some similar ideas can be used with cox models? – kjetil b halvorsen May 17 '17 at 11:24

1 Answers1

1

The Cox Proportional Hazards' Model needs your event variable to have at least one event and one non-event (event = 0) for each level of the categorical variable. Otherwise, it's called Perfect Classification. To check this see the results of: xtabs(~event + org_id, data = df_cox)

My guess is since your dataset has 8000 observations and 1199 different level, a solution would be to increase the number of observations or club different levels together.

Anuj Sao
  • 21
  • 2