I was wondering if it is appropriate to use target encoding (Catboost) for a survival analysis problem (most likely I will approach it first with Cox Proportional Hazards). I have several variables with high cardinality, some of them have similar frequency of appearance. I am unsure whether target encoding is a good fit for censored subjects.
1 Answers
First, before you jump into a model with high-cardinality predictors, see if you can use your knowledge of the subject matter or unsupervised learning methods to get around that problem. It might make sense to do things like combine related categories, etc. Harrell discusses those approaches in Chapter 4 of his text and course notes, with examples of application in later chapters.
Second, if you intend to use boosted trees for modeling, consider whether there's anything substantially to be gained by replacing the levels of those categorical predictors (as-is, or with categories combined as suggested in the first paragraph) with their average target values. In that modeling context, are the risks of overfitting and bias introduced by target encoding worth it, if you can (with adequately slow learning) use all the predictors without overfitting?
Third, it isn't immediately obvious how you could reliably use target encoding when the target is a potentially censored survival time. One possibility that comes to mind is to use a cumulative hazard estimate and some measure of censoring fraction as "targets" for this purpose. The author of the R mice
package recommends using individual Nelson-Aalen cumulative hazard estimates, provided by the package's nelsonaalen()
function, as predictors along with censoring indicators for purposes of multiple imputation. See Section 9.1.8 of van Buuren's Flexible Imputation of Missing Data. But I'm not at all sure how well that use of individual estimates and censoring would carry over to the averaging used in target encoding, and I'm not aware of any use of that approach for target encoding in survival models. Finally, it's not at all clear how reliably you could use a model based on that type of target encoding for predictions on new data samples.

- 57,766
- 7
- 66
- 187
-
Thank you for your answer and for the recommended docs! I will try the first approach and group as much as possible categories. – Gabriela Stoica Nov 23 '21 at 08:38