0

Can you help me understand how to deal with survival model feature selection: I am trying to predict median for lifetime for consumers, based on panel data (i.e. regular answers from a representative sample answering if they are still using the product or not).

The data set size is decent, I have ~17k observations, but the number of possible covariates is small, since there is limited number of variables known for the general population. One candidate predictor I am attempting to include is which product customers are using. The KM-estimator looks like this:

Survival function for products A,B,C

The products A and B are clearly adding some difference, but product C is relatively new on the market, and it looks like older lifetimes are simply not yet observed for it, so in prediction (extrapolated Weibull) I get shorter lifetime expectancy for C. This is not intuitively right, as the product is performing great in general. How do I deal with such a situation? Is it some kind of left-censoring? Should I discard this feature totally, or somehow transform it?

Silverfish
  • 20,678
  • 23
  • 92
  • 180

1 Answers1

0

The answer depends on whether you want (1) simply to model the duration of use of individual Products, or (2) estimate overall customer lifetime, with customers' use of various Products as predictors.

Case (1) is straightforward, if you assume that a Product is not re-activated once abandoned. I assume that time = 0 is the time that a customer acquires the product (one typically starts from time = 0 rather than time =1), and the "event" occurs when the customer reports that the product has been abandoned. Despite your sense that Product C "is performing great in general," your Kaplan-Meier plots seem to show that Product C is abandoned much more quickly at early times than Products A or B, although things seem to even out by time = 10.* It's hard to judge without error estimates on the curves and a sense of right-censoring (if any), so a very limited number of observations on Product C (despite the overall size of your dataset) might be playing a role.

For Case (2) you would need clear definitions of time = 0 for each customer, and for the time of the "event" of losing the customer.** Once you have those defined, then you could consider use of each of the Products as time-dependent covariates, maybe even considering combinations of Products as interaction terms. Such data are typically coded in a (startTime, stopTime, event) format for each combination of predictor values over time, left-truncated and (potentially) right-censored for each time interval.*** This would, for example, handle Product C being released at a time after a customer enters your study, then adopted by an established customer. This also allows for a customer to re-adopt a Product after previously abandoning it, so that the current use of all Products by a customer is related to the current risk of losing that customer.


*I'm assuming that these are empirical Kaplan-Meier curves rather than extrapolations from some model. It's really dangerous to try to extrapolate beyond the survival times over which you have collected data. Also, it looks like you have discrete-time data rather than the continuous-time data appropriate for things like Weibull models, so you should consider discrete-time modeling unless you can get the actual abandonment times (between the interview times) from the customers.

**This time for loss of a customer is not always easy to define. As a friend used to ask me: "When you're popping popcorn, how do you know when the very last kernel has popped?"

*** Left truncation means you have no information prior to the startTime of a time interval. Left censoring means you know a maximum value for an observation, just not the precise value.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • Thank you for elaborate answer, EdM! I now see, that my problem is not connected with censoring, it is about time-dependency of covariates. One thing (and i see now that it is missing from the data i have provided) is making me suspect that product C is not worse than A and B is that if we split data from A and B to periods (lets say before and after C launch, but i think it is more related to pandemic change), we will see that A and B recent purchases behave closer to C. I think i can do the formal check now, but if it is a case, in what direction should i think next? – Ema Nymton Mar 28 '21 at 08:14
  • @EmaNymton first make sure you have a good definition of the starting `time = 0` for each individual and of the "event" of losing a customer. Survival models generally do best when you incorporate as much information as you have that might be related to outcome. In your case that requires time-dependent covariates, of which "during the pandemic" seems to be particularly important. Remember that it's the _instantaneous_ values of covariates at event times that are used in modeling. How to predict about post-pandemic times? Don't know that the model alone can do that. – EdM Mar 28 '21 at 15:07