4

I have been given a new analytics problem to solve. The context is app analytics where we would like to predict total revenue per app install after 30 days from install based on just 7 days of data. I.e. a week from install can we predict what revenue will look like after 30 days.

The problem is that the vast majority of installs do not spend. Just for example, lets say it's 99%. So the data are imbalanced and in most cases will be 0. However, for those cases above zero we may see any number above zero e.g. 1, 2 100, 999 - any non zero dollar amount depending on how much each spender spends.

The goal is more accuracy than inference in this case. Also, I'll be using R.

I went into this thinking it to be a regression problem. However after some research this week I encountered a new concept of 'hurdle' and 'zero-inflated' models. This post was particularly useful.

Having read that post and some others on zero-inflated and hurdle models, I'm now trying to decide if I should stick to just running a regression or trying something new, a zero-inflated or hurdle model.

The concept of a hurdle model sounds simple to reproduce on my own with an ensemble method such as XGBoost. Is it as simple as creating a binary classifier and then, after defining a threshold, applying a regression model to those cases above the threshold? Is there anything 'wrong' with using XGBoost for both the binary classifier and then another XGBoost model for the regression part?

I'm thinking XGBoost because the goal here is accuracy over inference and I've had good results with XGB in the past. It looks like R has a package pscl with functions zeroinfl() and hurdle(). It looks like these functions use a predefined approach using glm models in r. Since the package already exists it's tempting to use it. But, since it uses glm I'm wondering if the goals of the package are more for explanatory power than predictive power?

Given my problem to solve as described above, what are some sound approaches of predicting 30 day revenue based behavior after 7 days from install?

Summary of relevant tidbits:

  • Ultimately seeking a numeric outcome - 30 day cumulative revenue per install
  • Expect that most cases, ~99% to be 0
  • For this project, the primary goal is accuracy over inference
  • Trying to determine if a hurdle or zero-inflation model is appropriate?
  • Seeking validation that using XGBoost is also a sound approach:

    • Just a single XGBoost regression, ignoring the fact I have imbalanced data OR

    • Imitate what I have learned about hurdle models using XGBoost, first creating a binary classifier for 1 or 0 and then using the class probabilities to define a threshold for a 'spender', after which apply a regression model to predict 30 day revenue.

    • I could use XGBoost for both the binary classifier and the regression models.

Doug Fir
  • 1,218
  • 1
  • 13
  • 27
  • A follow up question. Would a sound approach be to take the probability of spender threshold from the classifier and then multiply that by the expected numerical output of the regression model? i.e. A weighted regression model where the predictions are multiplied by the class probability of 1 (spender)? – Doug Fir Nov 30 '19 at 18:37
  • Did you try to balance the data using SMOTE or SMOTER to balance the zero and nonzero values and then applying a regression model? – thereandhere1 Apr 03 '20 at 20:07

0 Answers0