Interpretaing Arima Model Output with Exogenous Variables

Question

I am using R "Forecast' package for prediction of churn by including external variables.

However, in my case its bit confusing. What I expect when you introduce more titles a less people will live. In case of titles_live it shows negative sign whereas toptitles_new its positive ?

When I look at correlation analysis they are negatively correlated to churn

1- Correlation

> cor(churn_rate, titles_live)
  [1] -0.6511904

> cor(churn_rate, toptitles_new)
 [1] -0.3265537

Whereas using simple linear model titles_live show neagtively relationship and toptitles_new positively associated with churn rate.

2- Simple Linear Model

summary(lm (churn_rate ~  titles_live  + toptitles_new, 
    data = in_out_p_month))

Call:
     lm(formula = churn_rate ~ titles_live  + toptitles_new, 
        data = in_out_p_month)

Residuals:
      Min      1Q    Median      3Q     Max 
    -6.7563 -1.5096  0.1252  1.7473  9.3720 

 Coefficients:
                  Estimate Std. Error     t value Pr(>|t|)    
 (Intercept)       32.29875     3.32502   9.714   3.87e-10 ***
 titles_live.      -0.05407     0.01428  -3.787   0.000813 ***
 toptitles_new     0.03337      0.37485   0.089   0.929750    
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 Residual standard error: 3.684 on 26 degrees of freedom
 Multiple R-squared:  0.4242,    Adjusted R-squared:  0.3799 
 F-statistic: 9.578 on 2 and 26 DF,  p-value: 0.0007644

3- Forecasting with auto.arima

Well when I include those two variables as external regressors in my arima model my prediction for churn is quite close to real values. Even though one of them was not significantly associated with churn.

The direction of relationship is same as linear regression. As far I understand when using dynamic regression we can interpret regression coefficients. Can someone explains what could be reason for this weird relationship ? Also could please help me in interpretation of the output from model.

summary(arima_model_churn_rate)
   Series: ts_churn_rate_train 
   Regression with ARIMA(2,0,0) errors 

Coefficients:
        ar1      ar2       intercept  titles_live       toptitles_new
        0.8090  -0.5021    32.5879    -0.0573            0.3096
s.e.    0.1742   0.1833     4.5682     0.0190            0.3121

sigma^2 estimated as 9.454:  log likelihood=-58.67
AIC=129.34   AICc=134.28   BIC=136.41

Training set error measures:
            ME          RMSE    MAE       MPE     MAPE      MASE        
Training set -0.05194746 2.735777 2.347408 -2.053318 12.17807 0.3342007 
          ACF1 
         -0.06623817

Many thanks in advance !!!

This would be easier to answer if you explained churn of what, and titles of what. — Peter Ellis, Jul 12 '17 at 21:32
One likely explanation is at https://stats.stackexchange.com/questions/31841/coefficients-change-signs/32237#32237. It is difficult to be specific because your alleged output is inconsistent with the commands you posted! (The variable names are different between the commands and the `lm` summary.) — whuber, Jul 12 '17 at 21:35
@Peter. churn is how many people going to leave online subscription for watching a sorts channel. — James Taylor, Jul 13 '17 at 06:49

score 1 · Accepted Answer · answered Jul 12 '17 at 21:45

Putting aside the inconsistency in your code and output...

I don't think your problem, as I understand it, is actually anything to do with the time series nature of the problem. The results from auto.arima and from lm are pretty similar with regard to the relationship of titles_live to churn (about -0.05, with a standard error about a third the size of the estimate indicating it's quite a way from zero). And the estimated coefficient of toptitles_new is small compared to its standard error in both cases.

You can interpret the coefficients of auto.arima with xreg output similarly to output from lm - this is by design and if you search the web and in particular Rob Hyndman's excellent Hyndsight blog you will find various explanations of his design choices.

So, I don't think there is any "weird" relationship to explain here. There is good evidence that titles_live is negatively related to churn even when controlling for toptitles_new; and there isn't good evidence of a relationship between toptitles_new and churn, once controlling for titles_live. In other words, the correlation of -0.3 you are seeing between topttitles_new and churn in your first use of cor() is either pure noise (which is what I suspect), or perhaps topttitles_new is correlated with titles_live and is serving as a proxy for that; an effect which disappears in the correctly specified model with both explanatory variables in it.

There is less churn (whatever that is) when you have more "live titles" (whatever that means); but having more "new top titles" (whatever that means) doesn't help reduce churn (or at least, there is no evidence here that it does).

thanks very much. well I did edited code. as i could not share here. if you would like to have look I can share data and code with you. indeed it would be of great help to me. — James Taylor, Jul 13 '17 at 06:46

score 0 · Answer 2 · answered Jul 12 '17 at 21:29

Simply take the ar polynomial and use it as a multiplier of each of your two predictors to get the response function . In this case you would get 3 coefficients for each of your predictors reflecting the contemporaneous , lag1 and lag2 effects . This kind of useful output is available with some software. Perhaps you can get the author of your forecasting package to do just that.

Interpretaing Arima Model Output with Exogenous Variables

2 Answers2