1

One of the factors for my regression analysis is customer's familiarity with the store which equals 1 if the customer visited the store more than $N$ times and 0 otherwise. Is there a right way to choose $N$? For some $N$ this factor is statistically significant and for some not.

8k14
  • 181
  • 7
  • 1
    You are encountering the substantive import of *aggregation bias* and the necessity of theoretical justification for aggregation. See for example, Gehlke, C. E. and Biehl, K. (1934). Certain Effects of Grouping Upon the Size of the Correlation Coefficient in Census Tract Material. *Journal of the American Statistical Association*, 29(185):169–170. Or possibly see, Openshaw, S. and Taylor, P. J. (1979). Statistical Applications in the Spatial Sciences, chapter A million or so correlation coefficients: Three experiments on the modifiable area unit problem, pages 127–144. *Pion*, London, UK. – Alexis Dec 26 '17 at 19:48
  • 2
    Must you choose a threshold $N$ at all? Can you not study directly how the number of visits is related to the response? – whuber Dec 26 '17 at 19:55
  • @whuber A customer with 200 visits is not twice more familiar with the store than a customer with 100 visits. – 8k14 Dec 26 '17 at 20:01
  • Nobody is suggesting that. The first order of business is to study how number of visits relates to the response, and then to incorporate the number of visits into the regression in a suitable fashion. Forcing that number into a binary variable may be too Procrustean. – whuber Dec 26 '17 at 20:06
  • @whuber Thanks for your comments. Do you suggest to use the number of visits as it is as a factor? – 8k14 Dec 26 '17 at 20:13
  • Why not model the relationship using [nonparametric regression](https://en.wikipedia.org/wiki/Nonparametric_regression) on number of visits, thereby limiting *a priori* assertion of specific functional form relating *Y* and *X*? (within the limit of the number of smooths accommodated by the sample size and joint distribution, naturally). See, for example, Buja, A., Hastie, T., and Tibshirani, R. (1989). [Linear Smoothers and Additive Models](http://projecteuclid.org/download/pdf_1/euclid.aos/1176347115). *The Annals of Statistics*, 17(2):453–510. – Alexis Dec 26 '17 at 21:47
  • @Alexis Thanks. I think it's too complicated for my needs. – 8k14 Dec 27 '17 at 04:00
  • It's no more complicated to use and interpret than linear regression. usually the syntax is along the lines of `npreg Y X`. – Alexis Dec 27 '17 at 23:11

1 Answers1

3

Don't discretize your predictor at all. This would amount to treating everyone with $0$ to $N$ visits exactly the same, and also treating everyone with $N+1, N+2, \dots, 2N, \dots, 1000N, \dots$ exactly the same - with a discontinuous step at $N$. This is almost certainly not a good reflection of reality. See this earlier thread for more information: What is the justification for unsupervised discretization of continuous variables?, in particular this page edited by Frank Harrell.

As you note, it makes little sense to include the number of visits "as is", as familiarity with the store will not scale linearly with the number of visits.

My recommendation would be to transform the number of visits using , e.g., restricted cubic splines or natural splines. A very good introduction can be found at the very beginning of Frank Harrell's Regression Modeling Strategies.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • Thank your for your detailed explanation. How can I choose the spline knots? Isn't it the same problem as choosing $N$ in my original question? – 8k14 Dec 28 '17 at 11:50
  • I wouldn't say that it's the same as choosing $N$. Yes, it's a choice for a parameter. But the *consequences* are different: choosing a threshold makes your response discontinuous, but choosing spline knots only deforms a continuous response curve. ... – Stephan Kolassa Dec 28 '17 at 15:06
  • ... You would typically set the knots at specific quantiles of your observed number of store visits. Harrell has a rule of thumb table in his book. You can look at the default behavior of the `splines:ns()` function in R, [see this earlier thread](https://stats.stackexchange.com/q/7316/1352) or look [through earlier questions](https://stats.stackexchange.com/search?q=%5Bsplines%5D+knots+is%3Aq). – Stephan Kolassa Dec 28 '17 at 15:06
  • Thank you. By saying that choosing knots is the same that choosing the threshold I mean that it is also not related to the relation between familiarity and the number of visits. For example, the idea that this relation is not linear is not reflected there, right? – 8k14 Dec 28 '17 at 17:06
  • The nonlinearity will come in once you have transformed your original variable into a set of multiple spline regressors and fitted a model. The weighted sum of the spline regressors, weighted by the estimated coefficients, will be a nonlinear response function. Take a look at [the Wikipedia page](https://en.wikipedia.org/wiki/Spline_(mathematics)), or run the example in the help page for `splines::ns`. – Stephan Kolassa Dec 28 '17 at 17:12
  • Thanks. How the coefficients of `ns` can be interpreted? – 8k14 Dec 28 '17 at 18:53
  • Interpretation of spline coefficients is hard. Better to just calculate the matrix product between the spline regressors and the parameter estimates and plot this (the response function) against the number of visits. – Stephan Kolassa Dec 28 '17 at 19:06
  • Thanks again. Could you please be a bit more detailed? I just need to know if the corresponding factor has a statistically significant effect on the outcome. – 8k14 Dec 28 '17 at 20:26
  • Ah. In that case, I would recommend that you compare two models, e.g., using ANOVA or a likelihood ratio test. Model 1 would contain all your predictors *except* the number of visits (or any transform). Model 2 would contain all predictors *plus* the spline-transformed number of visits. Thus, Model 2 nests Model 1, and you can compare them using ANOVA or similar. – Stephan Kolassa Dec 28 '17 at 20:31
  • Thank you very much. Thus I can see the statistical significance and what about the direction of the effect? The signs of the spline coefficients? It's fine if they are equal but what if they aren't? – 8k14 Dec 28 '17 at 20:46
  • Even the signs are not overly helpful. The response function can curve up and down. Best to plot it and eyeball it. – Stephan Kolassa Dec 28 '17 at 20:51
  • Thank you. Then why on earth is linear regression still in use? No relation is purely linear... – 8k14 Dec 29 '17 at 04:34
  • 1
    [A simpler model can perform better than a correct one](https://ideas.repec.org/a/for/ijafaa/y2016i40p20-26.html) because of the [bias-variance tradeoff](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff). – Stephan Kolassa Dec 29 '17 at 07:35