2

In logistic regression with an N of 40,000, purchase decision is unrelated to price. However, with certain demographic variables controlled, price can show a positive coefficient of meaningful strength. (Searches for 2-, 3-, and 4-way interactions involving price have yielded nothing.)

The goal here is to maximize the leverage one can get out of discounting. There seems to be a real risk of confusing the "real" relationship with one that, in the words of Elazar Pedhazur, has an "air of fantasy about it" due to statistical manipulation (regression control). Any suggestions as to how to proceed?

Karl2
  • 21
  • 1
  • 3
    Alternatively, one could interpret the initial results of no relation as the "statistical fantasy," because the subsequent "meaningful" relationships found upon controlling for demographics suggest the original model was a bad fit to the data. What results do your goodness-of-fit tests and other model diagnostic tests show? – whuber Feb 17 '12 at 18:08
  • Over many regression trials on subgroups, here's a fairly typical set of results when only price was used as a predictor vs. when 4 other vars were controlled: N=2958, -2LL= 2534 vs. 2210, Chi-Square= 6 vs. 329, p = .02 vs. <.0001 .11="" .841="" a="" actually="" as="" cases="" classification="" correct="" cox="" did="" for="" if="" improve="" in="" indicators.="" know="" latter="" let="" little.="" looking="" me="" model="" most="" null="" other="" please="" pseudo="" rate=".846" rsq=".002" the="" vs="" vs.="" were="" you=""> – Karl2 Feb 17 '12 at 22:29
  • That sure looks like including the covariates has improved things, but it's hard to say. Hosmer & Lemeshow describe several ways to check for approximate linearity of the logit versus the independent variables: that would be useful to do here. You might want to apply some robust methods, too, and see whether things change much (in both models) when you exclude outliers and high-leverage data. – whuber Feb 17 '12 at 22:33
  • Thanks. Would checking for linearity between the logit and the Xs be akin to doing visual checks of scatterplots between Y and each X? Because those didn't reveal any nonlinearity. – Karl2 Feb 17 '12 at 22:44
  • One method smooths the Y's (e.g., Lowess), takes the logit of the smooth, and compares that to each X. – whuber Feb 17 '12 at 22:47
  • 1
    Re the question in the question text, causation cannot be obtained from any type of analysis, logistic regression or otherwise, it is based on study design. – Michelle Feb 19 '12 at 18:53

1 Answers1

3

I agree with @Michelle. In general, experimental control allows for causal inferences, but statistical control does not. In principle, statistically controlling for all confounding variables would allow you to make valid causal inferences, but in practice you have two problems:

First, fishing through a lot of different candidate predictors, and fitting lots of different models to find demographic variables that, having been controlled for, 'improve' the picture, will lead to substantial errors. For one thing, p-values (should you care about that) will be inaccurate, for example, the p-value returned by your software might be <.05, but the real p-value would be much higher. For another, your parameter estimates would be badly biased. I discussed this issue in a related way here, which may help make clear what is going on.

Second, even if you pick out variables that really are ones you need to control for, and none that you don't, you still have the problem of endogeneity, because you have no way of ensuring that you have controlled for all such variables. (In this, I am assuming, based on your question, that you are conducting an observational study with secondary data.)

This situation is very unfortunate and very common. With respect to the first issue, in general, my advise would be to pick out a single model to fit, based on information other than your data. Another approach is to split your data into several groups at random (with N=40k, I should think you have plenty), explore one subset, and test candidate models on a different subset. With respect to the second issue, an instrumental variables approach may be your best bet.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • 1
    The first 'problem' is orthogonal to causal inference issues. Conditioning can, but does not necessarily, *identify* the/a causal effect by blocking backdoor paths, thereby solving the second problem, at least in some circumstances. And instrumental variables are used when such conditioners are unknown but exogenous causes of the supposed cause can be found. Accurately representing uncertainty in these analyses with p-values or whatever is a separate issue. – conjugateprior Feb 20 '12 at 10:18
  • @ConjugatePrior, I largely agree. However, when results don't look like what you expect, it can be quite common to go looking for candidate predictors that will give you the picture you expected to see. Thus, in practice, this can occur. In addition, the question, as stated, suggests this might have occurred here. I don't want to sound critical, because this approach is quite intuitive & many people believe that it is the appropriate strategy. I just want to explicitly raise the issue to nip it in the bud. – gung - Reinstate Monica Feb 21 '12 at 20:15