5

I am using stepwise regression to predict if a customer would give a donation.

I used many variables in modelling and a variable called perception_rating is coming out very important. Now this variable is subjective and is a rating given by solicitor based on their idea of how much a donor is worth (example expensive car, big house, higher rating). There is no scientific or reasoning behind it, its just perception and very subjective Although I think I shouldn't be using it, but it seems like a good predictor. Should I use this variable or not?

Jeremy Miles
  • 13,917
  • 6
  • 30
  • 64

2 Answers2

8

I agree with Noah; it is not really a technical statistical question per se. There are several questions you need to have an clear answer.

Do you have a "consistent" subjective rating? Let say your training data come from an opinion of an existing employee, will the assessment of a new employee going to have the same opinion? It is really problematic if there are inconsistent opinions and ratings after the implementation phase of your model and if so you cannot infer the performance of the feature anymore. I think this is probably the most problematic assumption if you decided to you use it.

What is your modeling objective? If the goal is to maximize the predictive capability of the model solely, you have a legitimate reason to use it.

Is there any other business constraint? Sometimes even if you have a significant predictor, you can't use it because of some business and legal constraint. For example, if you were to build a credit model to predict default on loan in the financial sector, you can't use age and gender (in the U.S.)... etc.

Is it ethical to include the variable? This question probably puts your modeling higher standard; it depends on the context of your business domain.

Potential solution: Is it possible to derive an estimate from another variable? For example, do you have the address of the donor? If so, use addresses as an intermediate variable and get an estimate of the net worth of the donor (Zillow's Zestimate) may be a good idea.

P.S. There is a well-discussed topic on stepwise regression; you should check out the post here

Anthony Lei
  • 371
  • 1
  • 10
2

If you used stepwise regression, it is possible that you are making a type I error and capitalizing on chance, so be careful about interpreting results from it without a cross-validation sample. In addition, if this variable is highly correlated with another variable in your sample (e.g., wealth), the fact that it emerged as important and not the other variable could be do to chance.

That said, whether to include this variable in a model depends on what the model is attempting to do. If it is to be used to optimally predict the outcome in a new data set, then sure, use every variable you have that is helpful for doing so. The meaning of the variable is irrelevant.

If you are trying to make an inference about the relationship between predictors and the outcome in the population, then this variable doesn't do much to explain anything about an individual's characteristics and their decision to donate. Instead, it should hint that you need to collect additional data on the common causes of the perception and the propensity to donate. For example, maybe someone's job influences both the perceptions of their wealth by an onlooker and their decision to donate, independent of their actual wealth. Including this as a predictor would create a model with more explanatory power.

In general, this is a substantive rather than a statistical question and depends on the type of inference you want to make. Is your model meant to be optimally predictive in an external sample? Is it meant to explain variance in the outcome? Is it meant to represent causal relationships between predictors the the outcome? How you model and what variables you should include in your model are determined by the answers to these questions.

Noah
  • 20,638
  • 2
  • 20
  • 58