0

I'm new to model building and i"m trying to figure out how to create a model if my data doesn't follow a linear regression (doesn't meet all assumptions). For example, suppose I want to model prices of house insurance premiums. It's a multivariate model with my Y = the premium, and X's being many things such as customer characteristics, house characteristics, and a mix of continuous variables, categorical variables etc etc.

If it doesn't meet the assumption of linear regression (normality, constance variance, etc etc), is the first step to try to transform it to make it normal? If so, how do I know which transforms to use?

Or is it better to create interactions with my equation. i,e Y = X1 + X2 + X1*X2, etc etc) how do I know what type of interactions I want to create?

Or is it better to use non-linear techniques?

semidevil
  • 117
  • 7
  • 1
    It's striking that you mention normality first, but it's the least important assumption (read: ideal condition) for regression. Also, what is "it"? Normality of outcome or response variable is not an assumption. There are many, many threads on regression problems here: your question isn't very distinctive without more information on your data. For example, what are house insurance premiums precisely? If they are constrained to be positive, then a generalised linear model with logarithmic link is likely to make much more sense than plain regression. – Nick Cox May 11 '20 at 09:01
  • Your question needs to focus on a single problem. – Nick Cox May 11 '20 at 09:02

2 Answers2

1

What to do is going to depend on the particular problem and can only really be learned by experience.

However, when your dependent variable is a price (as in your case) it often makes sense to take the log of price. This is so not just for statistical reasons, but substantive ones. That is, if your premium goes up you tend to think of the change in percentage terms rather than dollar terms. I don't know what the prices typically are for your case, but, e.g. an increase from 100 to 110 is much bigger than one from 1000 to 1010.

You can then examine the assumptions again.

If they still aren't met, you can try methods that do not make these assumptions such as quantile regression, robust regression or tree based models.

Whether to add interactions is really a separate question.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
0

You can use tree based regression methods like Decision Tree Regressor or RandomForest Regressor.Tree based methods do not require these linearity assumptions to be met.