0

I am trying to see if there any Statistical Models that (better) "Exploit" Distributional Knowledge of the Predictor Variables.

For example, I feel that is a common misconception (e.g Where does the misconception that Y must be normally distributed come from?) regarding the requirement of the Normal Distribution in Regression Modelling. As I understand, the response variable does not need to be Normally Distributed. However:

  • The response variable needs to be "conditionally normal" on the covariates (i.e. E( Y| X=x )
  • The residuals needs to be normally distributed (i.e. (Y-truth - Y-predicted) ~ Normal Distribution)

Especially when considering GLM's, these requirements are quite convenient as they relax the distributional assumptions of the variables.

My question relates to the following: Suppose we happen to have information about the distributions of the covariate variables (e.g. suppose we have enough data and when we plot the histograms and run MLE tests, the data corresponds well to common probability distributions) - are there any statistical models that are able to "exploit" this information for the purpose of "enriching the quality" of the statistical models?

It seems to me that regression models do not require the covariates to have certain distributions (e.g. if we want to predict "age" based on "height and weight", the columns in our data corresponding to the "height" and "weight" variables do not need to be normally distributed) - however, at the same time, if we happened to know the distributions of "height" and "weight", it seems like regression models would not be able to make use of this information. If they could, is it possible that a potentially "better" statistical model could be made by "exploiting" this information?

I spent some time thinking about ways that statistical models can "exploit" distributional information, and came up with the following examples:

  • Suppose the response variable and the covariate variables in the data all have normal distributions, and we believe that we can model the joint probability distribution of the response variable and all covariates through a multivariate normal distribution, i.e. P(Y, X1, X2,..Xn)~ MVN (mu, sigma). Doing so, we have now effectively "exploited" the distributional information about the covariates (the regression model would have effectively ignored this information). In practice, we can now predict different values of Y conditional on X, e.g. P(Y| X = x)~ MVN (mu*, sigma*), effectively performing the same task as a regression model, but with the added benefit of potentially enriching our model by better exploiting available information on the covariates.

  • A similar approach can also be used even if each variable has is believed to have come from a different probability distribution. Copula models can create a joint probability distribution using the Cumulative Probability Distributions of each variable, thus allowing for each variable to have a fundamentally different probability distribution (e.g. height ~ Normal, Weight ~ Gamma, etc.). Random samples from the conditional distribution of the Copula model with respect to the response variable and some observed values of the covariates can be generated as well, also performing the same task a regression model P(Y| X = x) - but with the added benefit of potentially enriching our model by better exploiting available information on the covariates.

Can someone please comment on the above? Is what I have described above correct and relevant? In general, are there any Statistical Models that (better) "Exploit" Distributional Knowledge of the Predictor Variables?

Thanks!

stats_noob
  • 5,882
  • 1
  • 21
  • 42

0 Answers0