Why do probability distributions matter in data science and predictive modeling?

Question

I am trying to learn data science and have heard a lot of people talk about the importance of understanding the main types of probability distributions and being able to identify which distribution fits your data.

I understand that distributions are very important because they are fundamental concepts in Statistics, but don't understand how knowing the distribution can help you form a better predictive model. From what I understand, we can assume the coefficient estimates of the independent variables are pulled from a normal distribution regardless of the distribution of the independent variables themselves due to central limit theorem.

So it doesn't seem like knowing the distribution of the independent variables would influence the way you form a model. When people say it is important to know your distribution, are they talking about the dependent variable? And how can this help with prediction?

score 2 · Answer 1 · answered Apr 16 '18 at 05:20

First of all, central limit theorem is about convergence of sample mean. It says nothing that allows you to assume normal distribution for any parameter. Yes, you can use normal approximation in many cases, but nothing makes the approximation better, or even equally good as exact solution.

Second, you are mixing two things: the likelihood function, i.e. the distribution of the observed data parameterized by some parameter of interest, and the distributions of the parameters themselves. Again, you can use normal approximation here as well, but it can be far from optimal. For example, you can use normal approximation of Bernoulli distribution in regression and use linear regression instead of logistic, but it usually won't be a good idea.

So the fact that you own a hammer, does not make it a perfect tool for every job.

Why do probability distributions matter in data science and predictive modeling?

1 Answers1