2

This is a bit of a conceptual question that has been nagging me for a long time.

Based on a set of data, $(X_1, X_2, X_3, \ldots, X_k)$, with sample size $i = 1 \ldots n$ ,
is there an explicit relationship between

  1. Fitting a multivariate distribution on all of the data, and
  2. Estimating a regression model on the same data?

Both concepts seem very similar, for two reasons:

  1. Estimating simple linear regression models, and fitting distributions can both be accomplished using the same method, Maximum Likelihood Estimation (MLE), and
  2. After fitting a distribution (let's say the Normal) and gaining the parameters for its pdf, one can calculate the conditional distribution, $P(X_1 | X_2, X_3, \ldots , X_k)$, which would allow one to predict values for $X_1$ based on new values ($i = n+1, n+2, \ldots$) of $X_2, \ldots X_k$, very similarly to the way one could gain predictions for $X_1$ by running the following regression, with a Normally distributed error term, $\epsilon$, $$X_1 = \beta_0 + \beta_1X_2 + \beta_2X_3 + \ldots + \beta_{k-1}X_k + \epsilon\, ;$$ both methods allow one to make predictions with new data, after first performing some sort of fitting.

Any insights into this connection (if it is indeed real), such as the pros/cons of fitting a distribution vs estimating a simple regression model when it comes to forecasting, would be extremely appreciated.

Coolio2654
  • 642
  • 5
  • 17
  • 1
    Do you really mean to be summing random variables (or observations) in your second sentence or is the intended statement that $X=(X_1,X_2, \ldots, X_k)$? You might find it useful to review descriptions of regression models (I provide a broad, general one at https://stats.stackexchange.com/questions/148638). Look for any assumptions about the distribution of $(X_2,X_3,\ldots, X_k).$ – whuber Jul 05 '19 at 13:49
  • 1
    The edit doesn’t clear anything up for me. Do you mean that $X$ is a 1-vector formed as the some of the $X_k$? The notation is confusing. Why not just write $X=X_1 + X_2 + \ldots X_k$ if you mean the sum? I.e. no parentheses. – Peter K. Jul 06 '19 at 17:10
  • Thank you. I updated my notation to be more accurate. Your link is informative in general, and I'll go through it. However, going back to my question, is there any theoretical merit to what I am saying? What is the difference between the conditional distribution, of one variable on others, gained from the normal pdf estimated through MLE, and the prediction of $y = x + \ldots$, or, in this case, $X_1 = \beta_0 + \beta_1*X_2 + \ldots$ (using the same variables), from an OLS regression? – Coolio2654 Jul 06 '19 at 17:11
  • I have updated my notation to be more accurate. – Coolio2654 Jul 09 '19 at 02:56
  • 1
    What do you mean by *Estimating linear regression models, via OLS, ... can ... be accomplished using ... Maximum Likelihood Estimation*? OLS is a mechanical procedure for getting estimators from data. It has a concrete, strict definition, regardless of how you interpret it from the perspective of MLE. It is true, however, that MLE may coincide with OLS estimators under certain distributional assumptions. – Richard Hardy Jul 09 '19 at 08:50
  • You're absolutely right, OLS stands for a procedure for fitting linear models, and not for the type of models themselves, as I was stating. Fixed my question once again. – Coolio2654 Jul 09 '19 at 18:41

2 Answers2

3
  1. Estimating linear regression models, via OLS, and fitting distributions can both be accomplished using the same method, Maximum Likelihood Estimation (MLE), and

Yes, you are correct on this. When using maximum likelihood, we are always fitting some kind of distribution to the data. The difference is however between the particular kinds of distributions that we are fitting.

In regression model, we are predicting the conditional mean (but sometimes alternatively other things like median, quantiles, mode) of one variable ($X_1$ in your notation) given the other variables ($X_2,X_3,\dots,X_k$), where the relationship has a functional form $f$:

$$ E(X_1|X_2,X_3,\dots,X_k) = f(X_2,X_3,\dots,X_k) $$

so, for example, with linear regression the assumed distribution is normal, then we have

$$ X_1 \sim \mathsf{Normal}(\,f(X_2,X_3,\dots,X_k),\; \sigma^2\,) $$

where, for linear regression $f$ is a linear function

$$ f(X_2,X_3,\dots,X_k) = \beta_0 + \beta_1X_2 + \beta_2X_3 + \ldots + \beta_{k-1}X_k $$

but it doesn't have to be linear in other kinds of regression models.

On another hand, when people are "just" fitting the distribution, they usually mean by that searching for unknown parameters of a joint distribution of some variables, for example if we again used (multivariate) normal distribution, this would be something like

$$ (X_1,X_2,X_3,\dots,X_k) \sim \mathsf{MVN}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) $$

Notice the difference, that in here we do not assume any specific functional form of relationship between $X_1$ and $X_2,X_3,\dots,X_k$. In regression, we choose the functional relationship that we assume for the variables, while when fitting the distribution, the relationship is governed by the choice of the distribution (e.g. in multivariate normal distribution, it is governed by by the covariance matrix).

  1. After fitting a distribution (let's say the Normal) and gaining the parameters for its pdf, one can calculate the conditional distribution, $P(X_1 | X_2, X_3, \ldots , X_k)$, which would allow one to predict values for $X_1$ based on new values of $X_2, \ldots X_k$,

What do you mean by "new values" in here? Regression model could be something like

$$ \mathsf{salary}_i = \beta_0 + \beta_1 \mathsf{age}_i + \beta_2 \mathsf{gender}_i + \varepsilon_i $$

So if your data consisted of $i=1,2,\dots,n$ individuals, then you could make predictions about salary for $n+1$ individual, that was not observed in your data. However if you picked up another feature for the model, say $\mathsf{height}_i$, then the estimated regression model tells you nothing about the relationship between height and salary. I wouldn't call the features as "new values", because this would be very misleading.

very similarly to the way one could gain predictions for $X_1$ by running the following regression $$X_1 = \beta_0 + \beta_1X_2 + \beta_2X_3 + \ldots + \beta_{k-1}X_k + \epsilon\, ;$$ both methods allow one to make predictions with new data, after first performing some sort of fitting.

You are correct that if we know the joint distributions $p(X_1,X_2,X_3,\dots,X_k)$ and $p(X_2,X_3,\dots,X_k)$, then we can estimate the conditional distribution,

$$ p(X_1|X_2,X_3,\dots,X_k) = \frac{p(X_1,X_2,X_3,\dots,X_k)}{p(X_2,X_3,\dots,X_k)} $$

or conditional expectations, etc. The difference is however that with regression this is available right away, while in case of "raw" distribution, you would need to calculate those from the distribution (e.g. take integrals, or conduct Monte Carlo simulation).

Notice also, that with regression you cannot "go back" to the joint distribution, or estimate other kinds of conditional distributions (or expectations). So regression is a simplified case. "Simplified" is not bad in here, for example, being simplified means that you would need much less data to get reliable estimate as compared to more complicated model.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • 2
    Maybe one thing to add to that answer: As Tim said: with the model for prediction you get an expression close to $E[Y|X=x]$ (Y is the target variable, X are the features). This is a highly complicated mathematical object: the factorization of the conditional expectation. However, we basically know $$E[Y|X=x] = \int_{y} y\cdot p(y|x) dy$$ i.e. given that we know the common distribution $p(y,x)$ we can compute $p(y|x)$ and then we can compute $E[Y|X=x]$. However, it is not possible (without further assumptions) to revert this process, i.e. knowing $p(y,x)$ is 'better' than just the prediction – Fabian Werner Jul 09 '19 at 07:55
  • @FabianWerner it gives you more information, but it does not make things easier, and with limited data, does not have to be more precise. So "better" in only some sense. – Tim Jul 09 '19 at 08:23
  • 1
    Yes, I meant 'better' in the sense that you can 'create' the quantity $E[Y|X=x]$ from $p(y|x)$ (not even really from $p(y,x)$ because $p(x)$ is missing, that's clear) and had no intention of saying what approach actually 'works better' in reality :-) – Fabian Werner Jul 09 '19 at 08:45
  • 1
    Re "in here we do not assume any specific form of relationship between X1 and X2,X3,…,Xk." That doesn't seem quite true. Indeed, your example of an MVN joint distribution is tantamount to assuming a *linear* relationship between $X_1$ and the other $X_i$ (together with assuming an MVN distribution for $X_2,\ldots,X_n,$ something that is not usually part of a regression setup). One could view fitting parametric multivariate distribution families in a similar way, perhaps getting a little bit closer to the insight sought by the OP. – whuber Jul 09 '19 at 19:13
  • Thanks, Fabian, that additional context actually is useful, knowing the steps by which the joint distribution, $p(y, x)$, can be made into the expectation, which perfectly (though inefficiently, as has been said) mirrors the regression's output. Additionally, I was also about to make a comment on the same line that whuber responded to, that "we do not assume any specific form of relationship...", where, I immediately thought, a Gaussian distribution relationship is at least assumed? I think the answer I am looking for does indeed reside somewhere in why a MVN joint distribution assumes a – Coolio2654 Jul 09 '19 at 20:11
  • linear relationship. Would a Poisson multivariate joint distribution, or a Weibull one, make such an assumption? Would other distributions also assume the conditional distribution (derived from the joint distribution) represents a linear relationship? Are there ways to derive a conditional, $P(X_1| X_2, \ldots, X_k)$, that assume a *non-linear* relationship between the variables? – Coolio2654 Jul 09 '19 at 20:14
  • Thank you for the answer so far, though; it elucidated some aspects of my question very clearly already. – Coolio2654 Jul 09 '19 at 20:19
  • 3
    One thing seemingly absent is that (to my knowledge) you don't need to assume anything about the joint density of $X_2,...,X_k$ in regression. They can be of any type and that will not affect the likelihood for $X_1$. In fitting a joint distribution, you do have to make such an assumption on the joint distribution, and it's likely to be wrong (while the sole univariate distributional assumption of the disturbance in regression seems more reasonable). – Noah Jul 09 '19 at 20:34
  • You can construct families of multivariate distributions with linear regression functions using copulas. This approach is described in my post at https://stats.stackexchange.com/questions/257779. It will reveal that "a" Poisson joint or "a" Weibull joint distribution is ill-defined: you can create multivariate distributions with Poisson (or Weibull or whatever) marginals that have linear regression functions and others that have nonlinear regression functions. – whuber Jul 10 '19 at 00:16
  • Noah, but the answer, as well as Fabian's first comment, seem to indicate that the joint Normal distribution easily yields the conditional expectation, which is perfectly equivalent to a normally distributed linear regression model's conditional expectation. If they both lead to the same thing, how does a regression not make the same assumptions on the joint density $X_2, \ldots, X_k$ as does fitting a joint distribution, as you say? Could you elaborate more on how it happens that, though fitting the joint is more "likely to be wrong", it ends up producing the same answer as regression? – Coolio2654 Jul 10 '19 at 04:53
  • 1
    @Coolio2654 one difference is that for MVN you need *all* the variables to be normally distributed. With linear regression, you do not even have to assume that the features are random variables, see https://stats.stackexchange.com/questions/246047/independent-variable-random-variable/246388#246388 – Tim Jul 10 '19 at 07:14
  • I will have to think on that comment for a bit, Tim (but that just means it might yield a deep insight!), so thanks for that. My final question, that I think will answer my remaining thoughts on the matter, is this. From the *prediction* perspective, when would *either* fitting the joint distribution, or estimating a linear regression, be preferred? When it comes down to it, in at least the linear, Gaussian case, is there ever a legitimate reason to fit a joint distribution, and then derive its conditional expectations? – Coolio2654 Jul 11 '19 at 22:25
  • @Coolio2654 when making predictions you need only the conditional distribution, so I can't think of any reason for modelling using joint distribution. That said, "fitting joint distribution" is very broad and lots of things in machine learning in statistics could be considered as it, so it would also depends on case by case basis. – Tim Jul 12 '19 at 04:47
0

If I understand you correctly, I think one distinction that has been made in the literature is of that between discriminative models which learn $p(y|x)$ and generative models which learn $p(x,y)$.

The most thorough theoretical and experimental treatment of this distinction see this study

user3235916
  • 517
  • 2
  • 12
  • It would be helpful if you explained the terms and discussed the consequences of using such models. – Tim Jul 10 '19 at 04:39