In regression analysis, data generate model or model generate data?

Question

I am learning regression analysis and in starting of that I have encountered two statements:

S1: model generates data

S2: data generates model

Given that one is correct, so I picked up S2, thinking that first we have raw data and then we do some analysis and conclude that this data belongs/follows this model. But this was wrong.

Also, after the complete process of analysis does it really matter that S1 is correct or S2?

Kindly clarify my doubts.

Following are the pages from book:

Scortchi - Reinstate Monica · Answer 1 · 2020-08-02T13:10:22.097

1

It would seem more natural—or more consonant with most people's metaphysical leanings—to say that a model represents, or describes, or indeed models, the data-generating process, rather than that the model itself generates data.^† But it makes no odds with regards to to the authors' point, which is at heart the hoary one that the observed data are considered a realization of a random variable; the parameters constant, though unknown; & that inference about the latter from the former is a kind of "backwards" use of probability theory.

I'm not at all sure they're right, though, in thus explaining "regression". The term's said to have come from Galton's "regression toward the mean": see Why are regression problems called "regression" problems?.

It doesn't strike me as wrong to talk of data generating a model, merely uncommon—& of course "generate" here isn't being used in the sense in which the model may be said to generate the data.

† Except perhaps in the context of simulation.

edited Aug 02 '20 at 13:10

answered Aug 01 '20 at 11:56

Scortchi - Reinstate Monica

27,560
8
81
248

Please elaborate a little more and please comment on my second question. – Singh Aug 01 '20 at 14:31
1

Francis Galton is the appropriate name, not (Karl) Pearson, as in the linked thread (which includes some depressingly poor answers). – Nick Cox Aug 02 '20 at 11:12
@NickCox: Thank you! For some reason that's a mistake I keep making. – Scortchi - Reinstate Monica Aug 02 '20 at 13:15
1

Most attributions to people are wrong or if right objectionable on other grounds. – Nick Cox Aug 02 '20 at 13:20

score 0 · Answer 2 · answered Aug 01 '20 at 06:19

I think the idea that the textbook here is getting at is that there are two steps to model generation and model usage. You are correct that we first must use data to generate a model. There is no way to know how effects of independent variables are associated with the dependent variable unless we first have data to determine these relationships. This is called model fitting, and model fitting is done on training data.

However, once these relationships are determined from model fitting, we can use the betas we generate (the effects of the independent variables) and input our X1, X2, ..., Xn (independent variables) to arrive at some estimation of y (our dependent variable). I think it is in this sense that the text says we "use the model to generate data." Though this is a poor way to word what linear regression models actually accomplish. Much better is later when it says we can "use the proposed model [to estimate] the solution of the posed problem."

markowitz · Answer 3 · 2020-08-05T06:47:38.183

In theoretical point of view S1 seems me the correct answer, in fact your material says this. Seems me that your material give you a simple but exhaustive explanation. In fact is wrote:

Obviously S1 is correct. It can be broadly thought that the model exist in nature but is unknown to the experimenter.

This is a paradigm for causal inference; in this sense you can think about the model as a structural equation. Sometime this model is called true model also. The interpretation of the true model can be different and those difference can produce many problems, in my view the structural equation interpretation is the correct one. More important this interpretation is the closest on affirmation given in your material and reported above. In any case to read here can be useful: What is a 'true' model?

If the structural model is completely known there are nothing to “infer from data”. This can be true in simulations but in real world you never can be completely sure to know the true model exhaustively. Here start the tentative to “model the reality”, model the real link about real variables (data). This is the “to move to the backward direction” intended in your material. The regression can be useful about this scope. Here can seems that we move from data to model but this is only an impression because, if we work properly, we collect data with a theory in mind and it, at least ideally, refers on true model (then S1 holds).

I don’t know what your material is precisely focused on, however you have to known that:

the regression is a free concept, it per se say nothing about the nature and meaning of the data. Under very general conditions you can perform regression on the data you want regardless the meaning of them.

In fact any regression can suffer from misspecification, where it is considered in comparison to the true model. The regression, even at population level, do not generate nothing "in nature". What generate $Y$ is the data generating mechanism (= structural equation = true model); very different concept. Many problems come from misunderstanding about this and related points. To read here can help: endogenous regressor and correlation Regression and causality in econometrics Does homoscedasticity imply that the regressor variables and the errors are uncorrelated?

Now some warnings. I repeat that I don’t know what your material is precisely focused on and the fact that it, as many other source, speak about “model” in too general manner do not help us. Sometimes this “model” are focused on causal problem, sometimes else no, sometimes other is not clear. Is possible to speak about “model” like S1 without giving it any causal/ theoretical/ substantive meaning. I do not appreciate much this possibility but, more important, it is not consistent with the affirmation, of your material, reported above. To read here can help: Structural equation and causal model in economics

Finally, the question on your book is referred on the true model. In fact all the models built by researchers do not generate nothing "in nature". Them can generate predictions. Now, referring to the built/specified by researchers models and regardless the meaning of true model, we can be focused on pure prediction. Only in this case your own model, at least in finite sample and then in practice, should be completely data driven. In this case only I feel that S2 can be saved. S1 become useless more that incorrect, because we are not focused on discover the true model. To read here can help: Endogeneity in forecasting

In regression analysis, data generate model or model generate data?

3 Answers3