What does "the process that generates the data" mean? and How does feature selection help in recovering it?

Question

In [1], one of the motivations to use feature selection is stated to be: "to gain knowledge about the process that generated the data".

What does this "process" actually mean? and How does feature selection help in recovering it?

[1] Guyon, Isabelle, and André Elisseeff. "An introduction to feature extraction." Feature extraction. Springer Berlin Heidelberg, 2006. 1-25.

+1! Very good question! If you want to read more on the issue of Statistical Models and Data Generating Processes, I recommend Chapter 4 of Davidson's 'Econometric Theory'. It is easy to understand for anyone with a background in statistics and summarizes the issue beautifully. — Jeremias K, Jan 22 '16 at 11:14

score 4 · Accepted Answer · edited Apr 13 '17 at 12:44

A Data Generating Process is the mathematical model generating the data.

For example, if you run a regression model with regressors $X$ and dependent variable $Y$, you implicitly hypothesize a data generating process for $Y$. This data generating process can be described by the statistical model \begin{align} Y = X\beta + \varepsilon, \end{align} Where $X$ is a $1xk$ vector of random variables, $\beta \in \mathbb{R}^k$ is the $kx1$ vector of coefficients.

An example for variable selection in the case of a regression model would be where you have two sets of regressors, say $X_1$ and $X_2$ such that $X_2 \subset X_1$. Suppose that the true Data Generating Process is \begin{align} Y = X_1\beta + \varepsilon, \end{align} but that you have all regressors in $X_2$ at your disposal. Then model selection (in theory) helps you to discern the relevant regressors (i.e., $X_1$) from those that are not relevant (i.e., $X_2 \setminus X_1$). This can be done with the BIC, the AIC, or t-statistics. Note that this might affect statistical inference, see also my recent post here: Post Model Selection Inference problems - which remedies exist?

On a sidenote, the notion of a Data Generating Process is fragile. In specifying a statistical model, we impose the Axiom of correct specification. In a regression model, this happens insofar as we consider only linear combinations of the regressors we hypothesize to have an effect on $Y$. How do we know these combinations are not nonlinear? We don't! We simply have to assume it. This is why recently, a new school of statisticians operates without this axiom when doing inference. The only thing they try is to select statistical models (such as the regression model) that can approximate your true Data Generating Process well enough. To make this clearer, suppose the true Data Generating Process for our above regression model is \begin{align} Y = \sum_{i=1}^{\infty}X_i\frac{c}{i} + \varepsilon. \end{align} While there are infinitely many random variables $x_i$ that affect Y, their coefficients decay at rate $O(i)$. Hence, a good feature selection scheme would select the first $k$ to approximate the true Data Generating Process reasonably well.

Similarly, this applies to other statistical models.

score 2 · Answer 2 · answered Jan 22 '16 at 11:07

The process that generated data is the "true" model. I.e., if you had perfect knowledge about the world, this would be the "equation" you'd come up with to describe the interactions and processes that cause (truly unequivocally cause) your dependent variable.

Feature selection usually involves applying some sort of "domain" knowledge (what you know about the nature of the problem, theoretically). So, for example, if you were studying some medical problem, you could use a doctor's help in trying to pick the "features" (markers, test results, measurements, diagnostic history, genetics...) that the medical science (the theory) would "blame" for the outcome. So that's how it could aid in recovering the true process - by careful feature selection, you can weed out the irrelevant variables and focus on what truly matters.

However, it does have a flip side: the theory could be wrong. And you could miss out on important variables that you wouldn't consider just because the theory of the day doesn't account for it.

(There are other approaches to feature selection obviously, from hierarchical, step-wise approaches, to factoring / PDA, to deep learning, where the features are learned by the model... )

What does "the process that generates the data" mean? and How does feature selection help in recovering it?

2 Answers2