Is there a way to differentiate between linear and non-linear datasets in simulations studies?

Question

I'm trying to find a a measure of "linearity" among different simulated data sets and how I could control for such differences.

This idea came from studying Hastie's and Tibshirani's work around variable selection using the LASSO. A measure that they use a lot is the Signal-to-noise Ratio in order to see in which situations might a procedure or algorithm might perform better. It is defined as such:

$$ SNR = \frac{{\rm Var}(f(X))}{{\rm Var}(\varepsilon)} $$

I'm wondering if there is a measure similar (in spirit) to the SNR that could measure how non-linear a data set is, for simulations studies (meaning that there could be some constructions that are only applied to simulation settings).

The two sources for the SNR that I have read are 1 and 2, in case anyone is interested.

Edit:

Here by non-linear I mean model that can not be explained only by the main effects (linear combinations). In that sense non-linear variables should be added onto the data generating process for the simulation i.e. (log transformations, spline transformations, indicator variables and interactions).

What I am looking for is a way of quantifying and generating those datasets in such fashion as the SNR metric.

What does it mean for a *dataset* to be "nonlinear"? As written elsewhere on the site, [There are (at least) three senses in which a regression can be considered "linear"](https://stats.stackexchange.com/a/148713/). — gung - Reinstate Monica, Aug 21 '19 at 19:13
In my mind, and a shall clarify in the question, non-linear means that a linear combination (not including interactions) is not enough to better explain the variations in the response variable. — Guilherme Marthe, Aug 21 '19 at 19:16
"better" is still going to need more precision, but note that you are referring to a nonlinear *relationship* between variables *in* a dataset, not "nonlinear data". Please do read the linked answer to help you clarify what you want to ask about. There are different model forms given that can specify different meanings of "nonlinear". — gung - Reinstate Monica, Aug 21 '19 at 19:21
Neither data sets nor variables can be linear or non-linear. Only relationships can be. And they can be nonlinear in many many ways. — Peter Flom, Aug 21 '19 at 19:36
@gung I appreciate your reference to that thread :-), but I think the intended meaning is evident: to what extent can a dataset, viewed as a collection of $n$-dimensional points, be effectively viewed as lying on an affine subspace of positive codimension? That begs the questions of what kinds of errors and of what magnitudes one is willing to tolerate in making such a characterization, as well as considerations of what the dimension of this manifold ought to be. For instance, points along a circle in $\mathbb{R}^3$ lie entirely in a two-dimensional space, but does that make them "linear"? — whuber, Aug 21 '19 at 19:37
Here I'll add and edit that hasnt been accepted yet, which I think may help to clarify. By non-linear I mean model that can not be explained only by the main effects (linear combinations). In that sense non-linear variables should be added onto the data generating process for the simulation i.e. (log transformations, spline transformations, indicator variables and interactions). What I am looking for is a way of quantifying and generating those datasets in such fashion as the SNR metric. — Guilherme Marthe, Aug 21 '19 at 19:49
Check out [Setodji et al. (2007)](http://doi.org/10.1097/EDE.0000000000000734) for a measure of model "complexity", which gets at the degree to which a linear model would not explain the outcome. — Noah, Aug 21 '19 at 23:09
While the post could be further clarified I think it's now sufficiently clear that the general intent can be understood. I nominated to reopen. — Glen_b, Aug 21 '19 at 23:36

Is there a way to differentiate between linear and non-linear datasets in simulations studies?

0 Answers0