3

I have a dataset consisting of pairs of points, $(x_i,y_i)$. Visually, I see that the points are not completely smeared, but that $x$ seems to exert some control on $y$. In fact, I suspect that $y$ is a function of $x$ plus some noise. But I have not idea of the shape of this function. Is there a statistical test to determine if $y$ is a function of $x$, without knowing the shape of the function a priori?

To be more precise, I want to determine whether the data is consistent with a model of the form:

$$Y = f(X)+\eta$$

where $\eta$ is normally distributed with zero mean and unknown mean. The function $f$ is also unknown.

becko
  • 3,298
  • 1
  • 19
  • 36
  • "Is a function" in what sense? A huge map, that maps 10000000 distinct inputs to 10000000 distinct outputs is also a function, so whatever your data is, it can be output of some very, very complicated function. – Tim Oct 16 '17 at 13:03
  • @Tim Not really. A function means that if $x$ is the same, then $y$ is the same too (except for noise). – becko Oct 16 '17 at 13:04
  • well, yes, what I described is a function in that meaninfg. Eg, f(1) = 132123, f(2.1) = 3333, f(5) = 8, ... – Tim Oct 16 '17 at 13:05
  • "Function" is way too broad, instead you can test independence using [distance correlation](https://en.wikipedia.org/wiki/Distance_correlation), see also [When is distance covariance less appropriate than linear covariance?](https://stats.stackexchange.com/questions/23785/when-is-distance-covariance-less-appropriate-than-linear-covariance) – Francis Oct 16 '17 at 13:15
  • @Francis But "independence" is not specific enough. $X$ and $Y$ might be dependent, and yet $Y$ not be a function of $X$. Imagine a circle. There is not a single value of $Y$ corresponding to each value of $X$. – becko Oct 16 '17 at 13:20
  • A rough but effective exploratory rule of thumb is to split the $x_i$ into tertiles, then compare the mean of the $y_i$ corresponding to the top third of the $x_i$ to the mean of the $y_i$ corresponding to the bottom third. (This can be carried out mentally when inspecting a scatterplot.) But there exist more sophisticated methods depending on what you know, or are willing to assume, about the phenomenon you are studying and how the data were collected. – whuber Oct 16 '17 at 14:26
  • `[I] want to determine whether the data is consistent with a model of the form:`. Sure, you can determine this insofar as you can list characteristics distinguishing $f(.)$ from $\eta$, with a precision that will be bounded by the difference (under your model) between the former and the latter. – user603 Oct 16 '17 at 17:29

1 Answers1

0

$Y$ being a function of $X$ is a much more constrained scenario from $Y$ being a function of $X$ plus some noise.

The former case implies that for each value of $X$, there is only one possible value of $Y$ (i.e., the distribution of $Y$ conditional on any given value of $X$ is degenerate). It's easy to check whether a dataset satisfies this rule.

The meaning of the latter depends on what you mean by "noise". If you mean a random variable with mean 0 that's independent of $X$ and $Y$, then you could investigate how credible this proposition is by looking at the mean of $Y$ conditional on each value $x$ of $X$—it should be closer to $x$ the larger your sample is.

This said, none of this is likely to solve your real problem.

Kodiologist
  • 19,063
  • 2
  • 36
  • 68
  • I mean the latter case, see edit. Yes, the "noise" is normally distributed with mean zero but the variance is unknown. I did not understand the last statement, that "looking at the mean of $Y$ conditional on $X$ it should be closer to $X$ for larger sample"? Why? The unknown function relating $Y$ to $X$ need not be linear. – becko Oct 16 '17 at 17:11
  • @becko By hypothesis, the conditional distribution $(Y|X = x)$ is equal to $x + ε$ where $ε$ is a random variable with mean 0, so the mean of $x + ε$ is just $x$. The functional form of $f$ doesn't get to play a role here because of the conditionalization on a particular value of $X$. – Kodiologist Oct 16 '17 at 17:29
  • I disagree. If $f$ were known, the conditional distribution $P(Y|X=x,f)$ would be the normal distribution of the noise (assuming also the noise variance is known), in which case the mean of $Y$ would be $f(x)$. – becko Oct 16 '17 at 17:44
  • @becko Ack, you're right, sorry: the conditional mean should be $f(x)$. I guess I was asleep at the wheel. What you can do, then, is look at whether the conditional distribution is normal. That's the only part of your model that's restrictive enough to have much of a consequence for the data. – Kodiologist Oct 16 '17 at 17:53
  • But I don't know the variance nor the mean! – becko Oct 16 '17 at 18:39
  • @becko True, but you needn't know either to judge how normal-looking a distribution is. A normal distribution still looks quite different from many other distributions with the same mean and variance. – Kodiologist Oct 16 '17 at 19:31
  • Is there a test that I can apply? – becko Oct 16 '17 at 19:52
  • @becko Do you mean a significance test? For that, you would need to be able to formulate an appropriate null hypothesis, but I don't see how you could. – Kodiologist Oct 16 '17 at 20:19