What is the probability that two data points in a normally distributed data set have the same value?

Question

How does one determine the probability that two data points have the same value, given the data set is normally distributed and how does the actual value of these points influence this probability?

If $X\sim\mathcal{N}(\mu,\sigma)$ and $Y\sim\mathcal{N}(\mu,\sigma)$, then P(X=Y) = 0 for all values since the probability of a *continuously* distributed variable taking on any *single* value is infinitely small, no? — Alexis, Jan 20 '18 at 05:28
@Alexis $X\sim N(\mu,\sigma)$ and $Y\sim N(\mu, \sigma)$ definitely does not imply $P(X=Y)=0$ without extra assumptions -- one could even have $P(X=Y)=1$. — Juho Kokkala, Jan 20 '18 at 15:51

Carl · Answer 1 · 2018-01-20T18:58:21.100

1

If the distribution is truly normal, then all the answers are real numbers. Let us take an example of a normal distribution, $X\sim\mathcal{N}(\mu,\sigma)$, the standard normal, $X\sim\mathcal{N}(0,1)$. Since bivariate normal distributions can be transformed to be a standard normal distribution, what we observe for the standard normal distribution applies to all other normal distributions as well in some alternative scale and location. This includes the difference between two normal distributions as one such linear transformation, i.e., $Z=X-Y$ is normal and where transformed $Z\rightarrow W$ and $W\sim\mathcal{N}(0,1)$. Thus the difference between two normal distributions reduces to the standard normal distribution, a well know fact used in one sample testing.

Suppose we wish to reproduce a value of an $X_i$ of $X\sim\mathcal{N}(0,1)$ that is as close to zero as possible, and as that is the mean value of the standard normal, and that is its maximum density value, i.e., having the most densely packed sample-data. We might obtain a value of 0.001, or -0.0005, but as we attempt to obtain a value that is exactly zero, all we can do is approach it more closely as we continue to acquire more and more random $X$-values. Now provided that we record the $X$-value as a real number with infinite precision, we will never reproduce a zero value exactly. So, the answer is, the probability is as small as we like, provided that the recorded precision is sufficiently large, which is a "nice" way of saying that the probability is zero.

Now, given that even when the density is at its maximum value, the probability is zero, it is certainly no larger a probability for other values, such that the answer is $p=0$ everywhere.

edited Jan 20 '18 at 18:58

answered Jan 20 '18 at 07:11

Carl

11,532
7
45
102

2

This analysis only addresses whether both distributions equal the same *specified* value. That's a single point in the bivariate distribution. The event "the two values are equal" is an entire line. Proving that it has probability zero for an absolutely continuous distribution is not trivial, as illustrated in the analysis at https://stats.stackexchange.com/questions/256444. In the present case, though, all you have to do is note that for *bivariate* Normal $(X,Y)$, the event $X=Y$ is the same as $Z=0$ for $Z=X-Y$ and that $Z$ has a Normal distribution. *Then* you can apply your argument. – whuber Jan 20 '18 at 14:04
@whuber The event at $Z=0$ for $Z=X-Y$ is an alternative parameterization implicit to the written text. – Carl Jan 20 '18 at 18:00
1

Your answer would be correct if you made that explicit. Currently, your answer is not correct because it overlooks the distinction I was trying to call your attention to. – whuber Jan 20 '18 at 18:03
@whuber I thought that was obvious, but if you say it wasn't then I yield to your experience. Made it more explicit, as per your comment. – Carl Jan 20 '18 at 18:25
3

(1) The normality of $X-Y$ is not a consequence of the CLT: it *requires* the assumption of bivariate normality of $(X,Y)$. The CLT has nothing at all to say about this result. (2) You are using the terms "parameterization" and "reparameterization" in a sense that differs from the usual statistical one. "Linear transformation" might come closer to what you're trying to say. (3) I'm not sure what you think was obvious. Since the question is, at bottom, about a basic fact of continuous distributions, we should be cautious not to assume that related concepts, however elementary, are obvious. – whuber Jan 20 '18 at 18:33
1

@whuber (1) CLT removed. You emphasize *bivariate* and that is *not* obvious to me. So, in the spirit of *not assuming that related concepts are obvious* would you clarify, please? (2) I agree parameterization was too vague, used restricted language for edit. (3) Thanks for taking your time to refine the concepts here, it is instructive for many, not just me. – Carl Jan 20 '18 at 19:08
1

https://stats.stackexchange.com/questions/30159 contains a nicely illustrated exposition of this point. Also see https://stats.stackexchange.com/questions/81469 and especially look at https://stats.stackexchange.com/a/120900/919, because some examples there make the possible non-normality of $X-Y$ truly obvious. (I already applied my +1 to your answer in acknowledgment of the substantial improvements you have made to it.) – whuber Jan 20 '18 at 19:13
@whuber From those links it would appear that *bivariate* means the *parametric expression of paired x-value observations from two variates*. Is that correct? BTW, thanks for the (+1), I appreciate both that and your contribution. – Carl Jan 20 '18 at 19:26
"Bivariate" means $(X,Y)$ is viewed as a two-dimensional random variable. This emphases the need to consider the full joint distribution of $(X,Y)$, rather than just focusing on the marginal distributions of $X$ and $Y$ separately. – whuber Jan 21 '18 at 16:06

What is the probability that two data points in a normally distributed data set have the same value?

1 Answers1