4

Apologies if this is a really simple question; I'm sure if only I knew what to google I'd be able to find the answer myself, but it's been driving me mad.

I have two datasets with approximately gaussian distributions. Both are measurements of the same background distribution, taken for reproducibility of some optical instrumentation I've developed. I need to prove this using the two measurements.

My understanding is that to achieve this, I integrate common area that's underneath both measured distributions. However...

In my case, gaussian 1 has a mean of 41.3 and a standard deviation of 1.0. Gaussian 2 has a mean of 41.7 and a standard deviation of 1.6. This means that the two gaussians intersect twice.

When I integrate the common area, I get 0.76, which I interpret to mean there's a 0.76 probability that the two measurements are of the same background distribution. This sounds way too low to me.

I had a look at KL divergence, but this is asymmetric and assumes that one of the measured distributions is the 'true' distribution - this is not the case for my measurements.

I have some more similar comparisons with more than two measured distributions to worry about, but I'd like to walk before trying to run...

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
dr_who_99
  • 41
  • 1
  • 3
  • 3
    One can never actually measure an entire distribution--that's physically impossible. How, then, do you obtain these means and SDs? This is fundamentally important, for otherwise there is no disciplined correct way to answer your question. – whuber Jan 19 '17 at 00:36
  • 1
    I'm measuring size distributions of particles. A computer algorithm interrogates the scattered light from single particles and uses this to infer a size of that particle. Each of my distributions is composed of a large number (thousands) of such measurements. The two distributions are measuring the same particles; as such I am confident of the same background distribution. – dr_who_99 Jan 19 '17 at 08:57
  • Following @whuber's comments, it seems you use `distribution` in an un-statistical sense, which makes your question confusing. – Xi'an Jan 19 '17 at 10:09
  • @ Xi'an I've made an edit which hopefully clarifies the question. – dr_who_99 Jan 19 '17 at 13:20
  • It's unclear what you are asking. Your question seems to be "I need to prove this," but what exactly does "this" refer to? That the optical measurements are reproducible? That the background particle size distributions are indeed the same? That they are indeed approximately Gaussian? (That, by the way, would be unusual for a particle size distribution: typically it's the cube roots of the sizes that tend to have Gaussian distributions.) And precisely what data--and how much of them--do you have to make your "proof"? – whuber Jan 19 '17 at 19:00
  • "I had a look at KL divergence, but this is asymmetric and assumes that one of the measured distributions is the 'true' distribution". It is a very common thing to use the mean of the two KL divergences: $D = 0.5 \times D_{KL}(p||q) + 0.5 \times D_{KL}(q||p)$ – Eskapp Jan 19 '17 at 20:14
  • @whuber I need to demonstrate that the measurements are reproducible. Each 'measurement' is composed of a large number of measurements of individual particles (several thousand particles). I'm happy that the background distribution is closely approximated by a gaussian for these particular particles, and to a very good approximation the measurements of this distribution are also gaussian. – dr_who_99 Jan 19 '17 at 20:40
  • Okay, that helps us understand better what you're doing. In general it is difficult, if not impossible, to assess reproducibility of measurements by means of just two of them: typically one collects a larger set of measurements in order to assess how much they typically vary. Do you seek a way of *quantifying the difference* between two observed distributions? And if so, which aspects of their difference are of greatest importance in your application? – whuber Jan 19 '17 at 22:17
  • @whuber I'm specifically seeking a way of calculating the probability that the two measurements of my background distribution are indeed both measurements of the (unknown) background distribution. Failing that, I'd like to quantify the similarity between them or find the probability that they are the same. I'm more interested in the difference (or similarity) of the means than I am in the differences between the standard deviations. – dr_who_99 Jan 20 '17 at 15:50
  • That probability does not exist unless you (a) make probability assumptions about what the background might be and (b) also adopt a probability model of how the measurements arise from a distribution. In fact, it's hard to make sense of your question: your measurements clearly *are* measurements of your background distribution. Moreover, they are so detailed that it's extremely unlikely they ever would be the same. If you're interested in the difference of means, then you are in the classical textbook setting of testing the difference of means of two Normal distributions. – whuber Jan 20 '17 at 16:11
  • That's actually incredibly helpful - I was beginning so suspect that exactly the measure I was looking for was impossible to calculate, but I didn't have the expertise to confirm this myself. Thanks for taking the time to decipher my imprecise waffle. – dr_who_99 Jan 21 '17 at 15:51
  • Look at my answer here: https://stats.stackexchange.com/questions/271582/is-the-intersection-area-between-2-pdfs-a-probability/271590#271590 – kjetil b halvorsen May 06 '17 at 17:31

2 Answers2

3

What you are looking for is a two-sample test for equality of distribution. There are a number of known tests of this kind, including the Wald-Wolfowitz two-sample runs test, the Friedman-Rafsky two-sample runs test, the Kolmogorov-Smirnov two-sample test, the Henze nearest neighbour test, and the Zeck-Aslan minimum energy test. There are probably many others, but these will get you started. The Kolmogorov-Smirnov two-sample test is a particularly common test which is easy to implement and which has an explicit formula to estimate the p-value.

Ben
  • 91,027
  • 3
  • 150
  • 376
0

Concerning your comments on KL divergence:

I had a look at KL divergence, but this is asymmetric and assumes that one of the measured distributions is the 'true' distribution - this is not the case for my measurements.

I would suggest you use the symmetrized KL divergence: $$ KL_{sym}(P, Q) = (KL(P||Q) + KL(Q||P)) / 2 $$

lynnjohn
  • 151
  • 7