0

Here's some data$^1$ denoting the variation of the mean of samples (y-axis) with the no. of samples (x-axis). The uncertainties are $1\sigma$ standard deviation of those samples.

the original data Fig1. Graph of the data

Clearly the data has a trend. It converges to some value and has less uncertainty with more samples.

However, the data isn't smooth $^2$ and is made up of a jagged line of dips and bumps.

  1. How does one quantitatively differentiate if these peaks and valleys are actual features of data or just random fluctuations?$^3$

For e.g. here is one way to justify that most features are just stochastic noise.

i. Calculate a data trend (here a MeanFilter of kernel radius $\in{1,3,11}$ was used)

ii. Check if any $1\sigma$ data pts. are entirely out of your $1\sigma$ trendline. If they are, then upto $1\sigma$ they are actual features

Here's it in action:

Graph showing a trendline (kernel radius=3) over the data Fig2. Graph of data and a trendline with a MeanFilter of radius $3$.

trendlines with more drastic filter radii Fig3. Graph of data and trendlines with more drastic MeanFilter radii

  1. Is this heuretic a valid way of categorising bumps as actual features or noise?

Footnotes

$^1$ This was a Monte Carlo simulation to calculate the value of $\pi/4$. $100$ points were uniform randomly selected from a square circumscribing a unit circle and the ratio of the points that lay within the circle to all the points calculated. This formed one sample. The x-axis on the graph displays the no. of samples while the y-axis shows the arithmetic average of those samples. The uncertainties are $1\sigma$ standard deviation of those samples.

$^2$ and stays fluctuant even if the x-axis was more finely sampled.

$^3$ In this case I expect them being genuine statistical noise.

lineage
  • 101
  • 1
  • What might the meaning of an “actual feature” be in this setting? – Sycorax Dec 27 '21 at 05:55
  • @Sycorax say somehwhere the line shot up to 10...one would say that there's something going on there regardless of what the statistics is .... that ability to say that there's definitely a peak (or valley or some discrepancy) is what I mean by the presence of an "actual feature"..I am asking if there is an objective way to do that – lineage Dec 27 '21 at 06:18
  • Do you have a specific real world problem that you’re working on, and you want to do some kind of peak detection on that data? As it stands, this reads like an XY problem https://xyproblem.info/ – Sycorax Dec 27 '21 at 14:32
  • Here’s an example of one method https://stats.stackexchange.com/q/22974/22311 – Sycorax Dec 27 '21 at 14:38
  • @Sycorax not for now....I just took this as a simple MC example to explain *MCing*...the listener then asked why there were peaks and I said that they actually weren't - just stochastic fluctuations and that got me thinking and led to this question.....in this case the problem isn't an xy but rather the concrete use case the question is for – lineage Dec 27 '21 at 14:41
  • Would it be fair to say that your question is how to explain Monte Carlo results to a person unfamiliar with statistics? – Sycorax Dec 27 '21 at 14:43
  • @Sycorax not really...the stress is on when can undulations in a graph be disregarded as statitical fluctuations and when should they be considered as a symptom of something mysterious going on (as actual attribute of the underlying process that produced the graph) ...I think this goes beyonf just MC to any stochastic process.....maybe the question is equivalent to saying which peak finding algorithm is clever enough to differentiate statistical noise from peaks – lineage Dec 27 '21 at 14:48
  • @Sycorax I do have another plot (not posted) where I increased the total no. of samples in steps of $1$...there I think any peak finding algorithm (I ran those in available in Mathematica) will come up with a peak or two even though it's all *actually* noise. Should I post that too? – lineage Dec 27 '21 at 14:51
  • The core problem with using this specific example to motivate detecting "When is there a peak?" is that we know the true value of $\pi/4$ must be a constant (even if it is an unknown constant). There can't be any "signal" anywhere in the plot because changing the sample size of the underlying simulation cannot possibly be evidence that $\pi/4$ takes on a different value. That's why I suspected there must be an XY problem; I don't think you really believe that $\pi/4$ changes value as sample size changes. Therefore, looking for a way to detect that seems ... poorly considered. – Sycorax Dec 27 '21 at 14:54
  • @Sycorax "There can't be any "signal" anywhere in the plot$\ldots$"....what if there was something wrong in the code (maybe I *mistakenly*;-) multiplied by $sin(noOfSamples/10^6)$)? The point is seeing features in the plot that we don't expect there to be (yep the value here is indeed expected to be a constant) provide information regarding the the code (eg. its validity). The problem then is should I really start debugging my code having concluded there is some spurious signal when I should have instead *confidently* ascribed those variations to statistics? How do I justify my confidence? – lineage Dec 27 '21 at 15:02
  • @Sycorax By the way I appreciate you sticking around – lineage Dec 27 '21 at 15:02
  • That's a question about *unit testing.* – Sycorax Dec 27 '21 at 15:10
  • @Sycorax though not technically incorrect, u maybe overthinking it....... – lineage Dec 27 '21 at 17:12

0 Answers0