0

I have a set of measurements from an air polution sensor. I want to determine the min and the max value in a period of time (let's say in a day).

The min and the max don't have to be the true mathematical min and max and I want to determine them robustly, because I suspect that there are outliers in sensor data.

I want to use the 1st percentile and the 99th percentile. Is that okay?

lukin155
  • 1
  • 1

2 Answers2

0

if you have a list of all values through out a window of time (24 hours), then:

  1. rearrange the values in the list in ascending or descending order.
  2. calculate the median of the first n of values (e.g. take the median of a five values)
  3. check the values manually for the outliers by looking at the list
  4. if the all the n values fall within the outliers, then increase n to include more normal values
Thulfiqar
  • 101
  • 1
  • This recipe is problematic for several reasons. First, it ignores the likelihood of serial correlation. Second, it's too vague because it doesn't define (or even describe) what an "outlier" might be. Third, it simply won't work: try it out on some data. – whuber Jan 21 '21 at 13:09
  • @whuber thank you for your informative comment. I had assumed the outlier to be a huge value due to a faulty sensor – Thulfiqar Jan 21 '21 at 13:42
0

Denote $X_1,\dots,X_n$ the sensor data from which you want to compute the max.

A preliminary approach could be to take $$\widehat{max}(X_1,\dots,X_n) = Median(X_1,\dots,X_n)+\Phi^{-1}\left(\frac{n-\alpha}{n-2\alpha+1} \right)\frac{IQR(X_1,\dots,X_n)}{\Phi^{-1}(3/4)-\Phi^{-1}(1/4)} $$ with $\alpha=0.375$, $\Phi$ the gaussian cdf and $IQR$ the inter-quartile range. The idea is to consider an approximation of maximum order statistics found here and replace $\mu$ by the median and $\sigma$ by $\frac{IQR(X_1,\dots,X_n)}{\Phi^{-1}(3/4)-\Phi^{-1}(1/4)}$.

Then, if the data were gaussian, you would get an approximation of the expectation of the maximum. On the other hand, if the data are Gaussian but with outliers, you still get a robust estimator of the max because you use only the median and IQR and they can both tolerate up to $25\%$ outliers. Now this is very preliminary because it suppose a Gaussian model for the inliers, but nonetheless if your data are well behaved (we would need to see the data to assess that typically with a qqplot), then this should work.

TMat
  • 716
  • 1
  • 10