1

I have a set of numbers, and I need to calculate their average excluding outlier values (which I don't know a priori).

It came to mind that many years ago I studied Standard Deviation. Could I apply it to this problem?

If so, could someone give me an example of how to do it since I have to code it into PHP?

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Gonzalo
  • 11
  • 1
  • 5
    There are many threads on this topic. One key word is robust (statistics). Another is outliers. Those should surely be tags. Your problem is chicken and egg in that only when you have determined what you regard as outliers can you take the mean of the other values. It's far from agreed that you should _exclude_ outliers. There are many, many ways of addressing the question: the very simplest, usually, is to take the median. – Nick Cox Nov 29 '18 at 18:29

1 Answers1

0

If you are willing to assume that the non-outlier series is uncorrelated then you could use https://cran.r-project.org/web/packages/tsoutliers/tsoutliers.pdf as it will flag the anomalies thus providing you with the facility to obtain a mean of the outlier-free values. If however the non-outlier series is autocorrelated then things are a tad more complicated . AUTOBOX , a time series analysis package is designed to simultaneously identify both the ARIMA structure and the form of the anomalies.

Unusual values provide an insight into possible drivers and are ignored at the user's peril. If they are identified they should be allowed for in the forecast.

enter image description here

EDITED TO PRESENT HOW PULSE IDENTIFICATION WORKS.

Consider you have 60 values and are trying to find the most "unusual value"

run 61 ols models to search for whether or not there is an unusual value and where it is.

IrishStat
  • 27,906
  • 5
  • 29
  • 55
  • 3
    Nothing in the question implies that the OP is taking about time series. – Nick Cox Nov 29 '18 at 23:04
  • 1
    by specifying that the error structure is free of time series complications you are good to go. essentially denying time series. Unusual values can be identified even if it is not time series via tsoutliers. Try it .. it might be informative. – IrishStat Nov 30 '18 at 00:43
  • 1
    OK, but the answer is still just -- use certain software. You don't give any details on what criteria are used to identify outliers. (FWIW, I wasn't the downvoter or upvoter here.) – Nick Cox Nov 30 '18 at 07:20
  • the approach is here https://pdfs.semanticscholar.org/09c4/ba8dd3cc88289caf18d71e8985bdd11ad21c.pdf where the user presets a level of confidence and fundamentally it is an iterative process where alternatives are evaluated, winners selected to add dummy (0,1) series to a GLM – IrishStat Nov 30 '18 at 09:01
  • References are fine but a more detailed summary would be much better. With a non-time series dataset I would expect the approach you suggest to give identical answers regardless of any order in which the data are presented. Can you confirm that? – Nick Cox Nov 30 '18 at 11:43
  • Confirmed . One caveat the procedure can detect not only pulses but level shifts ( a sequence of similar values 0,0,0,1,1,1,1,0,0,0) and seasonal pulses and local time trends . The last 3 should be supressed as they have no role in cross-sectional data . Cross-sectional data can be seen to be a particular case of time series data where there is no autocorrelation. – IrishStat Nov 30 '18 at 14:59
  • @Nick I have added a few words to my response that might help explain the "search procedure" – IrishStat Nov 30 '18 at 15:10