Are there ways to automatically detect outliers ( we can fix uni-dimensional datasets ) when the underlying distribution is difficult to model ?
Intuitively, reseampling techniques could help.
(1) You split the data into two sets, S1 and S2.
(2) You fit an empirical distribution on S1 (e.g. using hystograms).
(3) You develop an engine to detect if points in S2 are compatible with the non-parametric distribution fitted using data from S1.
(4) Repeat this for may S1, S2 splits and record how many times the point was detected as an outlier.
As an example of Step (3), one could evaluate the interquartile range based on S1 and check how many and which points in S2 do not fall in the interquartile range.
Questions:
I am not sure that this intuitive idea make sense or can be made more rigorous or even implemented in a sensible way. Are there such methods ?
Alternatively, is there a "standard" for outlier estimation in a non-parametric setting ?