How to calculate Rousseeuw’s and Croux’ (1993) Qn scale estimator for large samples?

Question

Let $Q_n = C_n.\{|X_i-X_j|;i < j\}_{(k)}$ so for a very short sample like $\{1,3,6,2,7,5\}$ it can be calculated from finding the $k$th order static of pairwise differences:

    7 6 5 3 2 1
1   6 5 4 2 1
2   5 4 3 1
3   4 3 2
5   2 1
6   1
7

h=[n/2]+1=4

k=h(h-1)/2=8

Thus $Q_n=C_n. 2$

Obviously for large samples saying consist of 80,000 records we need very large memory.

Is there anyway to calculate $Q_n$ in 1D space instead of 2D?

A link to the answer ftp://ftp.win.ua.ac.be/pub/preprints/92/Timeff92.pdf although I cannot fully understand it.

OK, the answer for the guys who will read this later: if you just want to calculate a robust scale estimator for a piece of data 1-install latest version of R 2-install the robustbase package 3-ready to go! but if you are developing a code outside this environment you need to use weighted high medians to minimize required calculations for Sn or Qn. — K-1, Oct 29 '10 at 07:28
The link to the paper doesn't work. A proper reference (even better, with a quote of the most relevant information) would have helped us locate the information; as it stands it's useless when the link dies (as often happens). — Glen_b, Mar 16 '16 at 01:49
shouldn't it be k = h choose 2 = h(h-1)/2= **6** ? It doesn't change the end result, though. — a tiger, Apr 25 '17 at 11:18

score 15 · Accepted Answer · edited Dec 02 '19 at 14:52

15

Update: The crux of the problem is that in order to achieve the $O(n\log(n))$ time complexity, one needs in the order of $O(n)$ storage.

No, $O(n\log(n))$ is the lower theoretical bound for the time complexity of (see (1)) selecting the $k^{th}$ element among all $\frac{n(n-1)}{2}$ possible $|x_i - x_j|: 1 \leq i \lt j \leq n$.

You can get $O(1)$ space, but only by naively checking all combinations of $x_i-x_j$ in time $O(n^2)$.

The good news is that you can use the $\tau$ estimator of scale (see (2) and (3) for an improved version and some timing comparisons), implemented in the function scaleTau2() in the R package robustbase. The univariate $\tau$ estimator is a two-step (i.e. re-weighted) estimator of scale. It has 95 percent Gaussian efficiency, 50 percent breakdown point, and complexity of $O(n)$ time and $O(1)$ space (plus it can easily be made 'online', shaving off half the computational costs in repeated use -- although you will have to dig into the R code to implement this option, it is rather straightforward to do).

The complexity of selection and ranking in X + Y and matrices with sorted columns G. N. Frederickson and D. B. Johnson, Journal of Computer and System Sciences Volume 24, Issue 2, April 1982, Pages 197-208.
Yohai, V. and Zamar, R. (1988). High breakdown point estimates of regression by means of the minimization of an efficient scale. Journal of the American Statistical Association 83 406–413.
Maronna, R. and Zamar, R. (2002). Robust estimates of location and dispersion for high- dimensional data sets. Technometrics 44 307–317

Edit To use this

Fire up R (it's free and can be downloaded from here)
Install the package by typing:

install.packages("robustbase")

Load the package by typing:

library("robustbase")

Load your data file and run the function:

mydatavector <- read.table("address to my file in text format", header=T)
scaleTau2(mydatavector)

edited Dec 02 '19 at 14:52

Martin Valgur

105
3

answered Oct 26 '10 at 10:20

user603

21,225
3
71
135

I'm using Mathematica for developing the codes I need, how can I use what I downloaded from robustbase website? – K-1 Oct 27 '10 at 08:24
You can find the source of the function in this thread: http://stats.stackexchange.com/questions/4017/translate-with-rcpp and try to translate from there to mathemtica. – user603 Oct 27 '10 at 10:02
Ok, I installed the software, I used the scan to read my data: mydat – K-1 Oct 28 '10 at 07:16
thanks kwak, after all i could do test my code with R results. :) – K-1 Oct 29 '10 at 07:32
@K-1:> oh, i was going to respond...then i somehow forgot. Sorry about that. So it worked ? – user603 Oct 29 '10 at 12:10
yes, it worked. It's pretty faster than my code in mathematica. Now i should look for a way to connect mathematica and R kernels together. thanks for the helpful comments. – K-1 Oct 30 '10 at 00:42
@K-1:> if you look at the code/formula of the Tau estimator (type "scale2tau2" in R), it's basically just a weighted mean (with weights set by the median/MAD and cut of to 0 below a threshold). This should run fast on anything. – user603 Oct 31 '10 at 11:38
Could you please give a hint on how it can be made online? Do you mean exactly or an approximate version? – Quartz Nov 22 '13 at 17:53
@Quartz: which one, the mad/tau or the Qn? – user603 Nov 23 '13 at 13:50
2

@user603: the tau you were referring to. Btw why isn't it widespread if it has such good statistical and computational efficiencies and breakdown point? – Quartz Nov 25 '13 at 13:01
2

a) you can compute the mad and median [online](http://stats.stackexchange.com/a/3379/603). From there it's trivial to compute the Tau. b) breakdown ain't robustness and the Tau has terrible bias in the presence of outliers. You can find more argument against it in section 5 of [the Qn paper](http://web.ipac.caltech.edu/staff/fmasci/home/statistics_refs/BetterThanMAD.pdf) – user603 Nov 25 '13 at 13:26
1

@user603 you mean this paper? https://wis.kuleuven.be/stat/robust/papers/publications-1994/rousseeuwcroux-biaskstepmestimators-spl-1994.pdf – German Demidov Mar 16 '16 at 00:05
@user603 the problem is that $Q_n$ and $S_n$ also have bias. I did not find a paper about it, I just simulated data - 20% of outliers already create huge bias...=( so probably scale2tau2 is the best and also it returns a mean estimate, which is somehow similar to Lehman-Hodges estimator which I typically use... – German Demidov Mar 16 '16 at 10:36
@GermanDemidov: can you be more specific? What do you mean by bias here? – user603 Mar 16 '16 at 10:43
1

@user603 according to the paper, the bias curve tells us how much estimator can change due to a given fraction of contamination. $Q_n$ and $S_n$ were biased for my simulated examples (normal distribution + 20% of extremely high/low values), and the level of bias was comparable. May be I got something wrong, but both $S_n$ and $Q_n$ seem suffering from the same problem. – German Demidov Mar 16 '16 at 10:47
1

@user603 sorry, the effect could not be seen for samples of size 100. I clearly see the problem using large sample sizes. All of them have terrible biases, but $\tau$ has the largest one. – German Demidov Mar 16 '16 at 10:59
@GermanDemidov: I don't know why I just saw your comments above now. 3 years latter. Yes indeed, your points are all correct. Generally speaking the bias of robust estimators tends to get out of whack (but bounded) at $\varepsilon/2$ where $\varepsilon$ is the nominal breakdown point. – user603 Apr 06 '19 at 08:51

score 0 · Answer 2 · edited Jun 11 '20 at 14:32

0

(Very short answer) The text for commenting says

avoid answering questions in comments.

so here it goes: There is a paper about an online algorithm which seemingly runs quite well: Applying the $Q_n$ Estimator Online.

EDIT

(by user user603). The algorithm linked to in this article is a moving window version version of the $Q_n$.

Given a large sample $\{x_i\}_{i=1}^N$ divided into time windows of width $n<N$, $\{x_i\}_{i=t-n+1}^t$ we can apply the $Q_n$ to each time window yielding $N-n+1$ values of the $Q_n$. Denote these values $\{Q_n^i\}_{i=1}^{N-n+1}$

The algorithm cited here allows to obtain $Q_n^i|Q_n^{i-1}$ at an average cost less than the worst case $O(n\log(n))$ needed to compute $Q_n^i$ from scratch.

This algorithm can however not be used to compute the $Q_n$ of the full original sample $\{x_i\}_{i=1}^N$. It also needs to maintain an buffer whose size can be as large as $O(n^2)$ (though it is often much smaller).

edited Jun 11 '20 at 14:32

Community

1

answered Mar 13 '16 at 16:15

serv-inc

273
1
2
10

While you should not answer in comments, you should also not post comments as answers, and if your answer is only a link, it's [not an answer](http://stats.stackexchange.com/help/deleted-answers) (but might be a comment). If you want it to be an answer rather than a comment, your answer should contain the relevant information in some manner, such as a [quote](http://stats.stackexchange.com/help/how-to-answer) from a properly referenced link, or your own explanation of the important details. If you can, please provide the necessary details; alternatively I can convert this to a comment for you. – Glen_b Mar 16 '16 at 01:42
@Glen_b: go ahead and convert. Thank you for the clarification. – serv-inc Mar 16 '16 at 17:23
1

@user603 The perhaps you could (as in the links in my comment) edit the essential information into the above answer -- as it stands at present it's not within the SE networks guidelines for answers. – Glen_b Mar 16 '16 at 23:17
No problem, I will! (but it is really late here,) – user603 Mar 16 '16 at 23:51
@user603 Thanks; I'll leave it here for now then – Glen_b Mar 17 '16 at 01:09

score 0 · Answer 3 · answered May 17 '18 at 01:55

this is my implement of Qn...

I was programming this in C and the result is this:

void bubbleSort(double *datos, int N)
{
 for (int j=0; j<N-1 ;j++)     
  for (int i=j+1; i<N; i++)    
   if (datos[i]<datos[j])      
   {
    double tmp=datos[i];
    datos[i]=datos[j];
    datos[j]=tmp;
   }
}

double  fFactorial(long N)    
{
 double factorial=1.0;

 for (long i=1; i<=N; ++i)
  factorial*=(double)i;

 return factorial;  
}

double fQ_n(double *datos, int N)  // Rousseeuw's and Croux (1993) Qn scale estimator
{
 bubbleSort(datos, N);

 int m=(int)((fFactorial((long)N))/(fFactorial(2)*fFactorial((long)N-2)));

 double D[m];
 //double Cn=2.2219;      //not used now :) constant value https://www.itl.nist.gov/div898/software/dataplot/refman2/auxillar/qn_scale.htm

 int k=(int)((fFactorial((long)N/2+1))/(fFactorial(2)*fFactorial((long)N/2+1-2)));

 int y=0;

 for (int i=0; i<N; i++)
  for (int j=N-1; j>=0; j--)
   if (i<j)
   {
    D[y]=abs(datos[i]-datos[j]);
    y++;
   }

 bubbleSort(D, m);

 return D[k-1];
}

int main(int argc, char **argv)    
{
 double datos[6]={1,2,3,5,6,7};
 int N=6;

 // Priting in terminal the final solution
 printf("\n==[Results] ========================================\n\n");

 printf(" Q_n=%0.3f\n",fQ_n(datos,N));

 return 0;
}

Although implementation is often mixed with substantive content in questions, we are supposed to be a site for providing information about statistics, machine learning, etc., not code. It can be good to provide code as well, but please elaborate your substantive answer in text for people who don't read this language well enough to recognize & extract the answer from the code. — gung - Reinstate Monica, May 17 '18 at 02:02

How to calculate Rousseeuw’s and Croux’ (1993) Qn scale estimator for large samples?

3 Answers3

EDIT