1

Previous question

In my previous question (Kolmogorov-Smirnov test - reliability) I misused two-sample KS test for normality testing of one-sample data. I was given an advice that I should use one-sample KS test, also it was recommended to me to use Anderson-Darling test.

Current situation

I use following C++ implementations of Anderson-Darling (AD) and Kolmogorov-Smirnov (KS) tests to check if given cluster has normal distribution or not.

Problems

If i deal with lets say hundreds of points per cluster, AD gives me satisfactory results. Nevertheless, If I have thousands points in my cluster, I receive large values of A-squared from AD that are far away from AD's critical values.

I have read here that AD fails for large number of points and that it is better to use KS. However, I am not sure, if I did something wrong, but KS is equal to 1.0 in almost every case (cross checked in Matlab too).

Example data

Example data are located here. It represents a cluster of gaussian mixture points. I want to test this cluster for normality (and I want to fail in that test for this data). Consequently, I want to split data to sub-clusters, test each sub-cluster again and split again until I end up with "leaf clusters" that are normally distributed.

Michal
  • 265
  • 3
  • 12
  • 1
    The algorithm described in the last paragraph will, with rare exceptions, terminate only when each cluster has one or two points: that's about the only way a cluster can look perfectly "normally distributed." If instead you allow for slight deviations from normality *it is not appropriate to use the p-value of a hypothesis test for such a measure* (except in the very special case when all clusters have the same counts). This leaves you with a bunch of problems but, as of yet, no clearly articulated question. *What exactly are you trying to accomplish in the end?* – whuber May 07 '14 at 14:56
  • I want to endup with intuitive peaks (if we plot the data as histogram). I was thinking about allowing slight deviations from normality, as you said, in order to have some stop criterion - threshold. – Michal May 07 '14 at 15:19
  • You seem to be developing your own algorithm to identify a *mixture model*. That's not easy to do, but there exist several good, well-known approaches and plenty of software. Check out the [links to this term on our site.](http://stats.stackexchange.com/search?tab=votes&q=%20mixture%20model%20gaussian) – whuber May 07 '14 at 15:23
  • Yes that is true. However, I've already implemented Approximate gaussian mixtures algorithm (http://stackoverflow.com/questions/22169492/number-of-peaks-in-histogram?lq=1) which is based on EM. This algorithm works great, nertheless I have to work with bad 1D peaks (lot of noise, large overlaps). AGM in every EM iteration uses convolution criterion to merge clusters that are overlapping significantly. I am trying to develop algorithm which will compete with AGM and it will try to divide cluster that differs from normal distribution at most. – Michal May 12 '14 at 12:19

0 Answers0