Defining the "uniformity" of a dataset

Question

I am working on a few algorithms where I have a list of $N$ samples. Currently I have plotted these into a histogram and have a view of how uniform the values are distributed within an interval, which is quite good as a visualization, although I need a comparable value of how uniform the dataset is, in order to measure how robust it is compared to my other algorithms.

I have been looking at chi-squared test, but could not figure out how it would become helpful in my usecase?

Sample from dataset:

Code for importing data and applying chi-squared in R:

mydata = read.csv2("/opt/doc/stat/uniform_test_1.csv")
x <- sapply(mydata, as.numeric)
chisq.test(x)

Result: X-squared = 1664769844, df = 999998, p-value < 2.2e-16

You are asking a (simplified) one-dimensional version of http://stats.stackexchange.com/questions/40928/measure-the-uniformity-of-distribution-of-points-in-a-2d-square, all of whose answers are applicable to your question with essentially no change at all. — whuber, Jan 08 '13 at 18:27
@whuber, thank you for pointing me to that question. But i must admit that this part of statistics is a bit rusty in my mind, can you be a bit more specific? Thanks. — JavaCake, Jan 08 '13 at 18:40
[A Baumann](http://stats.stackexchange.com/a/40929) suggests a KS-test: read about it on Wikipedia and find an open-source implementation in `R`. [Ben Allison](http://stats.stackexchange.com/a/40935) recommends a chi-squared test and helpfully describes how to conduct it by dividing the range of data "into equally sized non-overlapping patches." Many sources, including [Wikipedia](http://en.wikipedia.org/wiki/Goodness_of_fit#Pearson.27s_chi-squared_test), describe this test and provide formulas. Open source software such as `R` will do the calculations automatically. — whuber, Jan 08 '13 at 18:44
@whuber, i have tried to apply `chisq.test` to my dataset and get following result `X-squared = 1664769844, df = 999998, p-value < 2.2e-16`. What does the `df` represent? I will look into `KS-test`. — JavaCake, Jan 08 '13 at 18:51
You may have misapplied it (although it's conceivable this result is correct): in an edit to your question, can you provide some details describing what you did? `df` is the **D** egrees of **F** reedom mentioned in all the references. — whuber, Jan 08 '13 at 18:52
That's what I suspected. It means you're misapplying the test, because you have not binned the data as described in the answers I linked to, but it also shows you're on the right track. With your edits, the question has become clearer and definite, so I'm confident it will get some great answers soon. If you could provide an example of your data--such as the first five or ten lines of the file--that would be even more helpful. — whuber, Jan 08 '13 at 19:58
One last question (I hope!) In principle it makes a difference if you know the range of the data beforehand. That is, do you want to check for uniformity within a *predetermined* range of values or do you want both to *estimate* what the range seems to be and *also* check for uniformity? (The "in principle" would be relevant to smaller datasets: with a million values, the distinction is not likely of practical importance.) Another detail: are all values necessarily integral? Although this is another minor point, it is one that can affect the validity of the test even with a million values. — whuber, Jan 08 '13 at 20:31
@whuber, im working with algorithms that requires RNG's and as far as for the interval is $[0:10^4]$. I do not require that the values are equally divided in the interval. Basically it could be interesting to measure how uniform the values are in the interval and also how uniform it is generally for the dataset (as i can see in the histogram). — JavaCake, Jan 08 '13 at 20:37
I think your question has been answered in http://stats.stackexchange.com/questions/375, http://stats.stackexchange.com/questions/4331, and http://stats.stackexchange.com/questions/30. — whuber, Jan 08 '13 at 20:42
@whuber, im not really interested in determining the randomness, so Diehard is somewhat out of the picture. http://stats.stackexchange.com/questions/4331/uniform-distribution-test seems useful, but pretty much similiar to my question and failure using `chi-square`. — JavaCake, Jan 08 '13 at 20:49
Because the idea of randomness includes uniformity--and uniformity is so basic to randomness--all tests of randomness include tests of uniformity. That's why you will find your question answered in those threads about testing randomness. — whuber, Jan 08 '13 at 21:03
@whuber, that is infact true, i might be moving in a wrong direction. But i find this field very confusing. I understand that Diehard is pretty much deprecated, but the alternatives are massive. Is there a particular recommended way to test randomness? — JavaCake, Jan 08 '13 at 21:07
I see that chi-square is also recommended, but why is my approach failing? — JavaCake, Jan 08 '13 at 21:09

score 3 · Accepted Answer · answered Jan 08 '13 at 21:17

3

Chi-squared is used in a LOT of ways in statistics. The R command chisq.test is described as: "chisq.test performs chi-squared contingency table tests and goodness-of-fit tests." And in particular, "If x ... is a vector and y is not given, then a goodness-of-fit test is performed (x is treated as a one-dimensional contingency table)." So if your $x$ is your raw data, you're getting nonsensical results.

It sounds like you're conflicted on what you're calling "uniform". Visually, you're looking at a histogram, which bins the data in intervals and displays the counts in each interval. Yet you don't require numbers to be equally divided in your interval?

Based on what you're seeing in the histogram, you should bin your data, as in the histogram's bins, and then you can do a chisq.test on that, or look at the variance among the bins, or look at quantiles of the bins, or something else.

From what you've said, the big difference between what you want and checking a random number generator is that you don't care about the order in which the numbers were generated, only the set of numbers that were generated. In which case, you'd expect the count of numbers in each bin to be proportional to the size of the bins, and deviance from that would indicate non-uniformity.

answered Jan 08 '13 at 21:17

Wayne

19,981
4
50
99

I have realised that my approach to uniform values perhaps has been false. You mention binning my data, can you please explain how i should input the data to `chisq.test`? The doc http://stat.ethz.ch/R-manual/R-patched/library/stats/html/chisq.test.html does not explain in detail how the data should be. – JavaCake Jan 08 '13 at 21:22
@JavaCake: There are various means for binning data. If you're using the `hist` function to generate your histograms, it does more than just graph, and you can do something like: `foo – Wayne Jan 08 '13 at 21:31
That gave infact a result. Since my interval is $[1:10^4]$ would the chisq function need this information in order to give the correct result? – JavaCake Jan 08 '13 at 21:39
My idea was to set `breaks=max` for `hist` and overcome that problem with not getting all bins? – JavaCake Jan 08 '13 at 21:46
@JavaCake: I'd look at `cut` and `table` instead of leaning on `hist`, but you need to tell `hist` what your actual range is. Something like: `foo – Wayne Jan 08 '13 at 21:50
@JavaCake: If I understand `hist`, `breaks=max` is saying that it should find the maximum number in your input and use that as the number of bins to break your data into. You need to slow down a bit and understand what you're doing a bit more. – Wayne Jan 08 '13 at 21:54
I am trying to put the puzzle together concerning how data should be input in `chisq.test`. I thought that all values in the dataset had to have its own bin and duplicates were summed up in that particular bin in order to get the correct statistical result? – JavaCake Jan 08 '13 at 21:59
@JavaCake: No, the values that fall into a bin (a range of x vaules) are counted. Depending on what function you use to calculate and plot a histogram, it may give "density" or "Frequency"/"Count". The default for `hist`, if the bins are all equal-width, is "Frequency", which is the number of x's that fell within each bin (range). – Wayne Jan 08 '13 at 22:17
that makes good sense, im trying to mess around with `cut` and i see the `chisq.test` results change significantly depending on the size of bins/breaks i use. As an example i used `10` which gave me `X-squared = 5.1192, df = 8, p-value = 0.7448` and `1000` which gave me `X-squared = 1087.253, df = 998, p-value = 0.02524`. The last probability is at a level where it does not make sense for my null hypothesis. You mention that only values that fall into a bin is counted, so if i set my break to `10000` there is a chance that all values can have an independent bin? – JavaCake Jan 08 '13 at 22:28

score 1 · Answer 2 · answered Apr 23 '13 at 04:23

I think if you're after a measure of uniformity, goodness of fit tests for the uniform offer a variety of statistics that can provide suitable 'uniformity' measures.

If your upper and lower limits are known, Kolomogorov-Smirnov, Cramer-von Mises or Anderson-Darling statistics offer measures of uniformity (though there are a bunch of other measures available from other statistics).

If the upper and lower limits are unknown, you could do correlation against uniform scores (expected uniform order statistics or similar), which doesn't depend on the limits being known.

An alternative is to use the sample max and min to scale the remainder of the sample to $(0,1)$; if the sample is from a uniform that rescaling leaves you with a standard uniform sample with two fewer observations; then one of the goodness-of-fit statistics can be used as a measure of uniformity.

Chi-square test statistics can be used but they don't make efficient use of the available information (they have relatively low power against interesting alternatives).

Defining the "uniformity" of a dataset

2 Answers2

Related