1

Lets say I have a set of numbers distributed as {5, 5, 5, 5} which has a mean value 5 and variance of 0. Now I can redistribute the data in different ways like for instance

{4, 3, 5, 8}; mean = 5, variance = 3.5

{4, 4, 4, 8}; mean = 5, variance = 3

{3, 3, 5, 9}; mean = 5, variance = 6

{6, 6, 6, 2}; mean = 5, variance = 3

Now I would like to choose the one which is better balanced/distributed comparing with the original distribution of {5, 5, 5, 5}. My question is what statistical test I can do to choose the most favourable redistribution from this example?

  • 2
    This needs some clarification. What is "favorable" in this case? What do you mean by "better balanced?" What is "the data" that you are redistributing? – shadowtalker May 25 '15 at 00:01
  • From this example the most favourables in my consideration are either of {4, 4, 4, 8} and {6, 6, 6, 2}. The data you consider as balls (total 20 here) and I'm distributing them in 4 bins. By better balance I mean how well these 4 bins are balanced. – Joarder Kamal May 25 '15 at 00:06
  • You mean, you want the distribution that's closest to uniform. Right? – shadowtalker May 25 '15 at 00:22
  • 1
    Thats right, I'll choose the best redistribution thats closet to uniformity. – Joarder Kamal May 25 '15 at 00:24
  • @ssdecontrol what do you mean by 'uniform' in this instance? Isn't "5 in each cell" perfectly uniform? How could the OP seek to make that more uniform? – Glen_b May 25 '15 at 02:23
  • @Glen_b ack I meant "constant" – shadowtalker May 25 '15 at 02:24
  • 1
    @sddecontrol -- sorry to be dense but I am still not clear on the intent of the new term. What would be more constant than 5,5,5,5? – Glen_b May 25 '15 at 03:26
  • This question is incomplete: it needs to specify how "better balanced" and "most favorable" ought to be evaluated or, at a minimum, what it is intended to mean. Note, too, that "uniformity" (as referenced in the comments) could be taken to mean several distinct things: both (5,5,5,5,5) and (3,4,5,6,7) are perfectly uniform (and have the same mean). Please edit the question to include enough information to enable all readers to understand it in the same way. – whuber May 26 '15 at 14:50

1 Answers1

2

Another way to look at this problem is to see it as a template matching problem where your template is {5,5,5,5} and the query is {4,4,4,8} for instance. Then you assume each distribution as a histogram and compare them together. If your bins are distinct and you have some information regarding them you can simply use cross-bin information distances such as EMD (Earth Movers Distance), otherwise normal metrics serves the purpose. To find the candidate with minimum entropy of difference to template, you can use Kullback-Liebler divergence (KL-dist). Another good choice is Chi-Square distance that is dervided from Pearson $\chi^2$ statistics.

There are plenty other metrics to compare two histograms that might interest you. Please refer to this page for a list of most common histogram distances and this GitHub page for their Matlab implementation.

Kourosh Meshgi
  • 646
  • 1
  • 5
  • 5