1

I think my assumptions are a bit naive regarding this matter. I have two metrics about my data set: the number of items and the cardinality of the items. Low cardinality means a lot of repeated items and high cardinality means few repeated items (tending to a uniform distribution). Based on this I want to identify if my data set is skewed.

{1,2,3,4,5,6,7,8,9,10} = high cardinality = 10 and 10 items totally different.
{1,1,1,1,2,2,2,2,3,3} = low cardinality = 3 and 10 items. a lot of repeated items.

+-----------------+-------------+--------------+-----------------+
| number of items | cardinality |     calc     |     skewed?     |
+-----------------+-------------+--------------+-----------------+
|              10 |          10 | 10/10 = 1    | totally uniform |
|              10 |           2 | 2/10 = 0.2   | skew            |
|              10 |           8 | 8/10 = 0.8   | uniform         |
|             100 |           8 | 8/100 = 0.08 | skew            |
|             100 |          50 | 50/100 = 0.5 | skew            |
|             100 |          80 | 80/100 = 0.8 | uniform         |
+-----------------+-------------+--------------+-----------------+

Is this a reasonable way to check if I have a skewed data set? I set a threshold of 0.8 and if the cardinality/number of items is less than 0.8 it means that I have skewed data on a given data set.

Izy
  • 579
  • 5
  • 17
Felipe
  • 113
  • 7
  • Can you please explain what you mean by *cardinality of the items*? – kjetil b halvorsen Aug 13 '19 at 09:20
  • I wrote an example – Felipe Aug 13 '19 at 09:52
  • Is the items you have seen **all types there could be**, or could some be unseen? – kjetil b halvorsen Aug 13 '19 at 09:59
  • I am afraid I didn't understand what you mean. If I see the item I compute it, otherwise, there is no way to include the item on my data set. – Felipe Aug 13 '19 at 10:02
  • I think it would help if you could tell us the context where this arises! You are asking for some measure of *uniformity*, not of *skewness*. – kjetil b halvorsen Aug 13 '19 at 10:04
  • I thought that the opposite of uniformity is skewness. My case is in a join algorithm. When I have to join items from 2 tables and the join key is skewed. – Felipe Aug 13 '19 at 10:06
  • 2
    I think OP means skewed towards certain values rather than being uniform the way that a die has 1/6 probability for each number. As a heads up, that is not what skewness means in statistics. The bell curve is not skewed, but it certainly is more likely to generate values near the mean than out in the rails (i.e. about average is more common than exceptionally good or exceptionally bad). For the OP’s problem, I think a chi-squared test would work, and if no one beats me to it, I’ll write a full answer when I’m not restricted to using just my phone. – Dave Aug 13 '19 at 10:40
  • thanks. could you also answer what is an OP problem? – Felipe Aug 13 '19 at 10:45
  • 1
    @Felipe It means that you’re the original poster. – Dave Aug 13 '19 at 11:24
  • thanks. I would like to see what you have in mind. – Felipe Aug 13 '19 at 12:00

1 Answers1

2

First, let's get into what skewed means versus uniform.

Here is an unskewed distribution that is not uniform. This is the standard normal bell curve.

enter image description here

plot(seq(-3,3,0.01),dnorm(seq(-3,3,0.01),0,1),type='l',xlab='',ylab='')

Here is a skewed distribution ($F_{5,5}$).

enter image description here

plot(seq(0,4,0.01),df(seq(0,4,0.01),5,5),type='l',xlab='',ylab='')

However, both distributions have values that they prefer. In the normal distribution, for instance, you would expect to get samples around 0 more than you would expect values around 2. Therefore, the distributions are nor uniform. A uniform distribution would be something like how a die has a 1/6 chance of landing on each number.

I see your problem as being akin to checking if a die is biased towards particular numbers. In your first example, ecah number between 1 and 10 is equally represented. You have a uniform distribution on $\{1,2,3,4,5,6,7,8,9,10\}$.

$$P(X = 1) = P(X=2) = \cdots = P(X=9) = P(X=10) = \frac{1}{10}$$

In your second example, you have some preference for 1 and 2 at the expense of 3.

$$P(X=1) = P(X=2) = \frac{4/10}, P(X=3) = \frac{2/10}

Number of unique items has nothing to do with the uniformity.

What I think you want to do is test if your sample indicates a preference for particular numbers. If you roll a die 12 times and get $\{3,2,6,5,4,1,2,1,3,4,5,4\}$, you'd notice that you have a slight preference for 4 at the expense of 6. However, you'd probably call this just luck of the draw and that if you did the experiment again, you'd be just as likely to get that 6 is preferred at the expense of some other number. The lack of uniformity is due to sampling variability (chance or luck of the draw, but nothing suggesting that the die lacks balance). Similarly, if you flip a coin four times and get HHTH, you probably won't think anything is fishy. That seems perfectly plausible for a fair coin.

However, what if you roll the die 12,000 or 12 billion times and still get a preference for 4 at the expense of 6, or you do billions of coin flips and find that heads is preferred 75% of the time? Then you'd start thinking that there is a lack of balance and that the lack of uniformity in your observations is not just due to random chance.

There is a statistical hypothesis test to quantify this. It's called Pearson's chi-squared test. The example on Wikipedia is pretty good. I'll summarize it here. It uses a die.

$$H_0: P(X=1) = \cdots = P(X=6) = \frac{1}{6}$$

This means that we are assuming equal probabilities of each face of the die and trying to find evidence suggesting that is false. This is called the null hypothesis.

Out alternative hypothesis is that $H_0$ is false, that some probability is not $\frac{1}{6}$ and the lack of uniformity in the observations is not due to chance alone.

We conduct an experiment of rolling the die 60 times. "The number of times it lands with 1, 2, 3, 4, 5, and 6 face up is 5, 8, 9, 8, 10, and 20, respectively."

For face 1, we would expect 10, but we got 5. This is a difference of 5. Then we square the difference to get 25. Then we divide by the expected number to get 2.5.

For face 2, we would expect 10, but we got 8. This is a difference of 2. Then we square the difference to get 4. Then we divide by the expected number to get 0.4.

Do the same for the remaining faces to get 0.1, 0.4, 0, and 10.

Now add up all of the values: $2.5 + 0.4 + 0.1 + 0.4 + 0 + 10 = 13.4$. This is our test statistic. We test against a $\chi^2$ distribution with 5 degrees of freedom. We get five because there are six outcomes, and we subtract 1. Now we can get our p-value! The R command to do that is "pchisq(13.4,5,lower.tail=F)" (don't put the quotation marks in R). The result is about 0.02, meaning that there is only a 2% chance of getting this level of non-uniformity (or more) due to random chance alone. It is common to reject the null hypothesis when the p-value is less than 0.05, so at the 0.05-level, we can say that we reject the null hypothesis in favor of the alternative. However, if we want to test at the 0.01-level, we lack sufficient evidence to say that the die is biased.

Try this out for an experiment where you roll a die 180 times and get 1, 2, 3, 4, 5, and 6 in the amounts of 60, 15, 24, 24, 27, and 30, respectively. When I do this in R, I get a p-value of about $1.36 \times 10^{-7}$ (1.36090775991073e-07 is the printout).

Now for the shortcut in R. Hover over the hidden text when you think you get the idea of this test and can do it by hand but don't want to.

V <- c(60, 15, 24, 24, 27, 30);chisq.test(V)

This creates a vector of the frequencies (V) and then tests that vector.

Dave
  • 28,473
  • 4
  • 52
  • 104
  • Thanks for this answer. When you said "there is only a 2% chance of getting this level of non-uniformity", can I rephrase it as "the level of non-uniformity is 2%" or "the outcomes of my die is 92% uniform"? However, on the second example you got 1.36090775991073e-07, and I don't know how to interpret this on the same way of your first example. – Felipe Aug 13 '19 at 17:50
  • 1
    The p-value is a measure of how inconsistent the observations are with your assumptions, with the thinking being that if the observations are inconsistent with your assumptions, the assumptions are probably wrong. What you're saying about 2% non-uniformity is incorrect. In fact, the ratios are the same in the 60-roll experiment as the 180-roll experiment, yet the larger sample size of 180 results in a much smaller chance of the observations simply being by bad luck. That's why the p-values are different despite the ratios being the same. (Think HHTH vs HHH HHH TTT HHH for flipping a coin.) – Dave Aug 13 '19 at 18:44
  • your example is good. however i guess i was looking for a measure that says if my data set is uniform vs nonuniform. The skewness that i mean is uniformity – Felipe Aug 14 '19 at 06:23
  • @Felipe The test I described in my post concerns uniformity. – Dave Aug 14 '19 at 10:14