I cannot decide if this question is patently silly or actually deep, so figured that makes it a perfect question for Cross Validated.
We have a dataset with N
values that is not normally distributed - assume it's just heavily right-skewed. We are using two columns of this dataset - x
and y
- to partition this data into "buckets" based on the % of elements that are above or below the median values of x
and y
.
So, as an example: if element a
has x[a] >= median(x)
and y[a] < median(y)
would fall into one "bucket" (greater than or equal to median x, less than median y), where as element b
that has x[b] < median(x)
and y[b] < median(y)
would be in another "bucket" (less than median x, less than median y).
Imagine we do this for the entire dataset. Is there any reason to assume that the percentages that would fall into these four "buckets" should be balanced? I'm having a tough time finding an argument - either proof-oriented or intuitive - to say that we should have no expectation of balance in this scenario. Any help would be appreciated!