3

I cannot decide if this question is patently silly or actually deep, so figured that makes it a perfect question for Cross Validated.

We have a dataset with N values that is not normally distributed - assume it's just heavily right-skewed. We are using two columns of this dataset - x and y - to partition this data into "buckets" based on the % of elements that are above or below the median values of x and y.

So, as an example: if element a has x[a] >= median(x) and y[a] < median(y) would fall into one "bucket" (greater than or equal to median x, less than median y), where as element b that has x[b] < median(x) and y[b] < median(y) would be in another "bucket" (less than median x, less than median y).

Imagine we do this for the entire dataset. Is there any reason to assume that the percentages that would fall into these four "buckets" should be balanced? I'm having a tough time finding an argument - either proof-oriented or intuitive - to say that we should have no expectation of balance in this scenario. Any help would be appreciated!

Kyle Shank
  • 85
  • 4

1 Answers1

2

In general, no. The medians will split the data in half along the marginal distributions for $x$ and $y$. The quads will be evenly populated if $x$ and $y$ are independent, but if they're not independent then this isn't guaranteed.

I don't have a proof on hand, but it's straightforward to find counter-examples. For instance, here are the results when we make $x$ and $y$ moderately correlated:

x <- rnorm(1000)
y <- rnorm(1000)/2 + x/2
table(data.frame(xhalf = x > median(x), yhalf = y > median(y)))
#        yhalf
# xhalf   FALSE TRUE
#   FALSE   383  117
#   TRUE    117  383
plot(x,y)
abline(h = median(x))
abline(v = median(y))

enter image description here

This is related to the use of copula plots: instead of plotting $x$ and $y$, this plots on the empirical distributions of $x$ and $y$. Since the resulting shape, a copula, has uniform marginal distributions, this is useful for examining dependencies between the variables.

  • 2
    For an extreme example, check out the upper right distribution in the first figure at https://stats.stackexchange.com/a/30205/919. There, the marginals are both Normal but there is the most extreme disparity in the partitioning by the marginal medians. – whuber May 26 '21 at 19:05