0

I have 100 unique joint probability mass functions with a dataset noting the prevalence of instances from each joint pmf, like this:

The total amount of instances in this case would be 16,073. Each joint PMF looks something like this:

 {'F': 0.3, 'M': 0.7},
 {'0–18': 0.1,
  '19–25': 0.3,
  '26–35': 0.2,
  ...
 },
 {'African American': 0.13,
  'Asian': 0.2,
  'Caucasian': 0.6,
  ...
 }, ...

Each joint PMF can be assumed independent (i.e. the probability of an instance is the product of the marginal distributions).

I have utilized stratified sampling to represent each joint PMF based on the number of instances in the dataset. For example, for a random sample of n=100k, there are approximately 100k/(16.073k) * 378 = 2,352 instances from PMF 1, 959 instances from PMF 2, ... , 1,319 instances from PMF 100. Resulting in a dataset (with 100k rows) like this:

I'm trying to calculate the Jensen–Shannon distance between two datasets that look like the first embedded table with a different number of instances (but the same 100 unique PMFs). Since there is no closed form solution for JS-distance for joint PMFs, I'm trying to implement a monte carlo simulation approximation via this.

However, I'm stuck on how to do this since my dataset has multiple joint pmfs instead of a single joint pmf. Any ideas would be very helpful! Thank you so much!

0 Answers0