1

So my situation is that I have a large set of events, each of which contains many variables (e.g. mass, length, momentum, colour...). This set of events can be divided into two categorys according to the colour variable, all the events are either "red" or "blue".

I want to use these events to train a neural net and that will learn to guess the colour variable based on length, momentum and other variables. However, I know for a fact that colour does not depend directly on length, although it may depend on the interaction between length and other variables. In order to ensure that neural net does not learn an unwanted correlation I want to reweigh the data that in "red" and "blue" individually so that they have about the same distribution in length.

I am told not to duplicate events in the set, only delete them. (I don't know why, is there some obvious reason for this?)

My plan at present is;

  1. Create a probability histogram of length for each colour with N bins (say 50).

  2. Take the summed difference between the corresponding bars to be a measure of how different the histograms are, call this the 'distance'. I want to reduce this distance below some target, say 0.1.

  3. Find the two bars that differ the most.

  4. Delete one event contributing to the larger of the two bars.

  5. Check if the distance between the histograms has been reduced below the target, if so stop. If not return to step 2.

This is probably not an optimal solution. When searching for similar predicament I have seen a lot about the Kolmogorov–Smirnov test, however, in my case the 'length' variable does not have a known empirical distribution.

My question is what concepts should I be reading about in order to figure out a better plan?

Clumsy cat
  • 121
  • 5

0 Answers0