We have a dataset that lists employers in a fairly unique industry, the number of employees they have and the breakdown of employees as male or female.
The size range of the companies in this dataset is fairly broad, so we’d like to break the dataset into four “bins” by size so we can easily see whether gender diversity varies significantly based on company size (by numbers of employees). We want to examine, for example, whether larger companies or smaller companies in this industry have a more gender diverse workforce on average.
The debate we are currently having is what’s the best, most statistically sound way to break the dataset up into four “bins” for this type of analysis.
The strategy we are currently using is to break up the dataset based on the number of employees and also what we know about the industry. So for example, one bin is made up of companies with, say, more than 700 employees, which we felt was a bin composed of firms that are roughly equivalent in terms of size, market reach, and capability, and are often viewed as reasonable substitutes for each other.
Using this strategy, we have created four “bins” with somewhat similar numbers of companies in each, but, for example, because there are more smaller companies than large ones (like in most industries,) the bin with the smallest companies is certainly a bit larger than the rest (by about 30 or so companies). And all the bins vary to a certain extent, by between 10-15 companies or so.
Basically what we want to know is whether our system of selecting bin sizes this way is bogus statistically. Do the number of companies in each bin have to be the same? Is it ok that part of our consideration is also choosing round numbers for sanity's sake? If we try to break things up so the bins have exactly the same numbers of companies, we get weird employee number ranges, (firms with 109-346 employees, for example) which seems possibly harder than necessary to explain, and sometimes I worry that we're lumping together companies that shouldn't necessarily be lumped together.
The dataset is based on survey data, so that may complicate things somewhat? The response rate actually seems pretty unusually good based on other reports I’ve seen for this industry, especially among the largest companies, but more varied among smaller ones. If that matters.