0

We have a dataset that lists employers in a fairly unique industry, the number of employees they have and the breakdown of employees as male or female.

The size range of the companies in this dataset is fairly broad, so we’d like to break the dataset into four “bins” by size so we can easily see whether gender diversity varies significantly based on company size (by numbers of employees). We want to examine, for example, whether larger companies or smaller companies in this industry have a more gender diverse workforce on average.

The debate we are currently having is what’s the best, most statistically sound way to break the dataset up into four “bins” for this type of analysis.

The strategy we are currently using is to break up the dataset based on the number of employees and also what we know about the industry. So for example, one bin is made up of companies with, say, more than 700 employees, which we felt was a bin composed of firms that are roughly equivalent in terms of size, market reach, and capability, and are often viewed as reasonable substitutes for each other.

Using this strategy, we have created four “bins” with somewhat similar numbers of companies in each, but, for example, because there are more smaller companies than large ones (like in most industries,) the bin with the smallest companies is certainly a bit larger than the rest (by about 30 or so companies). And all the bins vary to a certain extent, by between 10-15 companies or so.

Basically what we want to know is whether our system of selecting bin sizes this way is bogus statistically. Do the number of companies in each bin have to be the same? Is it ok that part of our consideration is also choosing round numbers for sanity's sake? If we try to break things up so the bins have exactly the same numbers of companies, we get weird employee number ranges, (firms with 109-346 employees, for example) which seems possibly harder than necessary to explain, and sometimes I worry that we're lumping together companies that shouldn't necessarily be lumped together.

The dataset is based on survey data, so that may complicate things somewhat? The response rate actually seems pretty unusually good based on other reports I’ve seen for this industry, especially among the largest companies, but more varied among smaller ones. If that matters.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Jack1234
  • 11
  • 1
  • 2
    Why bin at all? Your variable #employees is part of the data. No need to degrade it. Relationships between diversity (does this just mean % males or % females or something more complicated) and firm size will be noisy, but binning makes identifying underlying relationships all but impossible. Plot and then consider a smooth curve. – Nick Cox Jul 12 '17 at 16:24
  • 1
    "fairly unique": all my English teachers are sending distress signals from their current resting places. – Nick Cox Jul 12 '17 at 16:25
  • 1
    Interesting, thank you! And fair enough on "fairly unique"... – Jack1234 Jul 12 '17 at 22:51
  • See [What is the benefit of breaking up a continuous predictor variable?](https://stats.stackexchange.com/q/68834/17230). Also bear in mind that you might expect variability in the proportion of male employees to be higher among smaller companies. – Scortchi - Reinstate Monica Jul 17 '17 at 10:31

0 Answers0