Alternative to using $R^2$ to assign data categories?

Question

A background to my problem: I use survey data on firms, where I want to measure the relationship between a binary variable (perceived growth barriers) and firm size. However, I cannot treat "firm size" as continuous, but I rather need to categorize firms. For this, I have chosen to categorize them based on their statistical relationship to the dependent variables.

My approach to fit categories has been to run regressions where I try out dummy variables for all consecutive firm size intervals using OLS (250 regressions per round). I have then categorized the first size category based on which of them has the highest $R^2$, after which I have repeated the process until all sizes are categorized.

However, my data exhibits high variance among larger firms, which means that I cannot use $R^2$ alone as it would only end up creating "overly wide" categories. Therefore, I have also weighted each $R^2$ output with the estimated Kernel density of the bandwidth where the categories end (e.g., a category containing firms with 4-16 employees would be weighted by the Kernel density of the size "16 employees"). This was made to "slow down" the regression algorithm and to force it to include influential groups that are relevant to my research.

However, this last solution was made ad-hoc and not with respect to previous research (on which I found none with respect to creating categories).

My question is now:

Are there any alternative model fit measures to $R^2$ that is perhaps less sensitive to heteroscedasticity in the data? (i.e., ideally a measure that would not require the use of Kernel density weights to solve the above problem).

Alternatively, do you have any suggestions on improvements or alternative approaches to solving this issue?

How about trying the prediction error after splitting your dataset into test and training sets? — ERT, Jul 31 '18 at 10:54
Do you have a "ground truth" you can learn, i.e., are doing supervised classification? Or do you just want to cluster your firms, i.e., unsupervised learning? — Stephan Kolassa, Jul 31 '18 at 11:13
I just want to cluster the firms in the most efficient way possible (wrt. explaining variations in growth barriers). The end result is a coefficient plot with Prob(Growth barrier = 1) on the Y-axis and firm size on the X-axis. — user216262, Jul 31 '18 at 13:10
@ERT - Kind of like running a monte carlo to evaluate the model performance? — user216262, Jul 31 '18 at 13:11
http://backtestingblog.com/glossary/out-of-sample-testing/ Check out "out-of-sample testing" on Google. It is a way to (i) avoid overfitting, and (ii) directly compare the ability of different analyses. — ERT, Jul 31 '18 at 13:20
So, *do* you have a ground truth you can train by? To be honest, I'm confused by what you are aiming to do. You start off by writing about a relationship between growth barriers and firm size, but then you want to categorize firm sizes. At some point, dummies come in (what dummies?). What do you want to do, based on what? — Stephan Kolassa, Jul 31 '18 at 13:24
@StephanKolassa I apologize for being unclear. You are correct; I want to trace the specific relationship between growth barriers and firm size. To do so, however, I need to identify firm size categories that efficiently explain this relationship. The end result that I want is to say that "Firms with XX-YY employees are the most likely to face growth barrier X" — user216262, Jul 31 '18 at 13:34
OK, thank you. Why can't you use firm size as a continuous predictor? — Stephan Kolassa, Jul 31 '18 at 13:36
@StephanKolassa I want to explore the specific non-linear relationship between these variables, however, the nature of the data means that I cannot estimate polynomials of firm size directly. This comes, in turn, as the dependent variables are binary, so I am restricted to a probit/logit model, whereas marginal effects estimates cannot deliver anything else than a linear measure. This then lead me to using dummy variables instead. A bit of a lengthy answer, hope that it is readable. — user216262, Jul 31 '18 at 13:53
Thanks. I think I can now provide an answer below. Hope it helps. — Stephan Kolassa, Jul 31 '18 at 14:07

score 2 · Accepted Answer · answered Jul 31 '18 at 14:06

You have a binary outcome ("experiences growth barrier" - yes/no), and a continuous predictor (firm size). You suspect a nonlinear relationship between the two.

Your best bet is a standard logistic regression. In order to model potential nonlinearities, do not feed firm size into the logistic regression as-is. Rather, transform them using splines.

In a comment, you write:

marginal effects estimates cannot deliver anything else than a linear measure

This is incorrect - just use splines. These work for logistic regression just as well as for "vanilla" OLS. I have used splines to model nonlinearities in logistic regression models (regressing the likelihood to develop PTSD on spline transformed traumatic event load, Kolassa et al., 2010, J Clin Psych, and the same for the likelihood for spontaneous remission, Kolassa et al., 2010, Psych Trauma).

I very much recommend Frank Harrell's Regression Modeling Strategies on splines.

Do not use discretization to model nonlinearities, since the discretization will introduce discontinuities that are typically spurious. (In your specific scenario, discontinuities may actually be valid for regulatory reasons; e.g., certain regulations on reporting or employee protection may only apply to firms of a certain size. If something like this is pertinent, add one or more Boolean indicator variables.)

Thank you Stephan for your detailed answer, this does indeed sounds like a potential way forward! I will start by looking into Frank Harrel's work and familiarize myself with splines. — user216262, Jul 31 '18 at 14:13
Dear Stephan, I am currently running regressions where I imposing linear splines at different firm sizes, and picking the ones with the lowest AIC value to be the inflection points for each corresponding dependent variable. Then I re-do the process with the first inflection point to find the second one, and so on. Does that sound like a reasonable strategy to you? (Given that I don't overfit it of course). — user216262, Aug 01 '18 at 13:45
Hm. Redoing analyses does sound a bit like overfitting. I'd recommend using restricted cubic splines with knots set at specific percentiles. Harrell's book offers a table with very helpful rules of thumb. I'd go with these, unless I *really* knew what I was doing. — Stephan Kolassa, Aug 01 '18 at 16:13
Thank you Stephan, I will look into Harrel's work more closely! — user216262, Aug 06 '18 at 06:26

score 0 · Answer 2 · answered Jul 31 '18 at 11:44

0

An alternative approach I'm thinking of is to take the log of the firm size, instead of using it raw. That way, the difference between larger numbers gets smaller.

After that, my hunch is you will be able to simplify your classification algorithm, since your data will have been preprocessed to be more homoscedastic.

answered Jul 31 '18 at 11:44

danuker

179
5

That is true! Then I guess I could convert the values back into absolute firm sizes afterwards, when I present my results? – user216262 Jul 31 '18 at 13:13
Sure, you should do that, if you want the classification to be interpretable. – danuker Aug 02 '18 at 06:40

Alternative to using $R^2$ to assign data categories?

2 Answers2