2

I have a logit model aiming to explain a binary variable. I am wondering whether the way I have included a size variable incorrectly.

I have included the farm size in hectares, then using a dummy for the binary variable largefarm, which is 1 when the farm size is above 20ha and 0 otherwise. I have also included an interaction between the two. There is a regulation that causes a change in the farmsize coefficient at 20ha. They are all significant.

Is it incorrect to use largefarm as it directly relates to the value of the farmsize variable? It seems incorrect to me but I'm not sure how to create a break in the continuous variable at this point.

Thank you for any assistance.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
Paula
  • 21
  • 1

4 Answers4

2

The pertinent question here is: is it reasonable to assume that the impact (i.e., the slope) of farm size on the outcome is the same across all levels of farm size?

Based on your description, the answer to the above question is: probably not. This is because of some type of regulation that changes when farm size is greater than 20ha. This justifies the inclusion of the interaction term to test what type of impact the change in regulation have (if any) on the outcome.

For interpretation, let’s start with a model that contains only one variable: farm size. $$ logit(p)= b_0+b_1 x $$

Here, the interpretation of the coefficient is straight-forward: each one unit increase in farm size drives $b_1$ change in the estimated log odds of the outcome.

Now, here’s the model with the interaction term included: $$ logit(p)= b_0+b_1 x+b_2 (x*x_{20}) $$ The interpretation of $b_2$ is as follows: when the farm size is greater than 20, each unit increase in farm size drives $b_2$ change in the estimated log odds of the outcome, in addition to the baseline change of $b_1$.

You should ensure that the inclusion of the interaction term is not resulting in a high amount of multicolinearity in the model. You can run some multicolinearity diagnostics to check this. (Check out this answer if you need help.)

Another thing to check is see if the inclusion of $x_{20}$, in addition to the interaction term, helps improve the model performance.

Vishal
  • 1,134
  • 9
  • 14
  • 1
    If you fit the model with farm size, an indicator for large farm, and their interaction it may be helpful to parameterise the farm size to be zero at 20 Ha (subtract 20) so that the coefficient for the indicator variable for large farm size is the difference at the break. if you want the two lines to meet at 20 Ha then you need a broken stick model (linear spline with one knot). – mdewey Apr 11 '16 at 13:21
  • normally I would also include $ x_{20} $, but in this case the dummy variable used in the interaction is intended to produce a change in the marginal effect not caused by an actual interaction with another variable. – Fuca26 Apr 11 '16 at 13:45
1

I would be very careful in using interaction terms in non-linear models (in this case the logit model). In non-linear models there is already a sort of "interaction" between all the variables (i.e. the marginal effect of whatever variable depends on the value of all the other variables, and this variable itself). See for more detailed information this paper Berry, W. D., Golder, M., & Milton, D. (2012). Improving tests of theories positing interaction. Journal of Politics, 74(3), 653-671. Have you thought of using the Regression Kink Model? I believe it could be well suited for your analysis (but still, I do not know your precise reasons for using a logit model, so do not give much credit to this suggestion)

Fuca26
  • 795
  • 1
  • 9
  • 29
0

This largely depends on the goals of your analysis. If your goal is simply to make good predictions, then what you have done isn't problematic. However, if you want to interpret the coefficients, you have a problem on your hands due to the multicollinearity you've likely induced, and it would be best to simply include the farm size in hectares and drop the large farm size indicator variable.

StatsStudent
  • 10,205
  • 4
  • 37
  • 68
  • 1
    Thanks you for your quick answer! I am looking to interpret. I previously had farmsize and its square, I think I'll switch back to that. Much easier to interpret as well. – Paula Apr 10 '16 at 19:39
  • 1
    This answer seems misleading, if not actually incorrect. The stated purpose is to handle a break in the "farmsize coefficient." The interaction does precisely that. Whether it introduces a problematic amount of collinearity is a matter to be decided by the data, but recommending dropping it is tantamount to saying that it's impossible to implement the desired model. Surely there is a better solution. – whuber Apr 10 '16 at 19:46
  • Maybe I'm misunderstanding the OP's question. It sounds like he has included two variables: a binary large/small indicator variable AND farmsize. Why would you want to include both of these -- it's essentially redundant information? I'm not sure what you mean by a "break in the farmsize coefficient." – StatsStudent Apr 10 '16 at 20:24
-1

What you'rd trying to do can be done properly with linear splines, see Stata examples of mkspline function here. For instance, you introduce a knot point $s_0=20$, then get two new variables for size $s$:

$$x_1=min(s,s_0)$$ $$x_2=max(s,s_0)$$

or the alternative definition $$x'_2=max(s,s_0)-s_0$$

This way you get a model $$y=\beta_0+\beta_1x_1+\beta_2x_2+e$$ This model has different slopes for large $\beta_2$ and small $\beta_1$ farms, exactly what you need. Note, that there are a couple of ways to make linear splines, they have slightly different interpretations of slopes.

What you're doing now is not good, it introduces collinearity for no reason. Collinearity causes certain issues with interpretation of the data.

UPDATE On multicollinearity issue with splines. First, there'no perfect multicollinearity with linear spline: $x_1$ is not a linear combination of $x_2$ and other variables. So, there's no identification issue.

Second, although we split $x$ into two variables $x_1,x_2$ we don't really consider them as separate variables. It's just a technical trick to introduce nonlinearity into the model: have different slopes in two sections of the variable.

Aksakal
  • 55,939
  • 5
  • 90
  • 176
  • Note that this makes the lines join at s0 which may not be what the OP intended. Otherwise of course it is fine. – mdewey Apr 11 '16 at 14:06
  • @mdewey, yes, that Stata's mkspline method has two options, in one of them the lines do not join, something like $max(s,s_0)-s_0$ – Aksakal Apr 11 '16 at 14:08
  • @Aksakal, wouldn't $x_1$ and $x_2$ still have some level of collineariry? – Vishal Apr 13 '16 at 21:52
  • I like the idea of splines instead of interaction term, but it's not clear how that solution is preferable in light of multicollinearity. – Vishal Apr 19 '16 at 12:14
  • There's no multicollinearity problem, because you can't get $x_1$ as a combination of $x_2$ and other variables. The only issue here is if $\beta_1\approx\beta_2$, i.e. the slopes are very similar. – Aksakal Apr 19 '16 at 13:36