Linear regression: how to treat an explanatory variable that is discrete but does not have a natural zero

Question

Background/study system:

One of my MS students is studying the biomechanics of strand breakage in Spanish moss (an epiphyte--or plant that lives on other plants). Spanish moss has strands that can grow into larger clumps of strands ("festoons")...where new strands originate from renewal shoots at each node. Old strands often form a wire support from which new strands grow. The number of a nodes depends on the age of the strand, but she has found that most plants have 7 to 13 nodes (mode = 10, counting from proximal point where it joins the "wire" to the distal end of the strand).

When a single strand is pulled, it always breaks at a node (never at the inter-node) and this appears to an important mechanism by which Spanish moss reproduces (vegetatively): a strand can break off and (if it lands in a good spot) can keep growing. For each strand, she has made cuts at the midpoint around each node and examined material properties (such as "yield strength" and "work to breakage"). We want to regress these material properties against nodal position to determine if they change along a strand.

Statistical problem:

Ultimately, I would like to set this up a a mixed-effects model with nodal position as a fixed effect and strand as a random effect.

My question pertains to how the node variable should be treated in a regression analysis. Nodal position seems to be discrete (1,2,3...13) but it has no natural zero point, which would make it difficult to interpret the y-intercept. I have seen a lot of sources that discuss "truncated" variables and left-"censored" variables, but this seems to be a slightly different problem. It could also be argued that this is a form of an integer distribution without zero, but I cannot seem to find any discussion of such a probability distribution or how one would interpret regression results for such a distribution.

Alternatives:

My original thought was to treat this as an ordinal variable. My concern is that I am losing information and that the results will be harder to interpret because there will be a separate regression coefficient for each node. Adding random effects or accounting for a nonlinear response would make the results even more complicated.

I have also considered treating nodal position as a categorical variable with simple coding or something like Helmert coding, but with this approach, you are throwing away information about the order of the nodes and you again have lots of regression coefficients.

A compromise that has been suggested is defining some measure of top-middle-end, but this also comes with some loss of information and would require some sort of control for variation in strand length.

A previous study transformed the predictor by dividing the distance to the node (from the proximal end) by the total length of the strand. This simplifies the regression analysis and controls for strand length, but I am not sure if this is a good idea biologically since nodal properties for the same proportions may not be the same if the number of nodes per strand varies...and I am not sure if this is a good idea statically since you are making a naturally discrete variable continuous and, if you transform it (as described), you end up with a disproportionate number of data points at 100%.

You might be (way) overthinking this. Simple, intuitive analogs of your situation (if I understand it correctly) would range from the number of stories in a building to the number of rings in a tree to the number of employees (including the owner) in a business. Interpreting the intercept is rarely an issue in models that use such explanatory variables--and if it is, one can always use the simple expedient of renumbering. The real issue would appear to concern *how* to model this variable--as ordinal or with splines or linearly--as you discuss. — whuber, Jul 29 '18 at 16:37
By renumbering are suggesting that the first node would be treated as zero? If the intercept were not important, is the simple answer to just remove it when fitting the model? Seems like a common problem (indeed), but one that has not been addressed! — coreydevinanderson, Jul 29 '18 at 16:47
Yes, it's common--and it has been extensively addressed. The question of removing an intercept is discussed in all multiple regression textbooks and has been discussed in [several threads on this site, too.](https://stats.stackexchange.com/search?q=regression+intercept+force) Renumbering is not the same as changing the first node. If you intend to treat the counts as *numbers,* then a valid renumbering for a linear model would be some affine transformation. That could be as simple as subtacting 1 from each number so that the sequence begins at 0. — whuber, Jul 29 '18 at 17:00
I have see a lot of discussion about the validity of removing the intercept, but not in this context..and these threads don't necessarily discuss the problem stated above. Are you saying that if my goal is to test the effect (of say nodal position on yield strength) the y-intercept does not matter? — coreydevinanderson, Jul 29 '18 at 18:17
I wasn't saying that, but if indeed that's your objective, then the y-intercept would appear to be irrelevant. — whuber, Jul 29 '18 at 22:50

score 3 · Answer 1 · answered Jul 29 '18 at 21:46

If all you really care about is whether the properties "change along a strand," then the value of the intercept itself is unimportant. You are correct in your comment: "if [the] goal is to test the effect (of say nodal position on yield strength) the y-intercept does not matter." However you model the node position, the change along a strand will be indicated by the coefficient(s) for node position.

The idea to simply subtract 1 from each node position, proposed in comments by @whuber, would give a simple interpretation to the intercept: it would be the (predicted) value of a mechanical property at the very first node along the strand. (Think about this as having chosen to start numbering from 0 instead of from 1, as some computer languages do for indices into arrays.) In a simple random effects model for strands, the random effect (for an intercept) would then pick up the variation among strands in that property at the first node. This is simple, will be interpretable for all strands with nodes, and is easy to explain to a reviewer.

Note that if you were to treat node number as purely categorical (or ordinal), you would still have to choose a reference category value against which the other categories are compared. Subtracting 1 from all your node numbers, if they are treated as numerical predictors in some way, is simply choosing node 1 as the reference.

The bigger problem, as @whuber emphasized, is how to deal with the node numbers: as purely categorical, ordinal, numerical (possibly transformed in some way, or modeled with splines). Should node numbers perhaps be corrected for the total number of nodes along a strand? You knowledge of the underlying subject matter, and exploratory plots of properties as functions of node number, stratified by the total number of nodes in the strands, should indicate useful ways to proceed.

Linear regression: how to treat an explanatory variable that is discrete but does not have a natural zero

1 Answers1