Does using count data as independent variable violate any of GLM assumptions?

Question

I would like to employ count data as covariates while fitting a logistic regression model. My question is:

Do I violate any assumption of the logistic (and, more in general, of the generalized linear) models by employing count, non-negative integer variables as independent variables?

I found a lot of references in the literature regarding hot to use count data as outcome, but not as covariates; see for example the very clear paper: "N E Breslow (1996) Generalized Linear Models: Checking Assumptions and Strengthening Conclusions, Congresso Nazionale Societa Italiana di Biometria, Cortona June 1995", available at http://biostat.georgiahealth.edu/~dryu/course/stat9110spring12/land16_ref.pdf.

Loosely speaking, it seems that glm assumptions may be expressed as follows:

iid residuals;
the link function must correctly represent the relationship among dependent and independent variables;
absence of outliers

Does everybody know whether there exists any other assumption/technical problem that may suggest to use some other type of models for dealing with count covariates?

Finally, please notice that my data contain relatively few samples (<100) and that count variables' ranges can vary within 3-4 order of magnitude (i.e. some variables has value in the range 0-10, while other variables may have values within 0-10000).

A simple R example code follows:

\###########################################################

\#generating simulated data

var1 <- sample(0:10, 100, replace = TRUE);    
var2 <- sample(0:1000, 100, replace = TRUE);    
var3 <- sample(0:100000, 100, replace = TRUE);    
outcome <- sample(0:1, 100, replace = TRUE);
dataset <- data.frame(outcome, var1, var2, var3);

\#fitting the model

model <- glm(outcome ~ ., family=binomial, data = dataset)

\#inspecting the model

print(model)

\###########################################################

Welcome to the site! One remark: if you want to sign your posts, use your profile (especially the about me box). — , Oct 02 '12 at 19:21
usually, in GLM models, the predictor ("independent") variables are just supposed to be some known constants, there are __NO__ distributional assumptions about them! So there is nothing wrong in using count data as predictors. — kjetil b halvorsen, Oct 02 '12 at 21:17
kjetil That's correct--and a good answer to the question. Yet, with the extreme ranges of IVs described here, one would be wise to evaluate the influence of the data, check goodness of fit, and particularly assess the potential for a nonlinear relationship. This would be done in the hope that the relationship actually *is* nonlinear and that a re-expression of the IVs, such as a root or log, will linearize it, thereby simultaneously relieving some of the influence problems. This is probably what @user14583 is trying to indicate in their answer. — whuber, Oct 03 '12 at 15:16
@kjetilbhalvorsen - I agree on "no distributional assumptions," but I don't think you meant to say "known" or "constants," as neither of those words fits. — rolando2, Oct 03 '12 at 15:41
They are "constants" in the sense that theyb are not random: no distribution. They are "known" in the sense that they are assumed to be measured without error, so the measured value is the one that actually was working in the data generation mechanism. The GLM model assumes that all randomness is in the response mechanism, that is often dubious! — kjetil b halvorsen, Oct 03 '12 at 19:42

score 6 · Answer 1 · edited Apr 13 '17 at 12:44

There are some nuances at play here, and they may be creating some confusion.

You state that you understand the assumptions of a logistic regression include "iid residuals... ". I would argue that this is not quite correct. We generally do say that about the General Linear Model (i.e., regression), but in that case it means that the residuals are independent of each other, with the same distribution (typically normal) having the same mean (0), and variance (i.e., constant variance: homogeneity of variance / homoscedasticity). Note however that for the Bernoulli distribution and the Binomial distribution, the variance is a function of the mean. Thus, the variance couldn't be constant, unless the covariate were perfectly unrelated to the response. That would be an assumption so restrictive as to render logistic regression worthless. I note that in the abstract of the pdf you cite, it lists the assumptions starting with "the statistical independence of the observations", which we might call i-but-not-id (without meaning to be too cute about it).

Next, as @kjetilbhalvorsen notes in the comment above, covariate values (i.e., your independent variables) are assumed to be fixed in the Generalized Linear Model. That is, no particular distributional assumptions are made. Thus, it does not matter if they are counts or not, nor if they range from 0 to 10, from 1 to 10000, or from -3.1415927 to -2.718281828.

One thing to consider, however, as @whuber notes, if you have a small number of data that are very extreme on one of the covariate dimensions, those points could have a great deal of influence over the results of your analysis. That is, you might get a certain result only because of those points. One way to think about this is to do a kind of sensitivity analysis by fitting your model both with and without those data included. You may believe it is safer or more appropriate to drop those observations, use some form of robust statistical analysis, or to transform those covariates so as to minimize the extreme leverage those points would have. I would not characterize these considerations as "assumptions", but they are certainly important considerations in developing an appropriate model.

score 1 · Answer 2 · answered Oct 02 '12 at 23:04

1

One thing I would definitely check is the distributional properties of your independent variables. Very often with count data, you'll see some moderate to severe right-skew. In that case, you will likely want to transform your data, as you'll lose the log-linear relationship. But no, using a logistic (or other GLM) model is fine.

answered Oct 02 '12 at 23:04

user14583

27
1

3

How does right skew lose 'the log-linear relationship'? – Glen_b Oct 03 '12 at 03:07
3

This comment seems incorrect to me. Like @Glen_b, I don't see how this would necessarily lose the log-linear relationship. In any case, it would be better to examine the relationship directly (through plotting, for instance). – Peter Flom Oct 03 '12 at 10:23
2

A nonlinear transformation of an IV will definitely change the log-linear relationship to something else, @Peter. This answer seems basically correct to me. – whuber Oct 03 '12 at 15:18
1

@whuber I agree that a nonlinear transform of one variable will change the relationship between it and another variable. That seems pretty clear. But from what sort of relationship to what sort? Why not examine the relationship directly instead of assuming how it will be changed? Also, the answer seems to say that the person *wants* to lose the log linear relationship. – Peter Flom Oct 03 '12 at 15:23
2

That's a good point @Peter. Yet some people *do* want to change the relationship; that's not necessarily a mistaken notion. I agree that a direct examination is the right procedure: it will suggest how to re-express the IV(s) involved in order to create linear relationships. – whuber Oct 03 '12 at 15:46

Does using count data as independent variable violate any of GLM assumptions?

2 Answers2

Linked