3

While researching this topic, I have come across different regression models which allow for the response variable to have many zeros. This includes:

  • Negative Binomial Regression
  • Zero Inflated Regression
  • Hurdle Models
  • Tweedie GLM

However, all these regression models are designed for when the response variable contains many zeros - nothing is mentioned if these regression models are designed to accommodate covariates containing many zeros.

To illustrate my question, I created an example in which the covariates and response contain many zeros (using the R programming language):

#create non-zero data
response_variable = rnorm(100,9,5)
covariate_1 = rnorm(100,10, 5)
covariate_2= rnorm(100,11, 5)

data_1 = data.frame(response_variable, covariate_1, covariate_2)

#create zero data
response_variable = abs(rnorm(1000,0.1,1))
covariate_1 = abs(rnorm(1000,0.1, 1))
covariate_2= abs(rnorm(1000,0.1, 1))


data_2 = data.frame(response_variable, covariate_1, covariate_2)

#combine both together
final_data = rbind(data_1, data_2)


#add one regular variable

final_data$covariate_3 = rnorm(1100, 5,1)

enter image description here

From here, several of the above regression models can be employed:

library(MASS)
library(pscl)
library(statmod)

#Negative Binomial Regression (note: this does not allow negative values, so I took the absolute value of the entire dataset)

summary(m1 <- glm.nb(response_variable ~ ., data = abs(final_data)))

#Zero Inflated Regression (note: this does not accept non-integer values or negative values, so I converted all values to integer and non-negative)

summary(m2 <- zeroinfl(response_variable ~ .,, data = lapply(abs(final_data),as.integer) ))

#Hurdle Model (note: this does not accept non-integer values or negative values, so I converted all values to integer and non-negative)


summary( m3 <- hurdle(response_variable ~ ., data = lapply(abs(final_data),as.integer)))

#tweedie glm (does not work - will try to debug later)

summary(m4 <- glm(response_variable ~., data = final_data ,family=tweedie(var.power=3,link.power=)))

My Question: (Although the above examples are probably unrealistic and a naive attempt to re-create real world problems) At first glance, none of the above regression models for "high density zero data" seem to outright "disallow" the covariates from containing many zeros - but are there any theoretical (or "logical") restrictions suggesting that the above models are unlikely to perform well on data where the covariates contain many zeros? In practice, can such regression models successfully model data in which the response variable and the covariates both contain many zeros??

desertnaut
  • 278
  • 2
  • 10
stats_noob
  • 5,882
  • 1
  • 21
  • 42
  • There’s no general rule that consists shouldn’t have many zeros. Some models may not work though and it depends also on y variable – Aksakal Nov 30 '21 at 16:19

1 Answers1

9

Regression analysis accommodates situations where the explanatory variables/covariates can have any values, including zero. There is no particular model needed for this --- it can be implemented in almost any model. If there are lots of zeros for a particular covariate, the only issue this creates is that it affects the leverage of the data points and one must therefore be careful to assess the functional form of the posited relationship with the response variable.

Ben
  • 91,027
  • 3
  • 150
  • 376
  • 3
    +1. After all, ANOVA is nothing but a regression on a number of 0-1 covariates (which encode group membership). "Many* zeros in a covariate correspond to a group that is sparsely represented. No major problem here, except of course that parameter estimates will be less precise. – Stephan Kolassa Nov 30 '21 at 07:56
  • 1) @ Ben: Thank you so much for your answer! Can you please elaborate on this sentence? "affects the leverage of the data points and one must therefore be careful to assess the functional form of the posited relationship with the response variable." I am not quite sure I understand. Could including explanatory variables with many zeros "harm" the model's relationship with the other variables? – stats_noob Nov 30 '21 at 15:40
  • 2) I always wondered about: would it be possible to fit a probability distribution over this kind of data? For example, suppose we believe that each variable (the response variable and the explanatory variables) has a negative binomial distribution - would it be possible to jointly model all the variables together, e.g. P(Y, X1, X2..Xn) ~Multivariate Negative Binomial(r,p) -and predictions can be made by first taking the conditional distribution at a desired point, and then generating random samples from this conditional distribution from MCMC? E.g. P(Y | X1 = x1, X2 =x2..) – stats_noob Nov 30 '21 at 15:43
  • 3) Or maybe a Copula model could also be used to achieve something similar as in 2)? – stats_noob Nov 30 '21 at 15:44
  • 4) I always struggled to understand : why don't the explanatory variables in regression require some assumption on their probability distribution? Is this because historically, explanatory variables were always considered as "fixed" and not considered "random variables" by definition? Suppose we do have information about the distribution of the explanatory variables - is there some way to incorporate this information into the model and potentially "enrich" the model (e.g. like I suggested in 2) )? – stats_noob Nov 30 '21 at 15:48
  • 5) Here is a post where I ask a similar question: as in 4) :https://stats.stackexchange.com/questions/553854/statistical-models-that-exploit-distributional-knowledge-of-the-predictor-vari . Thanks so much @ Ben! – stats_noob Nov 30 '21 at 15:48
  • @ Stephan Kolassa : Thank you so much for your comment! In the case of high dimensional (big) data - I am thinking it might be difficult to identify such groups and have traditional models work around them. In the past, I was very interested in these kinds of regression models (hurdle, zero inflated, etc.) for data where the explanatory variables have many zeros - but in the end, I found that models like Random Forest were able to provide a "cheap and easy fix", by "randomly" finding subsets of the data where the response variable was homogeneous. However, now I am again interested in this! – stats_noob Nov 30 '21 at 15:52
  • @ Stephan Kolassa : Maybe you can take a look at this question if you have time? This is something that I was wondering about: https://stats.stackexchange.com/questions/553854/statistical-models-that-exploit-distributional-knowledge-of-the-predictor-vari . Thank you so much! – stats_noob Nov 30 '21 at 15:53
  • 6) I always wondered that if the explanatory variables had many zeros, would statistical computing packages still be able to perform the calculations to estimate the beta-coefficients for the regression model? For example (in OLS), if the explanatory variables have many zeros, could the "(X-transpose X)-inverse * X(transpose)*Y " matrix still be calculated? Or would this simply return errors, seeing as there are too many zeros in the explanatory variables? – stats_noob Nov 30 '21 at 17:53
  • 7) Logically speaking, if a certain explanatory variable ONLY contains zeros, obviously this variable would have no impact on the regression model and this would likely result in the statistical computing software prompting the user to re-run the regression model without this variable. I wonder if in practice, there might be some "threshold" regarding the number of zeros (or "near zero" values) that an explanatory variable can contain, before the regression model "crashes" (e.g. 8%)? – stats_noob Nov 30 '21 at 17:58
  • @stats555: All these questions would be better asked *as questions*. You can reference this post if you feel it adds useful context – Scortchi - Reinstate Monica Nov 30 '21 at 23:34