While researching this topic, I have come across different regression models which allow for the response variable to have many zeros. This includes:
- Negative Binomial Regression
- Zero Inflated Regression
- Hurdle Models
- Tweedie GLM
However, all these regression models are designed for when the response variable contains many zeros - nothing is mentioned if these regression models are designed to accommodate covariates containing many zeros.
To illustrate my question, I created an example in which the covariates and response contain many zeros (using the R programming language):
#create non-zero data
response_variable = rnorm(100,9,5)
covariate_1 = rnorm(100,10, 5)
covariate_2= rnorm(100,11, 5)
data_1 = data.frame(response_variable, covariate_1, covariate_2)
#create zero data
response_variable = abs(rnorm(1000,0.1,1))
covariate_1 = abs(rnorm(1000,0.1, 1))
covariate_2= abs(rnorm(1000,0.1, 1))
data_2 = data.frame(response_variable, covariate_1, covariate_2)
#combine both together
final_data = rbind(data_1, data_2)
#add one regular variable
final_data$covariate_3 = rnorm(1100, 5,1)
From here, several of the above regression models can be employed:
library(MASS)
library(pscl)
library(statmod)
#Negative Binomial Regression (note: this does not allow negative values, so I took the absolute value of the entire dataset)
summary(m1 <- glm.nb(response_variable ~ ., data = abs(final_data)))
#Zero Inflated Regression (note: this does not accept non-integer values or negative values, so I converted all values to integer and non-negative)
summary(m2 <- zeroinfl(response_variable ~ .,, data = lapply(abs(final_data),as.integer) ))
#Hurdle Model (note: this does not accept non-integer values or negative values, so I converted all values to integer and non-negative)
summary( m3 <- hurdle(response_variable ~ ., data = lapply(abs(final_data),as.integer)))
#tweedie glm (does not work - will try to debug later)
summary(m4 <- glm(response_variable ~., data = final_data ,family=tweedie(var.power=3,link.power=)))
My Question: (Although the above examples are probably unrealistic and a naive attempt to re-create real world problems) At first glance, none of the above regression models for "high density zero data" seem to outright "disallow" the covariates from containing many zeros - but are there any theoretical (or "logical") restrictions suggesting that the above models are unlikely to perform well on data where the covariates contain many zeros? In practice, can such regression models successfully model data in which the response variable and the covariates both contain many zeros??