My goal is to model the relationship between RETURN and SCORE from my survey dataset with the following structure:
- RETURN (numeric continuous) = company share price performance
- SCORE (numeric continuous) = company score collected via survey
- PARTICIPATION (binary) = 1 if participated / 0 score was estimated
- SIZE (numeric continuous) = company size
- COUNTRY (categorical factor 40 levels) = country of company
- INDUSTRY (categorical factor 20 levels) = industry of company
- COMPANY_ID (categorical factor 400 levels) = company identifier
- YEAR (categorical factor 10 levels) = year of survey
By design, the survey score are biased (=higher) according to both PARTICIPATION (=1) and SIZE (=higher).
Both RETURN and SCORE are influenced according to the categories COUNTRY, INDUSTRY, COMPANY_ID (repeat surveys per year), YEAR (scoring methodology is adapted per year).
Not all companies are surveyed every year, so the total number of observations is ~2500.
To model the relationship between RETURN and SCORE I therefore need to control for the effects of the other independent variables. Due to dimensional limits. I'd like to use a regularized regression approach e.g. LASSO. Building up the model setup to include the variables... I started with a multiple regression:
mod1=lm(data$RETURN~data$SCORE+data$SIZE)
Then added dummy variables for PARTICIPATION, COUNTRY, INDUSTRY and YEAR using LASSO from the glmnet package:
mod2=glmnet(x,y,alpha=1)
With x having dimensions (2500x70). I can then use cross validation to obtain the value of lambda for the minimum mse:
cvmod=cv.glmnet(x,y,alpha=1)
cvoptm=cvmod$lambda.min
lcoef=as.matrix(coef(mod2,s=cvoptm))
How can I include the variable COMPANY_ID into the model? Its surely not feasible to add as a dummy variable? Could I include it as a random effect using the glmmLASSO package? Further, shouldn't both COUNTRY and INDUSTRY also be considered as random effects in that case?