0

I have a dataset with around 40,000 rows and 36 variables, half of which are continuous and half of which are categorical. I have created dummy variables for the categorical variables and standardized the scale of the dataset. Now I am struggling with how to select features and what type of model to build. I was thinking of doing Lasso regularization to select features or a chi-square test, but it seems like neither of these can be implemented for both categorical and continuous variables if they are in one dataset. Would I need to analyze the variables separately? The dependent variable I am trying to predict is also continuous, so I am planning to try linear regression.

  • There is no issue unless particular software implementations dislike mixing categorical and numerical variables in regularized regression. A common name for a simple model that mixes categorical and continuous variables is ANCOVA: Analysis of Covariance. – Dave Aug 16 '21 at 00:15
  • Can you please explain why LASSO does not work with categorial variables? You might consider grouping the dummy varaibles for one predictor as explained in https://stats.stackexchange.com/a/326846, which might not be supported by software implementations. Using ANOVA should work out of the box with statistical software (it also works with mixed categorial/numeric variables), but it is a stepwise approach and might thus yield a suboptimal solution or even contradictory results when applied in different orders. – cdalitz Aug 16 '21 at 07:04

0 Answers0