How to Deal with Large Number of Dummy Variables in Machine Learning?

Question

I have a cross-sectional real estate dataset with information on roughly 100000 properties, including rental price, square meter size, number of bedrooms etc. In addition, the dataset contains information about the region for each property. The total number of regions in the dataset is 400.

I would like to play around with different machine learning methods in the caret package to predict rental prices using the relevant features in the dataset. I obviously need to incorporate regional information to capture region-specific price effects.

Update: To be more formal, I have the following two model specifications in mind:

$y_i=\beta_0+\beta_1x_{i,1}+\beta_2x_{i,2}+\beta_3D_{i,1}+\beta_4D_{i,2} + \epsilon_i$

where $y_i$ refers to the rental price of property $i$, $x_{i,1}$ denotes square meter size, $x_{i,2}$ denotes the number of rooms, $D_1$ is a matrix of dummy variables capturing different characteristics like balcony, kitchen, cellar and so on, and $D_2$ is a matrix containing the 399 regional dummies.

Alternatively, the model could be specified as follows:

$y_i=\beta_0+\beta_1x_{i,1}+\beta_2x_{i,2}+\beta_3D_{i,1}+\gamma d_{j} + \epsilon_i$

where $d$ is a vector containing the median or average square meter price in each region $j$. I guess inclusion of interaction effects could also make sense (see comment below by seanv507), but I am not sure about them right now.

The obvious advantage of the last equation is that I only need one variable in my model to proxy for a regional effect. However, I also do not see too much of a problem for the first equation given the relatively large number of observations in the dataset.

I have two questions:

Which of the two specifications would be preferable or are there better alternative ways to deal with this issue?
Which machine learning methods could be promising for this type of model?

There are many questions on this site about similar problems, such as: http://stats.stackexchange.com/questions/118255/treatment-for-factors-with-many-levels http://stats.stackexchange.com/questions/201287/handling-datasets-with-categorical-variables-of-many-levels http://stats.stackexchange.com/questions/50636/categorical-logit-predictor-with-too-many-different-levels http://stats.stackexchange.com/questions/167697/what-is-the-general-procedure-or-general-rules-for-grouping-factor-levels continued ... — kjetil b halvorsen, Aug 19 '16 at 14:42
... http://stats.stackexchange.com/questions/122005/categorical-variables-factor-reduction-can-i-use-the-dependent-variable http://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-categories http://stats.stackexchange.com/questions/128764/automatically-classifying-user-activity-sessions-on-a-website/128765#128765 http://stats.stackexchange.com/questions/50229/summarizing-confidence-intervals-when-there-are-many-levels/50347#50347 .... But few or no good answers! — kjetil b halvorsen, Aug 19 '16 at 14:54
Which dataset is it? It's not the [Ames housing dataset](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview). Be careful with your model's twin assumptions and lack of two-level and three-level interactions, it could be that the living-room area matters disproportionately, or number of ground-floor rooms, or area of the master bedroom, or average area of the largest two bedrooms, etc. You'll need an interaction matrix between the $D_{i,1}$ to capture those... — smci, Mar 02 '20 at 09:16
...and if there are regional preferences for those features (e.g. Arizonians like swimming pools or solar more than northerners, who (say) value basements), you'll also need an interaction matrix between the $D_{i,1}$ and $D_{j,2}$. Btw, how large is $D_{1}$? — smci, Mar 02 '20 at 09:17

score 2 · Answer 1 · answered Aug 19 '16 at 11:44

2

you need to think about what makes a region similar, and produce an encoding that captures it. 2 things spring to mind: 'legal' hierarchies (council/county/state/ ...), geographic proximity.

legal hierarchies could be done by more dummy variables ( and using regularisation eg L1/L2 regularisation ) to favour high level over low levels in hierarchy.

geographic proximity - I am not so sure, one way would be using distances rather than dummy coding (ie input the distance from each region for each house).

answered Aug 19 '16 at 11:44

seanv507

4,305
16
25

1

Thank you for your valuable ideas! One idea that came to my mind in the meantime is to generate one variable containing the median price in each region which may be taken as a proxy for the regional effect. This way I could avoid using dummy variables at all. – kanimbla Aug 19 '16 at 15:32
3

@kanimbla that makes a lot of sense. why don't you edit your question to write out a model that makes sense to you. eg rental price = median_price_per_sq_meter * sq_meter + no_of_rooms_factor .. Then people here can suggest how to implement it. ( for instance I would imagine though that no of rooms effect would also depend on the region - I would expect the variation between 1 and 5 bedroom in expensive area is bigger than in cheap area. another thought is that perhaps you should predict price per square meter (to remove basic variation) – seanv507 Aug 19 '16 at 17:02
Good suggestion! I updated the post, I hope it's more clear now. – kanimbla Aug 20 '16 at 07:29

How to Deal with Large Number of Dummy Variables in Machine Learning?

1 Answers1

Linked