Too many dummy variable in regression model

Question

we have about 50000 models of mobile phone (like Galaxy S7, iPhone 9) in database and the size of data is about 3 million.

We want to find the mobile phones that have the least call success rate ( the numbers of successful call divided by total call). We want to run a Regression model to find the impact of mobile on call success rate, so each mobile phone which has smallest coefficient, is the worst mobile phones. Dependent variable is call success rate and the independent variables are mobile phones, numbers of subscriber and type of cell cites (cell tower). we use dummy variable for mobile phones, because mobile phones are nominal, So there are 49999 dummy variable in model.

Dummy variables are too many. what is your suggestion?

Is regression a good model? Does anyone know others statistical model to solve this problem?

I attached a picture that shows sample of data and regression model

kjetil b halvorsen · Answer 1 · 2018-08-21T11:07:19.967

3

All those about 50k mobile phone models cannot really be different (with respect to your response variable), so it should be ok with a method/model which (implicitely or explicitely) makes a clustering. So you need some way of "collapsing" (or fusing) categories (levels), so that the levels fused are similar with respect to the response. There are many posts on this site about this problem, for a starter see Principled way of collapsing categorical variables with many categories and its links.

For the fused lasso, see The name of 'Fused' Lasso

   EDIT

Answer to some further questions in comments: Some Qs to OP he can ask himself: 3000000/50000=60, the mean number of obs pr phone model. Not large! What is the minimum? You should check linearity with respect to quantitative variables, maybe interactions. And, since the response is a proportion, you should check if the bound at 1 makes problems, if so maybe transform to logits?

Yes, you can use the 50000 dummys (but with some sort of regularization). But, memory could be a problem: 30e6*50e3*8/1e9 is the number of gigabytes you will need (approximately), and I doubt that fits in your computer! So you will need some implementation that can use sparse matrices, and I believe the R package glmnet can do that. I would try that, but am not sure if it can do fused lasso

edited Aug 21 '18 at 11:07

answered Aug 19 '18 at 12:07

kjetil b halvorsen

63,378
26
142
467

1

Thanks kjetil. I will study these methods. My data is very large (about 3 million). Is regression a good method to solve this problem – milad rahimi Aug 19 '18 at 13:31
Could be its a good model, worth a try. Some Qs: 3000000/50000=60, the mean number of obs pr phone model. Not large! What is the minimum? You should check linearity with respect to quantitative variables, maybe interactions. And, since the response is a proportion, you should check if the bound at 1 makes problems, if so maybe transform to logits? – kjetil b halvorsen Aug 19 '18 at 13:37
1

Suppose, I can increase the size of data to 30 million (10 days). so the mean number of obs pr model will be 600. My question is: Is it right solution to use 49999 dummy variables? Can we increase dummy with increasing.Thanks observation? – milad rahimi Aug 19 '18 at 19:46
Yes, you can use the 50000 dummys (but with some sort of regularization). But, memory could be a problem: 30e6*50e3*8/1e9 is the number of gigabytes you will need, and I doubt that fits in your computer! So you **will need** some implementation that can use sparse matrices, and I believe the R package `glmnet` can do that. I would try that, but am not sure if it can do fused lasso – kjetil b halvorsen Aug 20 '18 at 11:27
I want to find the worst model. I have found a solution: at first, I find some worse models in each cell, for example 10 models ( the 10 models which have the lowest rate). I repeat this in all cell. I think, some models are bad in many cells, so with this method, the number of models decrees. then I do regression for these models – milad rahimi Aug 22 '18 at 22:05
Sorry, I don't understand this. Do you mean that you split the data in 50000 sets for each level of that categorical variabel? – kjetil b halvorsen Aug 22 '18 at 22:39
No, For Cell 1, I choose 10 models ( the 10 models which have the lowest rate), For cell 2, I choose 10 models ( the 10 models which have the lowest rate). I do this for other cells. Then I do regression by these Selected models. – milad rahimi Aug 23 '18 at 11:55

Too many dummy variable in regression model

1 Answers1