1

Previously I have created dummy variable to handle categorical variables as independent variable.But now I have a data set where SSN is my independent variable which has more than 1000 unique levels.I'm feeling it won't be good to create dummy variable using such categorical variable. Can you please suggest me some better way to handle such categorical variable?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Python123
  • 73
  • 2
  • 9
  • Do you care about the coefficients or are they nuisance parameters? – dimitriy Sep 21 '16 at 06:20
  • Yes, I do care about the coefficients. – Python123 Sep 21 '16 at 06:30
  • You can use as.factor(), assuming the cells are not too small. There is nothing problematic with this assuming you have enough data and computer. You can also group some of the categories together based on institutional knowledge. – dimitriy Sep 21 '16 at 06:35
  • Sorry, I'm not getting.Can you please explain me a little so that I can understand the logic & later I can use it with any other language.What I'm getting as.factor() will convert my variable to factor type & then I will run the R code for regression.But it will be helpful if I can get a generalized understanding. I understood that in my comment "using R" created this problem. I'm removing that,sorry about this – Python123 Sep 21 '16 at 06:42
  • 1
    @RUser It doesn't matter what language. A regression on 100 levels is very hard to read and understand. You should merge the levels. – SmallChess Sep 21 '16 at 06:44
  • For our model building purpose we need to consider the social security number.which has huge number of unique level – Python123 Sep 21 '16 at 06:45
  • 1
    What could the coefficient on SSN dummy tell you? – dimitriy Sep 21 '16 at 06:47
  • I just don't get it. Using SSN is just like using row ID for your data set. What's the point? I can also do a regression with my row IDs, but it's rubbish. – SmallChess Sep 21 '16 at 06:48
  • 1
    @StudentT: it may not be the same; maybe the same SSN appears several time (say if several events are recorded in the study period). I wouldn't call it rubbish: it sounds like the typical case you may want to direct someone towards RE/Multi level models (I can see how DVM's comment is nudging that way but maybe the OP doesn't have a firm grasp of the distinction between estimating coefficients and nuisance parameters) – user603 Sep 21 '16 at 06:51
  • @user603 That was why I asked for the OP to edit the question so that we can give better response. Without much information, we can just guess and give general advice. – SmallChess Sep 21 '16 at 06:52
  • Edited. I had a confusion of using SSN in my model, like what student T mentioned.That's why I used simple categorical variable. Sorry for the confusion – Python123 Sep 21 '16 at 06:57
  • @RUser Are the SSN unique for each respondent (or client)? – SmallChess Sep 21 '16 at 07:01
  • Do you really need fixed effect for each SSN? Your description is brief so it's hard to comment, but it sounds rather as something to be modeled as *random* effect in linear mixed model. Check: http://stats.stackexchange.com/questions/120964/fixed-effect-vs-random-effect-when-all-possibilities-are-included-in-a-mixed-eff/137837#137837 as this may be related to what you are asking. – Tim Sep 21 '16 at 07:24
  • Yes it is unique for each respondent but it is repeating.We are doing some fraud analysis where we have 200,000 rows of data & in that 2% unique SSN – Python123 Sep 21 '16 at 08:21
  • Similar question here: http://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-categories look out for an eventual answer there! – kjetil b halvorsen Sep 21 '16 at 08:35
  • How many other (non-SSN) variables do you have? How many among those are continuous ones? – user603 Sep 21 '16 at 10:55

1 Answers1

2

A modern machine should be to do regression on the 100 levels variable. Having said that, a regression with such a complicated variable is hard to understand and will overflow your R console. You should seriously consider to combine the categorical levels. It's hard to believe you really need all those 100 levels to do your regression.

Have you checked the counts for each level with table()? Maybe you can remove some labels that have very low counts?

You may also want to split the variable into multiple categorical variables.

SmallChess
  • 6,764
  • 4
  • 27
  • 48
  • Actually we are planning to consider Social Security number in my model – Python123 Sep 21 '16 at 06:44
  • @RUser What do you mean? Maybe you want to edit your question so I can better answer your question? – SmallChess Sep 21 '16 at 06:44
  • No no @student T, I need a generalized approach, but due to my "using R" comment I'm getting answer related to the approach using R., that's what I wrote.Sorry for the confusion – Python123 Sep 21 '16 at 06:48
  • 1
    @Ruser Please edit your question so that myself and others like Dimit can give you a better answer. Please state exactly what you want to do and what data set you have. My answer is not related to R. Please edit your question. – SmallChess Sep 21 '16 at 06:49
  • Edited.I had a confusion. I also felt it rubbish to use SSN but as client is forcing,so I used categorical variable term instead of using SSN. – Python123 Sep 21 '16 at 06:59