-1

In classical statistical regression analysis (e.g. linear regression) one level of the categorical variable is usually not used to create a dummy variable to create a reference (e.g. there is only one column gender_male). I understand why.

I noticed that many machine learning models appear to still use all level whilst "one hot encoding". So gender with 2 levels results in 2 columns: gender_male and gender_female. This may lead to the curse of dimensionality having an impact. So not sure why ML folk do this.

Anyway, Can one still leave out the redundant level or is there a reason to use 2 columns using the simple example?

Please note that my specific model is an ANN (this). Thus, I am not using "statistical regression" (neither standard/lasso/ridge/elastic) and I am not interested in interpretability. Collinearity should also not be an issue AFIK.

PS:

I found a potentially other discussion on the context, which may help someone here.

PPS:

I am more inclined to use binary encoding for ANN now.

cs0815
  • 1,294
  • 18
  • 30
  • One reason for this is explained here, in short, if sing some sort of regularization it is better to keep all: https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding/329281#329281 – kjetil b halvorsen Aug 21 '20 at 16:30
  • Does this answer your question? [Dropping one of the columns when using one-hot encoding](https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding) – kjetil b halvorsen Aug 21 '20 at 16:30
  • I obviously read these before but they are not satisfactory! You do not leave anything out when you do not create a column for one REDUNDANT level. So part of the answer is nonsense. The question is does it matter if I leave out the redundant level. The evidence I have thus far is no. – cs0815 Aug 21 '20 at 16:35
  • Then please do give more context, because as it stands I have nothing more to say then in that answer! Do you use regularization? which? estimation method? ... – kjetil b halvorsen Aug 21 '20 at 16:53
  • @kjetilbhalvorsen funny how everyone is trigger happy to close questions and saying there is already an answer. I guess you voted for closing without even having this conversation? Please explain to me why on earth removing a level would cause ridge/lasso to fail? It just does not make sense. One level is simply the reference or point in a high dimensional space (0,0 , ...0). I think the same is the case for any machine learning model/ANN. Happy to be proven wrong with a proper citation rather than some opinion. – cs0815 Aug 21 '20 at 17:03
  • I adapted my question to the specifics of my model and as it stands it seems to perform well leaving out one level. – cs0815 Aug 21 '20 at 17:07
  • 1
    I have not said that leaving out one level causes ridge/lasso **to fail**, but that it destrous invariance, that is, it treats levels diferently, and **which level you leave out can change the predictons**. – kjetil b halvorsen Aug 21 '20 at 17:45
  • @kjetilbhalvorsen sorry I did not say this about you. I said this about the "answer" you quote. – cs0815 Aug 21 '20 at 19:24
  • There is strong disagreement about closing, for instance see this: https://stats.meta.stackexchange.com/users/7828/has-quit-anony-mousse Despite staying some hours on the frontpage, few people engaged with this podt: Why? I guess because the question contains to litle specific contextual information. Without context, we cannot say more than some probably unhelpful generalities. About the vote toclose as dup: At least that makes some few more people look at the post, so it can give more engagement, not less. Maybe some see something I cannot see. But you **should edit to give more context**. – kjetil b halvorsen Aug 21 '20 at 19:48
  • @kjetilbhalvorsen I edited the question as you may have noticed?! So really do not understand what you point is anymore. Especially, as I also explained why I do not think your suggested answers, which I read before posting my question, is helpful. – cs0815 Aug 21 '20 at 19:59
  • Yes, I have read that and do not understand it. Maybe more details? anyhow, hopefully somebody else will jump in. My comments were generic, about strategy in asking questions here. – kjetil b halvorsen Aug 21 '20 at 21:46
  • ahh thanks everyone for the close votes! question generated a few upvotes, answers and discussions but heh just close it. – cs0815 Aug 24 '20 at 15:33
  • I think the question's upvotes were generated as a result of what you originally asked, but as we've all discovered, this doesn't appear to be at all what you are really asking or are after. – StatsStudent Aug 25 '20 at 15:29
  • @StatsStudent I am really not sure why you claim this?! All I asked is, whether I can stick with the statistical approach to remove redundant levels whilst pre-processing data for an ml approach (ANN quoted after criticism) - basically dummying the nominal IVs. Just read my question. This is different to one-hot enc., which is another re-invented (data trans) method of machine learn-ists (remember me saying re-implement wheel admittedly slightly out of context confusing you?). Look you have more points on here so I better stop arguing with you, I am just a proletarian "data scientist". – cs0815 Aug 25 '20 at 16:01
  • @cs0815, it seems that everyone is confused by your actual question. We've asked you multiple times to update your question (not place them in comments) to clearly explain what it is you are looking for and why other answers have been insufficient to address your questions (more than simply saying "I already know that"). We're trying to help. We really are. But it seems the majority of the people aren't clear what it is you are after, so if you would like the question re-opened, please update the question and we'll then vote to reopen if sufficiently clear. Thanks. – StatsStudent Aug 25 '20 at 16:43
  • @StatsStudent - let me repeat the core of my question for you: "Can one still leave out the redundant level or is there a reason to use 2 columns using the simple example?" (I noticed the type column instead of columns). So please tell me what is not clear about this?! Again, I am basically asking if I need to encode the redundant level for the stated modeling technique. – cs0815 Aug 25 '20 at 17:09
  • Again, you need to update the question and not the comments. This is the last time I'll try to help unless you can follow the rules of behavior of stack exchange. Please review them in the help pages before making any other posts. Thank you – StatsStudent Aug 25 '20 at 19:41
  • @StatsStudent I cannot believe you accuse me of rudeness. I quoted the part of the question in the comments and you still claim it is not part of the question? I am totally lost! – cs0815 Aug 26 '20 at 06:25
  • Yes, @cs0815, It clearly appears so. – StatsStudent Aug 26 '20 at 09:53
  • @StatsStudent did you finally realize that what I state in the comments is actually in the question contrary to what you state or are you accusing me of being lost? – cs0815 Aug 26 '20 at 11:00

2 Answers2

1

Two things: perfect multicollinearity is not an issue in gradient based techniques of loss minimization; in CV it is possibly to be left without one category. That's why one hot is used in ML, dummy variable trap is not attempted to be avoided.

In regression we don't like being trapped in dummies, so we have one less dummy than categories so that the intercepts can be kept. Linear algebra techniques used in OLS dislike perfect multicollinearity that is due to the dummy trap. In ML optimization is based on gradient, so it's not a problem to have one hot coding.

Ok, but why not still drop the extra dummy? The reason is that when you run cross validation it is possible to end up with a training set that doesn't have one of the categories in it. So, it's simpler to run one hot and not to worry about this problem

Aksakal
  • 55,939
  • 5
  • 90
  • 176
  • Thanks. This: "doesn't have one of the categories in it" is exactly what I had to deal with but IMHO 3 levels can still be coded as 00, 01, and 10 thus only having 2 columns. It seems to work for me using the quoted model. I also think that, if one has the issue of highly infrequent levels, one should reconsider ones data processing. – cs0815 Aug 21 '20 at 19:30
  • 1
    ML crowd is a busy bunch, they don't want to mess with special cases, that's one reason why they fell in love with deep learning which promises to let you forget about feature engineering. The idea being you simply toss your junk into ML algo and it figures everything itself, while you move on to a new task. You apply one hot and don't bother about missing categories in CV - same attitude. – Aksakal Aug 21 '20 at 19:35
  • totally agree. So you think my "hot one encoding whilst removing a redundancy" is OK then? I think so and cannot see why not. – cs0815 Aug 21 '20 at 20:01
  • If you use one-hot and that gives you satisfactory results, then go for it! My comments above are more principled --- but your comments seems to indicate you have very many levels, if so, look at [tag:many-categories] – kjetil b halvorsen Aug 23 '20 at 19:39
  • @kjetilbhalvorsen the point of my question was that I do not want to use one-hot as such but remove one redundant level as in statistical regression. – cs0815 Aug 24 '20 at 08:06
  • 1
    OK, but your question was not very clear---if you edit it with the clarifications you have given in comments (but more detail led than last time), I will vote for reopen (reopening happens frequently.) But short answer to Q is this comment is: remove intercept is better than removing one redundant level, **but that only works with one factor, not with two or more** unless all interactions included. Details: https://stats.stackexchange.com/questions/215779/removing-intercept-from-glm-for-multiple-factorial-predictors-only-works-for-fir/218034#218034 – kjetil b halvorsen Aug 24 '20 at 18:34
  • 1
    I have voted to reopen. You could still make the point even clearer, maybe simply **like this** which you get `**like this**`. – kjetil b halvorsen Aug 25 '20 at 18:00
1

As Arksakal has indicated in his comment, "perfect multicollinearity is not an issue in gradient based techniques of loss minimization." To explain this further, statisticians are often most concerned with uniquely estimating parameters in regression equations so they can be interpreted. When you have linearly dependent columns created with one-hot encoding the parameter estimates cannot be uniquely determined. The estimated coefficients have an infinite number of solutions.

In machine learning, interpreation of parameters isn't really something of a concern. Instead predictive models are constructed and models are build to minimize loss functions such and the Means Squared Predictive Error. The value of these loss functions are identical whether one uses $c$ levels of a qualitative variable or $c-1$. However, if you were were to use $c$ levels of this variable in a statistical model, there would be no unique estimate for $\hat{\beta}_2$, for example, which makes interpretation of this parameter impossible.

StatsStudent
  • 10,205
  • 4
  • 37
  • 68
  • I know this and this is not really my question. My question was more why no one is removing redundant levels in "one-hot encoding"! – cs0815 Aug 24 '20 at 08:07
  • I just addressed this exact issue in my answer. I'm not sure where you confusion lies. Perhaps you need to add additional details to your question. I've just explained that the "redundant level" is left in because it doesn't affect the loss function. If it doesn't affect the loss function and you aren't interested in interpreting parameters, there's no need to go to the extra trouble of dropping the levels and assigning dummy variable accordingly. I'm not sure where your confusion is. – StatsStudent Aug 24 '20 at 08:13
  • well it does increase the likelihood of the curse of dimensional biting you doesn't it? So I am sorry I do not agree with your blanked statements! You see many machine learning techniques are just implementing the wheel whilst neglecting issues like this. – cs0815 Aug 24 '20 at 08:19
  • "implementing the wheel?" "blanked statements?" huh? I'm not sure what are you referring to when you say "implementing the wheel." Note, I didn't say that everyone in machine learning uses one-hot encoding. I said, that it's a convenience since the the predictive error rate is the regardless of the coding mechanism used. – StatsStudent Aug 24 '20 at 08:29
  • Ok this was not directly aimed at you. What I meant with "re-implementing the wheel" was that machine learning folk come up with "one hot encoding" even though dummy encoding exists and arguably one should remove the redundant level to reduce the curse of dim. issues. I understand that interpretability of parameters is not the concern with most ml models. As I said the curse of dimensionality is still an issue and IMHO you saying that it does not affect the loss function is a bit of a blanked statement. Surely this cannot be the only reason. – cs0815 Aug 24 '20 at 08:47
  • No. One hot encoding has been around way before the ML folks came to the party. They only gave it a new name. It's traditionally been called "dummy encoding" among statisticians for decades. The choice of one hot encoding or using one fewer variable to represent the levels of a factor is a blanked statement because it's true -- use of either does not affect the value of the predictive loss function. I never claimed it to be the only reason, it's certainly isn't and the choice of when to use one hot depends on the ML algorithm you wish to use. For example, .... – StatsStudent Aug 24 '20 at 09:03
  • (continued) you don't want to drop a redundant column when using lasso or elastic net which use regularization. A great explanation of this can be found on CV already here: https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding – StatsStudent Aug 24 '20 at 09:05
  • "It's traditionally been called "dummy encoding" among statisticians for decades." this is exactly what I said whilst statistician leave out redundant level! Need to check your statement reg. not leaving out redundant level for ridge/lasso/elastic regression. IMHO this is not true. – cs0815 Aug 24 '20 at 10:45
  • You actually didn't say that. – StatsStudent Aug 24 '20 at 13:08
  • emmm re-implementing the wheel comment? No point posting comments here I see ... – cs0815 Aug 24 '20 at 15:37
  • Oh. no. You simply said that "You see many machine learning techniques are just implementing the wheel whilst neglecting issues like this." You didn't mention anything about this being used for decades by statisticians. We'd have to have had a crystal ball to understand what exactly you were referring to by "many machine learning techniques" in that phrase. At any rate, it seems like you've accepted this answer so it seems like you have a satisfactory answer now, so no need to add additional comments. Thanks and best of luck to you! – StatsStudent Aug 24 '20 at 15:41