I'm training a model and among the features, I have the language of the users. Right now I have done one-hot encoding on the language feature. But I think it would make more sense to have the language as an embedding since some of the languages are close to each other (e.g. Portuguese and Spanish) and should be taken into account in my recommendations. I'm wondering how I can do that?
Asked
Active
Viewed 371 times
1

kjetil b halvorsen
- 63,378
- 26
- 142
- 467

Erin
- 121
- 1
- 8
-
You could add a feature that specifies "Spanish or Portuguese," and set it to be $1$ if the language is either Spanish or Portuguese, and $0$ otherwise. – Open Season Nov 01 '17 at 16:53
-
1I want to this automatically. I don't want to do something similar to word2vec, but only for languages. – Erin Nov 02 '17 at 00:01
-
You should look into the [tag:fused-lasso] which will solve your problem! Or look at [this post](https://stats.stackexchange.com/questions/327412/is-there-a-fused-version-ridge-regression) which gives a ridge regression version. – kjetil b halvorsen Mar 16 '19 at 22:07
-
1To get to something like word2vec, you need some kind of task or network structure (see net2vec, if you e.g. have some kind of hierarchical structure of languages or if you can calculate some kind of word similarity as a measure of language similarity) you want to embed. Alternatively, if you have some multiple features of languages, you could try something like UMAP to map that to a lower dimensional space. – Björn Mar 16 '19 at 22:34