How to use more features in text based machine learning models beyond the text itself?

Question

I have a text dataset such that for each document in the text I have more features like category of the document, sub-category of the document and some anonymized features which are float. I can model a model by considering only TF-IDF features from the each document's text but I will not be using other relevant information which encoded in other features. One way could be to append these features to text of the document and then do TF-IDF which is obviously wrong.

Can someone help me in understanding how can I using these other features of the dataset along with the text to build a model?

score 1 · Answer 1 · answered Oct 18 '16 at 14:58

Concatenate the features, appropriately coded (dummy coded, etc), to the end of your tf idf vector.

Since the TF-IDF features are high dimensional and you want to use a simpler model you may want to try applying dimensionalilty reduction on the tf idf matrix (say PCA). Then concatenate the other document features to the result. Note: you may need to consider standardization, normalization, depending on the machine learning algorithm you choose.

score 0 · Answer 2 · answered Oct 18 '16 at 13:17

0

You can dedicate some N input neurons (N being the size of the "other" features) to receive some representation of those input features and work as usual.

answered Oct 18 '16 at 13:17

Gabizon

388
1
11

Thanks but my question is more about feature engineering than training the model. Also, I am more interested in simpler models like LR or Decision Trees etc. – silent_dev Oct 18 '16 at 13:20

How to use more features in text based machine learning models beyond the text itself?

2 Answers2

Linked