1

I have a text dataset such that for each document in the text I have more features like category of the document, sub-category of the document and some anonymized features which are float. I can model a model by considering only TF-IDF features from the each document's text but I will not be using other relevant information which encoded in other features. One way could be to append these features to text of the document and then do TF-IDF which is obviously wrong.

Can someone help me in understanding how can I using these other features of the dataset along with the text to build a model?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
silent_dev
  • 557
  • 1
  • 6
  • 16

2 Answers2

1

Concatenate the features, appropriately coded (dummy coded, etc), to the end of your tf idf vector.

Since the TF-IDF features are high dimensional and you want to use a simpler model you may want to try applying dimensionalilty reduction on the tf idf matrix (say PCA). Then concatenate the other document features to the result. Note: you may need to consider standardization, normalization, depending on the machine learning algorithm you choose.

Kwame
  • 101
  • 8
0

You can dedicate some N input neurons (N being the size of the "other" features) to receive some representation of those input features and work as usual.

Gabizon
  • 388
  • 1
  • 11
  • Thanks but my question is more about feature engineering than training the model. Also, I am more interested in simpler models like LR or Decision Trees etc. – silent_dev Oct 18 '16 at 13:20