How to incorporate meta data into text classification model?

Question

I looked here: How it's better to include non-word features into text classification model? but there aren't any useful answers.

I have a possibly naive question: I'd like to incorporate meta data into a text classification model. However I'm not sure how to proceed.

Assume that I have a dataset that is $N \times 3$, where the columns are:

text document - for example, an amazon review or newspaper article
some meta_data - for example, number of words of length > 5, or time article was published
category - either A, B or C

The goal is to use the text document and the meta_data to classify the example in the correct category.

Typically one would perform text classification on the text document (tokenize, lemmatize, remove stopwords, etc...) and build a sparse matrix of word counts. A model (for example SVM is popular) would be trained on this sparse matrix and tested on some unseen data, whereby it would be classified A, B or C.

But what about the meta data? I'd like to incorporate that somehow but in this paradigm it's unclear to me where I can inject it. I feel like what I want is a model of the form:

$y = \beta_0X_0 + \beta_1X_1$

Where $X_0$ is the meta data and $X_1$ is the result of the NLP part. But how would I set up such a model? Can I reduce the text classification portion into a single coefficient? Or am I conflating two distinct approaches of modeling text?

Dan Hicks · Accepted Answer · 2017-06-14T11:24:10.347

2

Just use the metadata features as features for the SVM.

Typically the features that you feed into the SVM would be the $n \times k$ document-term matrix $\mathbf{T}$. You also have the $n \times j$ matrix of metadata features $\mathbf{M}$ (not including the category). So you want to give your SVM algorithm the combined $n \times (k + j)$ matrix $$ \left( \array{\mathbf{T} & \mathbf{M}} \right)$$

edited Jun 14 '17 at 11:24

answered Jun 14 '17 at 00:08

Dan Hicks

722
4
19

I see. So the elements of $\mathbf{T}$ would be, say, word counts and the elements of $\mathbf{M}$ would be some other features. But I the algorithm doesn't treat the columns differently right? For example, $k$ could be large whereas $j$ may be small, but much more important than any of the $k$ words individually. And therefore I suppose this is where feature selection comes in...? – ilanman Jun 14 '17 at 16:29
1

Right. Depending on the particularities of the fitting algorithm, you probably want to do some feature selection and might want to center and rescale all of the features. But there's nothing relevantly different about text features. – Dan Hicks Jun 14 '17 at 18:27

How to incorporate meta data into text classification model?

1 Answers1