Dealing with text column of thousands different values

Question

I have this dataset with some numerical and some text columns and want to create an ML forecasting model. The thing is that one column called 'diagnosis' is text (each entry is one sentence long) and has information on the diagnosis of a patient, but it's not encoded well and has about 4k different values.

How would you go about that? How can I possibly split the 4k values into broader categories, so that I don't end up with 4k columns after one-hot encoding?

How many different diagnosis are there, ignoring how they’re represented in natural language? — gunes, Jul 16 '21 at 09:25
That's something that I do not know. Possibly hundreds. Of course, there are some fixed words when it comes to diagnosis but many clinicians used their own way to describe it — hippocampus, Jul 16 '21 at 09:39
It won't answer your question but only be an addition : once you made your "broader categories", consider using Target Encoding (example : James-Stein encoding) to avoid OneHot multiplication of columns (if you still have a lot of broad categories) — Adept, Jul 16 '21 at 10:29
Have a look at https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels — kjetil b halvorsen, Jul 17 '21 at 00:01

score 0 · Answer 1 · answered Jul 24 '21 at 20:38

One approach is to create a sentence embedding. That article's example of best results is from a 2017 paper, and things have moved on a lot since then. A Transformer-based model is going to be superior to a BiLSTM model in just about any NLP task. One relatively straightforward approach would be to run the sentences through a pre-built BERT model.

That typically gives you a 512 dimensional vector, which may be impractical to work with? You could use PCA to reduce that to a more manageable handful of dimensions.

Another approach would be to list up all the distinct words used, and with the help of a domain specialist narrow this down to N key words. Where N is on the order of 20 to 50? Then you could do add N columns to your data, storing the number of times each keyword is mentioned.

A more structured way to do this would to use an ontology for your domain. Here is one of the first google hits I got for papers on medical ontologies, just to give you an idea: https://pubmed.ncbi.nlm.nih.gov/31094361/

Yet another approach would be to do sentiment analysis on the sentences. That will give you a single number, typically -1.0 to +1.0.

If the sentences are full of facts like "Patient smokes, has bruising around eyes, ..." then the keyword approach is better. But if it is more like "Patient unlikely to make it through the night" vs. "Patient lost a lot of blood, but is over the worst now." then the sentiment approach could be more useful.

Dealing with text column of thousands different values

1 Answers1