3

I'm working on a classification problem whose features are very noisy. I have a table with the 'official' feature levels, but the actual data loosely resemble them. For example, to represent a value of "11001 BOGOTA", the following can be found in the data:

"100100BOGOT"  "100100BOGOTA" "11001" "1100100BOG" "1100100 BOGOT" "1100100BOGOT" "1100100BOGOTA" "11001Bogot\U3e31653c" "11001 BOGOTA" "11001BOGOTA," "11001BOGOTA" "11001CONV.ES" "11100100BOGO" "0 BOGOT" "1100100BGTA" "11001Bogot" "1111001 BOGOT" "BOGOT" "BOGOTA01" "BOGOTA COLOMB" "BOGOTA D.C." "00000BOGOTA" 

The number of "real" levels is around 1000, while the number of observed levels is about 4000. Before attemtping any classification, this data should be de-noised to their "official" values... What would be the best approach to achieve this?

1.-By hand? (crazy, but possible)

2.-Some set of rules? How could I define them?

3.-Training some classifier? Which one would be the best option?

Here a similar problem is explained, but it is not categorical data and has no "official" referent.

Thanks a lot for reading this.

Pablo
  • 53
  • 1
  • 5
  • I would first come up with some similarity measure(s) Eg minimum edit distance and then do semi automated (Eg find best 5 candidates and check manually) – seanv507 Jan 12 '16 at 23:40
  • What do you mean with "minimum edit distance"? Could you develop this a little further? – Pablo Jan 13 '16 at 21:53
  • https://en.wikipedia.org/wiki/Edit_distance - just the number of changes to go from one string to another – seanv507 Jan 13 '16 at 22:09
  • You can use [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) to define a set of rules and/or to automate search and replaces in the data file. – Olivier Jan 02 '18 at 18:52

2 Answers2

0

Maybe first try to encode your features, then try to find a solution for denoising? Have a look at here it might help.

Once you encoded the features, you can apply denoising techniques which is common with numerical data in machine learning. For example, a simple linear regression or a neural network as an unsupervised feature learning can be useful.

Although, encoding a noisy categorical data might not be easy.

PickleRick
  • 688
  • 6
  • 19
  • Hamid, thanks for answering. But I do not quite understand why is it better to codify the variables before de-noising them. Right now, the noisy variables resemble (at least visually and formally) their true values. Codifying them would loose this information, wouldn't it? – Pablo Jan 12 '16 at 21:50
  • @Pablo it depends on the coding I guess. Somehow you need to map the features to their respected labels and usually this is how it's done in machine learning. You will lose some information for sure, but these information are gone already by noise. If you have enough data, your models might figure it out. Rule base solutions might lose less info but at the same time might be more complicated. If you have enough data, you need to somehow convert these categorical data into features that machine learning models can learn. find a suitable coding & the model might find the similarities by itself. – PickleRick Jan 12 '16 at 21:58
0

You can use n-grams perhaps. This approach is a bit crude, and you are perhaps willing to more directly model what's the noise, but this approach might work fine for classification purposes.

See https://en.wikipedia.org/wiki/N-gram for example, but in short, you encode all length three or two substrings ('BOG', 'OGO', 'OTA', .. '10B') etcetera, then use one-hot encoding to create features of that,

featurename | 'BOG' | 'OGO' | 'OTA' .. | .. | 10B |
row_1       | 1     | 0     |  1    .. | .. | 0   |
row_2       | 0     | 1     |  1    .. | .. | 0   |     

This way, observations that have similar labels are close in terms of features, and this might work well for a classifier. An unwanted side effect is that similarly named cities such as Santa Monica and Santa Cruz will produce at least some similar n-grams.

Gijs
  • 3,409
  • 11
  • 18