I'm working on a classification problem whose features are very noisy. I have a table with the 'official' feature levels, but the actual data loosely resemble them. For example, to represent a value of "11001 BOGOTA", the following can be found in the data:
"100100BOGOT" "100100BOGOTA" "11001" "1100100BOG" "1100100 BOGOT" "1100100BOGOT" "1100100BOGOTA" "11001Bogot\U3e31653c" "11001 BOGOTA" "11001BOGOTA," "11001BOGOTA" "11001CONV.ES" "11100100BOGO" "0 BOGOT" "1100100BGTA" "11001Bogot" "1111001 BOGOT" "BOGOT" "BOGOTA01" "BOGOTA COLOMB" "BOGOTA D.C." "00000BOGOTA"
The number of "real" levels is around 1000, while the number of observed levels is about 4000. Before attemtping any classification, this data should be de-noised to their "official" values... What would be the best approach to achieve this?
1.-By hand? (crazy, but possible)
2.-Some set of rules? How could I define them?
3.-Training some classifier? Which one would be the best option?
Here a similar problem is explained, but it is not categorical data and has no "official" referent.
Thanks a lot for reading this.