3

I have asked a related question in mathematics section, but I think here is a better place to ask.

for both KNN algorithm (classification) and k-means algorithm (clustering), there is a need for a distance metric (like euclidean distance) to compute the distance between two instances. I know there is also other methods.

when our training data contains both numeric and categorical attributes, it is said we have to convert categorical attributes to numerical values. I know there is some methods like binary variables and target-based-encoding for this conversion.

lets say I have converted all categorical data to the numerical. how about other data which was numerical? should I normalize them?

imagine I have some numerical data with a large range (salary between 0 to 100,000) then if I have a binary variable (only contains 0 and 1), then the effect of this binary variable is too small and I think computing the euclidean distance is meaningless in this case.

My question is: Should I convert All data to binary variables or do something like normalization in order to have the same range for all attributes?

ttnphns
  • 51,648
  • 40
  • 253
  • 462
Adel
  • 275
  • 2
  • 9
  • Why change categorical data to numeric in the first place? Numbers have an order. 2 is less than 5. But categorical data has no order like this. – Michael R. Chernick Dec 27 '17 at 00:02
  • In order to build a model (in machine learning algorithms) to predict a new instance class/group. – Adel Dec 27 '17 at 00:34
  • 2
    You might consider using a distance metric that inherently allows mixed data types. For example, take a look at Gower's distance. – user20160 Dec 27 '17 at 00:39
  • 2
    Gower similarity exactly does the normalization by range of numeric variables. https://stats.stackexchange.com/q/15287/3277. Classic k-means is not suited for categorical data, including recoded into dummies; look for other clustering method. – ttnphns Dec 27 '17 at 08:05

0 Answers0