Distance metric for categorical and numerical data

Question

I have asked a related question in mathematics section, but I think here is a better place to ask.

for both KNN algorithm (classification) and k-means algorithm (clustering), there is a need for a distance metric (like euclidean distance) to compute the distance between two instances. I know there is also other methods.

when our training data contains both numeric and categorical attributes, it is said we have to convert categorical attributes to numerical values. I know there is some methods like binary variables and target-based-encoding for this conversion.

lets say I have converted all categorical data to the numerical. how about other data which was numerical? should I normalize them?

imagine I have some numerical data with a large range (salary between 0 to 100,000) then if I have a binary variable (only contains 0 and 1), then the effect of this binary variable is too small and I think computing the euclidean distance is meaningless in this case.

My question is: Should I convert All data to binary variables or do something like normalization in order to have the same range for all attributes?

Why change categorical data to numeric in the first place? Numbers have an order. 2 is less than 5. But categorical data has no order like this. — Michael R. Chernick, Dec 27 '17 at 00:02
In order to build a model (in machine learning algorithms) to predict a new instance class/group. — Adel, Dec 27 '17 at 00:34
You might consider using a distance metric that inherently allows mixed data types. For example, take a look at Gower's distance. — user20160, Dec 27 '17 at 00:39
Gower similarity exactly does the normalization by range of numeric variables. https://stats.stackexchange.com/q/15287/3277. Classic k-means is not suited for categorical data, including recoded into dummies; look for other clustering method. — ttnphns, Dec 27 '17 at 08:05

Distance metric for categorical and numerical data

0 Answers0