How to handle clustering analysis of data which has different numbers of levels

Question

For example, if i have data which is along the lines of

variable : levels within variable
x1       : {1,2}
x2       : {1,2}
x3       : {1,2}
x4       : {1,2,3}
x5       : {1,2,3}
x6       : {1,2,3,4,5}
x7       : {1,2,3,4,5}
x8       : {1,2,3,4,5}
x8       : {1,2,3,4,5,6}
x9       : {1,2,3,4,5,6}
x10      : {1,2,3,4,5,6,7,8,9,10}
x11      : {1,2,3,4,5,6,7,8,9,10}
x12      : {1,2,3,4,5,6,7,8,9,10}

Where x1,...,x12 are ordinal variables.

How would one about treating data like the above for clustering? And what sort of algorithms are most typically used?

I'm aware of scaling data for use with some algorithms, but I'm not sure if scaling data remains valid when there are different numbers of levels as there are above.

edit

Following up this comment that says :

it may or may not be a good idea to try to encode all variables as low/high or low/typical/high based on the value distribution

I'm not too sure what is meant by this.

If I have x1 with the observations of

level  1  :  14   (2.0 %)
level  2  :  788  (98.0 %)

What would this mean with respect to encoding all variables as low/high based on the value distribution?

Another example might be having x8 with

level  1  :  274 (34.0 %)
level  2  :  264 (33.0 %)
level  3  :  180 (22.0 %)
level  4  :  50  (6.0 %)
level  5  :  10  (1.0 %)
level  6  :  24  (3.0 %)

score 1 · Answer 1 · answered Sep 26 '19 at 23:23

1

There is no simple method.

Because these values supposedly have some meaning, and the correct ways of handling such variables depends a lot on what the data meansz and how you like to do this.

Assuming this is some questionnaire, it may or may not be a good idea to try to encode all variables as low/high or low/typical/high based on the value distribution, for example.

answered Sep 26 '19 at 23:23

Has QUIT--Anony-Mousse

39,639
7
61
96

thanks - yes this is related to a survey. I've edited the post a little, as I wasn't too sure what you meant by *"try to encode all variables as low/high or low/typical/high based on the value distribution"*. If you're aware of any literature that outlines this that would be appreciated. – baxx Sep 26 '19 at 23:42
if there are binary variables would this restrict the encoding to low/high across all variables using the approach you've outlined? So I would be encoding all variables to binary variables and working from there? Any links to literature that discusses this would be appreciated. – baxx Oct 06 '19 at 00:05
1

You can find some discussion and literature links here: https://stats.stackexchange.com/q/10/7828 – Has QUIT--Anony-Mousse Oct 06 '19 at 05:32
yes, thank you. I don't see anything that answers my question about your comment on _low/typical/high_ though. I'm interested in whether this approach would typically consider the variable with the lowest number of levels (binary in this case) to dictate how many others have also. So, all would be coded into low/high in this case. Thanks – baxx Oct 07 '19 at 00:10
Sorry, I don't find it right now, you'll have to search yourself. There was a discussion on the psychologist aspects that values aren't used equally, and shouldn't be considered equally spaced. – Has QUIT--Anony-Mousse Oct 07 '19 at 06:00

How to handle clustering analysis of data which has different numbers of levels

edit

1 Answers1