For example, if i have data which is along the lines of
variable : levels within variable
x1 : {1,2}
x2 : {1,2}
x3 : {1,2}
x4 : {1,2,3}
x5 : {1,2,3}
x6 : {1,2,3,4,5}
x7 : {1,2,3,4,5}
x8 : {1,2,3,4,5}
x8 : {1,2,3,4,5,6}
x9 : {1,2,3,4,5,6}
x10 : {1,2,3,4,5,6,7,8,9,10}
x11 : {1,2,3,4,5,6,7,8,9,10}
x12 : {1,2,3,4,5,6,7,8,9,10}
Where x1,...,x12
are ordinal variables.
How would one about treating data like the above for clustering? And what sort of algorithms are most typically used?
I'm aware of scaling data for use with some algorithms, but I'm not sure if scaling data remains valid when there are different numbers of levels as there are above.
edit
Following up this comment that says :
it may or may not be a good idea to try to encode all variables as low/high or low/typical/high based on the value distribution
I'm not too sure what is meant by this.
If I have x1
with the observations of
level 1 : 14 (2.0 %)
level 2 : 788 (98.0 %)
What would this mean with respect to encoding all variables as low/high
based on the value distribution?
Another example might be having x8
with
level 1 : 274 (34.0 %)
level 2 : 264 (33.0 %)
level 3 : 180 (22.0 %)
level 4 : 50 (6.0 %)
level 5 : 10 (1.0 %)
level 6 : 24 (3.0 %)