1

I am working on the project that requires data mining. I have been asked to use R. I have a dataset with all categorical variables and would like to form clusters on that. I am unable to figure out how to do so in R.

Here is what I have done: I have converted all the variables to "factor" data type in R. But I am not able to see the underlying numbered levels. I also do not know how to use this with kmeans() to get the required result.

My question is how do I form clusters on these factors.

Here is what the data looks like:

RowNum|EmpNum|EmpName|EmpOrganization|EmpTitle|EmpLeaderNumber|EmpDepartment|EmpAccesstoApplicaton|EmpAccessID The entire data is 14MB.

The effort is to cluster people with similar access. So people with similar Title or in similar org might have similar access. I understand kmeans() isn't the best option, but that is what I would like to use for the first draft.

I converted the EmpOrganization, EmpTitle etc to numeric data in excel using simple vlookup. It is easy to convert these to indicator variables using if statement in excel but I'm hoping that there is a more efficient way to do this in R itself.

Jo Bennet
  • 225
  • 2
  • 8
  • Can you show the code you do have? – doctorlove Aug 19 '14 at 17:50
  • Surely something like `fit – doctorlove Aug 19 '14 at 17:55
  • 2
    Do you just not know how to do this in R? Or do you not know how to do this in any language? If you want suggestions for methods on clustering categorical data, you're better off asking at [stats.se]; that is not a specific programming question. –  Aug 19 '14 at 18:12
  • you have to specify what the required result is. is there any relationship between the categorical variables (eg hierarchies). what should the clusters represent? do you want to identify groups of roughly equal frequency, or do you actually have a supervised learning problem, where you want to find 'clusters' that have the similar effect ( in which case you are better off with a tree building package such as rpart) – seanv507 Aug 20 '14 at 07:47
  • 1
    Are the categories ordinal or nominal? – Glen_b Aug 20 '14 at 07:55
  • Doctorlove, thank you for the suggestion. I couldn't use that since my data has a lot of NAs and I would like to leave that in. Is there a way to work around that? – Jo Bennet Aug 21 '14 at 15:40
  • Hi MrFlick, I have created clusters in other launguages before. Also, kmeans isn't the best clustering option for this data. But for right now, kmeans in R is what I should use even though that isn't the most optimal. – Jo Bennet Aug 21 '14 at 16:08
  • Hi Glen_b, there are 10 rows. 2 are ordinal, 8 are nominal. – Jo Bennet Aug 21 '14 at 16:09
  • In what sense "should [you] use" kmeans? It isn't clear that k-means is even really possible w/ all categorical data. You can get various distance matrices & use other clustering algorithms, though. – gung - Reinstate Monica Aug 21 '14 at 16:52
  • Doctorlove, I just realized that fit – Jo Bennet Aug 21 '14 at 18:21
  • gung, yes I just realized that. I used a similar procedure in minitab for multiple dimentions so I figured it works the same way but I was wrong. What other algorithm could be used that- 1) Works on means? 2) Can be used with multiple variables 3) Can be replicated easily in any language. Thanks a lot for all your inputs. – Jo Bennet Aug 21 '14 at 18:23

1 Answers1

2

In R cluster package you can use daisy, this will give you a dissimilarity matrix it works for mixed types also. Then you can use any other clustering function directly.

Parag
  • 186
  • 1
  • 4