4

First of all, data I'm using can be found here.

My code is:

library(readxl)
prouni <- read_excel("C:/Users/e270860661/Downloads/prouni.xlsx", 
                 sheet = "Formatados_para_rodar")
View(prouni) 
str(prouni)

coercivo <- data.frame(prouni$mensalidade, prouni$Medicina, 
prouni$nota_integral_ampla)
summary(coercivo)
coercivo$dropador <- complete.cases(coercivo)
summary(coercivo$dropador)
final <- coercivo[coercivo$dropador == TRUE,]
final$dropador <- NULL


set.seed(100)
analise <- kmeans(final, centers = 4)
str(analise)

plot(final, col = analise$cluster)

The plot I get is: enter image description here

For context, "mensalidade" means "tuition", "medicina" is a dummy variable (1 for a given program being Medicine and 0 for not being Medicine) and "nota_integral_ampla" is equivalent to required SAT score to be approved in the program.

My problem is that clustering doesn't seem to be working "multivariably". The algorithm seems to have chosen tuition thresholds and classified observations considering only these thresholds. Is my intutition right or is kmeans supposed to do this? Is there a coding error?

I'm an economist by training so this is all very new to me, sorry if it's a poorly elaborated question.

2 Answers2

6

I think what is seen is an artefact of having data in different scales, using somewhat inappropriate $k$ and potentially employing an inappropriate clustering algorithm.

As the data in the sample's dimensions are on substantially different scales, it is problematic to use the raw data directly. That is because simple Euclidean distances between data-points are potentially irrelevant if the differences in scale do not translate to difference in importance. If we think that all features are equally important the easiest way to achieve that is by using normalised data in the sense of : newfinal <- apply( final, 2, scale) where the values of the features contained on the columns of newfinal are set to have mean $0$ and variance $1$. That said, I do not think that using $k=4$ is a good choice, even with standardised data. I say that based on how the data look as well as what standard metrics for choosing $k$ like the GAP-statistics (available in cluster::clusGap) suggest. Running something like: clusGap( newfinal, kmeans, K.max = 6, B = 100) strongly suggests that $k=4$ is overkill and potentially $k=2$ is a much more reliable options. You can use other metrics too (e.g. fitting a Gaussian Mixture Model and examining the relevant AIC, BIC values) but I suspect these will also suggest $k<4$.

Check out MATLAB's tutorial on Cluster Using Gaussian Mixture Models and its relevant links. It is well written and easy to follow. $k$-means is a special case of a GMM so this can help your understanding a lot.

Aside using $k$-means I would suggest look into a density-based approach like DBSCAN. This algorithm can be easily used by using the function dbscan::dscan. The DBSAN results suggest having two major clusters and 2-3 smaller/outlier clusters. The relevant command will be something like : analise2 <- dbscan( newfinal, eps = 1). Notice DBSCAN requires to define a smallest relevant "neighbourhood size, the parameter eps, there is very good thread here explain the role of the parameter eps in more detail.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
usεr11852
  • 33,608
  • 2
  • 75
  • 117
  • Late by 20 sec. Oh well. – usεr11852 Jul 16 '18 at 23:57
  • I'm almost ashamed of the half-assed answer I gave comparing to yours. XD – Marcelo Ventura Jul 17 '18 at 00:02
  • We good, no worries. Practice makes perfect. (+1) – usεr11852 Jul 17 '18 at 00:04
  • Damn, that's a great answer. I have a question now, which I hope you (and Marcelo <3) can answer too. Is changing mensalidade to standard-deviations from mean mensalidade something to consider as far as clustering _tecniqués_ go? I mean, it does make sense to the economist in me. I'm more interested in relative tuition costs (thus, prices), not on the exact amount of brazilian reais someone spends on a given program. Anyway, I'll check dbscan out. Thanks for the tips! – Pedro Cavalcante Jul 17 '18 at 03:28
  • _Let me know if this should be in a different question_ I've run this code: analise2 – Pedro Cavalcante Jul 17 '18 at 08:16
  • @PedroCavalcanteOliveira: 0. I am happy I could help! 1. Yes, using relative costs would be very reasonable. It will also make your findings potentially more transferable too 2. You seem to be using the Principal Component scores as surrogate data. That is not a bad idea! (Actually I would say it is a rather good one!) I am a bit confused as what dbscan function you are using. I think this is more suitable for a separate question. – usεr11852 Jul 17 '18 at 18:50
  • As a principal: When clustering the measurement of similarity between our points is as important if not more important to the actual algorithm we are using. That's why we opt to normalise our sample almost by convention; the most common distance (i.e. Euclidean) usually gives coherent similarity scores for points coming from a normalised sample, that is not always the case and that's why we have many different distance metrics. See for example the "Gower distance" (Gower 1971, *Biometrika*) that works both with categorical and continuous data natively. – usεr11852 Jul 17 '18 at 18:50
  • Hey, I made a question specifically about PCA. Here it is, if you guys don't mind helping me out a bit more: https://stats.stackexchange.com/questions/362349/how-do-i-interpret-principal-components-in-k-means-analysis – Pedro Cavalcante Aug 16 '18 at 01:33
4

Welcome to Cross Validated!

Notice that your two quantitative variables are on different scales, with mensalidade ranging from zero to ten thousand and nota_integrada ranging from 450 to 800 (say). Since the k-means minimizes a sum of squares involving the three variables, mensalidade's variability is swallowing the variability of the other two.

Standardize both variables and give it another try. (Notice that medicina will range between 0 and 1, there is no need to standardize it too).

Marcelo Ventura
  • 1,433
  • 11
  • 21
  • 1
    (+1 for spotting the scale issue) Notice that normalising the indicator variable `medicina` is not a bad idea. If we ultimately use a technique that uses the covariance matrix of our data, it sensible to have variance $1$ across all our features. Just because a feature's values are `0/1` do not guarantee unit variance. – usεr11852 Jul 17 '18 at 00:07
  • Indeed, it will not guarantee unit variance but that is not even the point. The difference of scale is so grossly wide, that magnitide of the variance will be immaterial. Once the two other variables get padronized (and get to the same univary scale), the variance of the dummy will be roughly on the same order of magnitude of the two quantitatives, so rescaling it will make only a marginal effect. – Marcelo Ventura Jul 17 '18 at 01:34
  • Marcelo, I wish I could accept your answer too, but the other one certainly took more time to be written, so I'll give user11852 the points, sorry. I'm very thankful for your answer, tho! – Pedro Cavalcante Jul 17 '18 at 03:29
  • @PedroCavalcanteOliveira I could not agree more. His answer is way more complete than mine. **I** would vote his. – Marcelo Ventura Jul 17 '18 at 03:48
  • Cross Validated will not allow me to greet Pedro with a hello. I'm trying to include that polited piece of starters, but Cross Validated is being ruthless with my attempts. – Marcelo Ventura Jul 17 '18 at 08:34