First of all, data I'm using can be found here.
My code is:
library(readxl)
prouni <- read_excel("C:/Users/e270860661/Downloads/prouni.xlsx",
sheet = "Formatados_para_rodar")
View(prouni)
str(prouni)
coercivo <- data.frame(prouni$mensalidade, prouni$Medicina,
prouni$nota_integral_ampla)
summary(coercivo)
coercivo$dropador <- complete.cases(coercivo)
summary(coercivo$dropador)
final <- coercivo[coercivo$dropador == TRUE,]
final$dropador <- NULL
set.seed(100)
analise <- kmeans(final, centers = 4)
str(analise)
plot(final, col = analise$cluster)
For context, "mensalidade" means "tuition", "medicina" is a dummy variable (1 for a given program being Medicine and 0 for not being Medicine) and "nota_integral_ampla" is equivalent to required SAT score to be approved in the program.
My problem is that clustering doesn't seem to be working "multivariably". The algorithm seems to have chosen tuition thresholds and classified observations considering only these thresholds. Is my intutition right or is kmeans supposed to do this? Is there a coding error?
I'm an economist by training so this is all very new to me, sorry if it's a poorly elaborated question.