0

I have a dataset with no labels. I've tried k-means to see how the data separates into multiple groups but haven't had much luck.

As an alternate approach I was thinking to "manufacture" target labels and then run a classification model (binary or multi-nomial) to detect patterns.

For example in the titanic dataset, say we didn't have access to the "survival" target variable. I could use a couple of features (say ticket fare * age) to create a new target variable and bucket them into 5 groups (quintiles) and then run a multi-nomial classification model against this new target.

If the resulting AUC (or some other metric) is high then could we conclude that the model predictions can be grouped into "clusters"? I know it would be hard to determine what each group represents but can we at least conclude that these groups exhibit a pattern?

jjreddick
  • 55
  • 1
  • 6
  • It seems a little artificial and perhaps biased in the sense that your "manufactured" labels are the result of some function $f$ you have defined; hence, the learning problem is simply to approximate $f$. If you could do this, why not just directly categorize/cluster them based on that $f$, rather than add the extra step? Incidentally, did you consider other clustering algorithms? – user3658307 Aug 27 '17 at 04:49
  • Thank you for your response. I tried DBSCAN in addition to k-means and it didn't perform well either. – jjreddick Aug 27 '17 at 16:21
  • How would I use f directly to categorize? To get the best f I tried various combinations (including polynomials) and ran it through logistic model. Then took the f that provides the best AUC. Agree with you that this feels a bit artificial but am running out of ideas given I don't have labels and the standard clustering algos are not performing. – jjreddick Aug 27 '17 at 16:30
  • I'm not sure if this is research you want to publish or what, but I'd suggest being careful about trying to *choose* the best $f$. It sounds like you have a "result you want" in mind, and are trying to get the data to fit that. If you have some domain knowledge that you can build into the algorithm, that could be useful, but generally you want the conclusion to fall out of the data, not change the algorithm until the results match a predefined desired conclusion. I'm not saying youre doing that; I'm just suggesting what a reviewer might say upon reading such a thing :) – user3658307 Aug 27 '17 at 16:41
  • One thing you *can* do, however, is optimize with respect to a [clustering quality metric](http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation). For instance, test many algorithms and parameters to optimize one of these measures, which is an objective "quality" measure for a clustering result. See also [here](https://stats.stackexchange.com/questions/21807/evaluation-measure-of-clustering-without-having-truth-labels). You're idea is not unreasonable, but it may be easier to simply improve the clustering imo. – user3658307 Aug 27 '17 at 16:44
  • Makes sense and will try your suggestions. Thank you for your feedback. – jjreddick Aug 28 '17 at 01:34

0 Answers0