0

I have values of attribute between 0 and 1 which i want to predict. The distribution of values is shown in fig.
enter image description here
I want to predict this attribute. The problem is there are around 15 classes in this attribute . I want to label 15 classes to 6 like value of attribute greater than 0.9 describe as excellent. Between 0.7 and 0.9 very good and between 0.5 and 0.7 is good etc. But the problem is i am unable to find the standard way of doing this type of labeling. Like why i have given 0.9 a excellent label etc. Someone told me it has something to do with mean and stddev. But i am clueless about it. Also mean and stddev of column is shown in fig.
enter image description here

james
  • 101
  • I'm having trouble understand the question. It sounds like you're talking about binning continuous values in $[0, 1]$ into discrete categories like "good", "excellent", etc. based on pre-specified ranges for each bin. Are you asking what the the justification is for this procedure? – user20160 May 30 '16 at 06:30
  • @user20160 exactly. i want justification for this. – james May 30 '16 at 06:44
  • There isn't any inherent justification. If you have some downstream system that requires discrete categories, that could be a valid reason. Absent something like that, you'll be throwing away information, and many people actually recommend against it. Search for "discretization" to read more. This post may be useful: http://stats.stackexchange.com/questions/104402/what-is-the-justification-for-unsupervised-discretization-of-continuous-variable – user20160 May 30 '16 at 06:58
  • Why not using **regression** and round the prediction value to the nearest **class** (which doesn't throw away information trough binning)? – sascha May 30 '16 at 10:24
  • @sascha how should i go about it? Could you please elaborate it more? – james May 30 '16 at 11:06

0 Answers0