0

I would like to predict a LABEL: A,B or C using a classification machine learning model.

My data to train the model is like:

LABEL AGE12-18 AGE19-24 AGE25-35
A         10         30      60
B         40         20      40
C          5          5      90  

Where AGE-12-18, AGE19-24 and AGE25-35 are the percentage of users with age between [12-19),[19-25) and [25-35) in each cluster. Then

AGE12-18+AGE19-24+AGE25-35=100%

So, I have aggregations of A,B,C instead of all the data.

I would like to transform this data to predict users with data like:

USER AGE    AGECAT
a    24   AGE19-25
b    32   AGE25-35

I was thinking to create a new dataset with a distribution with the same % of users in each cluster as:

LABEL               AGECAT
A       AGE12-18 X 10 rows
A       AGE19-24 x 30 rows
A       AGE25-35 x 60 rows

However, I don't like really much this solution as I am not sure If it is going to work. I have seen another similar question with aggregated dependent variable but not with the independent variables.

Do anybody knows if this is correct of any other way to achieve a classification model with this data? Thank you

Paul Vbl
  • 15
  • 5

1 Answers1

1

If you only have the percentages of each age category within each label, then that does not let you do much to predict the labels*. You'd need the number** of people within each cell of your table. Creating a row per person would indeed work in the way you mention.

* The problem is that you do not know how common each label is overall. If the age breakdown in a particular label is a specific way, then your prediction still depends enormously on whether that label occurs in 0.1% of people overall, 50% or 99.9%.

** Percentages with each label within each age category would also work for getting a prediction, but you would - without knowing the numbers behind the percentages - not be able to characterize the performance of your model or the uncertainty in your predictions (even if your data is a sample from the true population of interest).

Björn
  • 21,227
  • 2
  • 26
  • 65