Variable coarsening in Naive Bayes

Question

Say we have a binary classification problem that we solve with Naive Bayes. All features are categorical variables.

Say we focus on a single feature that takes one of $N$ possible values. If $N$ is high, and we use a discrete distribution to encode it, the model complexity can rapidly increase (one $\theta$ per value and feature).

One way of reducing model complexity (and potentially improve generalization performance if $N$ is relatively high) would be to cluster values of each variable and effectively use a smaller dictionary, reducing the number of $\theta$'s estimated. This would coarsen the probability mass function of every feature (variable) but since we end up with less parameters to estimate, it could help improve generalization performance.

Aside from cross-validation, what would be a principled way (good proxy) for identifying what values to group for a given variable/feature?

score 1 · Answer 1 · edited Apr 13 '17 at 12:44

1

If the values are numeric, i.e. integers between 1 and 100, you can consider modeling this variable as continuous with a parametric distribution (e.g. Gaussian). This has less parameters, and can hopefully generalize better.

If you are looking for more complicated models, take a look at this question.

HTH.

edited Apr 13 '17 at 12:44

Community

1

answered Nov 20 '14 at 22:45

Amir

373
1
12

Variable coarsening in Naive Bayes

1 Answers1