2

Say we have a binary classification problem that we solve with Naive Bayes. All features are categorical variables.

Say we focus on a single feature that takes one of $N$ possible values. If $N$ is high, and we use a discrete distribution to encode it, the model complexity can rapidly increase (one $\theta$ per value and feature).

One way of reducing model complexity (and potentially improve generalization performance if $N$ is relatively high) would be to cluster values of each variable and effectively use a smaller dictionary, reducing the number of $\theta$'s estimated. This would coarsen the probability mass function of every feature (variable) but since we end up with less parameters to estimate, it could help improve generalization performance.

Aside from cross-validation, what would be a principled way (good proxy) for identifying what values to group for a given variable/feature?

user30802
  • 67
  • 6
Amelio Vazquez-Reina
  • 17,546
  • 26
  • 74
  • 110

1 Answers1

1

If the values are numeric, i.e. integers between 1 and 100, you can consider modeling this variable as continuous with a parametric distribution (e.g. Gaussian). This has less parameters, and can hopefully generalize better.

If you are looking for more complicated models, take a look at this question.

HTH.

Amir
  • 373
  • 1
  • 12