1

I am training an ML model for a supervised classification problem. For this, I have been provided with two datasets. They both contain the same number of samples, however dataset 1 includes data for 14 features whilst dataset 2 includes data for 46 different features.

The question I want to answer is which dataset can be used to train a better model. However, I am wary that the datasets are "imbalanced" with regards to numbers of features, and am not sure whether I need to correct for this in order to make a fair comparison between the two. Specifically, the hypothesis we have is that the 46 features in dataset 2 are individually more informative than the 14 features in dataset 1, and so a "feature balanced" dataset 2, formed from 14 features of the original dataset 2, would be able to be used to train a better model than dataset 1. However, I don't want to throw away data unnecessarily. Therefore, I was wondering if I need to perform such "feature balancing" or not in order to fairly make this comparison.

kathryn
  • 13
  • 3

1 Answers1

1

No, there is no need to "balance" the number of input features. Many algorithms will discard features, anyway. Just follow good practices:

  1. Assess your models on a holdout sample, not in-sample.
  2. Choose appropriate evaluation measures. Not accuracy, that is.
  3. Consider bootstrapping the entire procedure to get an idea of the variability of your model's quality. Maybe even test the difference in quality for statistical significance.

I personally would be more concerned about whether inferences drawn from two different datasets are comparable. Even if your model on only 14 features performs better on dataset 1 than the model on 46 features (or on a subset of 14 features) on dataset 2, this may only be due to idiosyncrasies of your two models. If you want to generalize any findings, you would need to ensure you have comparable data for evaluation.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357