0

I am working on something there trying to predict a cost per location there are 8 variables one of them is a categorical value that has over 300 levels of postal codes in the entire provinces will that mess up my predictive model or it is better to use another method like binning to reduce the level. looking for advice as I will be using Decision tree, Random forest, KNN, ANN, and logistic to get some answers

Postal code carry individuals information, jobs categories and salary average. My sample size is 3000 x 13 and I am answering many questions with the data set. Estimation of loan requested and will it be repaid – Thank you

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
memile
  • 1
  • 1
  • 4
    Unless you give us more details, this is hardly answerable. For example, language models that have features for words, so we are talking about thousands of features. What exactly is the problem? – Tim Oct 18 '19 at 19:39
  • We need more detail. In addition to what @Tim asked for, 1) if you are predicting cost, how are you using these methods, some of which are for categorical variables? 2) What's the nature of the postal codes (they often carry information) 3) What's your sample size? 4) Why are you using so many methods? – Peter Flom Oct 19 '19 at 11:27
  • Look into regularization, like the fused lasso. See https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels. For random forest see https://stats.stackexchange.com/questions/390671/random-forest-regression-with-sparse-data-in-python/430127#430127. – kjetil b halvorsen Oct 20 '19 at 23:06
  • Since you've now clarified that your feature is postal code, see https://stats.stackexchange.com/q/94902/232706 and https://datascience.stackexchange.com/q/10509/55122 as starters. – Ben Reiniger Oct 21 '19 at 03:02

0 Answers0