2

I have a dataset in which new data comes in everyday. There are categorical variables in the inputs. As a result, I use one-hot-encoder to create a dummy variable. If a new categorical comes in, the number of features increases by 1, because it takes one more dimension to assign 0 or 1 for that category. However, I felt that this approach leads to a decrease of prediction performance. In other words, I felt that the testing error increases, even though there is new data coming in. Here is a source stating that increasing the number of features reduces performance, another source describing similar behavior of dropping performance.

My question is, if new data is associated with new category, does it decrease the prediction performance? Perhaps it is better to quarantine the outlier data?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
kensaii
  • 264
  • 2
  • 11
  • 1
    It should not create any issues. – Michael R. Chernick Mar 09 '17 at 03:07
  • 1
    @MichaelChernick: thank you for your response. would you please elaborate? the so-called curse of dimensionality doesn't apply here at all? – kensaii Mar 09 '17 at 03:10
  • 2
    While the links you cited mentioned the curse of dimensionality, I did not mention it. – Michael R. Chernick Mar 09 '17 at 03:15
  • @Michael Chernick: How can you be so sure it does not create issues? When new levels of a categorical variable arrives over time, that indicates a very open categorical variable with very many levels, maybe few occurences of each individual level. In the worst case this could be a model with number of parameters increasing over time to infinity, invalidating usual asymptotics, for instance. I guess such problem needs thinking – kjetil b halvorsen May 21 '17 at 11:16
  • See https://stats.stackexchange.com/questions/298137/dealing-with-new-factor-levels-in-a-regression-in-r/384749#384749 for ideas to handle new factor levels in data – kjetil b halvorsen Jan 26 '20 at 16:41

1 Answers1

0

Depending on the data and the model and how performance is assessed, new data with a new feature could either improve or worsen the performance.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276