Does adding new categorical data decrease prediction performance in classification?

Question

I have a dataset in which new data comes in everyday. There are categorical variables in the inputs. As a result, I use one-hot-encoder to create a dummy variable. If a new categorical comes in, the number of features increases by 1, because it takes one more dimension to assign 0 or 1 for that category. However, I felt that this approach leads to a decrease of prediction performance. In other words, I felt that the testing error increases, even though there is new data coming in. Here is a source stating that increasing the number of features reduces performance, another source describing similar behavior of dropping performance.

My question is, if new data is associated with new category, does it decrease the prediction performance? Perhaps it is better to quarantine the outlier data?

@MichaelChernick: thank you for your response. would you please elaborate? the so-called curse of dimensionality doesn't apply here at all? — kensaii, Mar 09 '17 at 03:10
While the links you cited mentioned the curse of dimensionality, I did not mention it. — Michael R. Chernick, Mar 09 '17 at 03:15
@Michael Chernick: How can you be so sure it does not create issues? When new levels of a categorical variable arrives over time, that indicates a very open categorical variable with very many levels, maybe few occurences of each individual level. In the worst case this could be a model with number of parameters increasing over time to infinity, invalidating usual asymptotics, for instance. I guess such problem needs thinking — kjetil b halvorsen, May 21 '17 at 11:16
See https://stats.stackexchange.com/questions/298137/dealing-with-new-factor-levels-in-a-regression-in-r/384749#384749 for ideas to handle new factor levels in data — kjetil b halvorsen, Jan 26 '20 at 16:41

score 0 · Answer 1 · answered Jan 27 '20 at 11:40

0

Depending on the data and the model and how performance is assessed, new data with a new feature could either improve or worsen the performance.

answered Jan 27 '20 at 11:40

Peter Flom

94,055
35
143
276

Does adding new categorical data decrease prediction performance in classification?

1 Answers1