7

This is a weird question, I know.

I'm just a noob and trying to learn about different classifier options and how they work. So I'm asking the question:

Given a dataset of n1-dimensions and n2-observations where each observation can be classified into n3-buckets, which algorithm most efficiently (ideally with just one training iteration) produces a model (classification boundary) that would perfectly classify every observation in the dataset (completely overfit)?

In other words, how does one most easily overfit?

(Please don't lecture me on 'not overfitting'. This is just for theoretical educational purposes.)

I have a suspicion that the answer something like this: "Well, if your number of dimensions is greater than your number of observations, use X algorithm to draw the boundary(ies), otherwise use Y algorithm."

I also have a suspicion that the answer will say: "You can draw a smooth boundary, but that more computationally expensive than drawing straight lines between all differing classified observations."

But that's as far as my intuition will guide me. Can you help?

I have a hand-drawn example of what I think I'm talking about in 2D with binary classification.

Basically, just split the difference, right? What algorithm does this efficiently for n-dimensions?

Basically just split the difference, right? What algorithm does this efficiently?

Legit Stack
  • 453
  • 2
  • 11
  • 5
    knn with $k=1$? – shimao May 01 '19 at 20:45
  • @shimao I suppose that would work, wouldn't it? Yeah, I can't see why not. Thank you very much! – Legit Stack May 01 '19 at 20:48
  • @shimao Is that the most efficient way to encode the Boundary? Probably right? Since we don't know if the data is entirely random or not, using the data itself as the encoded model with KNN algorithm is probably the best you can generally do. Right? – Legit Stack May 01 '19 at 20:55
  • 2
    @shimao: do you want to post your comment as an answer (perhaps with a little more detail)? – Stephan Kolassa May 01 '19 at 21:07
  • The title isn't very informative about the content. The vast majority of users coming to this page looking for answers are going to find the actual question asked very different than what they expected. Please revise. – OrangeSherbet May 03 '19 at 04:17
  • @OrangeSherbet I disagree, but maybe you're right. What title would you suggest? – Legit Stack May 05 '19 at 14:10
  • @OrangeSherbet Not sure how you can disagree. "How do I overfit efficiently" or "How does one most easily overfit" are vastly different questions than "How can I overfit", which is so broad it doesn't even have a useful answer. – OrangeSherbet May 07 '19 at 05:35
  • @OrangeSherbet It is done, those were excellent suggestions. It's kind to lead with suggestions rather than "Please revise." – Legit Stack May 07 '19 at 17:18

2 Answers2

7

As long as all the observations are unique, then K-nearest neighbors with K set to 1 and with any arbitrary valid distance metric will give a classifier which perfectly fits the training set (since the nearest neighbor of every point in the training set is trivially, itself). And it's probably the most efficient since no training at all is needed.

Is that the most efficient way to encode the Boundary? Probably right? Since we don't know if the data is entirely random or not, using the data itself as the encoded model with KNN algorithm is probably the best you can generally do. Right?

It's the most time-efficient, but not necessarily the most space efficient.

shimao
  • 22,706
  • 2
  • 42
  • 81
5

You can't.

At least not in general, to the degree you want, if you want a perfect fit with arbitrary data and arbitrary dimensionality.

As an example, suppose we have $n_1=0$ predictor dimensions (i.e., none at all) and $n_2=2$ observations classified into $n_3=2$ buckets. The two observations are classified into two different buckets, namely "chocolate" and "vanilla".

Since you don't have any predictors, you will not be able to classify them perfectly, period.


If you have at least one predictor that takes different values on each observation, then you can indeed overfit arbitrarily badly, simply by using arbitrarily high polynomial orders for a numerical predictor (if the predictor is categorical with different values on each observation, you don't even need to transform). The tool or model is pretty much secondary. Yes, it's easy to overfit.

Here is an example. The 10 observations are completely independent of the single numerical predictor. We fit increasingly complex logistical regressions or powers of the predictor and classify using a threshold of 0.5 (which is not good practice). Correctly fitted points are marked in green, incorrectly fitted ones in red.

overfitting

R code:

nn <- 10
set.seed(2)

predictor <- runif(nn)
outcome <- runif(nn)>0.5

plot(predictor,outcome,pch=19,yaxt="n",ylim=c(-0.1,1.6))
axis(2,c(0,1),c("FALSE","TRUE"))

orders <- c(1,2,3,5,7,9)
xx <- seq(min(predictor),max(predictor),0.01)

par(mfrow=c(3,2))
for ( kk in seq_along(orders) ) {
    plot(predictor,outcome,pch=19,yaxt="n",ylim=c(-0.2,1.2),main=paste("Order:",orders[kk]))
    axis(2,c(0,1),c("FALSE","TRUE"))

    model <- glm(outcome~poly(predictor,orders[kk]),family="binomial")
    fits_obs <- predict(model,type="response")
    fits <- predict(model,newdata=data.frame(predictor=xx),type="response")

    lines(xx,fits)
    correct <- (fits_obs>0.5 & outcome) | ( fits_obs<0.5 & !outcome)
    points(predictor[correct],outcome[correct],cex=1.4,col="green",pch="o")
    points(predictor[!correct],outcome[!correct],cex=1.4,col="red",pch="o")
}
Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357