How do I decide what span to use in LOESS regression in R?

Question

I am running LOESS regression models in R, and I want to compare the outputs of 12 different models with varying sample sizes. I can describe the actual models in more details if it helps with answering the question.

Here are the sample sizes:

Fastballs vs RHH 2008-09: 2002
Fastballs vs LHH 2008-09: 2209
Fastballs vs RHH 2010: 527 
Fastballs vs LHH 2010: 449

Changeups vs RHH 2008-09: 365
Changeups vs LHH 2008-09: 824
Changeups vs RHH 2010: 201
Changeups vs LHH 2010: 330

Curveballs vs RHH 2008-09: 488
Curveballs vs LHH 2008-09: 483
Curveballs vs RHH 2010: 213
Curveballs vs LHH 2010: 162

The LOESS regression model is a surface fit, where the X location and the Y location of each baseball pitch is used to predict sw, swinging strike probability. However, I'd like to compare between all 12 of these models, but setting the same span (i.e. span = 0.5) will bear different results since there is such a wide range of sample sizes.

My basic question is how do you determine the span of your model? A higher span smooths out the fit more, while a lower span captures more trends but introduces statistical noise if there is too little data. I use a higher span for smaller sample sizes and a lower span for larger sample sizes.

What should I do? What's a good rule of thumb when setting span for LOESS regression models in R? Thanks in advance!

Notice that the span measure would mean different window size for different number of observations. — Tal Galili, Aug 22 '10 at 04:10
Often I see loess being treated as more of a black box. Unfortunately, it's not true. There is not other way than to look at the scatter plot and the superimposed loess curve and check if it does a good job of describing the patterns in the data. **Iteration and residual checks are key in loess fitting**. — suncoolsu, Jul 17 '11 at 22:57

score 18 · Answer 1 · edited Jan 19 '21 at 15:50

A cross-validation is often used, for example k-fold, if the aim is to find a fit with lowest RMSEP. Split your data into k groups and, leaving each group out in turn, fit a loess model using the k-1 groups of data and a chosen value of the smoothing parameter, and use that model to predict for the left out group. Store the predicted values for the left out group and then repeat until each of the k groups has been left out once. Using the set of predicted values, compute RMSEP. Then repeat the whole thing for each value of the smoothing parameter you wish to tune over. Select that smoothing parameter that gives lowest RMSEP under CV.

This is, as you can see, fairly computationally heavy. I would be surprised if there wasn't a generalised cross-validation (GCV) alternative to true CV that you could use with LOESS - Hastie et al (section 6.2) indicate this is quite simple to do and is covered in one of their exercises.

I suggest you read section 6.1.1, 6.1.2 and 6.2, plus the sections on regularisation of smoothing splines (as the content applies here too) in Chapter 5 of Hastie et al. (2009) The Elements of Statistical Learning: Data mining, inference, and prediction. 2nd Edition. Springer. The PDF can be downloaded for free.

Mike Lawrence · Answer 2 · 2010-08-22T03:14:30.497

I suggest checking out generalized additive models (GAM, see the mgcv package in R). I'm just learning about them myself, but they seem to automatically figure out how much "wiggly-ness" is justified by the data. I also see that you're dealing with binomial data (strike vs not a strike), so be sure to analyze the raw data (i.e. don't aggregate to proportions, use the raw pitch-by-pitch data) and use family='binomial' (assuming that you're going to use R). If you have information about what individual pitchers and hitters are contributing to the data, you can probably increase your power by doing a generalized additive mixed model (GAMM, see the gamm4 package in R) and specifying pitcher and hitter as random effects (and again, setting family='binomial'). Finally, you probably want to allow for an interaction between the smooths of X & Y, but I've never tried this myself so I don't know how to go about that. A gamm4 model without the X*Y interaction would look like:

fit = gamm4(
    formula = strike ~ s(X) + s(Y) + pitch_type*batter_handedness + (1|pitcher) + (1|batter)
    , data = my_data
    , family = 'binomial'
)
summary(fit$gam)

Come to think of it, you probably want to let the smooths vary within each level of pitch type and batter handedness. This makes the problem more difficult as I've not yet found out how to let the smooths vary by multiple variables in a way that subsequently produces meaninful analytic tests (see my queries to the R-SIG-Mixed-Models list). You could try:

my_data$dummy = factor(paste(my_data$pitch_type,my_data$batter_handedness))
fit = gamm4(
    formula = strike ~ s(X,by=dummy) + s(Y,by=dummy) + pitch_type*batter_handedness + (1|pitcher) + (1|batter)
    , data = my_data
    , family = 'binomial'
)
summary(fit$gam)

But this won't give meaningful tests of the smooths. In attempting to solve this problem myself, I've used bootstrap resampling where on each iteration I obtain the model predictions for the full data space then compute the bootstap 95% CIs for each point in the space and any effects I care to compute.

It appears that ggplot uses GAM for its geom_smooth function for N>1000 datapoints by default. — Learning stats by example, Jan 26 '19 at 00:32

score 7 · Answer 3 · answered Jan 12 '12 at 14:58

For a loess regression, my understanding as a non-statistician, is that you can choose your span based on visual interpretation (plot with numerous span values can choose the one with the least amount of smoothing that seems appropriate) or you can use cross validation (CV) or generalized cross validation (GCV). Below is code I used for GCV of a loess regression based on code from Takezawa's excellent book, Introduction to Nonparametric Regression (from p219).

locv1 <- function(x1, y1, nd, span, ntrial)
{
locvgcv <- function(sp, x1, y1)
{
    nd <- length(x1)

    assign("data1", data.frame(xx1 = x1, yy1 = y1))
    fit.lo <- loess(yy1 ~ xx1, data = data1, span = sp, family = "gaussian", degree = 2, surface = "direct")
    res <- residuals(fit.lo)

    dhat2 <- function(x1, sp)
    {
        nd2 <- length(x1)
        diag1 <- diag(nd2)
        dhat <- rep(0, length = nd2)

        for(jj in 1:nd2){
            y2 <- diag1[, jj]
            assign("data1", data.frame(xx1 = x1, yy1 = y2))
            fit.lo <- loess(yy1 ~ xx1, data = data1, span = sp, family = "gaussian", degree = 2, surface = "direct")
            ey <- fitted.values(fit.lo)
            dhat[jj] <- ey[jj]
            }
            return(dhat)
        }

        dhat <- dhat2(x1, sp)
        trhat <- sum(dhat)
        sse <- sum(res^2)

        cv <- sum((res/(1 - dhat))^2)/nd
        gcv <- sse/(nd * (1 - (trhat/nd))^2)

        return(gcv)
    }

    gcv <- lapply(as.list(span1), locvgcv, x1 = x1, y1 = y1)
    #cvgcv <- unlist(cvgcv)
    #cv <- cvgcv[attr(cvgcv, "names") == "cv"]
    #gcv <- cvgcv[attr(cvgcv, "names") == "gcv"]

    return(gcv)
}

and with my data, I did the following:

nd <- length(Edge2$Distance)
xx <- Edge2$Distance
yy <- lcap

ntrial <- 50
span1 <- seq(from = 0.5, by = 0.01, length = ntrial)

output.lo <- locv1(xx, yy, nd, span1, ntrial)
#cv <- output.lo
gcv <- output.lo

plot(span1, gcv, type = "n", xlab = "span", ylab = "GCV")
points(span1, gcv, pch = 3)
lines(span1, gcv, lwd = 2)
gpcvmin <- seq(along = gcv)[gcv == min(gcv)]
spangcv <- span1[pgcvmin]
gcvmin <- cv[pgcvmin]
points(spangcv, gcvmin, cex = 1, pch = 15)

Sorry the code is rather sloppy, this was one of my first times using R, but it should give you an idea of how to do GSV for loess regression to find the best span to use in a more objective way than simple visual inspection. On the above plot, you are interested in the span that minimizes the function (lowest on the plotted "curve").

score 5 · Answer 4 · edited Jul 18 '11 at 07:33

5

If you switch to a generlized additive model, you could use the gam() function from the mgcv package, in which the author assures us:

So, exact choice of k is not generally critical: it should be chosen to be large enough that you are reasonably sure of having enough degrees of freedom to represent the underlying ‘truth’ reasonably well, but small enough to maintain reasonable computational efficiency. Clearly ‘large’ and ‘small’ are dependent on the particular problem being addressed.

(k here is the degrees of freedom parameter for the smoother, which is akin to loess' smoothness parameter)

edited Jul 18 '11 at 07:33

Gavin Simpson

37,567
5
110
153

answered Jul 18 '11 at 02:12

Mike Lawrence

12,691
8
40
65

Thanks Mike :) I've seen from previous answers you are strong on GAM. I will have a look at it in the future, for sure :) – Tal Galili Jul 18 '11 at 06:21

score 4 · Answer 5 · answered Oct 13 '17 at 14:31

You could write your own cross validation loop from scratch that uses the loess() function from the stats package.

Set up a toy data frame.

set.seed(4)
x <- rnorm(n = 500)
y <- (x)^3 + (x - 3)^2 + (x - 8) - 1 + rnorm(n = 500, sd = 0.5)
plot(x, y)
df <- data.frame(x, y)

Set up useful variables to handle cross-validation loop.

span.seq <- seq(from = 0.15, to = 0.95, by = 0.05) #explores range of spans
k <- 10 #number of folds
set.seed(1) # replicate results
folds <- sample(x = 1:k, size = length(x), replace = TRUE)
cv.error.mtrx <- matrix(rep(x = NA, times = k * length(span.seq)), 
                        nrow = length(span.seq), ncol = k)

Run a nested for loop iterating over each span possibility in span.seq, and each fold in folds.

for(i in 1:length(span.seq)) {
  for(j in 1:k) {
    loess.fit <- loess(formula = y ~ x, data = df[folds != j, ], span = span.seq[i])
    preds <- predict(object = loess.fit, newdata = df[folds == j, ])
    cv.error.mtrx[i, j] <- mean((df$y[folds == j] - preds)^2, na.rm = TRUE)
    # some predictions result in `NA` because of the `x` ranges in each fold
 }
}

Calculate average cross-validation mean square error from each of the 10 folds: $$CV_{(10)} = \frac{1}{10} \sum_{i=1}^{10} MSE_i$$
```
cv.errors <- rowMeans(cv.error.mtrx)
```

Find which span resulted in the lowest $MSE$.

best.span.i <- which.min(cv.errors)
best.span.i
span.seq[best.span.i]

Plot your results.

plot(x = span.seq, y = cv.errors, type = "l", main = "CV Plot")
points(x = span.seq, y = cv.errors, 
       pch = 20, cex = 0.75, col = "blue")
points(x = span.seq[best.span.i], y = cv.errors[best.span.i], 
       pch = 20, cex = 1, col = "red")

best.loess.fit <- loess(formula = y ~ x, data = df, 
                        span = span.seq[best.span.i])

x.seq <- seq(from = min(x), to = max(x), length = 100)

plot(x = df$x, y = df$y, main = "Best Span Plot")
lines(x = x.seq, y = predict(object = best.loess.fit, 
                             newdata = data.frame(x = x.seq)), 
      col = "red", lwd = 2)

Welcome to the site, @hynso. This is a good answer (+1), & I appreciate your use of the formatting options the site affords. Note that we aren't supposed to be an R-specific site & our tolerance for questions specifically about R has diminished in the 7 years since this Q was posted. In short, it might be better if you could augment this w/ pseudocode for future viewers who don't read R. — gung - Reinstate Monica, Oct 13 '17 at 14:49
Cool, thanks for the tips @gung. I'll work on adding pseudocode. — hynso, Oct 13 '17 at 14:51

score 2 · Answer 6 · answered Jul 15 '15 at 16:44

2

Use locfit package. Its a slightly modified version of the loess but way faster. It also has an inbuilt function to calculate gcv http://www.statistik.lmu.de/~leiten/Lehre/Material/GLM_0708/Tutorium/locfit.pdf

answered Jul 15 '15 at 16:44

derp92

131
6

How do you tell locfit to choose the model with the best span using gcv? – skan Feb 18 '22 at 12:29

score 0 · Answer 7 · answered Jan 28 '19 at 21:11

The fANCOVA package provides an automated way to compute the ideal span using gcv or aic:

FTSE.lo3 <- loess.as(Index, FTSE_close, degree = 1, criterion = c("aicc", "gcv")[2], user.span = NULL, plot = F)
FTSE.lo.predict3 <- predict(FTSE.lo3, data.frame(Index=Index))

How do I decide what span to use in LOESS regression in R?

7 Answers7

Linked