15

I have a zoo series with many missing values. I read that auto.arima can impute these missing values? Can anyone can teach me how to do it? thanks a lot!

This is what I have tried, but without success:

fit <- auto.arima(tsx)
plot(forecast(fit))
Avraham
  • 3,182
  • 21
  • 40
user3730957
  • 161
  • 1
  • 1
  • 4
  • As an addition to javlacalle and my answer below: I implemented these meanwhile in the imputeTS package. The function is called na.kalman and does Kalman smoothing on the state-space form of an ARIMA model – Steffen Moritz May 29 '18 at 23:40

2 Answers2

28

First, be aware that forecast computes out-of-sample predictions but you are interested in in-sample observations.

The Kalman filter handles missing values. Thus you can take the state space form of the ARIMA model from the output returned by forecast::auto.arima or stats::arima and pass it to KalmanRun.

Edit (fix in the code based on answer by stats0007)

In a previous version I took the column of the filtered states related to the observed series, however I should use the entire matrix and do the corresponding matrix operation of the observation equation, $y_t = Z \alpha_t$. (Thanks to @stats0007 for the comments.) Below I update the code and plot accordingly.

I use a ts object as a sample series instead of zoo, but it should be the same:

require(forecast)
# sample series
x0 <- x <- log(AirPassengers)
y <- x
# set some missing values
x[c(10,60:71,100,130)] <- NA
# fit model
fit <- auto.arima(x)
# Kalman filter
kr <- KalmanRun(x, fit$model)
# impute missing values Z %*% alpha at each missing observation
id.na <- which(is.na(x))
for (i in id.na)
  y[i] <- fit$model$Z %*% kr$states[i,]
# alternative to the explicit loop above
sapply(id.na, FUN = function(x, Z, alpha) Z %*% alpha[x,], 
  Z = fit$model$Z, alpha = kr$states)
y[id.na]
# [1] 4.767653 5.348100 5.364654 5.397167 5.523751 5.478211 5.482107 5.593442
# [9] 5.666549 5.701984 5.569021 5.463723 5.339286 5.855145 6.005067

You can plot the result (for the whole series and for the entire year with missing observations in the middle of the sample):

par(mfrow = c(2, 1), mar = c(2.2,2.2,2,2))
plot(x0, col = "gray")
lines(x)
points(time(x0)[id.na], x0[id.na], col = "blue", pch = 19)
points(time(y)[id.na], y[id.na], col = "red", pch = 17)
legend("topleft", legend = c("true values", "imputed values"), 
  col = c("blue", "red"), pch = c(19, 17))
plot(time(x0)[60:71], x0[60:71], type = "b", col = "blue", 
  pch = 19, ylim = range(x0[60:71]))
points(time(y)[60:71], y[60:71], col = "red", pch = 17)
lines(time(y)[60:71], y[60:71], col = "red")
legend("topleft", legend = c("true values", "imputed values"), 
  col = c("blue", "red"), pch = c(19, 17), lty = c(1, 1))

plot of the original series and the values imputed to missing observations

You can repeat the same example using the Kalman smoother instead of the Kalman filter. All you need to change are these lines:

kr <- KalmanSmooth(x, fit$model)
y[i] <- kr$smooth[i,]

Dealing with missing observations by means of the Kalman filter is sometimes interpreted as extrapolation of the series; when the Kalman smoother is used, missing observations are said to be filled in by interpolation in the observed series.

javlacalle
  • 11,184
  • 27
  • 53
  • Hi Javlacalle, thank you very much for your help. May I ask if there is any condition for the time series or this can apply for any? Could you explain a bit about these command lines? tmp – user3730957 Jun 25 '14 at 11:48
  • I checked again how `makeARIMA` defines the matrices of the state space form and I would say that the column taken by `id` is correct. The vector in the observation equation is defined in `makeARIMA` as: `Z – javlacalle Jun 25 '14 at 13:00
  • 1
    @user3730957 I have updated my answer fixing this issue with the indexing. – javlacalle Apr 02 '16 at 20:47
3

Here would be my solution:

# Take AirPassengers as example
data <- AirPassengers

# Set missing values
data[c(44,45,88,90,111,122,129,130,135,136)] <- NA


missindx <- is.na(data)

arimaModel <- auto.arima(data)
model <- arimaModel$model

#Kalman smoothing
kal <- KalmanSmooth(data, model, nit )
erg <- kal$smooth  

for ( i in 1:length(model$Z)) {
       erg[,i] = erg[,i] * model$Z[i]
}
karima <-rowSums(erg)

for (i in 1:length(data)) {
  if (is.na(data[i])) {
    data[i] <- karima[i]
  }
}
#Original TimeSeries with imputed values
print(data)

@ Javlacalle:

Thx for your post, very interesting!

I have two questions to your solution, hope you can help me:

  1. Why do you use KalmanRun instead of KalmanSmooth ? I read KalmanRun is considered extrapolation, while smooth would be estimation.

  2. I also do not get your id part. Why don't you use all components in .Z ? I mean for example .Z gives 1, 0,0,0,0,1,-1 -> 7 values. This would mean .smooth (in your case for KalmanRun states) gives me 7 columns. As I understand alle columns which are 1 or -1 go into the model.

    Let's say row number 5 is missing in AirPass. Then I would take the sum of row 5 like this: I would add value from column 1 (because Z gave 1), I wouldn't add column 2-4 (because Z says 0), I would add column 5 and I would add the negative value of column 7 (because Z says -1)

    Is my solution wrong? Or are they both ok? Can you perhaps explain to me further?

Steffen Moritz
  • 1,564
  • 2
  • 15
  • 22