How to use auto.arima to impute missing values

Question

I have a zoo series with many missing values. I read that auto.arima can impute these missing values? Can anyone can teach me how to do it? thanks a lot!

This is what I have tried, but without success:

fit <- auto.arima(tsx)
plot(forecast(fit))

As an addition to javlacalle and my answer below: I implemented these meanwhile in the imputeTS package. The function is called na.kalman and does Kalman smoothing on the state-space form of an ARIMA model — Steffen Moritz, May 29 '18 at 23:40

javlacalle · Answer 1 · 2016-04-02T20:35:12.883

First, be aware that forecast computes out-of-sample predictions but you are interested in in-sample observations.

The Kalman filter handles missing values. Thus you can take the state space form of the ARIMA model from the output returned by forecast::auto.arima or stats::arima and pass it to KalmanRun.

Edit (fix in the code based on answer by stats0007)

In a previous version I took the column of the filtered states related to the observed series, however I should use the entire matrix and do the corresponding matrix operation of the observation equation, $y_t = Z \alpha_t$. (Thanks to @stats0007 for the comments.) Below I update the code and plot accordingly.

I use a ts object as a sample series instead of zoo, but it should be the same:

require(forecast)
# sample series
x0 <- x <- log(AirPassengers)
y <- x
# set some missing values
x[c(10,60:71,100,130)] <- NA
# fit model
fit <- auto.arima(x)
# Kalman filter
kr <- KalmanRun(x, fit$model)
# impute missing values Z %*% alpha at each missing observation
id.na <- which(is.na(x))
for (i in id.na)
  y[i] <- fit$model$Z %*% kr$states[i,]
# alternative to the explicit loop above
sapply(id.na, FUN = function(x, Z, alpha) Z %*% alpha[x,], 
  Z = fit$model$Z, alpha = kr$states)
y[id.na]
# [1] 4.767653 5.348100 5.364654 5.397167 5.523751 5.478211 5.482107 5.593442
# [9] 5.666549 5.701984 5.569021 5.463723 5.339286 5.855145 6.005067

You can plot the result (for the whole series and for the entire year with missing observations in the middle of the sample):

par(mfrow = c(2, 1), mar = c(2.2,2.2,2,2))
plot(x0, col = "gray")
lines(x)
points(time(x0)[id.na], x0[id.na], col = "blue", pch = 19)
points(time(y)[id.na], y[id.na], col = "red", pch = 17)
legend("topleft", legend = c("true values", "imputed values"), 
  col = c("blue", "red"), pch = c(19, 17))
plot(time(x0)[60:71], x0[60:71], type = "b", col = "blue", 
  pch = 19, ylim = range(x0[60:71]))
points(time(y)[60:71], y[60:71], col = "red", pch = 17)
lines(time(y)[60:71], y[60:71], col = "red")
legend("topleft", legend = c("true values", "imputed values"), 
  col = c("blue", "red"), pch = c(19, 17), lty = c(1, 1))

You can repeat the same example using the Kalman smoother instead of the Kalman filter. All you need to change are these lines:

kr <- KalmanSmooth(x, fit$model)
y[i] <- kr$smooth[i,]

Dealing with missing observations by means of the Kalman filter is sometimes interpreted as extrapolation of the series; when the Kalman smoother is used, missing observations are said to be filled in by interpolation in the observed series.

Hi Javlacalle, thank you very much for your help. May I ask if there is any condition for the time series or this can apply for any? Could you explain a bit about these command lines? tmp — user3730957, Jun 25 '14 at 11:48
I checked again how `makeARIMA` defines the matrices of the state space form and I would say that the column taken by `id` is correct. The vector in the observation equation is defined in `makeARIMA` as: `Z — javlacalle, Jun 25 '14 at 13:00
@user3730957 I have updated my answer fixing this issue with the indexing. — javlacalle, Apr 02 '16 at 20:47

score 3 · Answer 2 · answered Apr 02 '16 at 18:36

Here would be my solution:

# Take AirPassengers as example
data <- AirPassengers

# Set missing values
data[c(44,45,88,90,111,122,129,130,135,136)] <- NA


missindx <- is.na(data)

arimaModel <- auto.arima(data)
model <- arimaModel$model

#Kalman smoothing
kal <- KalmanSmooth(data, model, nit )
erg <- kal$smooth  

for ( i in 1:length(model$Z)) {
       erg[,i] = erg[,i] * model$Z[i]
}
karima <-rowSums(erg)

for (i in 1:length(data)) {
  if (is.na(data[i])) {
    data[i] <- karima[i]
  }
}
#Original TimeSeries with imputed values
print(data)

@ Javlacalle:

Thx for your post, very interesting!

I have two questions to your solution, hope you can help me:

Why do you use KalmanRun instead of KalmanSmooth ? I read KalmanRun is considered extrapolation, while smooth would be estimation.
I also do not get your id part. Why don't you use all components in .Z ? I mean for example .Z gives 1, 0,0,0,0,1,-1 -> 7 values. This would mean .smooth (in your case for KalmanRun states) gives me 7 columns. As I understand alle columns which are 1 or -1 go into the model.

Let's say row number 5 is missing in AirPass. Then I would take the sum of row 5 like this: I would add value from column 1 (because Z gave 1), I wouldn't add column 2-4 (because Z says 0), I would add column 5 and I would add the negative value of column 7 (because Z says -1)

Is my solution wrong? Or are they both ok? Can you perhaps explain to me further?

I would recommend posting the second part of your answer as comments to @Javlacalle's post rather than within your own answer. — Patrick Coulombe, Apr 02 '16 at 19:27
tried ... but it says I have to have 50 reputation to comment — Steffen Moritz, Apr 02 '16 at 20:05

How to use auto.arima to impute missing values

2 Answers2

Linked