Another way to cross validate time series, which is worth sharing. Especially because the question is asked if H2o
can support time-series cv. Existing h2o
implementation is able support a variant of time-series cv shown below, with the help of fold_column variable.
fold 1 : training [4 5 6 7 8 9], test [1 2 3]
fold 2 : training [1 2 3 7 8 9], test [4 5 6]
fold 3 : training [1 2 3 4 5 6], test [7 8 9]
Solution:
library(h2o)
h2o.init()
airquality$Year <- rep(2017,nrow(airquality))
airquality$Date <- as.Date(with(airquality,paste(Year,Month,Day,sep="-")),"%Y-%m-%d")
df <- as.h2o(airquality[order(as.Date(airquality$Date, format="%m/%d/%Y")),])
df <- h2o.na_omit(df)
# Number of folds
NFOLDS <- 10
# Assign fold number sequentially to a window in data
fold_numbers <- as.h2o((1:nrow(df))%%NFOLDS %>% sort())
# This will assign fold number randomly
#fold_numbers <- h2o.kfold_column(df, nfolds = NFOLDS)
names(fold_numbers) <- "fold_numbers"
# set the predictor names and the response column name
predictors <- c("Solar.R", "Wind", "Temp", "Month", "Day")
response <- "Ozone"
# append the fold_numbers column to the dataset
df <- h2o.cbind(df, fold_numbers)
# try using the fold_column parameter:
airquality_gbm <- h2o.gbm(x = predictors, y = response, training_frame = df,
fold_column="fold_numbers", seed = 4)
# print the rmse for your model
print(h2o.rmse(airquality_gbm))
Parts of code borrowed from link1 and link2