6

I've read about How does h2o.r cross validation work?. However, for a time series dataset, does H2o support the type of CV described here Using k-fold cross-validation for time-series model selection? In particular, something like this:

fold 1 : training [1], test [2]
fold 2 : training [1 2], test [3]
fold 3 : training [1 2 3], test [4]
fold 4 : training [1 2 3 4], test [5]
fold 5 : training [1 2 3 4 5], test [6]

3 Answers3

3

H2O algorithms can optionally use k-fold cross-validation. H2O does not yet support time-series (aka "walk-forward" or "rolling") cross-validation, however there is an open ticket to implement it here.

There is an example of how you can manually implement time-series CV using the h2o R package referenced here, if you want to give that a try.

Erin LeDell
  • 765
  • 3
  • 11
2

I implemented it using Sklearn TimeSeriesSplit like this:

from sklearn.model_selection import TimeSeriesSplit
from h2o.estimators import H2ORandomForestEstimator

forest = h2o.estimators.H2ORandomForestEstimator
forest.set_params(nfolds=0)

tscv = TimeSeriesSplit(n_splits=5)

Xcols=list(set(X.names)-set('NumberOfSales'))
Ycol='NumberOfSales'
for train_index, test_index in tscv.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    train = X[min(train_index):max(train_index),:]
    test = X[min(test_index):max(test_index),:]
    print(len(train),len(test)) #Just to double check...
    forest.train(x=Xcols,y=Ycol,
                 training_frame=train,validation_frame=test,verbose=False)            
    y_pred=forest.predict(test[Xcols])
    EVar.append(explained_variance_score(test[Ycol].as_data_frame(), 
                y_pred.as_data_frame()))
    MAEar.append(mean_absolute_error(test[Ycol].as_data_frame(), 
                y_pred.as_data_frame()))
    MSEar.append(mean_squared_error(test[Ycol].as_data_frame(), 
                y_pred.as_data_frame()))
    R2ar.append(r2_score(test[Ycol].as_data_frame(), y_pred.as_data_frame()))

EV = np.array(EVar).mean()
MAE=np.array(MAEar).mean()
MSE=np.array(MSEar).mean()
RMSE=np.array(RMSEar).mean()
R2=np.array(R2ar).mean()
```
0

Another way to cross validate time series, which is worth sharing. Especially because the question is asked if H2o can support time-series cv. Existing h2o implementation is able support a variant of time-series cv shown below, with the help of fold_column variable.

fold 1 : training [4 5 6 7 8 9], test [1 2 3]
fold 2 : training [1 2 3 7 8 9], test [4 5 6]
fold 3 : training [1 2 3 4 5 6], test [7 8 9]

Solution:

library(h2o)
h2o.init()

airquality$Year <- rep(2017,nrow(airquality))
airquality$Date <- as.Date(with(airquality,paste(Year,Month,Day,sep="-")),"%Y-%m-%d")

df <- as.h2o(airquality[order(as.Date(airquality$Date, format="%m/%d/%Y")),])

df <- h2o.na_omit(df)

# Number of folds
NFOLDS <- 10

# Assign fold number sequentially to a window in data
fold_numbers <- as.h2o((1:nrow(df))%%NFOLDS %>% sort())

# This will assign fold number randomly
#fold_numbers <- h2o.kfold_column(df, nfolds = NFOLDS)

names(fold_numbers) <- "fold_numbers"

# set the predictor names and the response column name
predictors <- c("Solar.R", "Wind", "Temp", "Month", "Day")
response <- "Ozone"

# append the fold_numbers column to the dataset
df <- h2o.cbind(df, fold_numbers)

# try using the fold_column parameter:
airquality_gbm <- h2o.gbm(x = predictors, y = response, training_frame = df,
                    fold_column="fold_numbers", seed = 4)

# print the rmse for your model
print(h2o.rmse(airquality_gbm))

Parts of code borrowed from link1 and link2