I'm having trouble grasping all the information regarding the workflow with time series. First I'd to check if the data points are iid, or not. Then based on that I'd set up a workflow. This would include feature selection (ranking), model selection and model evaluation. I have read about nested cross validation and I'm not sure if its applicable to my problem, but I read it's to avoid optimistically biased estimates of performance that result from using the same cross-validation to set the values of the hyper-parameters of the model and thus is 'better' than regular k-fold cross validation.
If they are iid:
- Preprocessing data include normalizing
- Outside CV loop: estimates the performance of inner loop.
- Inside CV loop: Various models with different hyperparameters evaluated on the training set for model fitting. So this loop does the model selection.
- Finally we can fit the final model to the entire data set. This would be ready for deployment.
If they are not iid, can I still perform nested cross validation? This would have another approach, like this Using k-fold cross-validation for time-series model selection.
So the real question is how to first check if my time series data is independent. Finally, I was planning on trying a linear regression model, a random forest and a regression tree. During the steps 2 and 3 I want to plot learning curves, outliers and of course scores.
Background: I have about 3 years of data with boats arriving to the quayside, where there are multiple (varying) boats per day. I want to predict berth time (time between Arriving time and Departure time).
Thank you,
Regards, Kevin