0

Currently, I have log files from 10 machines. Each log file contains an event type and the occurrence time of the event.

Each machine's log file is recorded under the same time frame, and I wish to predict the time to a particular "critical" event.

How should I split the data into train/test? I currently have a couple of ideas:

  1. For all 10 machines, I take the first 80% as the training dataset and let the remaining 20% be the test dataset
  2. I randomly take 8 machines as the training set, and I try to predict the time to a critical event for the remaining 2 test datasets.

Any input or feedback would be greatly appreciated!

Courier12
  • 3
  • 1
  • Are you confident in assuming that the frequency of events and proportion of event types is constant over time? If not that rules out 1 almost immediately. – David Luke Thiessen Jun 22 '21 at 21:19
  • 1
    Also, see https://stats.stackexchange.com/questions/14099/using-k-fold-cross-validation-for-time-series-model-selection for another option – David Luke Thiessen Jun 22 '21 at 21:20

1 Answers1

0

Depends on the operational characteristics of all machines. Assuming they operate under similar circumstances, best to pool all events into one time-series. Then apply rolling cross-validation from Rob Hyndman, see here, in testing the model build.

msuzen
  • 1,709
  • 6
  • 27