I am a beginner in rare event modeling. I am working on predicting modem failures within a network where failures occur approximately 3% of the time. Currently my data is structured as follows:
network_name cm_mac_address time
Network XXXX XXXXXXXXXXXX 2016-02-22 01:05:35
status duration latency down_speed up_speed
OK 6308 0 4173985869 177881922
down_power down_snr up_power failure_next24
42 310 502 1
network_name and cm_mac_address are descriptive and independent of time. Status, duration, latency, down_speed, up_speed, down_power, down_snr, and up_power are all the current values at the time stamp. Failure_next24 is a binary variable indicating whether or not there is a failure in any of the following rows in a 24 hr time window. My models have not been successful looking at one observation at a time. I was reading this paper (https://pdfs.semanticscholar.org/895b/0b0472e1c47167d6cad1ea5436fdcf9976ca.pdf) and thought a window approach would be better.
To protect against link rot, the paper is "Learning to predict extremely rare events" by Gary Weiss and Haym Hirsh – ssdecontrol
I think looking at patterns in a sequence leading up to the event will allow more accurate predictions. I was wondering how I should arrange my data for this? My first thought is to add columns to my data set for each row that would be in my monitoring window. For example, my columns would be like below where n is the last observation in the monitoring window.
network_name, cm_mac_address, time, status, status_2, status_3, …., status_n,….,up_power, up_power_2,…,up_power_n, and failure_next24
However, adding all these new columns would take a tremendous amount of computing power since my data is about a TB in size. So I thought I would see what you guys think before I started down this road. I've read lots of papers about the sliding window approach, but I have not seen an example data set showing the structure of how the windows are created.
Thanks for your time.