Data structure for rare event predictions in temporal domains

Question

I am a beginner in rare event modeling. I am working on predicting modem failures within a network where failures occur approximately 3% of the time. Currently my data is structured as follows:

network_name   cm_mac_address   time 
Network XXXX   XXXXXXXXXXXX     2016-02-22 01:05:35  

status   duration   latency   down_speed   up_speed
OK       6308       0         4173985869   177881922

down_power   down_snr   up_power   failure_next24
42           310        502        1

network_name and cm_mac_address are descriptive and independent of time. Status, duration, latency, down_speed, up_speed, down_power, down_snr, and up_power are all the current values at the time stamp. Failure_next24 is a binary variable indicating whether or not there is a failure in any of the following rows in a 24 hr time window. My models have not been successful looking at one observation at a time. I was reading this paper (https://pdfs.semanticscholar.org/895b/0b0472e1c47167d6cad1ea5436fdcf9976ca.pdf) and thought a window approach would be better.

To protect against link rot, the paper is "Learning to predict extremely rare events" by Gary Weiss and Haym Hirsh – ssdecontrol

I think looking at patterns in a sequence leading up to the event will allow more accurate predictions. I was wondering how I should arrange my data for this? My first thought is to add columns to my data set for each row that would be in my monitoring window. For example, my columns would be like below where n is the last observation in the monitoring window.

network_name, cm_mac_address, time, status, status_2, status_3, …., status_n,….,up_power, up_power_2,…,up_power_n, and failure_next24

However, adding all these new columns would take a tremendous amount of computing power since my data is about a TB in size. So I thought I would see what you guys think before I started down this road. I've read lots of papers about the sliding window approach, but I have not seen an example data set showing the structure of how the windows are created.

Thanks for your time.

To protect against link rot, the paper is "Learning to predict extremely rare events" by Gary Weiss and Haym Hirsh — shadowtalker, May 05 '16 at 02:52

score 1 · Answer 1 · answered Oct 02 '18 at 19:48

Leaving this for future visitors to this question.

The questions touches on two approaches.

First is classification. The time window approach involves defining a time window of length prior to each event. This window might be the minute of event occurrence, the day, week, month or year depending on the domain and data available. A number of non-event windows are generated and a sensible sampling strategy is used to produce balanced class training data. The classification function is trained and used to predict if other time windows are of the positive or negative event class.

Second is survival analysis. This approach involves analysis of the sequence of feature values prior to each event and the effect on survival time.

This answer provides more detailed explanation of survival analysis

The point is that there is no definitive answer without running the two suggested approaches. Each approach may return better or worse results depending on the nature and quality of the data available. However, conceptually the problem can be approach from two paradigms, one of event independence (classification), and one of sequences to event. Apologies if it isn’t a hard answer, that link and explanation helped me solve a similar problem after much reading. — BenP, Oct 02 '18 at 20:51

score 0 · Answer 2 · answered May 14 '19 at 17:22

You already outlined the most straightforward approach: put all the lagged values in the same row (observation). For instance, if you data set has columns: (x, y), and you think that in addition to contemporaneous values the past two values of column x can be useful, you create a new data set with columns (x(t),z(t),x(t-1),x(t-2)).

As you mentioned the problem is that there's a lot of repeated values in this data set, and it inflates the storage requirements. However, many machine learning frameworks are built on tabular data, where each row has all the columns of the observation. So, the economical solution is through writing your own data streams. The machine learning frameworks usually provide you with API that allows you to create the tabular datasets on the fly. They may have different names, but the idea's the same: from API point of view there's a function that returns a row in data set with all required columns. In reality, this is a function that you wrote, which collects all lagged values x(t-k) on the fly from the underlying data set.

Data structure for rare event predictions in temporal domains

2 Answers2