Machine learning on non-fixed-length sequential data?

Question

I have a problem which I'd like to apply machine learning (supervised classification) to, however, the data is sequential and each row in the data vector has its own length. This implies that the number of features in each row is non-constant (think time-series data - for example - day-by-day data). So this means that the order of the data has meaning and we simply cannot e.g. pad with zeros to make all rows have equal length, as that would introduce spurious signals which would confuse my classifier. At least that's my current opinion.

One possible approach is to use e.g. window functions and simply compute (for each day) running sums of things. But that means I'm losing information on history, since each day would have to be represented as its own row in the matrix in order to make all rows have a fixed number of columns, so I could apply classical ML algorithms. I want to avoid this, as I believe it's a suboptimal approach - but I will listen to any arguments against my opinion.
I don't have a lot of experience with neural networks, but I believe there are architectures which support non-fixed-length sequence data, e.g., RNNs? Does anyone have any good links/resources I may consider?

I welcome thoughts and suggestions from practitioners on how to approach this modeling problem. Thank you!

Regards, M

Consider RNN as a possibly good approximator for timeseries-like processes, but it does not mean RNN allows non-fixed length data. It consumes fixed-length inputs, however it can construct in itself arbitrarily long contexts of the fixed-length input strings. That property gives one a way to utilize a pseudo non-fixed length of inputs (as the model will optimize itself). — Alexey Burnakov, Feb 01 '18 at 10:12
@AlexeyBurnakov: Thank you for your reply. I should have mentioned in my question, that my input data is numerical, is it still possible to apply an RNN, or will the embedding become meaningless in this case? — Magnus, Feb 01 '18 at 10:33
Addendum: Indeed the data is timeseries-like, however, by careful consideration I can transform it into a weakly correlated data sequence. It should be possible to categorize it as a 1st order markov process. But I guess I have to test my assumptions rigorously by cross validation. — Magnus, Feb 01 '18 at 10:37

score 6 · Accepted Answer · answered Feb 01 '18 at 10:29

It seems you are asking two questions here:

How to deal with the situation where different samples have different numbers of features, i.e. when some features are either not applicable to some samples or are not available
How to perform supervised classification on time-series data

With regards to question 1, it depends. Each sample does need to have the same number of features. Some models, i.e. decision-tree based ones, can explicitly deal with missing/NA data. Others, like logistic regression, need ordinal features and cannot deal with categorical features. In this case, it may be worth introducing additional binary features (representing whether feature X is present/applicable), and choosing some appropriate value for feature X in case it is missing / not applicable. A good choice would depend on the specific problem.

Question 2: you have a choice of manually engineering features, or trying a model that can attempt to deal with the temporal structure of your data automatically. Most models assume that each sample is independent of the others; ideally, you would apply some feature engineering to make your time series stationary and use your domain knowledge to decide what historical data is important for each sample and how it should be represented. Z-scores, moving averages, variances etc. could all be useful here. If you have a lot of data, you may attempt to use RNNs, but in my experience it is only worth it if you have a lot of data and you otherwise have no intuition about which features may be useful.

Regardless of which model you choose to use, setting up appropriate validation and testing frameworks is absolutely crucial. With time series you need to be extra careful. E.g. you need to decide if using data from the future to train your model is appropriate, whether you need to throw some data around your training set away etc. Do not just blindly randomly sample data into validation/test sets, this will likely give you wildly biased estimates that will not be useful.

I would also recommend researching each question independently, both have been addressed on this stackexchange before. Good luck!

Thank you for your reply. Regarding Q1: so OHE of categorical features will not allow me to use a logreg model? I thought differently... I like the indicator variable approach (feature_is_missing/not_missing), but unfortunately it will not help in my particular case. — Magnus, Feb 01 '18 at 14:09
Regarding Q2: I do have a lot of data for this problem so an RNN might be appropriate. Collecting data is inexpensive, so I can easily produce `10^6-10^7` data points. I do have a lot of domain knowledge for the problem I'm working on, and there are ways of probing even a black-box algorithm (small linear perturbations in predictor variables, or using a package such as LIME may help). — Magnus, Feb 01 '18 at 14:12

score 5 · Answer 2 · answered Feb 01 '18 at 10:18

5

It looks like an RNN would be a good model for your problem.

They can deal with time-series sequences of different lengths. Basically they implements in their internal state a memory mechanism which allow to "remember" what happened in order to consider the past for taking decision about the future.

However they are quite generic models and can be used in lots of domains. In general you can find different types of RNNs, probably the common ones are the LSTM and the GRU.

In my opinion on Colah's blog there is a great explanation of LSTM

If, instead, you are looking for something more practical I would suggest the PyTorch tutorial.

answered Feb 01 '18 at 10:18

emapesce

362
1
3

Thank you for your reply, I will have a look. RNNs seem like a good idea, but as I forgot to point out earlier in my question, I have numerical data only. I am unsure whether RNNs are suitable in that case - but I will do some research in order to find out. – Magnus Feb 01 '18 at 10:40
1

What's the problem of using numerical data with RNNs? RNNs deal with numerical data – emapesce Feb 01 '18 at 10:50
No problem, I simply weren't completely sure that it could be done due to my limited experience with them. – Magnus Feb 01 '18 at 14:04

Machine learning on non-fixed-length sequential data?

2 Answers2