Background
I have some data that looks like this:
time apples oranges
1 5 2
2 5 2
3 6 2
4 6 3
5 7 3
I want to create a sliding window time series from this.
I have converted this into time series data for deep learning by following this guide. Suppose I set the sliding window width to be 3 time steps.
(t) (t-1) (t-2)
So now my data looks like this:
time apples(t-2) oranges(t-2) apples(t-1) oranges(t-1) apples(t) oranges(t)
3 5 2 5 2 6 2
4 5 2 6 2 6 3
5 6 2 6 3 7 3
This format works very well for the problem described above where we only have a few columns and a small sliding window.
Suppose we scale this problem to the following:
time var-1 var-2 ... var-80
1 5 2 6
2 5 2 6
: : : ... :
400,000 6 1 2
And now we say we want a sliding window of 10,000 time steps. Using the same form as before the output shape will be (390,000 , 800,000). This wont work because a single sample becomes tens of gigabytes and read write times are too slow.
Question
I'm looking for a different way to structure my data that does not explode its size while still allowing it to be fed into an LSTM nerual network.
To provide further context:
train_X = train[:, :]
# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
...
model.fit(train_X, train_y, ...)
train
is a table created using the transform described above--every row is a sample containing all the time steps in the sliding window for every var
. train_x
is the same as train
in this case because our class labels are elsewhere.
train_x
is reshaped into a numpy array to fit the specifications for keras.models.sequential.fit
.
Except I don't want to load all of train_x
at once because I don't want to build a table that holds all of the data because I don't have enough memory to build this table for all of my experiments.
I'm looking for a way to train my time series model without needing to aggregate the entire training dataset all at once. I want to only aggregate some reasonable number of rows/samples as needed, on the fly.
Ideas
I am a newbie at deep learning frameworks so please be brutal and specific. It seems like there should be a way to pass indexes to a deep learning framework. I linked an article above that uses pandas.shift()
to restructure the data. I ended up doing my own implementation because the one from that link was too slow. I included it because the output of my program matches the output of the linked program.
My transform script relies on this line of code:
for idx in range(0,numRows-sliding_window):
newData[idx,:] = origionalData[idx:idx+sliding_window+1,:].ravel()
What this does is build the new data table row by row. A row in newData
is selected from origionalData
all at once. Each row in newData
would be an input to my LSTM. Each row by itself is managable in time and space complexity. However, putting all of these rows together into one table is where the space problem stems from.
Why can't I just pass indexes into my deep learning framework haha what an idiot you can...
As mentioned, I am a newbie at deep learning. This is my first large scale deep learning project. I chose to use Keras because it is simple. I know that other frameworks (TensorFlow) are more adaptable at the expense of being less user friendly.
I have researched this problem broadly but haven't been able to make any traction. I was hoping that somebody with a better understanding of deep learning frameworks might be able to point me in a helpful direction.
Thanks!