I recently learned about cnn-lstm architectures for time series, where the cnn part of the architecture acts as a feature-extractor. However, I struggle to grasp why there is still a 'time-related' aspect to that data, because I thought the output of the cnn layers are just 'features'?
For example, sound identification can be done by such an architecture. You provide train data of sounds of shape (2048,1) and a corresponding class label. First, the cnn will extract features from that time series, and then a lstm with a dense layer afterwards will predict the label. Can somebody perhaps explain how a lstm will still make sense of such data coming from the cnn?
An example image of how this works:
Example paper where I found another take on this story, you can see it in Figure 2 of: A CNN–LSTM model for gold price time-series forecasting, written by Ioannis E. Livieris et al.