1

The overall goal is to extract/engineer features from approximately 100 segments (various lengths but always more than 80 data points) that are as similar as possible to each other and have a very low dispersion. Now I want to adjust the parameters of an extraction function so that the overall goal can be achieved. For example adjusting the observation period of a certain extracted feature - the variance from the first 10 or 30 data points and so on).

Now I want to check what parameter results in a smaller dispersion of the same feature but also compare different features with each other in order to "select the best ones". Something like a feature ranking for example: the features with the lowest dispersion.

So far I used the coefficient of variation as a relative comparative measure and it seems to work quite well. But I'm not sure if this is the best or only the most simple approach that can cause some misconceptions. I looked for other measures for example the coefficient of dispersion (or MAD etc.) but I don't have enough knowledge in this field of research in order to correctly interpret the results and choose a suitable measurement approach.

I hope I could clarify my issue. Any approaches, ideas or explanations are highly appreciated!

EDIT:

Maybe I should clarify my goal a little bit more: I know that any of the 100 segments that I want to extract features from are "the same". The same means that they are time-series data points with e.g. power and I filtered the segments manually so that I know that these segments represent the same power consumption signature of an electronic device for example. So afterwards these extracted features should represent a segment and due to their similarity should be assigned to the same cluster in the next step.

Because the measured power of these segments is subject to fluctuation due to measurement inaccuracies etc. The extracted feature has to be as similar as possible for each of the "same" segments. For example if I choose the variance of the first 1-20 data points of a segment it varies much more than 2-20 due to different peaks at the beginning of each segment. Now I want something to compare these features with each other. Maybe the coefficient of variation is sufficient for this task (of course there is no perfect solution). But for example I'm not sure if I should use the CV or the coefficient of dispersion (https://en.wikipedia.org/wiki/Index_of_dispersion this is what i meant).

Checker9
  • 11
  • 2
  • 1
    If coefficient of variation = SD / mean and coefficient of dispersion is mean absolute deviation / median, then it's a good start that they are both unit-free. Beyond that comparisons are difficult even with some simplifying assumption such as that all your values are positive. In practice, high CV could go with low CD and vice versa. You are coy about exactly what you are doing and there could be outstandingly good reasons for coyness, but that doesn't clarify your goals for others. I can't give you more than empty advice to try CD and see whether it measures well what you want to measure. – Nick Cox Apr 27 '20 at 22:42
  • My strong prejudice is to look for a scale on which comparisons are easier, e.g. logarithmic. – Nick Cox Apr 28 '20 at 08:38
  • Sorry I added my comments to the question, I needed more characters. So you mean use the logarithmic scale (e.g.) on these segments before extracting the features? – Checker9 Apr 28 '20 at 08:53
  • That's what I am suggesting. It's standard that coefficient of variation makes fullest sense if variability on logarithmic scale is approximately constant, and the other measure isn't different in that sense, just typically more resistant to outliers or long tails. See e.g. https://stats.stackexchange.com/questions/118497/how-to-interpret-the-coefficient-of-variation – Nick Cox Apr 28 '20 at 09:06
  • Thanks Nick! Really appreciate your feedback. So your suggested procedure would be something like: 1. Log the power data of the segment 2. Extract possible features that represent the segments 3. Use CV and compare the features (of course only as an indicator not the solution for everything before clustering) – Checker9 Apr 28 '20 at 09:43
  • That is essentially it. Naturally I can have no feel of my own for how your data behave. – Nick Cox Apr 28 '20 at 09:48

0 Answers0