0

I'm trying to cluster together different short audio files based on the zero-crossing rate (an integer) and the energy, spectral centroid, and spectral bandwidth (time-variant values). I've decided to use these time series as features rather than just taking an average and using that as a feature because one of the goals is, for example, to cluster together the audio files which rise in pitch over time but decay in energy. I'm assuming that K-means clustering might not be the best algorithm here given that it's meant for low-dimensional spaces, and I'll at least need to do some specific preprocessing first (maybe not just because it's a time series but because it's audio). What should I look into using here for what I'm trying to do? Should I look into different features?

Cheers and thanks for any help!

Jodast
  • 155
  • 5
  • 1
    How short/long are the files? What kind of audio in there, and what kind of patterns do you want your clustering to find? – Jon Nordby Dec 26 '20 at 21:33
  • @jonnor they're drum samples, like a kick or a snare, so usually about 0.5s or less. I'm looking for interesting timbre patterns or anything that can further categorize these files – Jodast Dec 27 '20 at 00:57

1 Answers1

1

TL;DR: For clustering time-series data, as a baseline, you can try using Dynamic Time Warping (DTW) as a distance metric for your favorite clustering algorithm, say $k$-means.

DTW is commonly used in applications like ASR (automatic speech recognition), and other sequence alignment problems. I believe the Python package tslearn implements DTW, though I have not used it myself.

What is DTW?

At a very high level, DTW is a measure of how "aligned" two time-series are, so it can capture features like "rise in pitch over time + decay in energy;" i.e. two sequences are "close" if you can stretch/compress a source sequence in the time dimension into the target sequence. To compute this, let's define $x, y$ as sequences of length $n, m$ respectively, and $\pi = [\pi_1, \pi_2, \dots \pi_N]$, where $\pi_k = (i_k, j_k)$, a tuple of indices into $x, y$ respectively. Intuitively, $\pi$ defines a "path" through sequences $x, y$ by pairing up indices. Then you minimize the following objective:

$$ \underset{\pi}{\min} \sqrt{\sum_{(i, j) \in \pi} d(x_i, y_j)^2} \quad\text{subject to}$$ $$ i_{k-1} \leq i_k \leq i_{k-1} + 1,\;\; j_{k-1} \leq j_k \leq j_{k-1} + 1 \quad\text{for}\quad (i_k, j_k) \in \pi, 0 < k \leq N;\\ i_0 = 0, j_0 = 0;\\ i_K = n, j_K = m.$$

where $d$ is some distance metric (say $\ell_1, \ell_2$). The final two constraints are just boundary conditions that specify that the path $\pi$ must align with the beginnings and ends of $x, y$ simultaneously; the top constraint specifies that indices $i_k, j_k$ must be monotonically non-decreasing in $k$.

This schematic from Wikipedia illustrates this quite succinctly:

Alignment of two sequences via DTW.

We can imagine each location where a dashed line originates/terminates as an element of a discrete time-series; these correspond to the various $i_k, j_k$ defined in $\pi$.

What about the curse of dimensionality?

Your intuition that $k$-means runs into the curse of dimensionality is generally correct in practice, and often becomes a problem when you're using Euclidean distance (see this previous answer for more details), but the curse of dimensionality is more nuanced than "more dimensions = bad;" it is important to realize that the curse of dimensionality affects different metrics in different ways. Very informally, DTW is a metric that is less "susceptible" to the same dimensionality problems with squared Euclidean distance, as it better preserves its "meaning" in high dimensional spaces.

More concretely, using nearest-neighbors search as an illustrative example, under reasonable assumptions about your data-generating distribution, data points tend to become uniformly distant from one another in Euclidean distance as your dimensionality (or in the context of time-series, sequence length) increases; i.e., most points are "almost-nearest" neighbors. DTW does not suffer as severely from this problem as sequence length increases. A more mathematical way to gauge the severity of the curse of dimensionality is shown in Section 3.4 of this paper.

tchainzzz
  • 1,016
  • 3
  • 11
  • How would I go from there? I looked into DTW and tslearn but I can't figure out how to combine it with single-element features – Jodast Dec 27 '20 at 01:37
  • 1
    That's a good question -- DTW is intended for sequential features, so it doesn't take of combining with single-element features out of the box. Since you have a few time-series features and one scalar feature, a naive solution might be to use DTW for the sequential features and a scalar metric for the scalar features; then, you could add/average the aggregate cost, which should give you a notion of pairwise distance. Essentially, this clusters sets of features independently. I don't know if this is rigorous/optimal -- this is my first instinct, as I'm not too familiar with this area. – tchainzzz Dec 27 '20 at 01:58
  • i'll try this out! it seems like there might be a better solution that i'll try to look for, but i appreciate this in-depth answer! – Jodast Dec 27 '20 at 03:18