Linear predictive model convolution

Question

Accidentally asked this question in the general area and was told to ask here, so...

I've been trying to develop a lightweight, relatively-fast-to-decode sound compression format for use in my gaming projects (perfect reproduction is not needed so I only use 16 bit data).

The idea is to split the sound data into 14-sample frames and use linear prediction to reduce the data size by only storing the residuals. To make it even lighter, the residuals are then quantized to 4 bits per sample by reducing their precision so that they are stored as scaled residuals, where the scale is dictated by the frame header. To make the signal less noisy, 16 linear prediction models are generated that best suit the signal in the file.

Each frame ends up being 64 bits (4 bits for the LP model selection, 4 bits for the residual scale (2^n, so only n is stored), and 14*4 bits for the residual data).

What I've been doing is simply solving the Yule-Walker equations for each frame to get the coefficients and then saving them in an array. After calculating the coefficients for all the frames, I then try to create 16 coefficients based on the Euclidean distance from each LP model and averaging them out. That is to say...

Zero-out target buffer

For i = 1, i <= 16, i++
  Create and zero-out a temporary buffer for i LP models

  For each saved LP model
    Find the model in the original target buffer [up to i] with the smallest Euclidean distance between it and this model
    Accumulate the coefficients to the temporary buffer

  Average out the temporary buffer and store it to the final LP model array [again up to i]

As expected, this doesn't yield very good results at all if only because I imagine it's similar to uniform colour quantization - it has very little relation to the actual data.

After Googling for a few months, the only thing I can come up with is 'sparse convolution' but I'm not entirely sure what is meant by that when it comes to LPC or if it's even what I'm after.

Given these parameters (that is: generate 16 LPC models that will minimize the residual error given that the frames are 14 samples long), how would you go about doing it?

Why do you want to write it yourself instead of using an existing format? — endolith, Sep 28 '12 at 23:45
Among existing formats, assuming you want to spend very very little CPU per sample, with a 4x compression ratio in mind, IMA ADPCM would be the best. — pichenettes, Sep 28 '12 at 23:49
Another comment on your approach: after whitening by an optimal predictor, an audio signal is still "spiky", so a scale+quantified value representation is not optimal. That's why FLAC uses Golomb coding. — pichenettes, Sep 28 '12 at 23:59
IMA ADPCM is too noisy for my liking and most ADPCM formats that achieve 4:1 compression tend to sound terrible when it comes to saw waves. Hence my wanting to create a new one =) — Ruben Nunez, Sep 29 '12 at 00:15

pichenettes · Answer 1 · 2012-09-28T23:53:27.847

Averaging AR-coefficients is a risky business - this could cause drastic shifts in the position of the zeroes of the corresponding polynomial, resulting in instabilities!

If I were to approach the problem the first thing I would try would be the following:

Compute the spectrogram (short-term fourier transform)
Run VQ on the spectral slices with a codebook size of 16, using euclidian distance on the log
Once you have 16 spectral template, convert them back to autocorrelation functions, solve yule-walker for the 16 templates and you get your 16 models.

Or another "safe" way:

For each frame, extract the AR coefficients.
Convert them into reflection coefficients or any other alternative representation like line spectral pairs which can be safely interpolated.
Run VQ on these with a codebook size of 16 to get your 16 models.
Convert back to AR coefficients.

But I really doubt this is a path worth exploring. My two intuitions would be that:

If an audio file is long and diverse enough the 16 specialized models you would get out of it won't be very different form the 16 "universal" models you would obtain out of running the procedure once and for all on a large collection of audio files.
Learning 16 models from a large collection of audio files is likely to give you the kind of naive "fixed predictors" used in FLAC (polynomial interpolators).

One related result is that running a KLT on mel-frequency spectra on a sufficiently long audio clip is going to yield something very similar to the DCT base - that's why audio codecs like mp3 or AAC are not using per-file optimized transforms - the DCT is a good, universal one, and the advantage of using universal transforms is that they don't even require any side-information. This is what motivates my intuition - you are looking at a large dataset (an entire audio file), 16 is small. You'd rather think of 16 good universal predictors and use them than instead of trying to optimize your 16 predictors per file.

Ah, okay, thanks. I forgot to mention, though, that this format isn't really for compressing music. I'm using sequenced data so the audio being compressed is sound *samples* which are usually anywhere from 1,000 to 75,000 points at most. I'll try and do what you mentioned and I'll post back the results. =) — Ruben Nunez, Sep 29 '12 at 00:17
OK then it might work better since isolated instrument samples have much less variability. Here is another suggestion worth trying: use your 4 LP model bit to choose between either 15 universal models or 1 "local" model. Train the "local" model on windows of 256 frames (~ 81ms at 44.1kHz), and embed the coefficients in the bytestream every 256 frame. There's no reason why a 1000 samples looped single cycle waveform should be burdened by the overhead of 16 predictors while a 2s long bit of dialogue couldn't benefit from more models. — pichenettes, Sep 29 '12 at 08:28
That's actually a good idea I hadn't thought of, thanks. =) However, how would you suggest I generate the 15 'global' models? Just the LPC coefficients from different wave types (sine, triangle, square, saw, etc), or taking a number of different sound files and using their LP coefficients? — Ruben Nunez, Sep 29 '12 at 16:03

score 2 · Answer 2 · answered Nov 18 '12 at 19:39

Okay, I finally got around to writing the code and testing it. The results? Amazing =D

I followed pichenettes's original advice to store the coefficients as RFCs instead of LPCs and then perform vector quantization (I tried the idea of using the 15 'global' predictors and 1 block predictor. The results weren't as good as I thought they would be, even with many different predictors and precisions =/).

Personally, I think the hardest part about this was trying to figure out how to safely interpolate the data and my lack of in-depth knowledge of DSP influencing that. However, after being pointed in the right direction, I'm happy to say that the results are exactly what I was hoping for when I started nearly a year ago (yes, I've been trying to do this for that long =P).

The other part that was slightly tricky was writing fast vector quantization with acceptable results. While I could probably have done it the proper way with centroids, etc. I did it slightly differently (I wasn't sure how to show/hide blocks of preformatted text, so just read the bold lines if you don't want to see the C++ code I wrote to explain):

1) Generate the first coefficient set as the average of all the unprocessed coefficients (a sort of training for the quantization)

// Quantized coefficient sets
double VQCoef[CodebookSize][Order];

// For all original coefficients...
for(int i=0; i < NumberOfCoef; i++) {
  // For each coefficient in the set...
  for(int j=0; j < Order; j++) {
    // Accumulate
    VQCoef[0][j] += Coef[i][j];
  }
}

// Average out coefficients
// For each coefficient in the set...
for(int i=0; i < Order; i++) {
  // Normalize [clamped to (-1.0,+1.0) but omitted for brevity]
  VQCoef[0][i] /= (double)NumberOfCoef;
}

2) Calculate the next 2^n coefficient sets based on the last ones with a slight offset

// For each final coefficient set...
for(int i=0; i < CodebookSize;) {
  // Quantization buffer. Counter and accumulator.
  int    VQCnt[CodebookSize];        memset(VQCnt, 0, sizeof(VQCnt));
  double VQAcc[CodebookSize][Order]; memset(VQAcc, 0, sizeof(VQAcc));

  // Estimate the next set by slightly offsetting the original data
  for(int j=0; j < i; j++) for(int k=0; k < Order; k++) VQCoef[i+j][k] = VQCoef[j][k] + 0.01;
}

// Now we have double the amount of coefficients, so adjust 'i' accordingly
i *= 2;

3) Go through the list of the unprocessed coefficients and find the target ones that had the closest correlation (dot product). Once found, accumulate the data into a temp buffer

// For each original coefficient...
for(int i=0; i < NumCoef; i++) {
  // Find target coefficient set with best correlation
  int    BestPos = 0;       // Best target [reset when code runs]
  double BestDst = -1.0e99; // Best distance [just a really large negative value]
  for(int j=0; j < CodebookSize; j++) {
    // Get correlation between the original coefficient set and the tested target
    double Dist = NormalizedDotProduct(Coef[i], VQCoef[j], Order);

    // If the correlation was better, save this as the target
    if(Dist > BestDst) {
      BestPos = j;
      BestDst = Dist;
    }
  }

  // Now that we know the target, save it to the accumulation buffer
  VQCnt[BestPos]++;
  for(int j=0; j < Order; j++) VQAcc[BestPos][j] += Coef[i][j];
}

4) When this lot of coefficient sets is complete, average it out to find the new sets

// For all sets done up to now...
for(int i=0; i < CoefSetsDone; i++) {
  // Fetch count/normalization coefficient. If zero sets, don't bother doing this set
  double Cnt = VQCnt[i];
  if(Cnt != 0.0) {
    // We have sets, so average the coefficients out
    for(int j=0; j < Order; j++) {
      VQCoef[i][j] /= Cnt; // [again, this is clamped to (-1.0,+1.0)]
    }
  }
}

Steps 2-4 are repeated until I get the 2^n sets I want, at which point I simply convert the reflection coefficients into LP coefficients and I'm done. =)

Although the project I intend to use this for is VERY long term (think over a decade long), would you be okay with me crediting you for a crucial part of the sound design?

Linear predictive model convolution

2 Answers2