3

I have a set of integer series $S_1$, $S_2$, ... $S_n$. Each series has 3600 data points. Each data point is a positive integer. Each data point is stored as an unsigned int requiring 4 bytes. So, storing the entire series requires 4 * 3600 bytes. This is too large a size and these series need to be compressed.

The data are queried in the following way and any compression mechanism should minimize the error on every query.

  1. A few series are selected, say, $S_p, S_q$ and $ S_r$ (not that any number of series might be selected, I am using 3 as just an example).
  2. The selected series are summed to get $S_T$ such that $$S_T[i] = S_p[i] + S_q[i] + S_r[i]$$ for $i = 1$ to $3600$.
  3. The minimum value and the maximum value of $S_T$ are computed.

It is possible that the peaks in the series are aligned, so each of the series cannot be thought of as independent.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Nikhil
  • 73
  • 6
  • It rather depends on what you want to do with your data later. – Henry Oct 04 '11 at 13:26
  • @Henry Its going to be queried as described. No other queries will be made. – Nikhil Oct 04 '11 at 13:32
  • 3
    Just to be clear: if we define $f(p,q,r)$ to be the set containing the maximum and minimum of $\{(S_p(i)+S_q(i)+S_r(i)), 1\le i\le 3600\}$, then you want to find a data structure of smallest possible size that allows for computation of $f(p,q,r)$ for all $1\le p \le q \le r \le n$. Furthermore, you appear to allow $f$ to be computed with some error: but how much error is acceptable? Is there a chance the sums could overflow? Finally, how large is $n$? (After all, if $n$ is small, we could just store *every* value of $f$!) – whuber Oct 04 '11 at 13:54
  • It is possible that in (1) the word "not" should be "note," which would appreciably change the question! – whuber Dec 14 '15 at 15:12

2 Answers2

1

Is there some reason you can't just use a variable-size encoding with an unambiguous prefixes like Huffman coding or arithmetic coding for your sequences? These are highly efficiently lossless compression techniques. Since they produce a dictionary as part of the output, it should be easy to decompress on the fly, though there could be issues with efficient direct access of the data.

John Doucette
  • 2,113
  • 1
  • 15
  • 24
1

The scope of this question is perhaps too broad, because it does not specify quantitative targets for compression, it does not quantify $n$, it is unclear whether lossy compression is ok (and if so, how much loss is tolerable), nor does it provide any helpful characterization of the sequences in question: these are the considerations that can lead to good compression solutions.

Nevertheless, some opportunities may be available in particular cases.

Small $n$

As the question is written, exactly three series are selected for summing. If this is always the case, we can compress the data simply by precomputing the max and min for every possible one of the $\binom{n+2}{3} = n(n+1)(n+2)/3!$ choices. This will require less storage (and will be far faster in execution) provided $n \le 105$. The algorithm is

Precomputation:
    For p=n to 1 by -1 {For q=n to p by -1 {For r=n to q by -1 {
         Compute the min and max of S[p](i)+S[q](i)+S[r](i), 1 <= i <= 3600.
         Let k = Binomial[n-p+2,3] + Binomial[n-q+1,2] + n-r + 1.
         Store (min, max) at words 2*k-1 and 2*k, respectively.
    }}}

Execution:
    Input p,q,r
    Assume, after an initial sort, that p <= q <= r.
    Let k = Binomial[n-p+2,3] + Binomial[n-q+1,2] + n-r + 1.
    Return the values at words 2*k-1 and 2*k as (min, max).

For example, with $n=15$ the compression is $97.5$% because only $2*15*16*17/6 = 1360$ words are needed instead of $15*3600 = 54000$.

With a trick we can handle the option to sum just two or even one sequence rather than exactly three of them: include one more sequence, of zeros, in the data structure. This increases $n$ by $1$; we now need that $n$ not exceed $104$.

Slowly varying sequences

Store $S_p(1)$ and $(S_p(i+1)-S_p(i))$ for each $p$ and $1 \le i \lt 3600$ as unsigned bytes. Reserve a few values (such as -128, -127, -126) to signal that the value actually requires more bytes (e.g., -128 means two bytes are needed, -127 means three, and -126 means four) (and store that value in the immediately following bytes). An index of the beginnings of the sequences will also be needed, because they will have variable storage requirements. The encoding and decoding algorithms are straightforward.

When most differences are between -125 and +127, the average compression is almost 75%.

General

It is unstated whether the compression is to save offline storage or due to limited storage of data structures during execution. If it's the former, apply any good general-purpose compression algorithm. If it's the latter, the decompression has to be part of the execution code. As suggested in the reply by @John Doucette, a Huffman code is easy to implement. It's reasonably fast; the main requirement is that the source language should support efficient bit-twiddling and allow bit-level storage and retrieval of data. This should be applied on top of the methods suggested above. In the second case (storing differences), the Huffman code will capitalize on non-uniform frequencies of the bytes. Compression rates can be good but are rarely dramatic (because every value requires at least one bit of storage). Another attractive method (to be applied after one of the previous methods yet before Huffman coding) is a run length encoding, which would be appropriate when many of the sequences contain moderate to long stretches of constant values or constant differences.

whuber
  • 281,159
  • 54
  • 637
  • 1,101