What is the difference between data compression and compression in cryptographic algorithms?

Question

What is the differences between data compression as used in e.g. the ZIP protocol and compression as performed in cryptographic hashes? Are there common properties as well, apart from creating a smaller representation of the data?

As usual, trying to create a go-to Q/A so we don't have to explain it over and over again :) — Maarten Bodewes, May 17 '20 at 20:39

score 6 · Accepted Answer · edited May 17 '20 at 22:33

6

In data lossless compression we want the data recoverable from the compressed form. This is usually helpful if the entropy is low like text files, so-called zipping.
In data lossy compression like JPEG, we still want to get the image (data) but we don't care about the full quality of the image. This can be considered as the original data can be recovered approximately.
The compression function of hash functions (one-way compression function) takes two inputs and produces one output - the compression ratio doesn't need to be 2:1. The transformation is required to be
- one way: i.e. given an output finding the input value should be difficult.
- and, have avalanche property: each output bit must depend on all input bits.

Being one way is not a desired property for lossless compression, however, lossy compression is not totally one way but some information is lost. JPEG or similar compression is not a good compression for hash functions.

Avalance property doesn't suit the data compression since that causes a random behavior on the compression. In avalanche property, we want each output bit to have a 50% chance of flip for a bit flip of an input bit.

In short, a single word "compression" should not be used to identify those. Use the "data lossless compression" or "data lossy compression" for data compression and "compression function" for hash functions.

edited May 17 '20 at 22:33

Maarten Bodewes

88,868
12
146
304

answered May 17 '20 at 21:01

kelalaka

45,607
9
104
179

2

Some data compression algorithms have a fair degree of avalanche property, e.g. when they use arithmetic coding. However that's accidental, not by design. – fgrieu May 18 '20 at 06:26
I would not approve using the term "compression" at all when referring to irreversible functions. Lossy compression is appropriate because they are mostly, approximately reversible; but it's arguably more clear to teach (and think) of "compression" and "hashing" as separate, distinct, non-overlapping concepts and discourage referring to one-way hashing as compression because its properties are so substantially different; and talk about compression only when you have a *pair* of functions that can not only compress but also decompress. – Peteris May 18 '20 at 09:05
@fgrieu could you name one? – kelalaka May 18 '20 at 09:09
2

@kelalaka: my remark is based on tests two decades ago, with a Lempel-Ziv + arithmetic encoder, which IIRC was LZARI. I have not idea if [this](https://github.com/ps2dev/mymc/blob/master/lzari.py) will exhibit the same effect. The diffusion is forward-only. – fgrieu May 18 '20 at 09:48
@Peteris, unfortunately, the term is heavily in use in Merkle-Damgard construction. I'm not able to change it, just restated. I think they need a term other than hash since they need to construct the hash over the compression function. Note the term compression is lossless [Google: compression](https://www.google.com/search?q=compression+definition&pws=0&gl=us&gws_rd=cr). It might be true in a sense of lossy compression. – kelalaka May 18 '20 at 10:19
"each output bit must depend on all input bits" - better worded: a change to a single bit of the input should ideally change half of the output bits. – OrangeDog May 18 '20 at 12:22
@OrangeDog I think that is already covered by the first-to-last section of the answer. I think you are right, but it would mean repetition of the same info if the current structure of the answer is kept. – Maarten Bodewes May 18 '20 at 12:27
@OrangeDog, is that even true without qualifying it with "on average" or such? – ilkkachu May 18 '20 at 12:39
@ilkkachu no, I said "should ideally". You want to get a narrow distribution around that point, not a wide average. – OrangeDog May 18 '20 at 13:17
1

Isn't it useful to point out that hash functions are a one way avalanche to a set size for all input sizes? – Hogan May 18 '20 at 14:20
@Hogan you did. The question was about the compression. I want it to keep as simple as possible. – kelalaka May 18 '20 at 14:24

score 0 · Answer 2 · edited May 17 '20 at 22:16

Think of crypto compression as information destruction. This relies on the output domain being smaller than the input domain via wacky cryptographic primitives. Thru a lot of $\oplus$ and $\boxplus$ operators.

Information is irretrievably lost and thus absolutely mathematically irreplaceable. So a one way function and thus secure (or not if the implementation is flawed, or susceptible to computational power).

The objectives of cryptography are not those of information preservation.

What is the difference between data compression and compression in cryptographic algorithms?

2 Answers2