1

Does a checksum ensure absolute data integrity? That is, if a piece of data several gigabytes long changes a single bit, the odds are dependable that the checksum will be different, particularly with strong hash functions like SHA256?

Phrased differently, I'm asking whether a checksum offers assurance of reasonable integrity, that is, a few bytes here and there may commonly go astray without causing a checksum to fail (which would be fine for e.g., video files or other error-tolerant data streams), or does a correct checksum ensure the data is entirely intact?

To summarize, does a checksum provide assurance of sufficient integrity, or sufficient assurance of absolute integrity?

1 Answers1

0

I'll give you a statistical answer. If the checksum from the file and the correct checksum (which I'll call the reference) disagree, we are 100% certain the file has an error. On the other hand if the checksum and the reference agree, we are reasonably assured the file is error-free.

How sure?

Suppose the checksum value is $n$ characters long in a $k$-letter alphabet. In an ideal checksum algorithm, an error occurring in the file will appropriately generate an incorrect checksum with probability

$$P_{chk}\text{(checksum fails|error)}=\frac{k^n-1}{k^n}=1-\frac{1}{k^n}$$

The assumption is that for any error or set of errors, the resulting checksum will be completely random. Since the correct checksum is $1$ in $k^n$ possibilities, then the probability that an error or set of errors in the file randomly results in the correct checksum value is $\frac{1}{k^n}$.

To illustrate this, let's make a basic checksum. I'll choose binary data ($k=2$ alphabet, either $1$'s or $0$'s). And let's say my file data is $M$ bits in length. We have to decide on an algorithm for our checksum. I'll propose that our simple algorithm is just the modulus of the base ten representation of our $M$ character data by $n$, our checksum character length. So for example $data=1001$, which represents 8+1=9, and let's say our checksum is $n=2$. From above, we know this isn't a good checksum, since $P_{chk}=1-1/4$, i.e., 25% of the time our checksum will erroneously agree with the reference despite containing an error. Nevertheless, the reference checksum for this example would be $9 \text{ mod } 4=1$, and our checksum in binary would then be $01$.

Now if we made an error in the data, for example, $data=1000$, we see that the checksum for the erred data would be $8 \text{ mod } 4=0$, or $00$ in our binary representation, so we'd be 100% confident that the data contains an error. On the other hand, if $data=0001$, $1 \text{ mod } 4=1$, and our checksum is $01$, we'd erroneously conclude that the data were correct.

To emphasize how much depends on the checksum algorithm itself, if you think about my toy algorithm, you'll realize it only catches mistakes in the lowest 2 bits. That is, the original statement is correct in the sense that agreement in my checksums will denote the data are correct 75% of the time; but only if errors are equally likely to happen in all bits. In reality, if the lowest two bits are correct, my checksum will always check out (try it yourself).

Lastly I just wanted to point out that given an ideal checksum algorithm, it's quite a remarkable tool. A gigabyte file you downloaded off the internet is $8589934592$ bits in length. To be truly sure that the file is correct, we'd need to compare the download to the actual file, which of course doesn't make sense (you'd already have the file!) Yet with a very modest choice of even an $n=64$, $k=2$ checksum (64-bit checksum), we can be sure with $P_{chk}=1-1/2^{64}\approx1$ that if the checksum and reference agree, our file is intact.

vector07
  • 1,571
  • 1
  • 9
  • 8
  • So we can confirm that a matching checksum has an almost zero likelihood of permitting *any error at all*? – JamesTheAwesomeDude Apr 06 '14 at 16:16
  • 1
    @JamesTheAwesomeDude What I was trying to convey to you was the importance of understanding how the checksum algorithm works. In my example, you can see there is a distinct difference between theory and practice because my checksum algorithm is *not* ideal. Also, different checksum algorithms have different purposes. In the most basic sense, an "ideal" checksum would give impossibly small odds of agreeing with reference given any error. In practice, the outcome depends greatly on algorithm used. But in general, yes, a checksum is not supposed to "tolerate" some errors unless designed to do so. – vector07 Apr 06 '14 at 17:24